0% found this document useful (0 votes)
67 views40 pages

Ott Data Project

The document presents a case study on OTT media services, focusing on the analysis of content viewership and its influencing factors. It includes various statistical analyses, including univariate and bivariate analyses, to understand the distribution of content views, genres, and the impact of release timing on viewership. The study aims to identify key drivers for first-day content viewership to enhance user engagement on the ShowTime platform.

Uploaded by

Daffini Ruby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views40 pages

Ott Data Project

The document presents a case study on OTT media services, focusing on the analysis of content viewership and its influencing factors. It includes various statistical analyses, including univariate and bivariate analyses, to understand the distribution of content views, genres, and the impact of release timing on viewership. The study aims to identify key drivers for first-day content viewership to enhance user engagement on the ShowTime platform.

Uploaded by

Daffini Ruby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

OTT DATA

CASE STUDY

xxxxx

1|Page
TABLE OF CONTENTS
Introduction
Problem Statement
1. What does the distribution of content views look like?--------------------------- 8
2. What does the distribution of genres look like? ----------------------------------- 9
3. The day of the week on which content is released generally plays a key role in
the viewership. How does the viewership vary with the day of release?----------- 9
4. How does the viewership vary with the season of release? ---------------------11
5. What is the correlation between trailer views and content views?--------------13
A. Univariate analysis -------------------------------------------------------------------14
1. Continuous variable data ------------------------------------------------------------14
2. Categorical variable data ------------------------------------------------------------16
B. Bivariate Analysis --------------------------------------------------------------------18
1. Heat map correlation for continuous data -----------------------------------------18
2. Two Continuous data ----------------------------------------------------------------19
3. One Continuous and One Categorical Data---------------------------------------20
C. Outlier Treatment-------------------------------------------------------------------- 22
1. Before outlier treatment (visitor) -------------------------------------------------- 22
2. After outlier treatment (visitor) ---------------------------------------------------- 22
3. Before outlier treatment (ad_impression) ---------------------------------------- 23
4. After outlier treatment (ad_impression) ------------------------------------------ 24
D. Data preparation for modeling------------------------------------------------------ 24
1. Value changing from integers to string-------------------------------------------- 24
2. Dropping off the output variable----------------------------------------------------25
3. Creating dummy variables---------------------------------------------------------- 25
4. Splitting the data --------------------------------------------------------------------- 25
E. Model building – linear regression------------------------------------------------- 26
2|Page
1. Model coefficients----------------------------------------------------------------------26
1.1 R-squared value------------------------------------------------------------------------26
1.2 Coefficients-----------------------------------------------------------------------------26
1.3 P-value (P>|t|)--------------------------------------------------------------------------27
1.4 VIF values------------------------------------------------------------------------------27
1.5 Ad_impressions------------------------------------------------------------------------28
1.6 Genre_Romance-----------------------------------------------------------------------29
2. After removal of all genre dummy variables---------------------------------------30
2.1 Dayofweek_Tuesday------------------------------------------------------------------31
2.2 Dayofweek_Monday------------------------------------------------------------------32
2.3 Dayofweek_Thursday-----------------------------------------------------------------33
3. After removal of all dummy variables----------------------------------------------33
3.1 Training data result--------------------------------------------------------------------34
3.2 Testing data result---------------------------------------------------------------------34
F. Assumptions of linear regression model---------------------------------------------35
1. Test for linearity and independence--------------------------------------------------35
2. Test for normality--------------------------------------------------------------------- 35
3. Test for homoscedasticity------------------------------------------------------------ 37
G. Final Model-----------------------------------------------------------------------------38
1. Training Performance result----------------------------------------------------------39
2. Testing Performance result---------------------------------------------------------- 39

Conclusion and Recommendations

3|Page
List of Tables:
1. Dataset Details---------------------------------------------------------------------------7
2. Statistical data details-------------------------------------------------------------------8
3. Major_sports_event value changed to categorical data---------------------------24
4. Dropping off the output variable from input data ----------------------------------25
5. Creating the dummy variable for all the categorical data--------------------------25
6. OLS Regression result for original data----------------------------------------------26
7. VIF data-----------------------------------------------------------------------------------27
8. OLS Regression result after removing Ad_impression-----------------------------28
9. VIF Values after removing Ad_impression------------------------------------------29
10. OLS Regression result after removing Genre_Romance-------------------------30
11. OLS Regression result after removing all Genre dummy variable--------------31
12. OLS Regression result after removing Dayofweek_Tuesday--------------------32
13: OLS Regression result after removing Dayofweek_Monday--------------------32
14. OLS Regression result after removing Dayofweek_Thursday-------------------33
15: OLS Regression result after removing all dummy variables---------------------34
16. Training data result---------------------------------------------------------------------34
17. Testing data result----------------------------------------------------------------------34
18. Final OLS Regression model----------------------------------------------------------38
19. Training Performance result---------------------------------------------------------- 39
20. Testing Performance result -----------------------------------------------------------39

4|Page
List of Figures
1. Boxplot and Histplot of content views -----------------------------------------------8
2. Boxplot and Histplot of content views -----------------------------------------------9
3. Barplot of visitors and daysofweek --------------------------------------------------10
3.1 Pairplot of views_content and daysofweek---------------------------------------10
4. Boxplot for visitors vs. season---------------------------------------------------------11
4.1 Histplot for visitors vs. season------------------------------------------------------12
4.2 Pairplot for visitors and views_content vs. season-------------------------------12
5. Pairplot for view_trailers vs. views_content ----------------------------------------13
6. Boxplot and Histplot for ad_impressions -------------------------------------------14
7. Boxplot and Histplot for Major_sports_events-------------------------------------14
8. Boxplot and Histplot for Views_trailer----------------------------------------------15
9. Boxplot and Histplot for Views_content---------------------------------------------15
10. Pie-chart and countplot for season---------------------------------------------------16
11. Bargraph for Genre---------------------------------------------------------------------17
12. Bargraph for dayofweek---------------------------------------------------------------18
13.Continuous data correlation------------------------------------------------------------19
14. Pairplot for views_content and views_trailer---------------------------------------20
15. Pairplot for visitors and views_trailer----------------------------------------------- 21
16. Boxplot for Visitors and Season------------------------------------------------------21
17. Boxplot for visitors with outliers-----------------------------------------------------22
18. Boxplot for visitors without outliers-------------------------------------------------23
19. Boxplot for ad_impressions with outliers-------------------------------------------24
20. Boxplot for ad_impressions without outliers---------------------------------------24
21. Residual plot for Fitted values and Residuals--------------------------------------35
22. Histplot for Normality of residuals--------------------------------------------------36
23. QQ plot----------------------------------------------------------------------------------36
5|Page
OTT MEDIA SERVICE:
Introduction:
An over-the-top (OTT) media service is a media service offered directly to viewers
via the internet. The term is most synonymous with subscription-based video-on-
demand services that offer access to film and television content, including existing
series acquired from other producers, as well as original content produced
specifically for the service. They are typically accessed via websites on personal
computers, apps on smartphones and tablets, or televisions with integrated Smart
TV platforms.
Objective:
ShowTime is an OTT service provider and offers a wide variety of content
(movies, web shows, etc.) for its users. They want to determine the driver variables
for first-day content viewership so that they can take necessary measures to
improve the viewership of the content on their platform. Some of the reasons for
the decline in viewership of content would be the decline in the number of people
coming to the platform, decreased marketing spend, content timing clashes,
weekends and holidays, etc.
Data Description:

Variable Definition
Average number of visitors, in millions, to the platform in
Visitors
the past week
Ad_impressions Number of ad impressions, in millions, across all ad
campaigns for the content (running and completed)
Major_sports_even
Any major sports event on the day
t
Marital_status Marital status (Married & Single)
Genre Genre of the content

Dayofweek Day of the release of the content


Season Season of the release of the content

6|Page
Views_trailer Number of views, in millions, of the content trailer
Views_content Number of first-day views, in millions, of the content

Data Information:

Table 1: dataset details


Note:
 The data contains 1000 observations and 8 variables.
 No data’s were missing in the dataset.
 The Visitors, Ad_impressions, Major_sports_event, Marital_status,
Views_trailer and Views_content contains continuous data and Genre,
Dayofweek and Season variable contains categorical data. The data types of
the variables present is integer, float and object.
 There are 5 numeric (*float* and *int* type) and 3 string (*object* type)
columns in the data
 The target variable is the Views_content, which is of *float* type

Statistical summary of the data:

7|Page
Table 2: Statistical data details
 The content views vary between 0.22 and 0.89.
 The maximum number of visitors is in the range of 2.34.
Missing values in Dataset:
 There are no missing values in dataset.
 There are no duplicates values in dataset.
Problem Statement
1. What does the distribution of content views look like?

Figure 1. Boxplot and Histplot of content views


The above graph illustrates the fact that

8|Page
 The data is slightly right skewed with many outliers present in the dataset.
 The mean value of content views is between 0.4 and 0.5.
2. What does the distribution of genres look like?

Figure 2. Barplot of distribution of genres


The above graph states that
 The comedy and thriller genre types has 11.4% percentage which indicates
that the users are mostly interested to watch comedy and thriller shows.
 The drama is 0.4% higher than Romance. The 10% of the visitors prefers
other genre types like Sci-fi, Horror and Action.
3. The day of the week on which content is released generally plays a key role in
the viewership. How does the viewership vary with the day of release ?

9|Page
Figure 3. Barplot of visitors and daysofweek
The above graph indicates that
 The visitors using OTT platform on Saturday was higher when compared
with other days.
 On Wednesday and Friday, the user’s visiting the OTT platform ratio is
quite similar with a range of 1.2 to 1.8.

Figure 3.1 Pairplot of views_content and daysofweek

10 | P a g e
 The graph shows that there is a correlation between content views and days
of week.
 The 75 percent of the content viewed on Saturday, Wednesday and Friday.
From the graph we could identify that the content viewed on Friday,
Saturday and Sunday is quite larger when compared with other days.

4. How does the viewership vary with the season of release?

Figure 4 Boxplot for visitors vs season

11 | P a g e
Figure 4.1 Histplot for visitors vs season

Figure 4.2 Pairplot for visitors and views_content vs season


12 | P a g e
The above graphs states that,
 There is a high correlation on visitor viewing the content based on seasons.
For instance, majority of the OTT users prefers to spend more time on
Spring and Winter.
 In Boxplot we could some outliers present in the data.
5. What is the correlation between trailer views and content views?

Figure 5 Pairplot for view_trailers vs views_content


 The above correlation graph states that the data is heavily right skewed and
it has a positive correlation on the dataset.
 This specifies that majority of the trailer content was watched by the users.

13 | P a g e
A.UNIVARIATE ANALYSIS FOR CONTINUOUS VARIABLE DATA:
AD_IMPRESSIONS:

Figure 6 : Boxplot and Histplot for ad_impressions


The figure represents that the data is heavily right skewed. The initial value of
ad_impression is 1000 and the mean value is 1450.
MAJOR_SPORTS_EVENTS:

Figure 7 : Boxplot and Histplot for Major_sports_events

14 | P a g e
The sports event occurred was denoted by 0 and 1. The mean value is 0.4 and the
median value starts from 0.
VIEWS_TRAILER:

Figure 8 : Boxplot and Histplot for Views_trailer


The above graph illustrates that
 The data has more outliers and it is heavily right skewed normal distribution.
 For some trailers, the user’s watched ratio was high. The median value is
approximately 55.
VIEWS_CONTENT:

15 | P a g e
Figure 9 : Boxplot and Histplot for Views_content
The above graph illustrates that
 The data has outliers and it is seems to be a bimodal distribution.
 For some content, the first day views was high with the median value of
0.48.

UNIVARIATE ANALYSIS FOR CATEGORICAL VARIABLE DATA:


SEASON:

Figure 10 : Pie-chart and countplot for season


 The data implicates the information about winter season where the user’s
viewing the content was bit higher when compared with other seasons.

GENRE:

16 | P a g e
Figure 11 : Bargraph for Genre
The above graph illustrates that
 The data reveals the information about the types of genre user’s prefers to
watch.
 The Visitor’s preference of watching Sci-Fi, Horror, Romance and Action
was in the equal ratio and the preference of other’s was around 25.5% from
the overall data.

DAYOFWEEK:

17 | P a g e
Figure 12 : Bargraph for dayofweek
The above graph illustrates that
 The data reveals the information about weekdays and weekend user’s
percentage.
 From the information it was understood that on Friday about 36.9% of the
user’s was visiting the ott platform and on Wednesday it was identified as
33.2%. On the contrary, on Weekends, it was only 15%.

BIVARIATE ANALYSIS
HEAT MAP CORRELATION FOR CONTINUOUS DATA:

18 | P a g e
Figure 13: Continuous data correlation

From the above heat map, we could find the positive correlation between:
 Views_content and Views_trailer
The correlation between Views_content and Views_trailer has strong correlation
and plays a key role for user’s visiting the platform and viewing the content.

BIVARIATE ANALYSIS - TWO CONTINUOUS DATA


VIEWS_CONTENT AND VIEWS_TRAILER

19 | P a g e
Figure 14: Pairplot for views_content and views_trailer

 The correlation between Views_content and Views_trailer has strong and


positive correlation.

 This pairplot seems to be right skewed normal distribution.


ONE CONTINUOUS AND ONE CATEGORICAL DATA
VISITORS AND VIEWS_TRAILER VS SEASON

20 | P a g e
Figure 15: Pairplot for visitors and views_trailer

 The correlation between Views_content and Views_trailer has strong and


positive correlation.
VISITORS AND SEASON

Figure 16: Boxplot for Visitors and Season

21 | P a g e
 The above graph shows that during summer and winter, we could find the
same ratio of visitors.

C. OUTLIER TREATMENT:
BEFORE OUTLIER TREATMENT (VISITOR):

Figure 17: Boxplot for visitors with outliers

AFTER OUTLIER TREATMENT (VISITOR):

22 | P a g e
Figure 18: Boxplot for visitors without outliers

 The above graphs illustrate that few outliers was detected so we treated that
data using outlier treatment.
 After the treatment, the dataset contains information about 980 user’s data.

BEFORE OUTLIER TREATMENT (AD_IMPRESSION):

23 | P a g e
Figure 19: Boxplot for ad_impressions with outliers

AFTER OUTLIER TREATMENT (AD_IMPRESSION):

Figure 20: Boxplot for ad_impressions without outliers

 After the treatment, the dataset contains information about 968 user’s data.

D. DATA PREPARATION FOR MODELING:


VALUE CHANGING FROM INTEGERS TO STRING:

Table 3: Major_sports_event value changed to categorical data


- We want to predict the views_content of an OTT data.
- Before we proceed to build a model, we'll have to encode categorical features

24 | P a g e
- We'll split the data into train and test to be able to evaluate the model that we
build on the train data
- We will build a Linear Regression model using the train data and then check it's
performance
DROPPING OFF THE OUTPUT VARIABLE:

Table 4: dropping off the output variable from input data


CREATING DUMMY VARIABLES:

Table 5: Creating the dummy variable for all the categorical data
SPLITTING THE DATA:
The overall data was splitted into train and test data for predicting the model
output.

25 | P a g e
E. MODEL BUILDING – LINEAR REGRESSION:

Table 6: OLS Regression result for original data


1. MODEL COEFFICIENTS
R-squared value:
 The model can explain 78.8% of the variance in the training set.
Coefficients:
 The coefficients tell us how one unit change in X can affect Y.
 The sign of the coefficients indicates if the relationship is positive or
negative. In this dataset, For instance, an increase of 1 visitor occurs with a
0.1286 increase in views_content and an increase of 1 visitor will result in
decrease of 0.0637 in views_content.

26 | P a g e
P-value (P>|t|):
 If the P_value is greater than 0.05 that indicates that the predictor variable is
not significant.
 In our dataset, (ad_impression, genre_comedy, genre_horror, genre_drama,
genre_Comedy, genre_Others, genre_Romance, genre_Sci-Fi,
genre_Thriller, dayofweek_Monday and dayofweek_Tuesday) all these
variables has the p_value greater than 0.05. So, we need to analyze which
variable has less impact towards the y variable. If the influence of the
variable has lesser impact then we can eliminate that variable from the
dataset.
VIF values:

Table 7: VIF data


 This VIF values indicates that whether the data has mutlicollinearity or not.
 If the VIF of the variable value exceeds 5, we assume that the data has
multicollinearity.
 In our dataset, all the predictor variable are less than 5 so we assume that the
data doesn’t have mutlicollinearity.

27 | P a g e
Ad_impressions:
 When removing Ad_impression it doesn’t affect the R- square. So, we
removed the data that is not influencing the y-variable(views_content).

Table 8: OLS Regression result after removing Ad_impression

28 | P a g e
Table 9: VIF Values after removing Ad_impression

Genre_Romance:

29 | P a g e
Table 10: OLS Regression result after removing Genre_Romance
 After removal of genre_romance, we still find that the p-value is greater for
some variable.
 So, we removed all the genre dummy variable like (genre_Comedy,
genre_Drama, genre_Horror, genre_Others, genre_Sci-Fi and
genre_Thriller).
2. AFTER REMOVAL OF ALL GENRE DUMMY VARIABLES:

30 | P a g e
Table 11: OLS Regression result after removing all Genre dummy variable
 After removal of all Genre dummy variable, we still find that the p-value is
greater than 0.05 for dayofweek_Monday and dayofweek_Tuesday.
2.1 Dayofweek_Tuesday:

31 | P a g e
Table 12: OLS Regression result after removing Dayofweek_Tuesday
 After removal of dayofweek_Tuesday variable, we still find that the p-value
is greater than 0.05 for dayofweek_Monday and dayofweek_Thursday.
2.2 Dayofweek_Monday:

Table 13: OLS Regression result after removing Dayofweek_Monday

32 | P a g e
 After removal of dayofweek_Monday variable, we still find that the p-value
is greater than 0.05 for dayofweek_Thursday.
2.3 Dayofweek_Thursday:

Table 14: OLS Regression result after removing Dayofweek_Thursday


 After removal of dayofweek_Thursday variable, we find that the p-value is
less than 0.05.
 We removed all the dummy variable which doesn’t have any influence over
the output variable (View_content).

3 AFTER REMOVAL OF ALL DUMMY VARIABLES:

33 | P a g e
Table 15: OLS Regression result after removing all dummy variables
3.1 TRAINING DATA RESULT:

Table 16: Training data result


3.2 TESTING DATA RESULT:

Table 17: Testing data result


 We can see that the RMSE on the train and test datasets are comparable. So,
our model is not suffering from overfitting.
 MAE indicates that our current model to predict view_content within a mean
error of 0.041 on the test data.

34 | P a g e
F. ASSUMPTIONS OF LINEAR REGRESSION MODEL:
1. TEST FOR LINEARITY AND INDEPENDENCE:

Figure 21: Residual plot for Fitted values and Residuals

 We absorbed that the pattern is slightly distorted and the datapoints


seems to be randomly distributed.
 The pattern is not showing signs of linearity.

2. TEST FOR NORMALITY:

35 | P a g e
Figure 22: Histplot for Normality of residuals

 The residual terms are normally distributed.

Figure 23: QQ plot

36 | P a g e
 Most of the points are lying on the straight line.

SHAPIRO-WILK TEST:

Null hypothesis: Data is normally distributed


Alternate hypothesis: Data is not normally distributed
P-value:

 The p-value is greater than (>) 0.05, so we assume that the residuals are
normally distributed.

3. TEST FOR HOMOSCEDASTICITY:


GOLDFELDQUANDT TEST:
Null hypothesis: Residuals are homoscedastic
Alternate hypothesis: Residuals are heteroscedastic
P-value:

 The p-value is greater than (>) 0.05, so we assume that the residuals are
homoscedastic.

37 | P a g e
FINAL MODEL:

Table 18: Final OLS Regression model

OBSERVATION:
 R-squared of the model is 0.784 and adjusted R-squared is 0.781, which
shows that the model is able to explain 78% of the variance in the data and
the data is quite reasonable.
 An increase in the model of the variable visitor will result in a 0.1275
increase in the view_content.
 The views_content for dayofweek_Saturday is greater than other variables.
 The views_content for season_Summer is greater than Spring and Winter.

38 | P a g e
TRAINING PERFORMANCE RESULT:

Table 19: Training Performance result

TESTING PERFORMANCE RESULT:

Table 20: Testing Performance result

CONCLUSIONS AND RECOMMENDATIONS:


 The model is able to explain ~78% of the variation in the data which is good
 This indicates that the model is good for prediction as well as inference
purposes
 If the number of people visiting an OTT platform increases by one unit,
then its views_content increases by 0.1275 units, all other variables held
constant
 We can see that the RMSE on the train and test datasets are comparable and
less. So, our model is not suffering from overfitting and underfitting.
 MAE indicates that our current model to predict view_content within a mean
error of 0.041 on the test data.
 Most of the content released on Friday and Wednesday was rated top when
compared with other days. So, the OTT platform can use these days to
release the top content or new movies which will increase the user’s visiting
time.
 As the major_sports_event increases with a decrease in views_content, the
company can look into the sports data before releasing it because the
visitor’s watching sports content was very less.

39 | P a g e
 OTT platform can look to increase the number of comedy and Thriller
genres as they are the most watched shows on the platform.
 OTT platform can gather data about their users like age, gender,
geographical location, occupation, etc. to better understand about the age
group for which kind of web series and movies each users prefers to watch.
 We could find that during Summer and Winter, the user’s visiting OTT
platform is quite high so if we could release newly launched movies and
Series on that season with the discount in OTT platform could draw more
people into the OTT platform.

40 | P a g e

You might also like