0% found this document useful (0 votes)
36 views

Facebook Comment Volume Prediction Final Report

The document discusses predicting the number of comments on Facebook posts within a given time period. It describes the dataset containing information about Facebook pages and posts used to build predictive models. Several machine learning algorithms are applied and compared to identify the most accurate model for comment volume prediction.

Uploaded by

Anuja Krishnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Facebook Comment Volume Prediction Final Report

The document discusses predicting the number of comments on Facebook posts within a given time period. It describes the dataset containing information about Facebook pages and posts used to build predictive models. Several machine learning algorithms are applied and compared to identify the most accurate model for comment volume prediction.

Uploaded by

Anuja Krishnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Capstone – Facebook Comment Volume

Prediction- Final Report

Anuja Krishnan

1|Page
Table of Contents
1 Problem Statement.........................................................................................................................................................3

2 Need of study.................................................................................................................................................................. 3

3 Business/Social Opportunity...........................................................................................................................................3

4 Data structure and description........................................................................................................................................4

4.1 Data collection.......................................................................................................................................................4

4.2 Visual Description..................................................................................................................................................4

4.3 Data Attributes.......................................................................................................................................................4

5 Exploratory Analysis........................................................................................................................................................ 5

5.1 Removal of unwanted variables............................................................................................................................5

5.2 Outlier treatment...................................................................................................................................................6

5.3 Missing value treatment........................................................................................................................................6

5.4 Variable transformation.........................................................................................................................................7

5.5 Univariate and Bivariate Analysis..........................................................................................................................8

5.6 Identifying the significant variables.....................................................................................................................13

6 Insights from Exploratory Data Analysis:......................................................................................................................14

7 Model building............................................................................................................................................................... 15

8 Multiple Linear Regression............................................................................................................................................16

9 Decision Trees - CART....................................................................................................................................................18

10 Random Forests.............................................................................................................................................................21

11 Extreme Gradient Boosting...........................................................................................................................................24

12 Bagging.......................................................................................................................................................................... 25

13 Model comparison and insights....................................................................................................................................26

14 Business insights and recommendations......................................................................................................................27

2|Page
1 Problem Statement

The leading trends towards the Social Networking has drawn high public attention from past ‘two’
decades. For both small businesses and large corporations, social media is playing a key role in brand
building and customer communication. Facebook is one of the social networking sites relevant for
firms to make themselves real for customers. It is estimated that advertising revenues of Facebook in
the United States in 2018 stands up to 14.89 billion USD against 18.95 billion USD outside. Other
categories like news, communication, commenting, marketing, banking, Entertainment etc. are also
generating huge social media content every minute.

User generated content is used to describe any form of content such as text, messages, video,
images and other forms of media which are generate by the end users of an online system. As per
the latest research, it has been estimated that the highest user engagement is generated from the
user generated posts on Facebook.

In the current project, we will focusing on predicting the number of comments received for a user
generated post in a given set of hours

2 Need of study

The amount of data that gets added to the network increases day by day and it is a gold mine of
researchers who want to understand the intricacies of user behavior and user engagement.

In this project, we used the most active social networking service ‘Facebook’ importantly the
‘Facebook Pages’ for analysis. Our research is oriented towards the estimation of comment volume
that a post is expected to receive in next few hours. By analyzing the comment volume helps
understand the dynamic behavior of the user towards the Facebook posts

Before continuing to the problem of comment volume prediction, some domain specific concepts
are discussed below:

 Post/Feed: These are basically the individual stories published on page by administrators of
page.
 Comment: It is an important activity in social sites, that gives potential to become a
discussion forum and it is only one measure of popularity/interest towards post is to which
extent readers are inspired to leave comments on document/post.

3 Business/Social Opportunity

As mentioned in the problem statement, the advertising revenue across the world is around 18 billion
dollars and 14.9 billion dollars within the United States. By predicting the comment volume, we can
identify the below factors:

 Popularity of a particular page


 Number of times a post has been shared or promoted
 Number of comments in a given time
 Impact of post length and post published day on number of comments/likes

3|Page
This prediction is of great importance from a scientific perspective because of the great potential to
understand the thoughts and feelings of people based on their behavior on social media. The number
of comments on a Facebook post can be used as a factor for understanding the interest in the subject
of the page and relevance of the post’s content. Therefore, by formulating a model to predict the
number of Facebook comments based on page and post information, we can gain insight into the
thoughts and feelings of people active on social media which be used by advertisers, marketers for
designing more effective marketing strategies.

4 Data structure and description

4.1 Data collection


The given dataset consists of a variety of features associated with any user generated post on
Facebook. The dataset gives a basic idea about how many likes a particular page/category has
received, how many times a user has visited or checked in into a page and how many days did it take a
post to generate comments or shares.
The data set used is a ‘Facebook comment volume’ record captured over the period containing 32,759
rows and 43 variables.

4.2 Visual Description


The dataset has been analyzed for understanding basic features as below:

Structure of dataset : The dataset consists of 43 variables, of which, 2 variables – Post published
weekday and Base datetime weekday are “char” columns and the remaining dataset is of numeric
class

Predictor : The Target Variable is the predictor variable used to predict the number of comments on
each post

Feature Category : Each variable in the dataset is categorized into the below features:

 Page features: Common measure related to pages such as category, likes, share count etc
 Essential features : The pattern of comment from different users on the post at various time
interval with respect to randomly selected base time/date
 Derived features: These are features derived prominently from essential features and
aggregated by page, by calculating min, max, average, median and standard deviation of
essential features
 Other features: The remaining features that help to predict the volume of comment for each
page category and that includes to document about the source of the page and date/time for
about next H hours.

4.3 Data Attributes

The features of the given dataset with description and category are as follows:

 Page Features/ Likes: -


 Page Popularity - Page popularity and Number of likes for a page which supports for specific
comments, posts, status etc.
 Page Check-ins – Describes how may number of individuals have visited so far this place
which can be an institution, place, theatre etc.
4|Page
 Page Talking about – Defines the daily interest of individuals towards a page (which is
measured by activities like comments, likes to a post shares etc by visitors to the page)
 Page Category – Defines the category of the source (Place, Brand, Institution etc.)

 Derived Features
These features Feature 5- Feature 29 are aggregated by page, by calculating the minimum,
maximum, average, median and standard deviation of the essential features

 Essential Features
This includes the pattern of comments in various time frame.
Some of the given essential features are as below: -
 CC1: The Total comment count before selected base date/time.
 CC2: Comments count in last 24 hours relative to selected base date/time.
 CC3: Comments count in last 48 hours to last 24 hours relative to base date/time.
 CC4: The number of comments in the first 24 hours after the publication of post but before
base date/time
 CC5: The difference between CC2 and CC3.

 Other Features
This includes some document related features as below: -
 Base Time – Selected time in order to simulate the scenario.
 Post Promotion Status - To reach more people with posts in News Feed, individual promote
their post and this feature tells that whether the post is promoted (1) or not (0)
 Post Length – Character count in the post
 Post Share Count – Counts the number of shares of the post, how many people had shared
this post on their timelines.
 H – Local – Describes the H Hrs, for which we have the target variables/comments received.

 Weekday features
 Post Published weekday – Represents the day (Sunday …. Saturday) on which the post was
published.
 Base Date Time weekday – Represents the day (Sunday …. Saturday) on the selected base
Date/Time

 Target Variable
 Comments – The number of comments in next H Hrs (H Represents H Local)

5 Exploratory Analysis

The exploratory analysis of any dataset includes few basic steps such as removal of irrelevant
variables, treating the missing values and outliers, dimension reduction (if required) and visual
analysis.

5.1 Removal of unwanted variables

The first step towards data cleansing is to identify the insignificant variables. This is done with a step
by step approach. The first step would be to identify the variables the variables which do not provide
any unique importance to the dataset.

5|Page
In the give dataset, there are two variables which are not unique:

 Column 1:ID
 Column 38 : Post Promotion Status

Column 1 represents the serial numbers and column 38 do not have any values.

5.2 Outlier treatment


On analyzing the dataset with a box plot, it has been observed that almost all the variables are being
highly skewed. This makes the distribution highly uneven. This is evident from the boxplots given
below.

For better plots, we have tried doing a sample outlier treatment for the variables wherever there is the
highest gap from the 95th to the 100th Percentile. An outlier will be capped if the values is below its
first quartile – 1.5IQR or above third quartile + 1.5IQR.The below snippet was used for defining an
outlier function to cap the outliers using the inter-quartile range

Before outlier treatment, the data was divided into two subsets – one containing the character
variables and the other subset containing numeric variables for data cleansing

Boxplot with outliers Boxplot after outlier treatment

5.3 Missing value treatment

In the given dataset, missing values have been identified in the given variables.

Sum of all the missing values was computed as : 15649

Page likes 3208 Page talking about 3255 Page Checkins 3255

Page category 3024 Feature 7 1679 Feature 10 1632

Feature 13 1643 Feature 15 1692 Feature 18 1605

Feature 20 1600 Feature 22 1601 Feature 25 1600

Feature 27 1598 Feature 29 1600 CC1 3199

6|Page
CC4 3198 CC5 3200

Treatment of missing value:

There are several packages in R which can be used to treat the missing values such as knnImputation
and mice. The package used in this dataset is the MICE package.

MICE: mice” short for Multivariate Imputation by chained equations is an R package that provides
advanced features for missing value treatment. It uses a slightly uncommon way of implementing the
imputation in 2-steps, using mice() to build the model and complete() to generate the completed data.
The mice(df) function produces multiple complete copies of df, each with different imputations of the
missing data. The complete() function returns one or several of these data sets, with the default being
the first.

Post treatment with the mice package, it was observed that the dataset still had missing values in the
columns – Feature 7, Feature 20 and Feature 15. These were further imputed with mean and random
numbers to generate the final complete dataset

5.4 Variable transformation

The next step for data cleansing was reducing the dimensions of the dataset. The current dataset had
41 variables and of these 42 variables, Feature 5-29 were derived as a subset. The reason for doing
this was lack of any explanation as to what these variables represented. These values were further
reduced by removing Feature 10 and Feature 15 ( had 0 as value) and normalized to a range of (0,1).
Further, these were subjected to Principal component analysis.

Principal Component Analysis (PCA) is a method of dimensionality reduction. By dimensionality


reduction, the point is, the independent correlated variables in the dataset are grouped or clustered
into common factors. This is done by transforming the variables into new set of components, known as
the Principal components. Various steps of PCA include:

 Deriving Eigen values and vectors


 Creating the scree plot to understand the optimal number of factors
 Creating the reduction plot

The PCA was applied to Feature 5 – 29 and they were reduced into 3 factors which were named as
Factor 1, Factor 2 and Factor 3.

7|Page
5.5 Univariate and Bivariate Analysis
The univariate and bivariate analysis has been conducted on all the variable and below are few plots.

Univariate analysis

Histogram Boxplot

The first histogram and boxplot is that of the page related factors – Page likes, Page checkins, Page talking
about , Page Category.
 Page likes: The number of likes range between 0 to 55L
 Page checkins: The page checking range from 0 to 28000
 Page talking about: The value ranges from 0 to 60L
 Page Category: The page category ranges from 1 to 68

8|Page
The second univariate analysis was done on the comment variables (CC1,CC2,CC3,CC4)
 CC1 – Total comments before selected base time: The value ranges from 0 to 258
 CC2 – Comments in last 24 hours: The value ranges from 0 to 107
 CC3 – Comments in last 48 – 24 hours: The value ranges from 0 to 96
 CC4 – Comments in first 24 hours: The value ranges from 0 to 246

The above univariate analysis was performed on the factor derived from the Principal Component Analysis.
 Factor 1: The value of Factor 1 ranges from -1.6 to 3.07
 Factor 2: The value of Factor 2 ranges from – 5 to 6.722
 Factor 3: The value of Factor 1 ranges from – 2.5 to 85

9|Page
The above plots show the univariate analysis of the variables associated with post:
 Post length: The post length(in characters) range from 0 to 513 characters
 Post share count : The minimum value of post count is 1 and it ranges till 452

The univariate analysis shows the Time variables:


 Base time: The base time extends till 72 hours
 H.Local: The target variable (number of comments)received will be counted for 24 hours

Target Variable: The target variable ranges from 0 (no comments) to 1305. However, there are outliers in the
target variable, but they have not been treated, keeping in mind the business significance

10 | P a g e
Bivariate Analysis:

The bivariate analysis is performed between different variables and number of comments to
understand any linear relationship between them

Page likes vs No. of comments Page checkins vs No. of comments

Page talking about vs no. of comments Page category vs no. of comments

Comments before selected base time vs no. of Comments in last 24 hours vs no. of comments
comments

11 | P a g e
Comments in last 48 to 24 hours vs no. of comments Comments in first 24 hours vs no. of comments

Post length vs no. of comments Base date time weekday vs no. of comments

Post published weekday vs number of comments Page category vs Post length

Observations:
 The relationship between the target variable and the rest of variables are non linear
 The maximum number of comments were posted on Wednesday, Monday and Tuesday
 With increasing post length, the number of comments start to decrease

12 | P a g e
5.6 Identifying the significant variables

The next step of data cleansing is to identify the significant variables in the dataset. A combination of
correlation plot, vif values and linear regression p-values have been used to derive the important
feature of the dataset to build models.

The correlation plot and VIF values were calculated for the numerical values of derived dataset and below are the
results:

The VIF values from the linear regression in the first iteration were as below

Page likes Page Checkins Page talking Page Category CC1 CC2
about
2.186531 1.029816 2.945906 1.145452 18.420032 3.722549
CC3 CC4 CC5 Base time Post length Post share
count
4.112315 18.349000 4.796095 1.358041 1.014453 1.659210
Factor 1 2.160712 Factor 2 1.476951 Factor 3 1.159841

The dataset was subjected to multiple iterations of linear regression to identify the significant
variables. It was analyzed that the variables – CC1 to CC5- have significantly high VIF and hence they
were removed from the dataset.

Below is the correlation matrix and VIF for the new dataset:

13 | P a g e
> vif(model2)
Page.likes Page.Checkins Page.talking.about
1.804575 1.060851 1.606352
Page.Category Factor1 Factor2
1.159538 2.461764 1.196483
Factor3 CC2 CC3
1.032741 2.451173 2.656034
CC5 Base.Time Post.Length
2.918733 1.280638 1.013197
Post.Share.Count
1.561916

6 Insights from Exploratory Data Analysis:

The dataset consists of 32759 rows and 43 columns. Although the dataset didn’t seem unbalanced but
there were certain features in the dataset which did not have explanation for existence.

Below are the insights from the analysis:

Graphical Analysis:

 The univariate histogram of all the variables shows a skewness for the categories of – Page
likes, page checkins, page talking about, CC1, CC2, CC3,CC4,H local, Target.Variable

 The remaining variables tend to have a normal distribution.

14 | P a g e
The bivariate analysis throws light on some interesting aspect as below:

 The maximum number of Page.Likes for published posts were obtained on Wednesday
followed by Sunday and Friday. This implies that the more likes are generated on a weekday
or mid-week

 As the length of the post increases, the count of likes goes down.This implies that the
audience is not interested in reading long posts. The same trend can be observed in the count
of post shares as well

 In most scenarios, the posts which have more likes tend to have more comments. Based on
the above observations, we can say that the more the post length, lesser comments.

To recapitulate, the target group should be the user generated content with limited length and posted,
possibly on a start of weekend. This would generate more comments and thereby help in
understanding the behavior of the user.

Variable analysis:

All the variables were subjected to multicollinearity check and it was understood that:

 Feature 5-29 had no background as to how they were derived and were highly correlated.
Hence these were normalized and converted into 3 factors for easy analysis

 Further, the variables CC1 – CC5 were removed owing to high VIFs and a final dataset was
derived having the following variables – Page likes, Page Checkins, Page Category, Page
Talking about, Base time, Post Length, Post Share Count, Factor 1, Factor 2, Factor 3, Post
published weekday, Base datetime weekday and H.local

7 Model building

This section of the Facebook Comment Volume is to build various models to identify the most important
factors of the dataset which drive comments for user generated post.
Few methods of regression modeling have been applied on dataset such as:
 Multiple Linear Regression (MLR)
 Random Forest
 Classification and Regression trees (CART)
 Extreme Gradient Boosting (XGBoost)
 Bagging
The model building is followed by Model performance measures. Since the output variable is continuous,
the popular model performance measures would be:
 Root Mean Square Error (RMSE): The standard deviation of the residuals (prediction errors).
Residuals are a measure of how far from the regression line data points are; RMSE is a
measure of how spread out these residuals are.
 Mean Absolute Error (MAE): The mean absolute error of a model with respect to a test set is
the mean of the absolute values of the individual prediction errors on over all instances in the
test set
 Adjusted R2 : The adjusted R2 tells you the percentage of variation explained by only the
independent variables that actually affect the dependent variable . The adjusted R-squared
15 | P a g e
increases only if the new term improves the model more than would be expected by
chance

Data preparation:

The first step towards any model building is splitting the dataset into train and test dataset. In the
current scenario, the data has been split in 70-30 ratio – 70% for the train data and 30% for the test
data. The dimensions of train and test data are :

Train Test
 22931 rows  9828 rows
 17 columns  17 columns

8 Multiple Linear Regression

Multiple linear regression is used to estimate the relationship between two or more independent
variables and one dependent variable. You can use multiple linear regression when we want to
determine:

 How strong the relationship is between two or more independent variables and one
dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect
crop growth).

 The value of the dependent variable at a certain value of the independent variables (e.g. the
expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

The multiple linear regression runs on the below assumptions:


 Regression residuals must be normally distributed.
 A linear relationship is assumed between the dependent variable and the independent
variables.
 The residuals are homoscedastic and approximately rectangular-shaped.
 Absence of multicollinearity is assumed in the model, meaning that the independent variables
are not too highly correlated.

Model building and tuning:

In the first iteration of the linear regression model, it was observed that the parameters – Page likes,
Page checkins, Page Category, Base DateTime Weekday, Post Published Weekday – have the p-values
> 0.05 and hence they were removed from the dataset.

The model parameters were as below:


 Degrees of freedom: 22904
 p-value : < 2.2e-16
 Residual Standard Error(RSE) :30.14
 F-Statistic : 240.7

Tuning :

16 | P a g e
The model was tuned to remove the insignificant variables – Page likes, Page checkins, Page
Category, Base date time weekday, Post published weekday- as they have a p-value greater than
0.05. The output from the tuned model is:

Estimate Std.Error t value Pr(>|t|)


Intercept 3.631e00 2.637e00 1.377 0.16487
Page.Talking.abou
6.524e-06 2.173e-06 3.003 0.002675
t
Factor1 4.867e00 2.884e-01 16.876 <2e-16
Factor2 2.939e00 2.168e-01 13.557 <2e-16
Factor3 2.936e00 2.265e-01 12.962 <2e-16
CC2 1.689e-01 8.522e-03 19.815 <2e-16
CC3 -1.517e-01 9.522e-03 -15.936 <2e-16
CC5 2.255e-02 8.073e-03 2.793 0.005231
Base.time -2.024e-01 1.078e-02 -18.772 <2e-16
Post.length -4.983e-03 1.455e-03 -3.447 0.000567
Post.Share.Count 3.779e-02 1.666e-03 22.677 <2e-16
H.local 3.170e-01 1.078e-01 2.940 0.003290

Observations:

The observations from linear regression were as follows:

 R, F-statistic and Associated p-value:

The p-value is <2.2e-16 respectively, which are considerably significant. This means that, at
least, one of the predictor variables is significantly related to the outcome variable. The F-
statistic value is very high: 564.5 on 11 and 22919 degrees of freedom.

 Adjusted R-square:

The Adjusted R-square value is 0.2128, meaning that only 21% of the variance is explained
from the final model.

 Residual Standard Error

The RSE estimate which gives a measure of error of prediction, from this model is calculated
as 30.15

Interpretations:

For a given predictor variable, the coefficient can be interpreted as the average effect on y of a one
unit increase in predictor, holding all other predictors fixed

With this the equation for regression is:

Target.Comment = 3.402 + [6.524e-06(Page.Talking.about)] + [4.867(Factor1)] + [2.939(Factor2)] +


[2.936(Factor3)] + [1.689e-01(CC2) -[1.517e-01(CC3)] + [2.055e-02(CC5)] - [2.024e-01(Base.time)] -
[4.983e-03(Post.Length)] + [3.779e-02(Post.Share.Count)] + [3.170e-01(H.Local)]

 The below coefficients show a decrease in value with one unit increase in target variable –
17 | P a g e
CC3,Base.Time and Post.Length. This implies that with increase in Base time, post length,
page likes and checkins, the number of comments decrease.

 The below coefficients show an increase in estimate with one unit increase in target variable
– Page Talking about, Factor1, Factor 2, Factor 3, CC2,CC5,Post Share Count, H.Local. With
increase in these variables, the comment volume decreases.

Insights:

Based on the intercept and estimate values, below are the conclusions:

 When the Post Share count increases, the number of comments increases. In order to
generate revenue, user generated posts should be shared more frequently.

 Similarly, H.Local implies the hours for which the comments received. As the time
increases after comment generation, the number of comments increases

 The next factors are increase in CC2 and CC5 where CC5 is the difference between
comments in 24 to 48 hours and comments in last 24 hours, that is, the comments on the
beginning of second day gains more comments

 The number of comments decreases with increase in post length, page checkins and page
likes. The page likes and page checkins increase with time and as a result, the number of
comments tend to decrease

 In order to generate revenue out of the user generated content, institutions should focus
on sharing these posts more frequently within the first half of second day and keep the
post to a limited length

Model Performance Measures:

The below model performance measures were calculated for the MLR model:
 Root Mean Square Error - 34.78%
 R2 – 15.9%
 Mean Absolute Error – 10.39%
 Mean Absolute Percentage Error – Inf

This time output shows Inf in MAPE measure. The reason behind it we have zeros in observed
values. When the dependent variable can take zero as one of the outputs, we cannot use MAPE as
error measure. In this case other error measures should be used.

9 Decision Trees - CART

Decision Tree Analysis is a general, predictive modelling tool that has applications spanning a number
of different areas. In general, decision trees are constructed via an algorithmic approach that
identifies ways to split a data set based on different conditions. It is one of the most widely used and
practical methods for supervised learning.
In the given case study, we will be designing the CART model for the Facebook dataset

18 | P a g e
Classification and Regression trees, commonly known as CART model, was introduced by Leo Breiman,
Jerome Freidman, Richard Olshen and Charles Stone. CART is an umbrella term to refer to the
following types of trees:
 Classification trees : where the target variable is categorical, and the tree is used to identify
the "class" within which a target variable would likely fall into.
 Regression trees: where the target variable is continuous, and tree is used to predict its value

The CART algorithm is structured as a sequence of questions, the answers to which determine what
the next question, if any should be. The result of these questions is a tree like structure where the
ends are terminal nodes at which point there are no more questions
The main elements of CART (and any decision tree algorithm) are:
 Rules for splitting data at a node based on the value of one variable;
 Stopping rules for deciding when a branch is terminal and can be split no more; and
 Finally, a prediction for the target variable in each terminal node.
In the given case study, we will build a CART model for the dependent variable – Target.Variable with
respect to the other independent variables.

Model building and tuning:

The tree obtained in the first iteration of CART is :

The tree is built with the classification method, denoted by method = “anova” in the code. The
method used is anova since we are trying to predict a numeric/continuous value
As per the output, the below splits have been identified:
 The “CC2” forms the root node of the tree and the split is initiated at the “CC2<69”
 The second split is done on a conditional statement, that is, if CC2<69, then the next split is
with respect to Base time>=1 and if CC2> 69, the next split is w.r.t Base.Time>=6
 The next split is done with respect to Post Share count < 300 with base time >=9 and <299

19 | P a g e
 The Post share count > 299 is further split into Page category >= 9 which is further divided
with respect to Page likes, Factor 2 and Post published weekday(towards the end)
 Similarly, based on the several parameters, the tree has been built with 14 terminal nodes.

Model Tuning:

The CART model output with CP = 0 resulted in a tree with 10 terminal nodes but these CART trees
tend to overfit the data. Therefore, Pruning is applied on the CART tree. Pruning, in simple terms, is
defined as cutting the tree. Every time you add a branch, you need to have a value that branch adds.
Pruning can be achieved by saying that each split needs to decrease error by atleast that amount. In
order to decide the optimal tree size, Complexity parameter (CP) is used. If the cost of adding another
variable to the decision tree from the current node is above the value of CP, then tree building does
not continue. We could also say that tree construction does not continue unless it would decrease the
overall lack of fit by a factor of CP.

The CP plot is as below:

As per the complexity parameter table, the minimum relative error is at split = 9. However, if the
number of terminal nodes is 9, it would still tend to overfit the data and hence we plot the graph.
In the graph, we see that the line tends to become parallel and consistent at a point somewhere
between 0.021 to 0.031 and hence we can select the CP as 0.027 and construct the pruned tree.

20 | P a g e
The pruned tree is :

Observations:

 In the pruned tree, the number of terminal nodes has decreased to 6


 The parent node is created taking the attribute - Comments in Last 24 hrs(CC2)
 The variables actually used in tree construction are a) Comments in last 24 hours b) Base time
c) Post Share count
 Comments in the last 24 hrs where the split happens at 69
 The maximum number of splits happens at Post Share count which is also the last node where
the split happens for the terminal nodes

Insights:

The insights with CART model:


 The observations tend to correlate with the observations from the multiple linear regression
 The variables of CC2 and Post Share count imply that greater the post count, greater number
of comments and the post gains more comments in first 24 hours.

Model Performance measures:

The below model performance measures were calculated for the MLR model:
 Root Mean Square Error - 31.02%
 R2 – 33.67%
 Mean Absolute Error – 6.85%

10 Random Forests
Random Forest, one of the most popular and powerful ensemble methods used today in Machine
Learning. An ensemble method or ensemble learning algorithm consists of aggregating multiple
outputs made by a diverse set of predictors to obtain better results. Random forests select

21 | P a g e
observations and specific features to build multiple decision trees and then average the results
across these trees. Random forests use the bootstrap aggregating or bagging algorithm which
generates new training subsets of original data. Each subset is of the same size and is sampled with
replacement. This method allows several instances to be used repeatedly for the training stage given
that we are sampling with replacement.

The initial model calculated the below parameters:


 Number of trees (ntree) = 501
 No. of variables at each split(mtry) – 3
 Mean of squared residuals – 635.33
 % var explained – 44.99%
 Number of trees in which least MSE is achieved - 454

Observations:

In the first iteration of the random forest, the below plot is obtained:

Model Tuning:

The model tuning is performed by selecting appropriate values of mtry, number of trees(ntree) and
minimum size of terminal nodes (nodesize).
Mtry: The mtry value is selected on the below formula:
floor (sqrt (ncol (Train_data_Facebook) - 1))

The mtry value obtained was 4


Number of trees: In general, more the trees, better the model. Usually the node size ranges from 300
and can go upto 1000. Based on the plot obtained in random forest, the number of trees can
optimized to 301 as the error curve starts to flatten at 300
Nodesize: The best tree is obtained with lesser node size, but node size can be increased to reduce
the CPU usage and run the algorithm faster. The node size selected was 10

22 | P a g e
Based on the new parameters, the tree was tuned and below plot was obtained:

Importance of Random forest: The “importance” command in RF returns certain set of independent
variables. For each independent variable, it returns certain parameters which shows how important
the variable is.The below plot was obtained for importance:

Observations:

The initial model calculated the below parameters:


 Number of trees (ntree) - 301
 No. of variables at each split(mtry) – 4
 Mean of squared residuals – 638.87
 % var explained – 44.69%

23 | P a g e
Insights:

 The % variance explained by Random forest is 41.37% which is good as compared to the linear
regression model where the R2 value is around 15%.
 The top- most importance variables are : Base time, CC2, Post Share Count, Page talking
about which correlated with the output of linear regression
 The Base Weekday and Post published weekday also have considerable importance

Model performance measures:

The below model performance measures were calculated for the MLR model:
 Root Mean Square Error - 28.84%
 R2 – 43%
 Mean Absolute Error – 5.4%

11 Extreme Gradient Boosting

Boosting refers to a group of algorithms that utilize weighted averages to make weak learners into
stronger learners. Unlike bagging that had each model run independently and then aggregate the
outputs at the end without preference to any model. Boosting is a two-step approach, where one first
uses subsets of the original data to produce a series of averagely performing models and then "boosts"
their performance by combining them together using a particular cost function (=majority vote).

In the present dataset, we would be focusing on extreme gradient boosting (XGBoost).

One important aspect of XGBoost is that the train and test dataset have to be numeric in the form of
matrix to implement the XGBoost

Observations and tuning:

The XGBoost model is validation over different values of eta, nrounds and max_depth to choose the
optimum value of 0.01, 50 and 25 respectively based on the lowest RMSE

eta max_round nround RMSE


0.01 15 25 39.91
0.3 15 25 34.17
0.1 20 25 33.61
0.1 25 25 33.44
0.1 25 30 33.23
0.1 25 40 32.76
0.1 25 50 32.63
0.1 25 60 33.6

24 | P a g e
The importance plot was built for the tunes XGboost as below:

Insights:

The insights are:


 The XGboost model is the next best model as compared to the Random forest and the
important parameters are Post.length, Base.time, Factor1,Factor3, CC2, Page talking about,
Post Share Count, Page likes, Factor 2
 The least important variables are the weekday variables.
 These results correlate with the observations from Linear regression and CART

Model Performance measures:

The below model performance measures were calculated for the Extreme gradient boosting model:
 Root Mean Square Error - 30.08%
 R2 – 37.7%
 Mean Absolute Error – 5.96%

12 Bagging
Bagging" or bootstrap aggregation is a specific type of machine learning process that uses ensemble
learning to evolve machine learning models. Bagging is a way to decrease the variance of your
prediction by generating additional data for training from your original dataset using combinations
with repetitions to produce multisets of the same cardinality/size as your original data.

25 | P a g e
The bagging works as below:

 It creates randomized samples of the data set (just like random forest) and grows trees on a
different sample of the original data. The remaining 1/3 of the sample is used to estimate
unbiased OOB error.
 It considers all the features at a node (for splitting).
 Once the trees are fully grown, it uses averaging or voting to combine the resultant
predictions.

Model building and tuning:

The bagging was performed with different values of ntree- 1, 3, 5, seq(10, 200, 10) – to find the
optimum value

Bagging, by default performs 25 default bootstrap samples, we require a greater number of trees to
be built to achieve stabilized results. Hence, as a measure to fine tune the single bagging model, we
pass 10-50 bagged trees, and calculate RMSE at each iteration.

Tuning:

The bagging model was tuned with 10-50 trees and the least RMSE was obtained for ntree = 20. The
RMSE was calculated around 30.83% for ntree = 10 and 30.23% for ntree = 20. Hence the value of 20
was chosen for ntree
The Out of Bag error rate is 26.97%

Model Performance measures:

The below model performance measures were calculated for the MLR model:
 Root Mean Square Error - 30.23%
 R2 – 37.27%
 Mean Absolute Error – 6.5%

13 Model comparison and insights

26 | P a g e
Various model performance measures were calculated for all the models as given below:

Model Name RMSE MAE Adjusted


R squared

Multiple Linear Regression 34.78% 10.47% 15.9%

CART 31.02% 6.85% 33.67%

Random Forest 28.84% 5.4% 43%

XGBoost 30.08% 5.96% 37.7%

Bagging 30.23% 6.5% 37.27%

Insights:

Based on the below above table, below are the conclusions:

 The best model is Random Forest as the RMSE is least and around 43% of the variance due to
predictor variables is explained by this model
 The Multiple Linear regression (MLR) and CART is the least accurate model of all five models
 Based on the models, the important variables are – Base.Time, Post.Share.Count,
CC3,CC5,Factor1, Factor2,Page likes, Page talking about

14 Business insights and recommendations


The business insights are derived from the models in combination with the observations from
Exploratory Data analysis. The below insights are derived:

 With increase in post length and increase in frequency of post sharing , the comments
decreases.
 Most number of comments are generated on Tuesday, Monday and Wednesday and the
comments are generated within 24 to 36 hours of the post
 Least number of comments are obtained on posts after 48 hours from base time and post
published weekday.
 The most important variables are Post Share count, Page talking about, Base time, Comments
in last 24 hours, post length

Therefore, the advice to business would be:

To focus on posting comments with limited length.


Encourage Sharing of posts as frequently as they could because sharing ensures that post
reached a wide area and hence more comments can be expected
Since most of the comments are obtained in the last 24 hours and on a Wednesday, followed
by Tuesday and Monday, the focus should be to post the comments in the beginning of the
week (preferably Monday)

27 | P a g e

You might also like