Facebook Comment Volume Prediction Final Report
Facebook Comment Volume Prediction Final Report
Anuja Krishnan
1|Page
Table of Contents
1 Problem Statement.........................................................................................................................................................3
2 Need of study.................................................................................................................................................................. 3
3 Business/Social Opportunity...........................................................................................................................................3
5 Exploratory Analysis........................................................................................................................................................ 5
7 Model building............................................................................................................................................................... 15
10 Random Forests.............................................................................................................................................................21
12 Bagging.......................................................................................................................................................................... 25
2|Page
1 Problem Statement
The leading trends towards the Social Networking has drawn high public attention from past ‘two’
decades. For both small businesses and large corporations, social media is playing a key role in brand
building and customer communication. Facebook is one of the social networking sites relevant for
firms to make themselves real for customers. It is estimated that advertising revenues of Facebook in
the United States in 2018 stands up to 14.89 billion USD against 18.95 billion USD outside. Other
categories like news, communication, commenting, marketing, banking, Entertainment etc. are also
generating huge social media content every minute.
User generated content is used to describe any form of content such as text, messages, video,
images and other forms of media which are generate by the end users of an online system. As per
the latest research, it has been estimated that the highest user engagement is generated from the
user generated posts on Facebook.
In the current project, we will focusing on predicting the number of comments received for a user
generated post in a given set of hours
2 Need of study
The amount of data that gets added to the network increases day by day and it is a gold mine of
researchers who want to understand the intricacies of user behavior and user engagement.
In this project, we used the most active social networking service ‘Facebook’ importantly the
‘Facebook Pages’ for analysis. Our research is oriented towards the estimation of comment volume
that a post is expected to receive in next few hours. By analyzing the comment volume helps
understand the dynamic behavior of the user towards the Facebook posts
Before continuing to the problem of comment volume prediction, some domain specific concepts
are discussed below:
Post/Feed: These are basically the individual stories published on page by administrators of
page.
Comment: It is an important activity in social sites, that gives potential to become a
discussion forum and it is only one measure of popularity/interest towards post is to which
extent readers are inspired to leave comments on document/post.
3 Business/Social Opportunity
As mentioned in the problem statement, the advertising revenue across the world is around 18 billion
dollars and 14.9 billion dollars within the United States. By predicting the comment volume, we can
identify the below factors:
3|Page
This prediction is of great importance from a scientific perspective because of the great potential to
understand the thoughts and feelings of people based on their behavior on social media. The number
of comments on a Facebook post can be used as a factor for understanding the interest in the subject
of the page and relevance of the post’s content. Therefore, by formulating a model to predict the
number of Facebook comments based on page and post information, we can gain insight into the
thoughts and feelings of people active on social media which be used by advertisers, marketers for
designing more effective marketing strategies.
Structure of dataset : The dataset consists of 43 variables, of which, 2 variables – Post published
weekday and Base datetime weekday are “char” columns and the remaining dataset is of numeric
class
Predictor : The Target Variable is the predictor variable used to predict the number of comments on
each post
Feature Category : Each variable in the dataset is categorized into the below features:
Page features: Common measure related to pages such as category, likes, share count etc
Essential features : The pattern of comment from different users on the post at various time
interval with respect to randomly selected base time/date
Derived features: These are features derived prominently from essential features and
aggregated by page, by calculating min, max, average, median and standard deviation of
essential features
Other features: The remaining features that help to predict the volume of comment for each
page category and that includes to document about the source of the page and date/time for
about next H hours.
The features of the given dataset with description and category are as follows:
Derived Features
These features Feature 5- Feature 29 are aggregated by page, by calculating the minimum,
maximum, average, median and standard deviation of the essential features
Essential Features
This includes the pattern of comments in various time frame.
Some of the given essential features are as below: -
CC1: The Total comment count before selected base date/time.
CC2: Comments count in last 24 hours relative to selected base date/time.
CC3: Comments count in last 48 hours to last 24 hours relative to base date/time.
CC4: The number of comments in the first 24 hours after the publication of post but before
base date/time
CC5: The difference between CC2 and CC3.
Other Features
This includes some document related features as below: -
Base Time – Selected time in order to simulate the scenario.
Post Promotion Status - To reach more people with posts in News Feed, individual promote
their post and this feature tells that whether the post is promoted (1) or not (0)
Post Length – Character count in the post
Post Share Count – Counts the number of shares of the post, how many people had shared
this post on their timelines.
H – Local – Describes the H Hrs, for which we have the target variables/comments received.
Weekday features
Post Published weekday – Represents the day (Sunday …. Saturday) on which the post was
published.
Base Date Time weekday – Represents the day (Sunday …. Saturday) on the selected base
Date/Time
Target Variable
Comments – The number of comments in next H Hrs (H Represents H Local)
5 Exploratory Analysis
The exploratory analysis of any dataset includes few basic steps such as removal of irrelevant
variables, treating the missing values and outliers, dimension reduction (if required) and visual
analysis.
The first step towards data cleansing is to identify the insignificant variables. This is done with a step
by step approach. The first step would be to identify the variables the variables which do not provide
any unique importance to the dataset.
5|Page
In the give dataset, there are two variables which are not unique:
Column 1:ID
Column 38 : Post Promotion Status
Column 1 represents the serial numbers and column 38 do not have any values.
For better plots, we have tried doing a sample outlier treatment for the variables wherever there is the
highest gap from the 95th to the 100th Percentile. An outlier will be capped if the values is below its
first quartile – 1.5IQR or above third quartile + 1.5IQR.The below snippet was used for defining an
outlier function to cap the outliers using the inter-quartile range
Before outlier treatment, the data was divided into two subsets – one containing the character
variables and the other subset containing numeric variables for data cleansing
In the given dataset, missing values have been identified in the given variables.
Page likes 3208 Page talking about 3255 Page Checkins 3255
6|Page
CC4 3198 CC5 3200
There are several packages in R which can be used to treat the missing values such as knnImputation
and mice. The package used in this dataset is the MICE package.
MICE: mice” short for Multivariate Imputation by chained equations is an R package that provides
advanced features for missing value treatment. It uses a slightly uncommon way of implementing the
imputation in 2-steps, using mice() to build the model and complete() to generate the completed data.
The mice(df) function produces multiple complete copies of df, each with different imputations of the
missing data. The complete() function returns one or several of these data sets, with the default being
the first.
Post treatment with the mice package, it was observed that the dataset still had missing values in the
columns – Feature 7, Feature 20 and Feature 15. These were further imputed with mean and random
numbers to generate the final complete dataset
The next step for data cleansing was reducing the dimensions of the dataset. The current dataset had
41 variables and of these 42 variables, Feature 5-29 were derived as a subset. The reason for doing
this was lack of any explanation as to what these variables represented. These values were further
reduced by removing Feature 10 and Feature 15 ( had 0 as value) and normalized to a range of (0,1).
Further, these were subjected to Principal component analysis.
The PCA was applied to Feature 5 – 29 and they were reduced into 3 factors which were named as
Factor 1, Factor 2 and Factor 3.
7|Page
5.5 Univariate and Bivariate Analysis
The univariate and bivariate analysis has been conducted on all the variable and below are few plots.
Univariate analysis
Histogram Boxplot
The first histogram and boxplot is that of the page related factors – Page likes, Page checkins, Page talking
about , Page Category.
Page likes: The number of likes range between 0 to 55L
Page checkins: The page checking range from 0 to 28000
Page talking about: The value ranges from 0 to 60L
Page Category: The page category ranges from 1 to 68
8|Page
The second univariate analysis was done on the comment variables (CC1,CC2,CC3,CC4)
CC1 – Total comments before selected base time: The value ranges from 0 to 258
CC2 – Comments in last 24 hours: The value ranges from 0 to 107
CC3 – Comments in last 48 – 24 hours: The value ranges from 0 to 96
CC4 – Comments in first 24 hours: The value ranges from 0 to 246
The above univariate analysis was performed on the factor derived from the Principal Component Analysis.
Factor 1: The value of Factor 1 ranges from -1.6 to 3.07
Factor 2: The value of Factor 2 ranges from – 5 to 6.722
Factor 3: The value of Factor 1 ranges from – 2.5 to 85
9|Page
The above plots show the univariate analysis of the variables associated with post:
Post length: The post length(in characters) range from 0 to 513 characters
Post share count : The minimum value of post count is 1 and it ranges till 452
Target Variable: The target variable ranges from 0 (no comments) to 1305. However, there are outliers in the
target variable, but they have not been treated, keeping in mind the business significance
10 | P a g e
Bivariate Analysis:
The bivariate analysis is performed between different variables and number of comments to
understand any linear relationship between them
Comments before selected base time vs no. of Comments in last 24 hours vs no. of comments
comments
11 | P a g e
Comments in last 48 to 24 hours vs no. of comments Comments in first 24 hours vs no. of comments
Post length vs no. of comments Base date time weekday vs no. of comments
Observations:
The relationship between the target variable and the rest of variables are non linear
The maximum number of comments were posted on Wednesday, Monday and Tuesday
With increasing post length, the number of comments start to decrease
12 | P a g e
5.6 Identifying the significant variables
The next step of data cleansing is to identify the significant variables in the dataset. A combination of
correlation plot, vif values and linear regression p-values have been used to derive the important
feature of the dataset to build models.
The correlation plot and VIF values were calculated for the numerical values of derived dataset and below are the
results:
The VIF values from the linear regression in the first iteration were as below
Page likes Page Checkins Page talking Page Category CC1 CC2
about
2.186531 1.029816 2.945906 1.145452 18.420032 3.722549
CC3 CC4 CC5 Base time Post length Post share
count
4.112315 18.349000 4.796095 1.358041 1.014453 1.659210
Factor 1 2.160712 Factor 2 1.476951 Factor 3 1.159841
The dataset was subjected to multiple iterations of linear regression to identify the significant
variables. It was analyzed that the variables – CC1 to CC5- have significantly high VIF and hence they
were removed from the dataset.
Below is the correlation matrix and VIF for the new dataset:
13 | P a g e
> vif(model2)
Page.likes Page.Checkins Page.talking.about
1.804575 1.060851 1.606352
Page.Category Factor1 Factor2
1.159538 2.461764 1.196483
Factor3 CC2 CC3
1.032741 2.451173 2.656034
CC5 Base.Time Post.Length
2.918733 1.280638 1.013197
Post.Share.Count
1.561916
The dataset consists of 32759 rows and 43 columns. Although the dataset didn’t seem unbalanced but
there were certain features in the dataset which did not have explanation for existence.
Graphical Analysis:
The univariate histogram of all the variables shows a skewness for the categories of – Page
likes, page checkins, page talking about, CC1, CC2, CC3,CC4,H local, Target.Variable
14 | P a g e
The bivariate analysis throws light on some interesting aspect as below:
The maximum number of Page.Likes for published posts were obtained on Wednesday
followed by Sunday and Friday. This implies that the more likes are generated on a weekday
or mid-week
As the length of the post increases, the count of likes goes down.This implies that the
audience is not interested in reading long posts. The same trend can be observed in the count
of post shares as well
In most scenarios, the posts which have more likes tend to have more comments. Based on
the above observations, we can say that the more the post length, lesser comments.
To recapitulate, the target group should be the user generated content with limited length and posted,
possibly on a start of weekend. This would generate more comments and thereby help in
understanding the behavior of the user.
Variable analysis:
All the variables were subjected to multicollinearity check and it was understood that:
Feature 5-29 had no background as to how they were derived and were highly correlated.
Hence these were normalized and converted into 3 factors for easy analysis
Further, the variables CC1 – CC5 were removed owing to high VIFs and a final dataset was
derived having the following variables – Page likes, Page Checkins, Page Category, Page
Talking about, Base time, Post Length, Post Share Count, Factor 1, Factor 2, Factor 3, Post
published weekday, Base datetime weekday and H.local
7 Model building
This section of the Facebook Comment Volume is to build various models to identify the most important
factors of the dataset which drive comments for user generated post.
Few methods of regression modeling have been applied on dataset such as:
Multiple Linear Regression (MLR)
Random Forest
Classification and Regression trees (CART)
Extreme Gradient Boosting (XGBoost)
Bagging
The model building is followed by Model performance measures. Since the output variable is continuous,
the popular model performance measures would be:
Root Mean Square Error (RMSE): The standard deviation of the residuals (prediction errors).
Residuals are a measure of how far from the regression line data points are; RMSE is a
measure of how spread out these residuals are.
Mean Absolute Error (MAE): The mean absolute error of a model with respect to a test set is
the mean of the absolute values of the individual prediction errors on over all instances in the
test set
Adjusted R2 : The adjusted R2 tells you the percentage of variation explained by only the
independent variables that actually affect the dependent variable . The adjusted R-squared
15 | P a g e
increases only if the new term improves the model more than would be expected by
chance
Data preparation:
The first step towards any model building is splitting the dataset into train and test dataset. In the
current scenario, the data has been split in 70-30 ratio – 70% for the train data and 30% for the test
data. The dimensions of train and test data are :
Train Test
22931 rows 9828 rows
17 columns 17 columns
Multiple linear regression is used to estimate the relationship between two or more independent
variables and one dependent variable. You can use multiple linear regression when we want to
determine:
How strong the relationship is between two or more independent variables and one
dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect
crop growth).
The value of the dependent variable at a certain value of the independent variables (e.g. the
expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).
In the first iteration of the linear regression model, it was observed that the parameters – Page likes,
Page checkins, Page Category, Base DateTime Weekday, Post Published Weekday – have the p-values
> 0.05 and hence they were removed from the dataset.
Tuning :
16 | P a g e
The model was tuned to remove the insignificant variables – Page likes, Page checkins, Page
Category, Base date time weekday, Post published weekday- as they have a p-value greater than
0.05. The output from the tuned model is:
Observations:
The p-value is <2.2e-16 respectively, which are considerably significant. This means that, at
least, one of the predictor variables is significantly related to the outcome variable. The F-
statistic value is very high: 564.5 on 11 and 22919 degrees of freedom.
Adjusted R-square:
The Adjusted R-square value is 0.2128, meaning that only 21% of the variance is explained
from the final model.
The RSE estimate which gives a measure of error of prediction, from this model is calculated
as 30.15
Interpretations:
For a given predictor variable, the coefficient can be interpreted as the average effect on y of a one
unit increase in predictor, holding all other predictors fixed
The below coefficients show a decrease in value with one unit increase in target variable –
17 | P a g e
CC3,Base.Time and Post.Length. This implies that with increase in Base time, post length,
page likes and checkins, the number of comments decrease.
The below coefficients show an increase in estimate with one unit increase in target variable
– Page Talking about, Factor1, Factor 2, Factor 3, CC2,CC5,Post Share Count, H.Local. With
increase in these variables, the comment volume decreases.
Insights:
Based on the intercept and estimate values, below are the conclusions:
When the Post Share count increases, the number of comments increases. In order to
generate revenue, user generated posts should be shared more frequently.
Similarly, H.Local implies the hours for which the comments received. As the time
increases after comment generation, the number of comments increases
The next factors are increase in CC2 and CC5 where CC5 is the difference between
comments in 24 to 48 hours and comments in last 24 hours, that is, the comments on the
beginning of second day gains more comments
The number of comments decreases with increase in post length, page checkins and page
likes. The page likes and page checkins increase with time and as a result, the number of
comments tend to decrease
In order to generate revenue out of the user generated content, institutions should focus
on sharing these posts more frequently within the first half of second day and keep the
post to a limited length
The below model performance measures were calculated for the MLR model:
Root Mean Square Error - 34.78%
R2 – 15.9%
Mean Absolute Error – 10.39%
Mean Absolute Percentage Error – Inf
This time output shows Inf in MAPE measure. The reason behind it we have zeros in observed
values. When the dependent variable can take zero as one of the outputs, we cannot use MAPE as
error measure. In this case other error measures should be used.
Decision Tree Analysis is a general, predictive modelling tool that has applications spanning a number
of different areas. In general, decision trees are constructed via an algorithmic approach that
identifies ways to split a data set based on different conditions. It is one of the most widely used and
practical methods for supervised learning.
In the given case study, we will be designing the CART model for the Facebook dataset
18 | P a g e
Classification and Regression trees, commonly known as CART model, was introduced by Leo Breiman,
Jerome Freidman, Richard Olshen and Charles Stone. CART is an umbrella term to refer to the
following types of trees:
Classification trees : where the target variable is categorical, and the tree is used to identify
the "class" within which a target variable would likely fall into.
Regression trees: where the target variable is continuous, and tree is used to predict its value
The CART algorithm is structured as a sequence of questions, the answers to which determine what
the next question, if any should be. The result of these questions is a tree like structure where the
ends are terminal nodes at which point there are no more questions
The main elements of CART (and any decision tree algorithm) are:
Rules for splitting data at a node based on the value of one variable;
Stopping rules for deciding when a branch is terminal and can be split no more; and
Finally, a prediction for the target variable in each terminal node.
In the given case study, we will build a CART model for the dependent variable – Target.Variable with
respect to the other independent variables.
The tree is built with the classification method, denoted by method = “anova” in the code. The
method used is anova since we are trying to predict a numeric/continuous value
As per the output, the below splits have been identified:
The “CC2” forms the root node of the tree and the split is initiated at the “CC2<69”
The second split is done on a conditional statement, that is, if CC2<69, then the next split is
with respect to Base time>=1 and if CC2> 69, the next split is w.r.t Base.Time>=6
The next split is done with respect to Post Share count < 300 with base time >=9 and <299
19 | P a g e
The Post share count > 299 is further split into Page category >= 9 which is further divided
with respect to Page likes, Factor 2 and Post published weekday(towards the end)
Similarly, based on the several parameters, the tree has been built with 14 terminal nodes.
Model Tuning:
The CART model output with CP = 0 resulted in a tree with 10 terminal nodes but these CART trees
tend to overfit the data. Therefore, Pruning is applied on the CART tree. Pruning, in simple terms, is
defined as cutting the tree. Every time you add a branch, you need to have a value that branch adds.
Pruning can be achieved by saying that each split needs to decrease error by atleast that amount. In
order to decide the optimal tree size, Complexity parameter (CP) is used. If the cost of adding another
variable to the decision tree from the current node is above the value of CP, then tree building does
not continue. We could also say that tree construction does not continue unless it would decrease the
overall lack of fit by a factor of CP.
As per the complexity parameter table, the minimum relative error is at split = 9. However, if the
number of terminal nodes is 9, it would still tend to overfit the data and hence we plot the graph.
In the graph, we see that the line tends to become parallel and consistent at a point somewhere
between 0.021 to 0.031 and hence we can select the CP as 0.027 and construct the pruned tree.
20 | P a g e
The pruned tree is :
Observations:
Insights:
The below model performance measures were calculated for the MLR model:
Root Mean Square Error - 31.02%
R2 – 33.67%
Mean Absolute Error – 6.85%
10 Random Forests
Random Forest, one of the most popular and powerful ensemble methods used today in Machine
Learning. An ensemble method or ensemble learning algorithm consists of aggregating multiple
outputs made by a diverse set of predictors to obtain better results. Random forests select
21 | P a g e
observations and specific features to build multiple decision trees and then average the results
across these trees. Random forests use the bootstrap aggregating or bagging algorithm which
generates new training subsets of original data. Each subset is of the same size and is sampled with
replacement. This method allows several instances to be used repeatedly for the training stage given
that we are sampling with replacement.
Observations:
In the first iteration of the random forest, the below plot is obtained:
Model Tuning:
The model tuning is performed by selecting appropriate values of mtry, number of trees(ntree) and
minimum size of terminal nodes (nodesize).
Mtry: The mtry value is selected on the below formula:
floor (sqrt (ncol (Train_data_Facebook) - 1))
22 | P a g e
Based on the new parameters, the tree was tuned and below plot was obtained:
Importance of Random forest: The “importance” command in RF returns certain set of independent
variables. For each independent variable, it returns certain parameters which shows how important
the variable is.The below plot was obtained for importance:
Observations:
23 | P a g e
Insights:
The % variance explained by Random forest is 41.37% which is good as compared to the linear
regression model where the R2 value is around 15%.
The top- most importance variables are : Base time, CC2, Post Share Count, Page talking
about which correlated with the output of linear regression
The Base Weekday and Post published weekday also have considerable importance
The below model performance measures were calculated for the MLR model:
Root Mean Square Error - 28.84%
R2 – 43%
Mean Absolute Error – 5.4%
Boosting refers to a group of algorithms that utilize weighted averages to make weak learners into
stronger learners. Unlike bagging that had each model run independently and then aggregate the
outputs at the end without preference to any model. Boosting is a two-step approach, where one first
uses subsets of the original data to produce a series of averagely performing models and then "boosts"
their performance by combining them together using a particular cost function (=majority vote).
One important aspect of XGBoost is that the train and test dataset have to be numeric in the form of
matrix to implement the XGBoost
The XGBoost model is validation over different values of eta, nrounds and max_depth to choose the
optimum value of 0.01, 50 and 25 respectively based on the lowest RMSE
24 | P a g e
The importance plot was built for the tunes XGboost as below:
Insights:
The below model performance measures were calculated for the Extreme gradient boosting model:
Root Mean Square Error - 30.08%
R2 – 37.7%
Mean Absolute Error – 5.96%
12 Bagging
Bagging" or bootstrap aggregation is a specific type of machine learning process that uses ensemble
learning to evolve machine learning models. Bagging is a way to decrease the variance of your
prediction by generating additional data for training from your original dataset using combinations
with repetitions to produce multisets of the same cardinality/size as your original data.
25 | P a g e
The bagging works as below:
It creates randomized samples of the data set (just like random forest) and grows trees on a
different sample of the original data. The remaining 1/3 of the sample is used to estimate
unbiased OOB error.
It considers all the features at a node (for splitting).
Once the trees are fully grown, it uses averaging or voting to combine the resultant
predictions.
The bagging was performed with different values of ntree- 1, 3, 5, seq(10, 200, 10) – to find the
optimum value
Bagging, by default performs 25 default bootstrap samples, we require a greater number of trees to
be built to achieve stabilized results. Hence, as a measure to fine tune the single bagging model, we
pass 10-50 bagged trees, and calculate RMSE at each iteration.
Tuning:
The bagging model was tuned with 10-50 trees and the least RMSE was obtained for ntree = 20. The
RMSE was calculated around 30.83% for ntree = 10 and 30.23% for ntree = 20. Hence the value of 20
was chosen for ntree
The Out of Bag error rate is 26.97%
The below model performance measures were calculated for the MLR model:
Root Mean Square Error - 30.23%
R2 – 37.27%
Mean Absolute Error – 6.5%
26 | P a g e
Various model performance measures were calculated for all the models as given below:
Insights:
The best model is Random Forest as the RMSE is least and around 43% of the variance due to
predictor variables is explained by this model
The Multiple Linear regression (MLR) and CART is the least accurate model of all five models
Based on the models, the important variables are – Base.Time, Post.Share.Count,
CC3,CC5,Factor1, Factor2,Page likes, Page talking about
With increase in post length and increase in frequency of post sharing , the comments
decreases.
Most number of comments are generated on Tuesday, Monday and Wednesday and the
comments are generated within 24 to 36 hours of the post
Least number of comments are obtained on posts after 48 hours from base time and post
published weekday.
The most important variables are Post Share count, Page talking about, Base time, Comments
in last 24 hours, post length
27 | P a g e