0% found this document useful (0 votes)
18 views18 pages

Project

Linear regression was used to predict yearly customer spending on an e-commerce platform based on engagement metrics. The key predictors were average session length, time spent on the app and website, and length of membership. The data was cleaned by removing NA values and duplicate rows. Variables like customer email and avatar that did not predict spending were removed. Two linear regression models were created - one using all engagement predictors and one using just average session length. Both aimed to uncover patterns in customer behavior and spending to help optimize the customer experience and increase revenue.

Uploaded by

Salma Shaheen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views18 pages

Project

Linear regression was used to predict yearly customer spending on an e-commerce platform based on engagement metrics. The key predictors were average session length, time spent on the app and website, and length of membership. The data was cleaned by removing NA values and duplicate rows. Variables like customer email and avatar that did not predict spending were removed. Two linear regression models were created - one using all engagement predictors and one using just average session length. Both aimed to uncover patterns in customer behavior and spending to help optimize the customer experience and increase revenue.

Uploaded by

Salma Shaheen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Predictive Modelling with Linear Regression

Project 1 (Multivariate Statistics


STAT8031 - Fall 2023 - Section 1)
Salma Shaheen

Student ID| 8913789


Table of Contents

1. Data Set: E-Commerce Customers --------------------------------------------------------------2


2. Initial Modeling for E-commerce Dataset -----------------------------------------------------4
3. Diagnostics for E-commerce Dataset-----------------------------------------------------------11
4. Model Selection for E-commerce Dataset-----------------------------------------------------14
5. Predictions and Summary ------------------------------------------------------------------------17

1
Data Set: E-commerce Customers
The given dataset seems to be a simulated collection of customer data in an e-commerce setting,
likely used for analytical or modeling purposes. There are 500 observations in the dataset. It appears
that each entry(observation) in the dataset represents a distinct customer, and the dataset contains
multiple variables or attributes that provide information about different aspects of these customers'
interactions and behaviors on the e-commerce platform. Here is a detailed breakdown of the
dataset:

1. Email: This variable contains the email addresses of the customers, serving as unique
identifiers for each customer in the dataset.
2. Address: This variable includes the physical addresses of the customers, providing additional
contact information.
3. Avatar: This variable represents the chosen avatars or profile pictures of the customers,
potentially for personalization or identification purposes.
4. Avg. Session Length: This variable records the average length of time each customer spends
per session on the e-commerce platform.
5. Time on App: This variable represents the time spent by each customer using the e-
commerce application.
6. Time on Website: This variable indicates the time spent by each customer on the e-
commerce website.
7. Length of Membership: This variable denotes the duration for which each customer has
been a member of the e-commerce platform.
8. Yearly Amount Spent: This variable serves as the quantitative response, representing the
total amount of money spent by each customer on the e-commerce platform within a year.

The dataset includes both quantitative variables, such as time metrics and the amount spent, and
qualitative variables, such as email addresses, addresses, and avatars.

In the e-commerce dataset, the following variables can be considered for regression analysis:
 Avg. Session Length
 Time on App
 Time on Website
 Length of Membership
 Yearly amount spent.

These variables are suitable for regression analysis as they are quantitative in nature, providing
measurable insights into customer behavior and engagement on the e-commerce platform. Utilizing
these variables in a regression analysis can help in understanding the relationships between
customer engagement metrics and the amount spent yearly on the platform. By examining how
changes in these variables influence the yearly amount spent, businesses can gain valuable insights

2
into customer preferences, behavior patterns, and the effectiveness of the e-commerce platform in
driving customer spending.

In the e-commerce dataset, there is at least one categorical variable, which is the "Avatar" variable.
This variable represents the chosen avatars or profile pictures of the customers. Categorical variables
are those that have distinct categories or groups, and in this dataset, the "Avatar" variable is a
categorical attribute representing different visual representations chosen by the customers.

Analyzing the impact of this categorical variable on customer behavior or the amount spent might
provide insights into the potential influence of visual representations on customer engagement and
spending patterns. Including categorical variables in the analysis can help understand the role of
visual cues or personalization in customer interactions and provide valuable insights for improving
the user experience on the e-commerce platform. While it might be possible to use this variable for
categorical regression, it's important to note that using it as a predictor in a regression analysis
might not be straightforward.

Categorical regression, also known as logistic regression, typically requires categorical variables to
be encoded as dummy or indicator variables. This process involves creating binary variables that
represent the presence or absence of a category. However, the "Avatar" variable, as it stands, may
not be directly applicable for this type of analysis without appropriate preprocessing.

Following are the things that I hope to predict using this dataset:

 Given the rich array of dataset, the primary aim would be to predict the yearly amount spent
by customers based on their engagement metrics and membership duration.
 By leveraging the predictors such as the time spent on the app and website, session length,
and length of membership, it becomes possible to uncover patterns and trends that drive
customer spending behavior.
 Understanding the factors that influence customer spending can offer valuable insights for e-
commerce businesses to optimize their platform, improve user experience, and tailor
marketing strategies to boost customer engagement and maximize revenue.

Through comprehensive analysis and modeling techniques, it is possible to uncover hidden patterns
within the data, such as identifying customer segments with varying spending habits, preferences, or
levels of engagement. By uncovering these insights, businesses can refine their marketing strategies,
streamline user experiences, and enhance customer satisfaction, ultimately leading to increased
customer loyalty and higher overall revenue generation. Understanding the dynamics of customer
spending within the e-commerce platform can provide businesses with a competitive edge in a
rapidly evolving and highly competitive online marketplace.

3
Initial Modeling for E-commerce Dataset

For the e-commerce dataset, the "Yearly Amount Spent" serves as the dependent (response)
variable.

Based on the e-commerce dataset, several variables seem crucial as predictors for understanding
customer spending behavior. The "Time on App" and "Time on Website" are likely to be significant,
as they represent the duration of customer interactions on these platforms, indicating the level of
engagement. The "Length of Membership" is another essential predictor as it can provide insights
into customer loyalty and the potential impact of long-term engagement on spending. The “average
session length” is another useful predictor in understanding customer behavior and potentially
predicting the yearly amount spent by customers on the e-commerce platform. The average session
length reflects the duration of time customers spend during each interaction or session, providing
insights into their engagement and activity levels. Longer average session lengths might indicate
higher levels of customer involvement, interest, or satisfaction with the platform, potentially leading
to increased spending.

Additionally, considering possible non-linear relationships, it might be beneficial to include the


squared or cubed terms of these variables to account for any curvilinear patterns. Interaction terms
between the "Time on App" and "Time on Website" could capture any combined effect of these
variables on the yearly amount spent.

On the other hand, variables such as the customer's email, address, or avatar might not be directly
useful for predicting spending behavior and could be excluded from the model to simplify the
analysis and avoid unnecessary complexity.

Cleaning Data:
Firstly, I uploaded data in “R” by importing it. After that, I renamed the data for easiness.

 DT = E_commerece_Customers

After that, I noticed that there are many places having “NA” (may be its just happens when I
converted the data file into Excel). So, I removed the “NA” values by using the following code:

 library(dplyr)
 Remove_NA = na.omit(DT)
 Remove_NA

To make it sure that no rows are duplicate, I use the following command:

4
 Remove_duplicate = distinct(Remove_NA)
 Remove_duplicate

As variables such as the customer's email, address, or avatar might not be directly useful for
predicting spending behavior, so I removed these columns, by the following command:

Again, I renamed the data, and I used the following codes to see that my data is in accurate form
having 500 observation and five variables:
 DTC = Remove_columns
 DTC
 head(DTC)
 str(DTC)
 nrow(DTC)

Modeling:

Linear Regression model for Avg. Session Length, Time on App, Time on Website,
Length of Membership & Yearly amount spent:

The following code will be used to find the regression line:

 model1= lm(`Yearly Amount Spent` ~ `Avg. Session Length`+ `Time on App`+


`Time on Website`+ `Length of Membership`, data=DTC)
 model1

The coefficients are:

Intercept Avg. Session Length Time on App Time on Website `Length of Membership`
-1051.5943 25.7343 38.7092 0.4367 61.5773

Linear Regression model for Avg. Session Length & Yearly amount spent :

The following code will be used to find the regression line:

 m1= lm(DTC$`Yearly Amount Spent` ~ DTC$`Avg. Session Length`)


 m1

The coefficients are:

5
Intercept Slope
-438.56 28.37

This indicates that if we increase Average session Length by 1 unit, then Yearly amount spent will
increase on average of 28.37.

The code for plot along with regression line is:

 plot(DTC$`Avg. Session Length`, DTC$`Yearly Amount Spent`,


main = "Yearly Amount vs Average Length", xlab = "Average Length",
ylab = "Yealry Amount")
 abline(lm(`Yearly Amount Spent` ~ `Avg. Session Length`, data = DTC), col = "blue")

The result is:

This relationship seems to be nonlinear. So, the model after a suitable transformation is:

 DTCSqrModel1 = lm(`Yearly Amount Spent` ~ `Avg. Session Length`


+ I(`Avg. Session Length`^2), data = DTC)
 coef(DTCSqrModel1)

The coefficients are:

6
Intercept Avg. Session Length I(Avg. Session Length^2)
334.632106 -18.4748455 0.7090415
5

Linear Regression model for Time on App & Yearly amount spent:

The following code will be used to find the regression line:

 m2= lm(DTC$`Yearly Amount Spent` ~ DTC$`Time on App`)


 m2

The coefficients are:

Intercept Slope
19.21 39.83

This indicates that if we increase Average time on App by 1 unit, then Yearly amount spent will
increase on average of 39.83.

The code for plot is:


 plot(DTC$`Time on App`, DTC$`Yearly Amount Spent`,
main = "Yearly Amount vs Time on App", xlab = "Time on App",
ylab = "Yealry Amount")
 abline(lm(`Yearly Amount Spent` ~ `Time on App`, data = DTC), col = "blue")

The result is:

7
This relationship seems to be nonlinear. So, the model after a suitable transformation is:

 DTCSqrModel2 = lm(`Yearly Amount Spent` ~ `Time on App`


+ I(`Time on App`^2), data = DTC)
 coef(DTCSqrModel2)

The coefficients are:

Intercept Time on App I(Time on App^2)


217.111103 6.646338 1.381877

Linear Regression model for Time on Website & Yearly amount spent:

The following code will be used to find the regression line:

 m3= lm(DTC$`Yearly Amount Spent` ~ DTC$`Time on Website`)


 m3

The coefficients are:

Intercept Slope
506.9961 -0.2073

This indicates that if we increase Average time on Web by 1 unit, then Yearly amount spent will
decrease on average of 0.2073.

8
The code for plot is:
 plot(DTC$`Time on Website`, DTC$`Yearly Amount Spent`,
main = "Yearly Amount vs Time on Website", xlab = "Time on Website",
ylab = "Yealry Amount")
 abline(lm(`Yearly Amount Spent` ~ `Time on Website`, data = DTC), col = "blue")

The result is:

This relationship seems to be nonlinear. So, the model after a suitable transformation is:

 DTCSqrModel3 = lm(`Yearly Amount Spent` ~ `Time on App`


+ I(`Time on App`^2), data = DTC)
 coef(DTCSqrModel3)

The coefficients are:

Intercept Time on App I(Time on App^2)


2289.68515 -96.467231 1.298474
4

Linear Regression model for Length of Membership & Yearly amount spent:

The following code will be used to find the regression line:

9
 m4= lm(DTC$`Yearly Amount Spent` ~ DTC$`Length of Membership`)
 m4

The coefficients are:

Intercept Slope
272.40 64.22

This indicates that if we increase Length of Membership by 1 unit, then Yearly amount spent will
increase on average of 64.22.
The code for plot is:
 plot(DTC$`Length of Membership`, DTC$`Yearly Amount Spent`,
main = "Yearly Amount vs Length of Membership", xlab = "Time on Website",
ylab = "Yealry Amount")
 abline(lm(`Yearly Amount Spent` ~ `Length of Membership`, data = DTC), col =
"blue")

The result is:

This relationship seems to be linear.

10
Diagnostics for E-commerce Dataset
To perform diagnostic tests on a linear regression model in R (model1), We will use Residual Analysis
(Residue vs fitted values plot), Histogram and Quantile-Quantile (QQ) plot to check the major
assumptions of the linear model (model1).

The Residue vs fitted plot:

The code in “R” is:

 model1Resids= model1$residuals
 model1Resids
 model1fitted=model1$fitted.values
 model1fitted
 plot(model1fitted, model1Resids, main="Scatter Plot", xlab = "Fitted Values", ylab =
"Residuals")

The resulted plot is:

Figure 1 (Residue vs Fitted Values)

11
The homoscedasticity assumption is violated. The variance (i.e. spread) of the residuals decreases as
the predicted values increase. Further, the model as constructed is having a hard time predicting
certain observations in the data, leading to outliers.

Histogram:

The code in “R “ is:

 hist(model1Resids, main = "Histogram of Residues", xlab = "Residues", ylab =


"Frequency")

The result is:

Figure 2(Histogram)

The histogram for the residuals might appear normal at first glance but it is actually “long tailed”. It
is more common for a distribution like this to generate outliers than a normal distribution. This
suggests that our residuals are not normally distributed.

Quantile-Quantile plot:

The code in “R” is:

 qqnorm(model1Resids)

The result is:

12
The qq-plot here also suggests the same as histogram. If the residuals were normal, they points
would cluster around the line. Instead, we see the effects of the outliers.

QQ plot should be straight….

13
Model Selection for E-commerce Dataset
Now, we will consider the problem of choosing from several different models.

Firstly, train a linear model with response Yearly Amount Spent and a single predictor Avg. Session
Length, with 10-fold cross-validatio.

The code is:


 library(stargazer)
 library(caret)
 library(leaps)
 set.seed(12)

 small_model= train(form = `Yearly Amount Spent` ~ `Avg. Session Length`,data =


DTC, method = "lm", trControl = trainControl(method = "cv", number = 10))
 small_model

The resampling results are:

RMSE R squared MAE


74.01713 0.1314221 57.38752

Further, we will train a linear model with response Yearly Amount Spent and every other variable
as predictor, with 10-fold cross-validation.

The code is:

 Model2 = train( form = `Yearly Amount Spent` ~ `Avg. Session Length` +


`Time on App` + `Time on Website` + `Length of Membership`,
data = DTC, method = "lm", trControl = trainControl(method = "cv",
number = 10))
 Model2

The resampling results are:

RMSE R squared MAE


9.963529 0.9832007 7.955738

Next, we will use best subset model selection for Yearly Amount spent and plot this using adj2 scale.

14
The code is:
 Model2 = train( form = `Yearly Amount Spent` ~ `Avg. Session Length` +
`Time on App` + `Time on Website` + `Length of Membership`,
data = DTC, method = "lm", trControl = trainControl(method = "cv",
number = 10))
 subsetmodel = regsubsets(`Yearly Amount Spent` ~ `Avg. Session Length` +
`Time on App` + `Time on Website` + `Length of Membership`, data =
DTC, nvmax=5)
 plot(subsetmodel, scale = "adjr2")
 summary(subsetmodel)

 bestM = which.max(summary(subsetmodel)$adjr2);
 bestM

The plot is:

Since the bestM value is 3. So, the code for new model is:

 ModelN = train( form = `Yearly Amount Spent` ~ `Avg. Session Length` +


`Time on App` + `Length of Membership`,

15
data = DTC, method = "lm", trControl = trainControl(method = "cv", number
= 10))
 ModelN

RMSE R squared MAE


9.950781 0.9841474 7.938899

The values of RMSE and MAE are slightly less as compared to the previous model. So, this is the
better model.

16
Predictions and Summary
To make Predictions or calculate predicted values, we will use the following code:

 predicted_values = predict(ModelN, newdata = DTC)


 predicted_values
Furthermore, we can compare these with the actual with the help of this code:
 comparison = data.frame(Actual = DTC$`Yearly Amount Spent`, Predicted =
predicted_values)

Thus, from the results we observe:


 The predicted values should reflect the estimated responses for the provided data points.
 RMSE indicates the model's accuracy in predicting the response variable.
 R-squared measures the proportion of the variance in the dependent variable that is
predictable from the independent variables.
 MAE represents the average of the absolute errors between predicted and observed values.

17

You might also like