Project
Project
1
Data Set: E-commerce Customers
The given dataset seems to be a simulated collection of customer data in an e-commerce setting,
likely used for analytical or modeling purposes. There are 500 observations in the dataset. It appears
that each entry(observation) in the dataset represents a distinct customer, and the dataset contains
multiple variables or attributes that provide information about different aspects of these customers'
interactions and behaviors on the e-commerce platform. Here is a detailed breakdown of the
dataset:
1. Email: This variable contains the email addresses of the customers, serving as unique
identifiers for each customer in the dataset.
2. Address: This variable includes the physical addresses of the customers, providing additional
contact information.
3. Avatar: This variable represents the chosen avatars or profile pictures of the customers,
potentially for personalization or identification purposes.
4. Avg. Session Length: This variable records the average length of time each customer spends
per session on the e-commerce platform.
5. Time on App: This variable represents the time spent by each customer using the e-
commerce application.
6. Time on Website: This variable indicates the time spent by each customer on the e-
commerce website.
7. Length of Membership: This variable denotes the duration for which each customer has
been a member of the e-commerce platform.
8. Yearly Amount Spent: This variable serves as the quantitative response, representing the
total amount of money spent by each customer on the e-commerce platform within a year.
The dataset includes both quantitative variables, such as time metrics and the amount spent, and
qualitative variables, such as email addresses, addresses, and avatars.
In the e-commerce dataset, the following variables can be considered for regression analysis:
Avg. Session Length
Time on App
Time on Website
Length of Membership
Yearly amount spent.
These variables are suitable for regression analysis as they are quantitative in nature, providing
measurable insights into customer behavior and engagement on the e-commerce platform. Utilizing
these variables in a regression analysis can help in understanding the relationships between
customer engagement metrics and the amount spent yearly on the platform. By examining how
changes in these variables influence the yearly amount spent, businesses can gain valuable insights
2
into customer preferences, behavior patterns, and the effectiveness of the e-commerce platform in
driving customer spending.
In the e-commerce dataset, there is at least one categorical variable, which is the "Avatar" variable.
This variable represents the chosen avatars or profile pictures of the customers. Categorical variables
are those that have distinct categories or groups, and in this dataset, the "Avatar" variable is a
categorical attribute representing different visual representations chosen by the customers.
Analyzing the impact of this categorical variable on customer behavior or the amount spent might
provide insights into the potential influence of visual representations on customer engagement and
spending patterns. Including categorical variables in the analysis can help understand the role of
visual cues or personalization in customer interactions and provide valuable insights for improving
the user experience on the e-commerce platform. While it might be possible to use this variable for
categorical regression, it's important to note that using it as a predictor in a regression analysis
might not be straightforward.
Categorical regression, also known as logistic regression, typically requires categorical variables to
be encoded as dummy or indicator variables. This process involves creating binary variables that
represent the presence or absence of a category. However, the "Avatar" variable, as it stands, may
not be directly applicable for this type of analysis without appropriate preprocessing.
Following are the things that I hope to predict using this dataset:
Given the rich array of dataset, the primary aim would be to predict the yearly amount spent
by customers based on their engagement metrics and membership duration.
By leveraging the predictors such as the time spent on the app and website, session length,
and length of membership, it becomes possible to uncover patterns and trends that drive
customer spending behavior.
Understanding the factors that influence customer spending can offer valuable insights for e-
commerce businesses to optimize their platform, improve user experience, and tailor
marketing strategies to boost customer engagement and maximize revenue.
Through comprehensive analysis and modeling techniques, it is possible to uncover hidden patterns
within the data, such as identifying customer segments with varying spending habits, preferences, or
levels of engagement. By uncovering these insights, businesses can refine their marketing strategies,
streamline user experiences, and enhance customer satisfaction, ultimately leading to increased
customer loyalty and higher overall revenue generation. Understanding the dynamics of customer
spending within the e-commerce platform can provide businesses with a competitive edge in a
rapidly evolving and highly competitive online marketplace.
3
Initial Modeling for E-commerce Dataset
For the e-commerce dataset, the "Yearly Amount Spent" serves as the dependent (response)
variable.
Based on the e-commerce dataset, several variables seem crucial as predictors for understanding
customer spending behavior. The "Time on App" and "Time on Website" are likely to be significant,
as they represent the duration of customer interactions on these platforms, indicating the level of
engagement. The "Length of Membership" is another essential predictor as it can provide insights
into customer loyalty and the potential impact of long-term engagement on spending. The “average
session length” is another useful predictor in understanding customer behavior and potentially
predicting the yearly amount spent by customers on the e-commerce platform. The average session
length reflects the duration of time customers spend during each interaction or session, providing
insights into their engagement and activity levels. Longer average session lengths might indicate
higher levels of customer involvement, interest, or satisfaction with the platform, potentially leading
to increased spending.
On the other hand, variables such as the customer's email, address, or avatar might not be directly
useful for predicting spending behavior and could be excluded from the model to simplify the
analysis and avoid unnecessary complexity.
Cleaning Data:
Firstly, I uploaded data in “R” by importing it. After that, I renamed the data for easiness.
DT = E_commerece_Customers
After that, I noticed that there are many places having “NA” (may be its just happens when I
converted the data file into Excel). So, I removed the “NA” values by using the following code:
library(dplyr)
Remove_NA = na.omit(DT)
Remove_NA
To make it sure that no rows are duplicate, I use the following command:
4
Remove_duplicate = distinct(Remove_NA)
Remove_duplicate
As variables such as the customer's email, address, or avatar might not be directly useful for
predicting spending behavior, so I removed these columns, by the following command:
Again, I renamed the data, and I used the following codes to see that my data is in accurate form
having 500 observation and five variables:
DTC = Remove_columns
DTC
head(DTC)
str(DTC)
nrow(DTC)
Modeling:
Linear Regression model for Avg. Session Length, Time on App, Time on Website,
Length of Membership & Yearly amount spent:
Intercept Avg. Session Length Time on App Time on Website `Length of Membership`
-1051.5943 25.7343 38.7092 0.4367 61.5773
Linear Regression model for Avg. Session Length & Yearly amount spent :
5
Intercept Slope
-438.56 28.37
This indicates that if we increase Average session Length by 1 unit, then Yearly amount spent will
increase on average of 28.37.
This relationship seems to be nonlinear. So, the model after a suitable transformation is:
6
Intercept Avg. Session Length I(Avg. Session Length^2)
334.632106 -18.4748455 0.7090415
5
Linear Regression model for Time on App & Yearly amount spent:
Intercept Slope
19.21 39.83
This indicates that if we increase Average time on App by 1 unit, then Yearly amount spent will
increase on average of 39.83.
7
This relationship seems to be nonlinear. So, the model after a suitable transformation is:
Linear Regression model for Time on Website & Yearly amount spent:
Intercept Slope
506.9961 -0.2073
This indicates that if we increase Average time on Web by 1 unit, then Yearly amount spent will
decrease on average of 0.2073.
8
The code for plot is:
plot(DTC$`Time on Website`, DTC$`Yearly Amount Spent`,
main = "Yearly Amount vs Time on Website", xlab = "Time on Website",
ylab = "Yealry Amount")
abline(lm(`Yearly Amount Spent` ~ `Time on Website`, data = DTC), col = "blue")
This relationship seems to be nonlinear. So, the model after a suitable transformation is:
Linear Regression model for Length of Membership & Yearly amount spent:
9
m4= lm(DTC$`Yearly Amount Spent` ~ DTC$`Length of Membership`)
m4
Intercept Slope
272.40 64.22
This indicates that if we increase Length of Membership by 1 unit, then Yearly amount spent will
increase on average of 64.22.
The code for plot is:
plot(DTC$`Length of Membership`, DTC$`Yearly Amount Spent`,
main = "Yearly Amount vs Length of Membership", xlab = "Time on Website",
ylab = "Yealry Amount")
abline(lm(`Yearly Amount Spent` ~ `Length of Membership`, data = DTC), col =
"blue")
10
Diagnostics for E-commerce Dataset
To perform diagnostic tests on a linear regression model in R (model1), We will use Residual Analysis
(Residue vs fitted values plot), Histogram and Quantile-Quantile (QQ) plot to check the major
assumptions of the linear model (model1).
model1Resids= model1$residuals
model1Resids
model1fitted=model1$fitted.values
model1fitted
plot(model1fitted, model1Resids, main="Scatter Plot", xlab = "Fitted Values", ylab =
"Residuals")
11
The homoscedasticity assumption is violated. The variance (i.e. spread) of the residuals decreases as
the predicted values increase. Further, the model as constructed is having a hard time predicting
certain observations in the data, leading to outliers.
Histogram:
Figure 2(Histogram)
The histogram for the residuals might appear normal at first glance but it is actually “long tailed”. It
is more common for a distribution like this to generate outliers than a normal distribution. This
suggests that our residuals are not normally distributed.
Quantile-Quantile plot:
qqnorm(model1Resids)
12
The qq-plot here also suggests the same as histogram. If the residuals were normal, they points
would cluster around the line. Instead, we see the effects of the outliers.
13
Model Selection for E-commerce Dataset
Now, we will consider the problem of choosing from several different models.
Firstly, train a linear model with response Yearly Amount Spent and a single predictor Avg. Session
Length, with 10-fold cross-validatio.
Further, we will train a linear model with response Yearly Amount Spent and every other variable
as predictor, with 10-fold cross-validation.
Next, we will use best subset model selection for Yearly Amount spent and plot this using adj2 scale.
14
The code is:
Model2 = train( form = `Yearly Amount Spent` ~ `Avg. Session Length` +
`Time on App` + `Time on Website` + `Length of Membership`,
data = DTC, method = "lm", trControl = trainControl(method = "cv",
number = 10))
subsetmodel = regsubsets(`Yearly Amount Spent` ~ `Avg. Session Length` +
`Time on App` + `Time on Website` + `Length of Membership`, data =
DTC, nvmax=5)
plot(subsetmodel, scale = "adjr2")
summary(subsetmodel)
bestM = which.max(summary(subsetmodel)$adjr2);
bestM
Since the bestM value is 3. So, the code for new model is:
15
data = DTC, method = "lm", trControl = trainControl(method = "cv", number
= 10))
ModelN
The values of RMSE and MAE are slightly less as compared to the previous model. So, this is the
better model.
16
Predictions and Summary
To make Predictions or calculate predicted values, we will use the following code:
17