0% found this document useful (0 votes)
17 views17 pages

MLM Report Customer Churn

Uploaded by

Manish Mohapatra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views17 pages

MLM Report Customer Churn

Uploaded by

Manish Mohapatra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Project 1: Report (Sample)

Project Title: Segmentation of Consumer Data

1. Objectives:
PO2 | PS2: Identification of Appropriate Number of Segments or Clusters

2. Data Description:
2.1 Data Source, Size, record, Shape:

2.1.1 Data Source:

Kaggle -> Predict Purity and Price of Honey

https://www.kaggle.com/datasets/stealthtechnologies/predict-purity-and-price-of-
honey

2.1.2 Data Size:

=> 15839 KB or 15.8 MB

2.1.3 Data Shape:

Size: The dataset contains 247903 rows or data points and 11 columns or variables.

Shape: The shape of the dataset is (247903, 11), suggesting that there are 247903 rows
and 11 columns.

2.1. Data Source, Size, Shape


 2.1.1. Data Source (Website Link): The dataset was obtained from Dataset Source
Link.
 2.1.2. Data Size: The size of the dataset is approximately XX MB.
 2.1.3. Data Shape (Dimension: Number of Variables | Number of Records): The
dataset consists of 14 variables and XX records.

Description of Variables

 Row Number: Sequential numbering of rows in the dataset.


 Customer Id: Unique identifier for each customer.
 Surname: Last name of the customer.
 Credit Score: Numerical variable representing the credit score of the
customer.
 Geography: Categorical variable indicating the country where the customer
resides.
 Gender: Categorical variable indicating the gender of the customer.
 Age: Numerical variable representing the age of the customer.
 Tenure: Numerical variable representing the number of years the customer
has been with the company.
 Balance: Numerical variable representing the balance in the customer's
account.
 Number Of Products: Numerical variable representing the number of
products the customer has.
 Has Credit Card: Binary variable indicating whether the customer has a credit
card (1 for Yes, 0 for No).
 Is Active Member: Binary variable indicating whether the customer is an
active member (1 for Yes, 0 for No).
 Estimated Salary: Numerical variable representing the estimated salary of the
customer.
 Exited: Binary variable indicating whether the customer has churned (1 for
Yes, 0 for No).

2.2. Description of Variables


2.2.1. Index Variable(s) (Doubt)
 Index: Sequential numbering of rows in the dataset. (I1, I2, ...)
2.2.2. Variables or Features having Categories | Categorical Variables or Features (CV)
2.2.2.1. Variables or Features having Nominal Categories | Categorical Variables or Features -
Nominal Type
 Geography: Country where the customer resides. (CNV1)
 Gender: Gender of the customer. (CNV2)
 Number Of Products: Number of products the customer has. (CNV3)
 Has Credit Card: Indicates whether the customer has a credit card. (CNV4)
 Is Active Member: Indicates whether the customer is an active member. (CNV5)
 Exited: Indicates whether the customer has churned. (CNV6)
2.2.2.2. Variables or Features having Ordinal Categories | Categorical Variables or Features -
Ordinal Type
 None
2.2.3. Non-Categorical Variables or Features
 Credit Score: Numerical variable representing the credit score of the customer.
(NCV1)
 Age: Numerical variable representing the age of the customer. (NCV2)
 Tenure: Numerical variable representing the number of years the customer has been
with the company. (NCV3)
 Balance: Numerical variable representing the balance in the customer's account.
(NCV4)
 Estimated Salary: Numerical variable representing the estimated salary of the
customer. (NCV5)

Data Pre processing


2. Data Description:

2.1 Data Source, Size, record, Shape:

2.1.1 Data Source:

Kaggle -> Predict Purity and Price of Honey

https://www.kaggle.com/datasets/stealthtechnologies/predict-purity-and-price-of-honey

2.1.2 Data Size:

=> 15839 KB or 15.8 MB

2.1.3 Data Shape:

Size: The dataset contains 247903 rows or data points and 11 columns or variables.

Shape: The shape of the dataset is (247903, 11), suggesting that there are 247903 rows and 11
columns.

2.2 Description of Variables

2.2.1. Index Variable: index

2.2.2 Categorical Variables : Pollen_analysis

2.2.2.1 Nominal Variables : Pollen_analysis

2.2.2.2 Ordinal Variables : There are no explicitly ordinal variables in the dataset.

2.2.3. Non-Categorical Variables: CS (Color Score), Density, WC (Water Content), pH, EC (Electrical
Conductivity), F (Fructose Level), G (Glucose Level), Viscosity, Purity and Price.

2.3. Descriptive Statistics

2.3.1 Descriptive Statistics: Categorical Variables or Features

2.3.1.1. Count | Frequency Statistics

2.3.1.2. Proportion (Relative Frequency) Statistics

2.3.2. Descriptive Statistics: Non-Categorical Variables or Features

2.3.2.1. Measures of Central Tendency

2.3.2.2. Measures of Dispersion

2.3.2.3. Correlation Statistics (with Test of Correlation)

3. Analysis of Data
3.1. Data Pre-Processing

3.1.1 Missing Data Statistics and Treatment

Data Transformation & Rescaling [Treatment of Outliers]

Treatment of Outliers:

There are no Significant outliers in this dataset

Pre-Processed Dataset

1. Pre-Processed Categorical Data Subset: df_cat_ppd

2. Pre-Processed Non-Categorical Data Subset: df_noncat_ppd

3. Pre-Processed Dataset: df_ppd

The pre-processed dataset, df_ppd, encompasses all variables after outlier treatment and
preprocessing procedures.

3.1.1.1.1 Missing Data Statistics: Maximum no. of columns missing in records are 0.

3.1.1.1.2 Missing Data Treatment: Records

3.1.1.1.2.1. Removal of Records with More Than 50% Missing Data: None

3.1.1.2.1. Missing Data Statistics(Categorical Variables or Features) : None

3.1.1.2.2. Missing Data Treatment: Categorical Variables or Features

3.1.1.2.2.1. Removal of Variables or Features with More Than 50% Missing Data: None

3.1.1.3.1. Missing Data Statistics(Non-Categorical Variables or Features) : None

3.1.1.3.2. Missing Data Treatment: Non-Categorical Variables or Features

3.1.1.3.2.1 Removal of Variables or Features with More Than 50% Missing Data: None

3.1.1.3.2.2 Imputation of Missing Data using Descriptive Statistics: Mean | Median :

For imputing missing data in our dataset, I utilized two common strategies: mean and median
imputation based on descriptive statistics. Given the absence of outliers in my dataset, both mean
and median imputation methods provide robust estimates of the central tendency of the data. These
methods allows to maintain the overall distribution of the variables while filling in missing values,
ensuring that the analysis is not unduly influenced by incomplete data.

3.1.2 Numeric Coding of Data

Numerical Encoding of Categorical Data

1. Since categorical variables in the dataset are nominal, we apply label encoding to transform them
into numerical representations.

2. Label Encoding: Label encoding assigns a unique numerical label to each category within a
categorical variable.

3. Mapping: Below is the mapping of original categories to their corresponding numerical labels.
3.1.3. Outlier Statistics and Treatment (Scaling | Transformation) [No Outliers]

3.1.3.1.1. Outlier Statistics(Non-Categorical Variables or Features) : CS (Color Score), Density, WC


(Water Content), pH, EC (Electrical Conductivity), F (Fructose Level), G (Glucose Level), Viscosity,
Purity and Price.

3.1.3.1.2.1 Outlier Treatment: Non-Categorical Variables or Features: Not Applicable

3.1.3.1.2.2 Normalization using Min-Max Scaler: CS (Color Score), Density, WC (Water Content), pH,
EC (Electrical Conductivity), F (Fructose Level), G (Glucose Level), Viscosity, Purity and Price.

3.1.3.1.2.2 Data Bifurcation [Training & Testing Datasets]

1. The dataset is partitioned into two subsets: training and testing datasets.

2. Training dataset is 75% of complete data

3. Testing dataset is 25% of complete data

3.1.2 Numeric Coding of Data

Numerical Encoding of Categorical Data

1. Since categorical variables in the dataset are nominal, we apply label encoding to transform them
into numerical representations.

2. Label Encoding: Label encoding assigns a unique numerical label to each category within a
categorical variable.

3. Mapping: Below is the mapping of original categories to their corresponding numerical labels.

3.1.3. Outlier Statistics and Treatment (Scaling | Transformation) [No Outliers]

3.1.3.1.1. Outlier Statistics(Non-Categorical Variables or Features) : CS (Color Score), Density, WC


(Water Content), pH, EC (Electrical Conductivity), F (Fructose Level), G (Glucose Level), Viscosity,
Purity and Price.

3.1.3.1.2.1 Outlier Treatment: Non-Categorical Variables or Features: Not Applicable

3.1.3.1.2.2 Normalization using Min-Max Scaler: CS (Color Score), Density, WC (Water Content), pH,
EC (Electrical Conductivity), F (Fructose Level), G (Glucose Level), Viscosity, Purity and Price.

3.1.3.1.2.2 Data Bifurcation [Training & Testing Datasets]

1. The dataset is partitioned into two subsets: training and testing datasets.

2. Training dataset is 75% of complete data

3. Testing dataset is 25% of complete data

3.2. Data Analysis

3.2.1.1. PO1 | PS1:: Unsupervised Machine Learning Clustering Algorithm: K-Means (Base Model) |
Metrics Used - Euclidean Distance

3.2.1.2. PO1 | PS1:: Unsupervised Machine Learning Clustering Algorithms: {DBSCAN | BIRCH |
OPTICS} (Comparison Models: At Least One) | Metrics Used - Euclidean Distance
3.2.2.1.1. PO2 | PS2:: Clustering Model Performance Evaluation: Silhouette Score | Davies-Bouldin
Score (Base Model: K-Mean)

To determine the best clustering model based on the Davies-Bouldin (DB) score and Silhouette (SS)
score, we need to consider the following:

Silhouette Score (SS): A higher Silhouette score indicates better separation between clusters. The
Silhouette score ranges from -1 to 1, where a score closer to 1 indicates better clustering.

Davies-Bouldin Score (DB): A lower Davies-Bouldin score indicates better clustering. The DB score
measures the average similarity between each cluster and its most similar cluster, where a lower
score indicates better separation between clusters.

For k=2: SS score is 0.589 and DB score is 0.552. For k=3: SS score is 0.524 and DB score is 0.603. For
k=4: SS score is 0.462 and DB score is 0.684. For k=5: SS score is 0.430 and DB score is 0.732.

Since we want to maximize the Silhouette score and minimize the Davies-Bouldin score, the best
clustering model would be the one with the highest Silhouette score and the lowest Davies-Bouldin
score.

In this case, for k=2, the clustering model has the highest Silhouette score (0.589) and the lowest
Davies-Bouldin score (0.552). Therefore, the clustering model with k=2 is likely the best choice based
on both the Silhouette and Davies-Bouldin scores.

But we will consider the 3 clustering model as the best model for our clustering subset

Descriptive statistics analysis:


1. *Geography_oe* and *Gender_oe*:
- *Geography_oe*: The majority of customers (mean ≈ 0.74) are associated with one of the
encoded geographical regions.
- *Gender_oe*: The average gender encoding is around 0.55, indicating a relatively
balanced representation of male and female customers.

2. *Number Of Products_oe, **Has Credit Card_oe, and **Is Active Member_oe*:


- *Number Of Products_oe*: On average, customers have approximately 0.53 products.
- *Has Credit Card_oe*: About 70% of customers have a credit card (mean ≈ 0.71).
- *Is Active Member_oe*: Roughly 52% of customers are active members (mean ≈ 0.52).

3. *Exited_oe*:
- This binary column indicates whether a customer has exited the service (1) or not (0).
- Only about 20% of customers have exited (mean ≈ 0.20).
4. *Tenure, **Balance, **Estimated Salary, **Age_mmnorm, and **Credit
Score_mmnorm*:
- *Tenure*: The average customer tenure is approximately 5 years.
- *Balance*: The mean balance is around $77,852.
- *Estimated Salary*: Customers' estimated salaries average at $101,012.
- *Age_mmnorm*: The normalized age (between 0 and 1) has an average value of
approximately 0.60.
- *Credit Score_mmnorm*: The normalized credit score also averages around 0.28.

5. *Variability*:
- The standard deviation (std) provides a measure of variability for each feature.
- For example, the high std in *Balance* suggests significant variation in account balances.

6. *Percentiles*:
- The 25th percentile (25%) represents the lower quartile, while the 75th percentile (75%)
represents the upper quartile.
- These values give insights into the distribution of data.

2) Certainly! Let's analyze the provided table:

1. *Variable Category*:

- The first column represents different variable categories.

2. *Count*:

- The second column indicates the count or frequency of each category.


- For example, the category "Index" appears once, and other categories have varying counts.

3. *Frequency*:

- The third column represents the relative frequency (proportion) of each category.

- It shows how often each category occurs in the dataset.

4. *Observations*:

- The dataset seems to contain a mix of categorical and numerical variables.

- Categories like "Exited," "Has Credit Card," and "Is Active Member" are binary (0 or 1).

- The "Exited" category has a relatively low frequency (around 20%).

- Other numerical variables (e.g., "Tenure," "Balance," "Estimated Salary," "Age_mmnorm," and "Credit

Score_mmnorm") are not explicitly labeled but likely correspond to continuous data.

Remember that this analysis is based on the provided data, and further context or domain

knowledge would be helpful for a deeper understanding. If you have any specific questions

or need additional insights, feel free to ask! 😊

3) Certainly! Let' s analyze the provided dataset:


1. *Credit Score*:

- The credit scores range from 502 to 850.

- The average credit score appears to be around 650.

2. *Geography*:

- Customers are primarily from three countries: France, Spain, and Germany.

- The majority of customers are from France.

3. *Gender*:

- The dataset includes both male and female customers.

- The gender distribution is not specified, but we can infer it from the "Female" entries.

4. *Age*:

- Customer ages range from 28 to 43 years.

- The average age seems to be around 35-40 years.


5. *Tenure*:

- Customer tenure (duration of association) varies from 1 to 8 years.

- The average tenure is approximately 5 years.

6. *Balance*:

- Account balances range from 0 to 159,660.80.

- The mean balance is around $77,852.

7. *Number Of Products*:

- Customers have 1 to 3 products.

- The average number of products is approximately 2.

8. *Has Credit Card* and *Is Active Member*:

- Most customers have a credit card (1.0) and are active members (1.0).

9. *Estimated Salary*:

- Estimated salaries vary from $38,190.78 to $113,931.57.


- The average estimated salary is around $101,000.

10. *Exited*:

- The "Exited" column indicates whether a customer has exited the service (1) or not (0).

- About 20% of customers have exited.

Remember that this analysis is based on the provided data, and further context or domain

knowledge would be helpful for a deeper understanding. If you have any specific questions

or need additional insights, feel free to ask! 😊

4) Certainly! Let's analyze the provided dataset:

1. *Credit Score*:

- The credit scores range from 502 to 850.

- The average credit score appears to be around 650.

2. *Geography*:

- Customers are primarily from three countries: France, Spain, and Germany.
- The majority of customers are from France.

3. *Gender*:

- The dataset includes both male and female customers.

- The gender distribution is not specified, but we can infer it from the "Female" entries.

4. *Age*:

- Customer ages range from 28 to 43 years.

- The average age seems to be around 35-40 years.

5. *Tenure*:

- Customer tenure (duration of association) varies from 1 to 8 years.

- The average tenure is approximately 5 years.

6. *Balance*:

- Account balances range from 0 to 159,660.80.

- The mean balance is around $77,852.


7. *Number Of Products*:

- Customers have 1 to 3 products.

- The average number of products is approximately 2.

8. *Has Credit Card* and *Is Active Member*:

- Most customers have a credit card (1.0) and are active members (1.0).

9. *Estimated Salary*:

- Estimated salaries vary from $38,190.78 to $113,931.57.

- The average estimated salary is around $101,000.

10. *Exited*:

- The "Exited" column indicates whether a customer has exited the service (1) or not (0).

- About 20% of customers have exited.


Remember that this analysis is based on the provided data, and further context or domain

knowledge would be helpful for a deeper understanding. If you have any specific questions

or need additional insights, feel free to ask! 😊

5) Certainly! Let's analyze the correlation matrix and coefficients provided:

1. *Correlation Matrix*:

- The correlation matrix shows the relationships between different features in the dataset.

- Each cell represents the correlation coefficient between two variables.

- Here are the interpretations for the given features:

- *Tenure*:

- Weak negative correlation with *Credit Score_mmnorm* (-0.01).

- No significant correlation with other features.

- *Balance*:

- Weak positive correlation with *Credit Score_mmnorm* (0.03).

- No significant correlation with other features.


- *Estimated Salary*:

- Weak negative correlation with *Age_mmnorm* (-0.01).

- No significant correlation with other features.

- *Age_mmnorm*:

- No significant correlation with other features.

- *Credit Score_mmnorm*:

- Weak positive correlation with *Balance* (0.03).

- Weak negative correlation with *Tenure* (-0.01).

2. *Correlation Coefficients*:

- The correlation coefficients provide a numerical value for the strength and direction of the

relatio n sh ip b etw een tw o v ariab les.

- A coefficient of 1 indicates a perfect positive correlation, while -1 indicates a perfect negative

correlation.

- Here are some notable coefficients:


- *Balance* and *Credit Score_mmnorm*: A positive correlation (0.03) suggests that higher balances

m ig h t b e asso ciated w ith b etter cred it sco res.

- *Tenure* and *Credit Score_mmnorm*: A weak negative correlation (-0.01) implies that longer tenure

is slightly associated with lower credit scores.

Remember that correlation does not imply causation. These relationships are based on statistical patterns

and may not have any causal link. If you have further questions or need additional insights, feel free to

ask! 😊

K-mean
Chart analysis

1. Chart Analysis:
o The image displays a line graph titled “Elbow Curve for Optimal K.”
o Key elements of the chart:
 X-axis: Represents the values of K (ranging from 1 to 10).
 Y-axis: Represents the Within Cluster Sum Squared Distance (WCSS),
scaled at an exponential level (indicated by ‘1e13’ at the top left
corner).
 Blue ‘x’ markers indicate data points corresponding to each value of K.
 The curve shows a sharp decline from K=1 until around K=3 or K=4,
after which it flattens out significantly.
 The “elbow” point (where the curve starts to look like a straight line)
suggests the optimal K for clustering.
2. Interpretation:
o The elbow method helps us determine the optimal number of clusters by
observing the rate of decrease in WCSS as K increases.
o In this case, the curve resembles an “elbow,” indicating that K=3 or K=4
might be the optimal choice for clustering based on the dataset.
SS AND DB SCORE:
1. Silhouette Score (SS):
o The Silhouette Score measures how similar an object is to its own cluster
(cohesion) compared to other clusters (separation).
o It ranges from -1 to 1:
 A high positive value (close to 1) indicates that the sample is well
clustered.
 A value near 0 suggests overlapping clusters.
 A negative value indicates that the sample might have been assigned to
the wrong cluster.
o In your case, the Silhouette Score for 2 clusters is approximately 0.46.
o Interpretation:
 A score above 0.5 is generally considered good, but your score is
slightly lower.
 It implies that the data points within the clusters are not very well
separated.
2. Davies-Bouldin Index (DB):
o The Davies-Bouldin Index evaluates the average similarity between each
cluster and its most similar cluster.
o Lower values indicate better separation between clusters.
o In your case, the DB Index for 2 clusters is approximately 0.92.
o Interpretation:
 A lower DB Index suggests better-defined clusters.
 However, a value close to 1 indicates some overlap or suboptimal
clustering.
3. Overall Assessment:
o The SS and DB scores provide complementary insights:
 SS focuses on individual data points’ cohesion and separation.
 DB considers the overall cluster separation.
o Considering both scores, it appears that the K-means model with 2 clusters
might not be ideal for your data.
o You may want to explore other values of K or evaluate additional metrics to
determine the optimal number of clusters.

You might also like