MLM Report Customer Churn
MLM Report Customer Churn
1. Objectives:
PO2 | PS2: Identification of Appropriate Number of Segments or Clusters
2. Data Description:
2.1 Data Source, Size, record, Shape:
https://www.kaggle.com/datasets/stealthtechnologies/predict-purity-and-price-of-
honey
Size: The dataset contains 247903 rows or data points and 11 columns or variables.
Shape: The shape of the dataset is (247903, 11), suggesting that there are 247903 rows
and 11 columns.
Description of Variables
https://www.kaggle.com/datasets/stealthtechnologies/predict-purity-and-price-of-honey
Size: The dataset contains 247903 rows or data points and 11 columns or variables.
Shape: The shape of the dataset is (247903, 11), suggesting that there are 247903 rows and 11
columns.
2.2.2.2 Ordinal Variables : There are no explicitly ordinal variables in the dataset.
2.2.3. Non-Categorical Variables: CS (Color Score), Density, WC (Water Content), pH, EC (Electrical
Conductivity), F (Fructose Level), G (Glucose Level), Viscosity, Purity and Price.
3. Analysis of Data
3.1. Data Pre-Processing
Treatment of Outliers:
Pre-Processed Dataset
The pre-processed dataset, df_ppd, encompasses all variables after outlier treatment and
preprocessing procedures.
3.1.1.1.1 Missing Data Statistics: Maximum no. of columns missing in records are 0.
3.1.1.1.2.1. Removal of Records with More Than 50% Missing Data: None
3.1.1.2.2.1. Removal of Variables or Features with More Than 50% Missing Data: None
3.1.1.3.2.1 Removal of Variables or Features with More Than 50% Missing Data: None
For imputing missing data in our dataset, I utilized two common strategies: mean and median
imputation based on descriptive statistics. Given the absence of outliers in my dataset, both mean
and median imputation methods provide robust estimates of the central tendency of the data. These
methods allows to maintain the overall distribution of the variables while filling in missing values,
ensuring that the analysis is not unduly influenced by incomplete data.
1. Since categorical variables in the dataset are nominal, we apply label encoding to transform them
into numerical representations.
2. Label Encoding: Label encoding assigns a unique numerical label to each category within a
categorical variable.
3. Mapping: Below is the mapping of original categories to their corresponding numerical labels.
3.1.3. Outlier Statistics and Treatment (Scaling | Transformation) [No Outliers]
3.1.3.1.2.2 Normalization using Min-Max Scaler: CS (Color Score), Density, WC (Water Content), pH,
EC (Electrical Conductivity), F (Fructose Level), G (Glucose Level), Viscosity, Purity and Price.
1. The dataset is partitioned into two subsets: training and testing datasets.
1. Since categorical variables in the dataset are nominal, we apply label encoding to transform them
into numerical representations.
2. Label Encoding: Label encoding assigns a unique numerical label to each category within a
categorical variable.
3. Mapping: Below is the mapping of original categories to their corresponding numerical labels.
3.1.3.1.2.2 Normalization using Min-Max Scaler: CS (Color Score), Density, WC (Water Content), pH,
EC (Electrical Conductivity), F (Fructose Level), G (Glucose Level), Viscosity, Purity and Price.
1. The dataset is partitioned into two subsets: training and testing datasets.
3.2.1.1. PO1 | PS1:: Unsupervised Machine Learning Clustering Algorithm: K-Means (Base Model) |
Metrics Used - Euclidean Distance
3.2.1.2. PO1 | PS1:: Unsupervised Machine Learning Clustering Algorithms: {DBSCAN | BIRCH |
OPTICS} (Comparison Models: At Least One) | Metrics Used - Euclidean Distance
3.2.2.1.1. PO2 | PS2:: Clustering Model Performance Evaluation: Silhouette Score | Davies-Bouldin
Score (Base Model: K-Mean)
To determine the best clustering model based on the Davies-Bouldin (DB) score and Silhouette (SS)
score, we need to consider the following:
Silhouette Score (SS): A higher Silhouette score indicates better separation between clusters. The
Silhouette score ranges from -1 to 1, where a score closer to 1 indicates better clustering.
Davies-Bouldin Score (DB): A lower Davies-Bouldin score indicates better clustering. The DB score
measures the average similarity between each cluster and its most similar cluster, where a lower
score indicates better separation between clusters.
For k=2: SS score is 0.589 and DB score is 0.552. For k=3: SS score is 0.524 and DB score is 0.603. For
k=4: SS score is 0.462 and DB score is 0.684. For k=5: SS score is 0.430 and DB score is 0.732.
Since we want to maximize the Silhouette score and minimize the Davies-Bouldin score, the best
clustering model would be the one with the highest Silhouette score and the lowest Davies-Bouldin
score.
In this case, for k=2, the clustering model has the highest Silhouette score (0.589) and the lowest
Davies-Bouldin score (0.552). Therefore, the clustering model with k=2 is likely the best choice based
on both the Silhouette and Davies-Bouldin scores.
But we will consider the 3 clustering model as the best model for our clustering subset
3. *Exited_oe*:
- This binary column indicates whether a customer has exited the service (1) or not (0).
- Only about 20% of customers have exited (mean ≈ 0.20).
4. *Tenure, **Balance, **Estimated Salary, **Age_mmnorm, and **Credit
Score_mmnorm*:
- *Tenure*: The average customer tenure is approximately 5 years.
- *Balance*: The mean balance is around $77,852.
- *Estimated Salary*: Customers' estimated salaries average at $101,012.
- *Age_mmnorm*: The normalized age (between 0 and 1) has an average value of
approximately 0.60.
- *Credit Score_mmnorm*: The normalized credit score also averages around 0.28.
5. *Variability*:
- The standard deviation (std) provides a measure of variability for each feature.
- For example, the high std in *Balance* suggests significant variation in account balances.
6. *Percentiles*:
- The 25th percentile (25%) represents the lower quartile, while the 75th percentile (75%)
represents the upper quartile.
- These values give insights into the distribution of data.
1. *Variable Category*:
2. *Count*:
3. *Frequency*:
- The third column represents the relative frequency (proportion) of each category.
4. *Observations*:
- Categories like "Exited," "Has Credit Card," and "Is Active Member" are binary (0 or 1).
- Other numerical variables (e.g., "Tenure," "Balance," "Estimated Salary," "Age_mmnorm," and "Credit
Score_mmnorm") are not explicitly labeled but likely correspond to continuous data.
Remember that this analysis is based on the provided data, and further context or domain
knowledge would be helpful for a deeper understanding. If you have any specific questions
2. *Geography*:
- Customers are primarily from three countries: France, Spain, and Germany.
3. *Gender*:
- The gender distribution is not specified, but we can infer it from the "Female" entries.
4. *Age*:
6. *Balance*:
7. *Number Of Products*:
- Most customers have a credit card (1.0) and are active members (1.0).
9. *Estimated Salary*:
10. *Exited*:
- The "Exited" column indicates whether a customer has exited the service (1) or not (0).
Remember that this analysis is based on the provided data, and further context or domain
knowledge would be helpful for a deeper understanding. If you have any specific questions
1. *Credit Score*:
2. *Geography*:
- Customers are primarily from three countries: France, Spain, and Germany.
- The majority of customers are from France.
3. *Gender*:
- The gender distribution is not specified, but we can infer it from the "Female" entries.
4. *Age*:
5. *Tenure*:
6. *Balance*:
- Most customers have a credit card (1.0) and are active members (1.0).
9. *Estimated Salary*:
10. *Exited*:
- The "Exited" column indicates whether a customer has exited the service (1) or not (0).
knowledge would be helpful for a deeper understanding. If you have any specific questions
1. *Correlation Matrix*:
- The correlation matrix shows the relationships between different features in the dataset.
- *Tenure*:
- *Balance*:
- *Age_mmnorm*:
- *Credit Score_mmnorm*:
2. *Correlation Coefficients*:
- The correlation coefficients provide a numerical value for the strength and direction of the
correlation.
- *Tenure* and *Credit Score_mmnorm*: A weak negative correlation (-0.01) implies that longer tenure
Remember that correlation does not imply causation. These relationships are based on statistical patterns
and may not have any causal link. If you have further questions or need additional insights, feel free to
ask! 😊
K-mean
Chart analysis
1. Chart Analysis:
o The image displays a line graph titled “Elbow Curve for Optimal K.”
o Key elements of the chart:
X-axis: Represents the values of K (ranging from 1 to 10).
Y-axis: Represents the Within Cluster Sum Squared Distance (WCSS),
scaled at an exponential level (indicated by ‘1e13’ at the top left
corner).
Blue ‘x’ markers indicate data points corresponding to each value of K.
The curve shows a sharp decline from K=1 until around K=3 or K=4,
after which it flattens out significantly.
The “elbow” point (where the curve starts to look like a straight line)
suggests the optimal K for clustering.
2. Interpretation:
o The elbow method helps us determine the optimal number of clusters by
observing the rate of decrease in WCSS as K increases.
o In this case, the curve resembles an “elbow,” indicating that K=3 or K=4
might be the optimal choice for clustering based on the dataset.
SS AND DB SCORE:
1. Silhouette Score (SS):
o The Silhouette Score measures how similar an object is to its own cluster
(cohesion) compared to other clusters (separation).
o It ranges from -1 to 1:
A high positive value (close to 1) indicates that the sample is well
clustered.
A value near 0 suggests overlapping clusters.
A negative value indicates that the sample might have been assigned to
the wrong cluster.
o In your case, the Silhouette Score for 2 clusters is approximately 0.46.
o Interpretation:
A score above 0.5 is generally considered good, but your score is
slightly lower.
It implies that the data points within the clusters are not very well
separated.
2. Davies-Bouldin Index (DB):
o The Davies-Bouldin Index evaluates the average similarity between each
cluster and its most similar cluster.
o Lower values indicate better separation between clusters.
o In your case, the DB Index for 2 clusters is approximately 0.92.
o Interpretation:
A lower DB Index suggests better-defined clusters.
However, a value close to 1 indicates some overlap or suboptimal
clustering.
3. Overall Assessment:
o The SS and DB scores provide complementary insights:
SS focuses on individual data points’ cohesion and separation.
DB considers the overall cluster separation.
o Considering both scores, it appears that the K-means model with 2 clusters
might not be ideal for your data.
o You may want to explore other values of K or evaluate additional metrics to
determine the optimal number of clusters.