0% found this document useful (0 votes)

17 views17 pages

MLM Report Customer Churn

Uploaded by

Manish Mohapatra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views17 pages

MLM Report Customer Churn

Uploaded by

Manish Mohapatra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Project 1: Report (Sample)

Project Title: Segmentation of Consumer Data

1. Objectives:
PO2 | PS2: Identification of Appropriate Number of Segments or Clusters

2. Data Description:
2.1 Data Source, Size, record, Shape:

2.1.1 Data Source:

Kaggle -> Predict Purity and Price of Honey

https://www.kaggle.com/datasets/stealthtechnologies/predict-purity-and-price-of-
honey

2.1.2 Data Size:

=> 15839 KB or 15.8 MB

2.1.3 Data Shape:

Size: The dataset contains 247903 rows or data points and 11 columns or variables.

Shape: The shape of the dataset is (247903, 11), suggesting that there are 247903 rows
and 11 columns.

2.1. Data Source, Size, Shape

 2.1.1. Data Source (Website Link): The dataset was obtained from Dataset Source
Link.
 2.1.2. Data Size: The size of the dataset is approximately XX MB.
 2.1.3. Data Shape (Dimension: Number of Variables | Number of Records): The
dataset consists of 14 variables and XX records.

Description of Variables

 Row Number: Sequential numbering of rows in the dataset.

 Customer Id: Unique identifier for each customer.
 Surname: Last name of the customer.
 Credit Score: Numerical variable representing the credit score of the
customer.
 Geography: Categorical variable indicating the country where the customer
resides.
 Gender: Categorical variable indicating the gender of the customer.
 Age: Numerical variable representing the age of the customer.
 Tenure: Numerical variable representing the number of years the customer
has been with the company.
 Balance: Numerical variable representing the balance in the customer's
account.
 Number Of Products: Numerical variable representing the number of
products the customer has.
 Has Credit Card: Binary variable indicating whether the customer has a credit
card (1 for Yes, 0 for No).
 Is Active Member: Binary variable indicating whether the customer is an
active member (1 for Yes, 0 for No).
 Estimated Salary: Numerical variable representing the estimated salary of the
customer.
 Exited: Binary variable indicating whether the customer has churned (1 for
Yes, 0 for No).

2.2. Description of Variables

2.2.1. Index Variable(s) (Doubt)
 Index: Sequential numbering of rows in the dataset. (I1, I2, ...)
2.2.2. Variables or Features having Categories | Categorical Variables or Features (CV)
2.2.2.1. Variables or Features having Nominal Categories | Categorical Variables or Features -
Nominal Type
 Geography: Country where the customer resides. (CNV1)
 Gender: Gender of the customer. (CNV2)
 Number Of Products: Number of products the customer has. (CNV3)
 Has Credit Card: Indicates whether the customer has a credit card. (CNV4)
 Is Active Member: Indicates whether the customer is an active member. (CNV5)
 Exited: Indicates whether the customer has churned. (CNV6)
2.2.2.2. Variables or Features having Ordinal Categories | Categorical Variables or Features -
Ordinal Type
 None
2.2.3. Non-Categorical Variables or Features
 Credit Score: Numerical variable representing the credit score of the customer.
(NCV1)
 Age: Numerical variable representing the age of the customer. (NCV2)
 Tenure: Numerical variable representing the number of years the customer has been
with the company. (NCV3)
 Balance: Numerical variable representing the balance in the customer's account.
(NCV4)
 Estimated Salary: Numerical variable representing the estimated salary of the
customer. (NCV5)

Data Pre processing

2. Data Description:

2.1 Data Source, Size, record, Shape:

2.1.1 Data Source:

Kaggle -> Predict Purity and Price of Honey

https://www.kaggle.com/datasets/stealthtechnologies/predict-purity-and-price-of-honey

2.1.2 Data Size:

=> 15839 KB or 15.8 MB

2.1.3 Data Shape:

Size: The dataset contains 247903 rows or data points and 11 columns or variables.

Shape: The shape of the dataset is (247903, 11), suggesting that there are 247903 rows and 11
columns.

2.2 Description of Variables

2.2.1. Index Variable: index

2.2.2 Categorical Variables : Pollen_analysis

2.2.2.1 Nominal Variables : Pollen_analysis

2.2.2.2 Ordinal Variables : There are no explicitly ordinal variables in the dataset.

2.2.3. Non-Categorical Variables: CS (Color Score), Density, WC (Water Content), pH, EC (Electrical
Conductivity), F (Fructose Level), G (Glucose Level), Viscosity, Purity and Price.

2.3. Descriptive Statistics

2.3.1 Descriptive Statistics: Categorical Variables or Features

2.3.1.1. Count | Frequency Statistics

2.3.1.2. Proportion (Relative Frequency) Statistics

2.3.2. Descriptive Statistics: Non-Categorical Variables or Features

2.3.2.1. Measures of Central Tendency

2.3.2.2. Measures of Dispersion

2.3.2.3. Correlation Statistics (with Test of Correlation)

3. Analysis of Data
3.1. Data Pre-Processing

3.1.1 Missing Data Statistics and Treatment

Data Transformation & Rescaling [Treatment of Outliers]

Treatment of Outliers:

There are no Significant outliers in this dataset

Pre-Processed Dataset

1. Pre-Processed Categorical Data Subset: df_cat_ppd

2. Pre-Processed Non-Categorical Data Subset: df_noncat_ppd

3. Pre-Processed Dataset: df_ppd

The pre-processed dataset, df_ppd, encompasses all variables after outlier treatment and
preprocessing procedures.

3.1.1.1.1 Missing Data Statistics: Maximum no. of columns missing in records are 0.

3.1.1.1.2 Missing Data Treatment: Records

3.1.1.1.2.1. Removal of Records with More Than 50% Missing Data: None

3.1.1.2.1. Missing Data Statistics(Categorical Variables or Features) : None

3.1.1.2.2. Missing Data Treatment: Categorical Variables or Features

3.1.1.2.2.1. Removal of Variables or Features with More Than 50% Missing Data: None

3.1.1.3.1. Missing Data Statistics(Non-Categorical Variables or Features) : None

3.1.1.3.2. Missing Data Treatment: Non-Categorical Variables or Features

3.1.1.3.2.1 Removal of Variables or Features with More Than 50% Missing Data: None

3.1.1.3.2.2 Imputation of Missing Data using Descriptive Statistics: Mean | Median :

For imputing missing data in our dataset, I utilized two common strategies: mean and median
imputation based on descriptive statistics. Given the absence of outliers in my dataset, both mean
and median imputation methods provide robust estimates of the central tendency of the data. These
methods allows to maintain the overall distribution of the variables while filling in missing values,
ensuring that the analysis is not unduly influenced by incomplete data.

3.1.2 Numeric Coding of Data

Numerical Encoding of Categorical Data

1. Since categorical variables in the dataset are nominal, we apply label encoding to transform them
into numerical representations.

2. Label Encoding: Label encoding assigns a unique numerical label to each category within a
categorical variable.

3. Mapping: Below is the mapping of original categories to their corresponding numerical labels.
3.1.3. Outlier Statistics and Treatment (Scaling | Transformation) [No Outliers]

3.1.3.1.1. Outlier Statistics(Non-Categorical Variables or Features) : CS (Color Score), Density, WC

(Water Content), pH, EC (Electrical Conductivity), F (Fructose Level), G (Glucose Level), Viscosity,
Purity and Price.

3.1.3.1.2.1 Outlier Treatment: Non-Categorical Variables or Features: Not Applicable

3.1.3.1.2.2 Normalization using Min-Max Scaler: CS (Color Score), Density, WC (Water Content), pH,
EC (Electrical Conductivity), F (Fructose Level), G (Glucose Level), Viscosity, Purity and Price.

3.1.3.1.2.2 Data Bifurcation [Training & Testing Datasets]

1. The dataset is partitioned into two subsets: training and testing datasets.

2. Training dataset is 75% of complete data

3. Testing dataset is 25% of complete data

3.1.2 Numeric Coding of Data

Numerical Encoding of Categorical Data

1. Since categorical variables in the dataset are nominal, we apply label encoding to transform them
into numerical representations.

2. Label Encoding: Label encoding assigns a unique numerical label to each category within a
categorical variable.

3. Mapping: Below is the mapping of original categories to their corresponding numerical labels.

3.1.3. Outlier Statistics and Treatment (Scaling | Transformation) [No Outliers]

3.1.3.1.1. Outlier Statistics(Non-Categorical Variables or Features) : CS (Color Score), Density, WC

(Water Content), pH, EC (Electrical Conductivity), F (Fructose Level), G (Glucose Level), Viscosity,
Purity and Price.

3.1.3.1.2.1 Outlier Treatment: Non-Categorical Variables or Features: Not Applicable

3.1.3.1.2.2 Normalization using Min-Max Scaler: CS (Color Score), Density, WC (Water Content), pH,
EC (Electrical Conductivity), F (Fructose Level), G (Glucose Level), Viscosity, Purity and Price.

3.1.3.1.2.2 Data Bifurcation [Training & Testing Datasets]

1. The dataset is partitioned into two subsets: training and testing datasets.

2. Training dataset is 75% of complete data

3. Testing dataset is 25% of complete data

3.2. Data Analysis

3.2.1.1. PO1 | PS1:: Unsupervised Machine Learning Clustering Algorithm: K-Means (Base Model) |
Metrics Used - Euclidean Distance

3.2.1.2. PO1 | PS1:: Unsupervised Machine Learning Clustering Algorithms: {DBSCAN | BIRCH |
OPTICS} (Comparison Models: At Least One) | Metrics Used - Euclidean Distance
3.2.2.1.1. PO2 | PS2:: Clustering Model Performance Evaluation: Silhouette Score | Davies-Bouldin
Score (Base Model: K-Mean)

To determine the best clustering model based on the Davies-Bouldin (DB) score and Silhouette (SS)
score, we need to consider the following:

Silhouette Score (SS): A higher Silhouette score indicates better separation between clusters. The
Silhouette score ranges from -1 to 1, where a score closer to 1 indicates better clustering.

Davies-Bouldin Score (DB): A lower Davies-Bouldin score indicates better clustering. The DB score
measures the average similarity between each cluster and its most similar cluster, where a lower
score indicates better separation between clusters.

For k=2: SS score is 0.589 and DB score is 0.552. For k=3: SS score is 0.524 and DB score is 0.603. For
k=4: SS score is 0.462 and DB score is 0.684. For k=5: SS score is 0.430 and DB score is 0.732.

Since we want to maximize the Silhouette score and minimize the Davies-Bouldin score, the best
clustering model would be the one with the highest Silhouette score and the lowest Davies-Bouldin
score.

In this case, for k=2, the clustering model has the highest Silhouette score (0.589) and the lowest
Davies-Bouldin score (0.552). Therefore, the clustering model with k=2 is likely the best choice based
on both the Silhouette and Davies-Bouldin scores.

But we will consider the 3 clustering model as the best model for our clustering subset

Descriptive statistics analysis:

1. *Geography_oe* and *Gender_oe*:
- *Geography_oe*: The majority of customers (mean ≈ 0.74) are associated with one of the
encoded geographical regions.
- *Gender_oe*: The average gender encoding is around 0.55, indicating a relatively
balanced representation of male and female customers.

2. Number Of Products_oe, Has Credit Card_oe, and Is Active Member_oe:

- *Number Of Products_oe*: On average, customers have approximately 0.53 products.
- *Has Credit Card_oe*: About 70% of customers have a credit card (mean ≈ 0.71).
- *Is Active Member_oe*: Roughly 52% of customers are active members (mean ≈ 0.52).

3. *Exited_oe*:
- This binary column indicates whether a customer has exited the service (1) or not (0).
- Only about 20% of customers have exited (mean ≈ 0.20).
4. *Tenure, **Balance, **Estimated Salary, **Age_mmnorm, and **Credit
Score_mmnorm*:
- *Tenure*: The average customer tenure is approximately 5 years.
- *Balance*: The mean balance is around $77,852.
- *Estimated Salary*: Customers' estimated salaries average at $101,012.
- *Age_mmnorm*: The normalized age (between 0 and 1) has an average value of
approximately 0.60.
- *Credit Score_mmnorm*: The normalized credit score also averages around 0.28.

5. *Variability*:
- The standard deviation (std) provides a measure of variability for each feature.
- For example, the high std in *Balance* suggests significant variation in account balances.

6. *Percentiles*:
- The 25th percentile (25%) represents the lower quartile, while the 75th percentile (75%)
represents the upper quartile.
- These values give insights into the distribution of data.

2) Certainly! Let's analyze the provided table:

1. *Variable Category*:

- The first column represents different variable categories.

2. *Count*:

- The second column indicates the count or frequency of each category.

- For example, the category "Index" appears once, and other categories have varying counts.

3. *Frequency*:

- The third column represents the relative frequency (proportion) of each category.

- It shows how often each category occurs in the dataset.

4. *Observations*:

- The dataset seems to contain a mix of categorical and numerical variables.

- Categories like "Exited," "Has Credit Card," and "Is Active Member" are binary (0 or 1).

- The "Exited" category has a relatively low frequency (around 20%).

- Other numerical variables (e.g., "Tenure," "Balance," "Estimated Salary," "Age_mmnorm," and "Credit

Score_mmnorm") are not explicitly labeled but likely correspond to continuous data.

Remember that this analysis is based on the provided data, and further context or domain

knowledge would be helpful for a deeper understanding. If you have any specific questions

or need additional insights, feel free to ask! 😊

3) Certainly! Let' s analyze the provided dataset:

1. *Credit Score*:

- The credit scores range from 502 to 850.

- The average credit score appears to be around 650.

2. *Geography*:

- Customers are primarily from three countries: France, Spain, and Germany.

- The majority of customers are from France.

3. *Gender*:

- The dataset includes both male and female customers.

- The gender distribution is not specified, but we can infer it from the "Female" entries.

4. *Age*:

- Customer ages range from 28 to 43 years.

- The average age seems to be around 35-40 years.

5. *Tenure*:

- Customer tenure (duration of association) varies from 1 to 8 years.

- The average tenure is approximately 5 years.

6. *Balance*:

- Account balances range from 0 to 159,660.80.

- The mean balance is around $77,852.

7. *Number Of Products*:

- Customers have 1 to 3 products.

- The average number of products is approximately 2.

8. Has Credit Card and Is Active Member:

- Most customers have a credit card (1.0) and are active members (1.0).

9. *Estimated Salary*:

- Estimated salaries vary from $38,190.78 to $113,931.57.

- The average estimated salary is around $101,000.

10. *Exited*:

- The "Exited" column indicates whether a customer has exited the service (1) or not (0).

- About 20% of customers have exited.

Remember that this analysis is based on the provided data, and further context or domain

knowledge would be helpful for a deeper understanding. If you have any specific questions

or need additional insights, feel free to ask! 😊

4) Certainly! Let's analyze the provided dataset:

1. *Credit Score*:

- The credit scores range from 502 to 850.

- The average credit score appears to be around 650.

2. *Geography*:

- Customers are primarily from three countries: France, Spain, and Germany.
- The majority of customers are from France.

3. *Gender*:

- The dataset includes both male and female customers.

- The gender distribution is not specified, but we can infer it from the "Female" entries.

4. *Age*:

- Customer ages range from 28 to 43 years.

- The average age seems to be around 35-40 years.

5. *Tenure*:

- Customer tenure (duration of association) varies from 1 to 8 years.

- The average tenure is approximately 5 years.

6. *Balance*:

- Account balances range from 0 to 159,660.80.

- The mean balance is around $77,852.

7. *Number Of Products*:

- Customers have 1 to 3 products.

- The average number of products is approximately 2.

8. Has Credit Card and Is Active Member:

- Most customers have a credit card (1.0) and are active members (1.0).

9. *Estimated Salary*:

- Estimated salaries vary from $38,190.78 to $113,931.57.

- The average estimated salary is around $101,000.

10. *Exited*:

- The "Exited" column indicates whether a customer has exited the service (1) or not (0).

- About 20% of customers have exited.

Remember that this analysis is based on the provided data, and further context or domain

knowledge would be helpful for a deeper understanding. If you have any specific questions

or need additional insights, feel free to ask! 😊

5) Certainly! Let's analyze the correlation matrix and coefficients provided:

1. *Correlation Matrix*:

- The correlation matrix shows the relationships between different features in the dataset.

- Each cell represents the correlation coefficient between two variables.

- Here are the interpretations for the given features:

- *Tenure*:

- Weak negative correlation with Credit Score_mmnorm (-0.01).

- No significant correlation with other features.

- *Balance*:

- Weak positive correlation with Credit Score_mmnorm (0.03).

- No significant correlation with other features.

- *Estimated Salary*:

- Weak negative correlation with Age_mmnorm (-0.01).

- No significant correlation with other features.

- *Age_mmnorm*:

- No significant correlation with other features.

- *Credit Score_mmnorm*:

- Weak positive correlation with Balance (0.03).

- Weak negative correlation with Tenure (-0.01).

2. *Correlation Coefficients*:

- The correlation coefficients provide a numerical value for the strength and direction of the

relatio n sh ip b etw een tw o v ariab les.

- A coefficient of 1 indicates a perfect positive correlation, while -1 indicates a perfect negative

correlation.

- Here are some notable coefficients:

- *Balance* and *Credit Score_mmnorm*: A positive correlation (0.03) suggests that higher balances

m ig h t b e asso ciated w ith b etter cred it sco res.

- *Tenure* and *Credit Score_mmnorm*: A weak negative correlation (-0.01) implies that longer tenure

is slightly associated with lower credit scores.

Remember that correlation does not imply causation. These relationships are based on statistical patterns

and may not have any causal link. If you have further questions or need additional insights, feel free to

ask! 😊

K-mean
Chart analysis

1. Chart Analysis:
o The image displays a line graph titled “Elbow Curve for Optimal K.”
o Key elements of the chart:
 X-axis: Represents the values of K (ranging from 1 to 10).
 Y-axis: Represents the Within Cluster Sum Squared Distance (WCSS),
scaled at an exponential level (indicated by ‘1e13’ at the top left
corner).
 Blue ‘x’ markers indicate data points corresponding to each value of K.
 The curve shows a sharp decline from K=1 until around K=3 or K=4,
after which it flattens out significantly.
 The “elbow” point (where the curve starts to look like a straight line)
suggests the optimal K for clustering.
2. Interpretation:
o The elbow method helps us determine the optimal number of clusters by
observing the rate of decrease in WCSS as K increases.
o In this case, the curve resembles an “elbow,” indicating that K=3 or K=4
might be the optimal choice for clustering based on the dataset.
SS AND DB SCORE:
1. Silhouette Score (SS):
o The Silhouette Score measures how similar an object is to its own cluster
(cohesion) compared to other clusters (separation).
o It ranges from -1 to 1:
 A high positive value (close to 1) indicates that the sample is well
clustered.
 A value near 0 suggests overlapping clusters.
 A negative value indicates that the sample might have been assigned to
the wrong cluster.
o In your case, the Silhouette Score for 2 clusters is approximately 0.46.
o Interpretation:
 A score above 0.5 is generally considered good, but your score is
slightly lower.
 It implies that the data points within the clusters are not very well
separated.
2. Davies-Bouldin Index (DB):
o The Davies-Bouldin Index evaluates the average similarity between each
cluster and its most similar cluster.
o Lower values indicate better separation between clusters.
o In your case, the DB Index for 2 clusters is approximately 0.92.
o Interpretation:
 A lower DB Index suggests better-defined clusters.
 However, a value close to 1 indicates some overlap or suboptimal
clustering.
3. Overall Assessment:
o The SS and DB scores provide complementary insights:
 SS focuses on individual data points’ cohesion and separation.
 DB considers the overall cluster separation.
o Considering both scores, it appears that the K-means model with 2 clusters
might not be ideal for your data.
o You may want to explore other values of K or evaluate additional metrics to
determine the optimal number of clusters.

Data Science and Big Data Analytics
No ratings yet
Data Science and Big Data Analytics
264 pages
Nfpa 70B
100% (4)
Nfpa 70B
32 pages
Dzone Researchguide Automatedtesting
No ratings yet
Dzone Researchguide Automatedtesting
41 pages
Exp 01-B Feature Selection and Extraction
No ratings yet
Exp 01-B Feature Selection and Extraction
12 pages
PreProcessing With R
No ratings yet
PreProcessing With R
6 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Another Project-Creating Customer Segments
No ratings yet
Another Project-Creating Customer Segments
31 pages
ML Lab Manual 2025-2
No ratings yet
ML Lab Manual 2025-2
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
7 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Practicals
No ratings yet
Practicals
42 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Data Science in Society Cat
No ratings yet
Data Science in Society Cat
5 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Thinespary Sitharam 841007106016-Supply Chain Management Data Analytic
No ratings yet
Thinespary Sitharam 841007106016-Supply Chain Management Data Analytic
6 pages
AIML
No ratings yet
AIML
13 pages
Data Science Tutorial 1686911993
No ratings yet
Data Science Tutorial 1686911993
41 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Praktikum Modul 3
No ratings yet
Praktikum Modul 3
5 pages
Python For Data Sceince l1 Hands On
No ratings yet
Python For Data Sceince l1 Hands On
5 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
Capstone CLA1
No ratings yet
Capstone CLA1
16 pages
Employee Performance Analysis
No ratings yet
Employee Performance Analysis
3 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Assignment 2 - Factor Hair
No ratings yet
Assignment 2 - Factor Hair
39 pages
Machine Learning - Customer Segment Project. Approved by UDACITY
100% (1)
Machine Learning - Customer Segment Project. Approved by UDACITY
19 pages
Nikita Prasad - Exploratory Data Analysis (EDA)
No ratings yet
Nikita Prasad - Exploratory Data Analysis (EDA)
18 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Predictive Modeling Project
No ratings yet
Predictive Modeling Project
16 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Coding Notes Data Science
No ratings yet
Coding Notes Data Science
4 pages
Data Preprocess Steps
No ratings yet
Data Preprocess Steps
2 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
DS Journal - Final
No ratings yet
DS Journal - Final
37 pages
DS Journal-1
No ratings yet
DS Journal-1
25 pages
3-Preparing The Data-10-01-2024
No ratings yet
3-Preparing The Data-10-01-2024
127 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Student Performance Analysis and Prediction
No ratings yet
Student Performance Analysis and Prediction
19 pages
ML Lab
No ratings yet
ML Lab
14 pages
Graded Project
No ratings yet
Graded Project
36 pages
1152CS239-Intro. To Data Science-Syllabus
No ratings yet
1152CS239-Intro. To Data Science-Syllabus
6 pages
BigMart PDF
100% (1)
BigMart PDF
42 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Research File 3
No ratings yet
Research File 3
10 pages
Index: SR. NO. Practical Name Date of Perform NO. Sign
No ratings yet
Index: SR. NO. Practical Name Date of Perform NO. Sign
28 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
ML 8 Program
No ratings yet
ML 8 Program
5 pages
Internals1 FDS Scheme
No ratings yet
Internals1 FDS Scheme
7 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
Data Science Project - Flow Graph
No ratings yet
Data Science Project - Flow Graph
7 pages
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
From Everand
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
Wouter Verbeke
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
1965 STHS Yearbook
No ratings yet
1965 STHS Yearbook
122 pages
Tendernotice 1
No ratings yet
Tendernotice 1
16 pages
Baldwin 2020 The Shift To The Third Unbundling in The World
No ratings yet
Baldwin 2020 The Shift To The Third Unbundling in The World
13 pages
Tutorial
No ratings yet
Tutorial
59 pages
Cainta Catholic College Senior High School Department Cainta, Rizal
No ratings yet
Cainta Catholic College Senior High School Department Cainta, Rizal
33 pages
Chapter 12 Computational Methods For Stitching Alignmen 2019 Methods in C
No ratings yet
Chapter 12 Computational Methods For Stitching Alignmen 2019 Methods in C
16 pages
Certified Ethical Hacker Sample
No ratings yet
Certified Ethical Hacker Sample
0 pages
Use of ICT in Automobile Industry
100% (3)
Use of ICT in Automobile Industry
3 pages
Core Java - Munishwar Gulati
No ratings yet
Core Java - Munishwar Gulati
252 pages
Parsing Dependency
No ratings yet
Parsing Dependency
26 pages
Gauss Jordan Elimination 2a For Print 3a
No ratings yet
Gauss Jordan Elimination 2a For Print 3a
24 pages
Us2 Ss 56 Mothers Day Math Codebreaker Differentiated Activity Sheets English - Ver - 2
No ratings yet
Us2 Ss 56 Mothers Day Math Codebreaker Differentiated Activity Sheets English - Ver - 2
8 pages
UNIT 1 Database Management System DBMS 2
No ratings yet
UNIT 1 Database Management System DBMS 2
20 pages
Electrical Drives and Control
100% (1)
Electrical Drives and Control
8 pages
Journal of Parallel and Distributed Computing: Daming Zhao, Jiantao Zhou
No ratings yet
Journal of Parallel and Distributed Computing: Daming Zhao, Jiantao Zhou
11 pages
Building Information Modelling (Bim) For Facilities Management (FM) : The Mediacity Case Study Approach
No ratings yet
Building Information Modelling (Bim) For Facilities Management (FM) : The Mediacity Case Study Approach
21 pages
2023 R Programming Apr May (AICTE)
No ratings yet
2023 R Programming Apr May (AICTE)
3 pages
Digital Communications: Instructor: Dr. Phan Van Ca Lecture #4: Introduction To Digital Communications
No ratings yet
Digital Communications: Instructor: Dr. Phan Van Ca Lecture #4: Introduction To Digital Communications
27 pages
Grover 221210109
No ratings yet
Grover 221210109
5 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
Assignment 2 Elise Cook Unit 9
No ratings yet
Assignment 2 Elise Cook Unit 9
6 pages
DevOps Engineer
No ratings yet
DevOps Engineer
2 pages
STC Issue
No ratings yet
STC Issue
2 pages
TK Series Magnet GPS Tracker USER MANUAL
No ratings yet
TK Series Magnet GPS Tracker USER MANUAL
26 pages
CHM121 - Module 2 - Significant Figures
No ratings yet
CHM121 - Module 2 - Significant Figures
26 pages
Full Download Multiple Valued Logic Concepts and Representation 1st Edition D. Michael Miller PDF
100% (3)
Full Download Multiple Valued Logic Concepts and Representation 1st Edition D. Michael Miller PDF
40 pages
Gvg110 Panel Mods
No ratings yet
Gvg110 Panel Mods
14 pages

MLM Report Customer Churn

Uploaded by

MLM Report Customer Churn

Uploaded by

Project 1: Report (Sample)

Project Title: Segmentation of Consumer Data

2.1.1 Data Source:

Kaggle -> Predict Purity and Price of Honey

2.1.2 Data Size:

=> 15839 KB or 15.8 MB

2.1.3 Data Shape:

2.1. Data Source, Size, Shape

 Row Number: Sequential numbering of rows in the dataset.

2.2. Description of Variables

Data Pre processing

2.1 Data Source, Size, record, Shape:

2.1.1 Data Source:

Kaggle -> Predict Purity and Price of Honey

2.1.2 Data Size:

=> 15839 KB or 15.8 MB

2.1.3 Data Shape:

2.2 Description of Variables

2.2.1. Index Variable: index

2.2.2 Categorical Variables : Pollen_analysis

2.2.2.1 Nominal Variables : Pollen_analysis

2.3. Descriptive Statistics

2.3.1 Descriptive Statistics: Categorical Variables or Features

2.3.1.1. Count | Frequency Statistics

2.3.1.2. Proportion (Relative Frequency) Statistics

2.3.2. Descriptive Statistics: Non-Categorical Variables or Features

2.3.2.1. Measures of Central Tendency

2.3.2.2. Measures of Dispersion

2.3.2.3. Correlation Statistics (with Test of Correlation)

3.1.1 Missing Data Statistics and Treatment

Data Transformation & Rescaling [Treatment of Outliers]

There are no Significant outliers in this dataset

1. Pre-Processed Categorical Data Subset: df_cat_ppd

2. Pre-Processed Non-Categorical Data Subset: df_noncat_ppd

3. Pre-Processed Dataset: df_ppd

3.1.1.1.2 Missing Data Treatment: Records

3.1.1.2.1. Missing Data Statistics(Categorical Variables or Features) : None

3.1.1.2.2. Missing Data Treatment: Categorical Variables or Features

3.1.1.3.1. Missing Data Statistics(Non-Categorical Variables or Features) : None

3.1.1.3.2. Missing Data Treatment: Non-Categorical Variables or Features

3.1.1.3.2.2 Imputation of Missing Data using Descriptive Statistics: Mean | Median :

3.1.2 Numeric Coding of Data

Numerical Encoding of Categorical Data

3.1.3.1.1. Outlier Statistics(Non-Categorical Variables or Features) : CS (Color Score), Density, WC

3.1.3.1.2.1 Outlier Treatment: Non-Categorical Variables or Features: Not Applicable

3.1.3.1.2.2 Data Bifurcation [Training & Testing Datasets]

2. Training dataset is 75% of complete data

3. Testing dataset is 25% of complete data

3.1.2 Numeric Coding of Data

Numerical Encoding of Categorical Data

3.1.3. Outlier Statistics and Treatment (Scaling | Transformation) [No Outliers]

3.1.3.1.1. Outlier Statistics(Non-Categorical Variables or Features) : CS (Color Score), Density, WC

3.1.3.1.2.1 Outlier Treatment: Non-Categorical Variables or Features: Not Applicable

3.1.3.1.2.2 Data Bifurcation [Training & Testing Datasets]

2. Training dataset is 75% of complete data

3. Testing dataset is 25% of complete data

3.2. Data Analysis

Descriptive statistics analysis:

2. *Number Of Products_oe, **Has Credit Card_oe, and **Is Active Member_oe*:

2) Certainly! Let's analyze the provided table:

- The first column represents different variable categories.

- The second column indicates the count or frequency of each category.

- It shows how often each category occurs in the dataset.

- The dataset seems to contain a mix of categorical and numerical variables.

- The "Exited" category has a relatively low frequency (around 20%).

or need additional insights, feel free to ask! 😊

3) Certainly! Let' s analyze the provided dataset:

- The credit scores range from 502 to 850.

- The average credit score appears to be around 650.

- The majority of customers are from France.

- The dataset includes both male and female customers.

- Customer ages range from 28 to 43 years.

- The average age seems to be around 35-40 years.

- Customer tenure (duration of association) varies from 1 to 8 years.

- The average tenure is approximately 5 years.

- Account balances range from 0 to 159,660.80.

2. Number Of Products_oe, Has Credit Card_oe, and Is Active Member_oe:

8. Has Credit Card and Is Active Member:

8. Has Credit Card and Is Active Member:

- Weak negative correlation with Credit Score_mmnorm (-0.01).

- Weak positive correlation with Credit Score_mmnorm (0.03).

- Weak negative correlation with Age_mmnorm (-0.01).

- Weak positive correlation with Balance (0.03).

- Weak negative correlation with Tenure (-0.01).