EDA - The Right Way
EDA - The Right Way
INVESTIGATION
EDA the Right Way
Vivek Chaudhary
Anirudh Dayma
John Gabriel
Manvendra Singh
Table of Contents
ACKNOWLEDGEMENT 3
PREFACE 5
ROAD MAP 9
KAGGLE EXPLORATION 51
CHURN PREDICTION 68
IMBALANCE CLASSIFICATION 76
WORDS OF WISDOM 88
BOT DETECTION 93
I would love to express my special thanks to John, Anirudh and Manvendra for
their continuous dedication to make it a good one. Thanks to Naresh Talwar for
supporting me during my hard times.
Special love & thanks to my lovely dog “Kittu” & “Chuggu” for being a part of my
unreached destination.
“Kittu” & “Chuggu” will always be missed.
- Vivek Chaudhary
I would like to thank my Mumma, Papa and my Sisters for pushing me whenever
I got demotivated. It would have not been possible to complete this book
without their support. I would also like to thank the co-authors for making this
happen.
And last but not the least, I would like to thank all my friends. Romit and Pushkar
you rock!!!
- Anirudh Dayma
First of all I would like to thank god, with whose grace we were able to write and
complete this book and also the Co-authors of this book. I am really thankful for
the support that I have received from my parents, my Mentor Miss. Sutithi
Chakraborty & my friends.
- Manvendra Singh
Special thanks to Sinku Kumar for his time and for helping us when we got stuck.
Thanks for your selfless support and efforts.
Gratitude to Swati Mishra for helping us out with Applied statistics code
snippets.
We would also like to thank the below people for devoting their precious time
for proof reading this book (names in alphabetical order).
Ayesha Khan
Bharadwaj Narayanam
Ritama Daw
Urmi Shah
Also, thanks to everyone who was associated with this book directly or
indirectly. We hope we didn’t miss anyone.
Preface
Did you ever think, how you will come to know which graph you should plot like
histogram, pie chart and which are the different techniques to have a look at
your Dataset? Interestingly, EDA is the process to know which one fits the best
and much more beyond that.
Let's consider you have three different features having 20%, 40%, and 60% Null
values, here most of us would delete those features which have Null values more
than 40 % or 50%. But, this is not how exactly it is done. There are multiple
techniques to fill Null Values but which technique to choose is something we will
come to know by having a deep understanding of the problem statement, this
is known to be EDA process and will be covered in this book.
For example, let’s say that you delete a feature which has around 60% Null
values, but what if that feature is really important while building your Model. By
this small mistake of deleting that feature, we will lose some important
information and this is going to impact our model. That's why the research,
patience, and critical thinking with commonsense concerning a problem
statement will help us tackle this situation which is also a part of EDA.
Let's understand in layman terms, what we want to deliver to our audience. For
example, you want to cook Rajma and you get the recipe from your mom.
Ingredients include garam masala, ginger, salt, ghee, and oil. But what would
happen to the taste of Rajma if you don't know how much quantity of
ingredients you want to add for a good taste? So basically, we all know that
cooking rajma is a technique that we have learned from our mom but what
quantity of ingredients to add and when to add before we cook Rajma, is
something that should be known. And this book is going to deliver the questions
of what, when, and how to add ingredients for a good taste of Rajma(Before
building a model).
Did you ever think whenever you deal with any data set, why is Machine
Learning needed? and if it’s needed, then at what different instances your
model can be the best and also fail at some point in time?
In this book, our focus into to build models rather we would be focussing on
some of the techniques you should follow for EDA when you deal with any
problem statement.
This book contains every basic approach that as a Data Scientist we have to deal
with. Majority of us have questions about how to apply feature engineering and
we will get the answer to the same in this book by working on some usecases
from different domains.
For whom this book belongs to?
The book "Data Investigation-EDA the Right Way" is going to help all the
beginners and intermediates who are struggling and finding EDA as a huge
challenge. This book is for the Data Science Enthusiast who have just jumped
into this field and are busy plotting graphs and filling NANs with
mean/median/mode. This book is also for people who have undergone learning
but lack when it comes to answering "Why this visualization?" At the same time,
this book is also for people who think EDA is just about flaunting fancy
visualizations. So this book is for everyone who has a basic knowledge of
Python/ML and thinks he/she knows EDA as this book will help them to
understand and apply EDA the right way.
One main misconception that people have is that Machine Learning is all about
building models. But people fail to understand that model might help to a
certain extent but after that, it is the features provided to the model that helps.
Machine learning works on the principle of "Garbage In Garbage Out", so if we
feed poor features to the model, the performance of the model will also be poor.
So to build efficient models one should majorly focus on the features and use
EDA to analyse the features and then finally come up with important features
that should then be fed to the model. In this book, we will describe how to make
assumptions, validate those assumptions, and then come up with useful
features to build a model.
Road map
Let’s discuss what you are going to learn from this book and what it contains
for the readers.
As we all know that Data Science is booming, but we believe that, ‘Data
Science is not Everyone’s Cup of Tea.’
People tend to follow the wrong practice by focusing on building Machine
Learning Model with 92% or 95% accuracy without realizing that their model is
learning or memorizing.
Suppose that you are teaching your kid how to identify a car, you would
describe it by saying that a car has four wheels. Whenever that kid sees a four-
wheeled object it would identify it a car, but do you remember the Mr. Bean
show, there was a blue-colored car with 3 wheels.
Figure 1: Car1
If the kid has memorized then he/she will fail to identify it as a car and the beach
bikes with 4 wheels would be identified as a car. In this scenario, we could say
that the kid has memorized as he/she gets confused when he/she sees some
variation. But if the kid would have learned then he/she would have identified
the objects correctly. How the kid learns depends on how well we teach the kid.
1
Image source - https://i.stack.imgur.com/OuWxL.jpg
Similar is the case with Model building, the model is analogous to the kid and it
depends on us how we train the model.
For example, most people blindly apply one-hot encoding or label encoding
when dealing with categorical features to build their Model hoping to get good
accuracy, they don’t bother to think if that encoding technique is going to help
or not. The focus should be to make the model learn and not memorize.
Let’s say you are trying to teach the same kid to identify an apple and if you just
focus on the red color, again the kid would fail when it comes across apples that
are green or have a slightly different shade of red. The kid might identify red
cherries as apples because the kid knows that whatever is red should be an
apple. While making the kid understand, we should also focus on the size and
some other variants of apple which have different color/characteristics. So size
along with other characteristics becomes an important feature when dealing
with the above problem. The motive of this example was to make you
understand the importance of feature selection. You should be in a position to
identify which feature is important and which isn’t.
Most of the Data Science Enthusiasts are scared to do some research and don’t
bother to understand the intuition behind the encoding techniques and fail to
understand its application. For example, if you know about one-hot encoding
and label encoding then you are going to apply these two every time, which is a
wrong practice. Just because you fear to research and experiment, you won’t
explore which are the other techniques which you could use. By doing some
research you might come across some new techniques and it might happen that
those techniques won’t help you but by doing so, you have learned about some
different encoding techniques and also learned about their advantages,
disadvantages, where you could apply them and where you can’t.
This field is all about exploring and your willingness to make mistakes and learn
from those mistakes.
Did you get that?
No?
Let’s make it more clear with another example, suppose that you have a Maths
exam after 2 days and you are running out of time so you won’t be covering all
the topics. You would only cover those topics which you feel are important from
an exam perspective and you would only focus on those topics. What if things
don’t happen the way as you want them to happen and what if all the questions
come from the topics you have skipped. You would score poor or might even
fail.
You should not only focus on the good scenarios but also try to focus on the bad
scenarios as well.
And this is how most of the Data Science Enthusiast work, they just focus on
good scenarios.
Imagine that you get a chance to work on a project where you have to detect if
its a bot or its not a bot. Most of the people would be focussed on building a
model that detects a bot and this would be comparatively easy but your model
would fail to detect when it isn’t a bot and this is a difficult task. Because learning
is being able to identify when it’s a bot and when it isn’t.
What comes to your mind when you hear the term Exploratory Data Analysis
(EDA)?
Exploratory Data Analysis isn’t only about plotting graphs but it's way beyond
that. As the name ‘Exploratory’ suggests it is more about exploring the data and
making yourself familiar with the data, it is about asking questions to the data,
making assumptions, and validating them using statistical tests.
For example, if you plan to go on a vacation to Shimla with your family then you
are not going to book the tickets directly. Instead, you would look for different
places you could visit as per your family discussion, then will select some places
based on your budget, etc, and this is something known to be as EDA in layman
words.
If you explore the place you plan to visit before your actually visit that place,
then why don’t you do the same with your problem statement and data in hand.
Remember EDA isn’t about who plots more number of graphs, it is about who
has understood the data well, it is more about drawing insights from the data.
Data Science always revolves around Why’s. Before plotting a graph you should
be clear with ‘What are you looking for in the data’, you should have some
question in hand, some assumptions made, and then check if the data looks the
way you assume it to be. Your assumptions might be wrong but they would lead
you to the correct conclusion. The graph which you plot is an answer to the
questions you have.
Applying common-sense and making Assumptions is one of the key skills you
should try to master as a Data Scientist.
It’s pretty simple, Research helps you improve your common sense and
Common-sense helps you create assumptions.
But remember Assumptions are always wrong until they have statistical
significance. And this is where statistical tests help us and that’s the reason we
have given a major emphasis on Statistical tests in this book and have it as the
very first chapter because to validate your assumptions you need statistical
tests.
For example, if someone told us that Virat Kohli’s strike rate is similar to Rohit
Sharma’s strike rate then being a Data Scientist we won’t agree upon that
statement unless we look into data and check for its statistical significance.
In this scenario,
Before getting into coding, do some research about the problem statement,
make some assumptions and even the wrong assumptions might take you to the
correct solution.
The main motive of this book is to make you aware of some of the weapons
which you could use to torture your data and ultimately start
1
Image source - https://imgflip.com/i/4aouuy
If statistical tests bother you, trust us by the end of the first chapter you would
feel pretty comfortable with statistical tests and you would get a clear idea
about where to use which test.
Assumptions -
An important thing to do before dealing with any problem statement is making
some assumptions. Assumption is a thing that is accepted as true or as certain
to happen, without proof. It is a key skill that is ignored by many. People are
fascinated by models and hence ignore the importance of making assumptions.
Assumptions could be of two types - technical assumptions or business
assumptions.
Technical assumptions:
Many analytical models or algorithms rely on a certain set of assumptions. This
means that before using that model or algorithm we should ensure that our data
abides by the assumptions made by the model/algorithm. Technical
assumptions could be that a model assumes there should be no collinearity in
the data being fed, some models might be affected by the missing data and
might also be sensitive to outliers.
EDA helps us to explore our data and thereby helps us to know our data well.
During EDA, various technical assumptions will be assessed, which would indeed
help us to select the best model to use. If we fail to do such an assessment, we
might use a model whose assumptions are violated by the data in hand thereby
leading to poor predictions.
For example, Linear regression (Ordinary least squares) has a set of assumptions
which the data should abide by before we apply Linear Regression on our data.
This might also land up to some incorrect conclusion which could have negative
implications on the business.
Business assumptions:
Business assumptions play a vital role when it comes to solving real-world
problems. These assumptions are not visible but could be understood by having
a deep sense of how the business works and what is the problem that we are
trying to solve.
It is very much important to have the correct domain knowledge before dealing
with data. Having a sound knowledge about the business makes things easier.
Myth
It is believed that the Data Scientist’s job is just to build models.
And so, Data Science Enthusiast run behind building models as soon
as they get the data. But it should be understood that not all
problem statements require a model to provide the solution. Many
use cases don’t require any model building, they just require getting
insights from the data and provide a remedy to the problem
statement.
Consider that your client is a YouTuber and he comes up with exciting food
recipes. During the current Covid-19 situation as most of us are staying indoors,
he thinks of coming up with videos on some exotic dishes. He assumes that as
most people are indoors and due to lockdown, people avoid ordering food
online. So if I come up with videos on some cool desserts, exotic dishes then I
would target more audiences and might get more viewers and subscribers.
So, he puts in all his efforts and starts making videos. But to his surprise, things
are not going as he expected. He was expecting that after making videos on
exotics stuff he would get more viewers, subscribers but in reality, the number
of viewers decreased substantially.
In such a case, model building is not required. We need to analyze the data,
gather insights from the data, and then come up with a possible solution. We
could ask the client to gather information about his subscribers and viewers. We
can then explore this information and try to get an intuition about the data.
In our case, the client assumes that his viewers might be interested in exotic
dishes but we all know that preparing exotic dishes consume a lot of time, it
requires too much effort and also some dishes might require ingredients that
are not readily available amid lockdown. It might so happen that his viewers are
of a lazy type and they are not at all interested in investing so much time in
cooking, they are okay with simple dishes that can be cooked quickly.
These things can be validated once we have data about the preferences of the
viewers. So, we should have a relevant set of questions which we could suggest
to our client to ask the viewers. Only by asking relevant questions, we would get
answers which could help us to unwrap some of the mystery.
From the above use case, we come to know that it isn’t only about model
building, it’s also about understanding the business domain and then suggesting
the client ways by which they could gather data, that will help us to provide a
solution to the problem statement. Now consider that after your suggestion, the
client thinks of releasing a form with certain questions and hopes that the
answer to those questions would help us to analyze the data and eventually help
in making some decisions.
It is evident, that out of all the viewers the client has, only a few would fill the
form and genuinely answer the questions. But many haven’t filled the form, so
the data which we will get will not be the population data, rather it would be a
sample of data.
Terminology alert
Population - A population dataset contains all members of a specified group
(the entire list of possible data values). In our case, if we could get the data for
all our viewers then the data would be called population data.
Sample - A sample dataset contains a part, or a subset, of a population. The
size of a sample is always less than the size of the population from which it is
taken. In our case, only a set of viewers would fill the form, so it is a sample
data.
So, it is clear that the data we would be getting will be sample data and not the
population data. With the sample data in hand, we cannot make a decision
about the population data as it is not the entire data. The data collected is a
subset of the population.
Why cannot we make decisions based on the sample data, why do we care so
much about population?
There are 2 reasons for this -
1. The sample is a subset of the population, so it won’t tell us the entire story about
the data, and without knowing the entire story we cannot make decisions.
2. The sample might be biased, there is a possibility that out of all the viewers who
have filled the form, most of them feel that exotic dishes are time-consuming
but there could be many viewers who are interested in exotic dishes but they
didn’t want to submit the form and answer the questions.
So, the sample might be biased, and there is a lot of information that the sample
might have possibly missed, so just by using the sample, we shouldn’t make
decisions.
Now it is clear that just by analyzing the sample we cannot come up with
conclusions for the entire population.
Hypothesis:
A hypothesis is a proposition made as a basis for reasoning, without any
assumption of its truth.
There’s a reason for underlining certain words and you would know its reason
as we proceed. There are two types of hypothesis - Null hypothesis and
Alternate hypothesis.
Null Hypothesis:
Let us try to understand this with the help of an example. We want to check
whether or not there is a difference between the average income of Indian
employees in the years 2019 and 2020. So as the word null means zero or no,
the null hypothesis would be there is no difference or zero difference between
the average income of the 2 years.
The null hypothesis can also be a proposition that has been made earlier and that
proposition is accepted.
Mathematically -
The average income for the year 2019 = The average income for the year 2020
Alternate Hypothesis:
The alternate hypothesis would say that there is a difference between the two
values (which value is larger and which value is smaller is a different question
but there is a difference).
In our above example, it would be that both the incomes differ.
Mathematically:
The average income for the year 2019 ≠ The average income for the year 2020.
An alternate hypothesis can also be a statement that differs from the proposition
that is accepted by the people.
Real-life example
It has been proposed by Trump that anti-malaria tablets could cure Covid-19.
This proposition would become our null hypothesis. It has not been proved that
it cures Covid-19, it is only proposed and there is no significant proof of it.
Now a researcher comes up and says that no, the anti-malaria tablet doesn’t
cure Covid-19. This would become our alternate hypothesis.
When we deal with the hypothesis, we never prove that a hypothesis is correct
we just prove that another hypothesis is incorrect.
Similar to our judicial system, a person is innocent until proven guilty. If we
cannot prove a hypothesis wrong, it is accepted until it is proved incorrect.
The researcher would have to come up with significant results to prove his
proposition. If he fails to do so we would accept that anti-malaria tablets cure
Covid-19.
The hypothesis test that we perform assumes that the null hypothesis is true. If
the null hypothesis is true then we should get a high probability. If we are getting
very less probability, less than 5% then we are pretty sure that our assumption
that the null hypothesis is true is incorrect, hence we reject null hypothesis when
the p-value is less than 0.05. P-value is also the probability that what we have
seen (in the above example that coefficients are non-zero) is due to random
chance. If the probability is small then we are sure that the change is not due to
random chance, so we reject the null hypothesis.
Note: The threshold is generally 0.05 i.e. 5% and we can reduce it to 1% as well
if we want to reduce the chances of making an incorrect decision. For example,
if we are performing a hypothesis test for a drug then we need to reduce the
chances of making an incorrect decision. In such cases, we might keep the
threshold as 0.01 or 1%.
Hypothesis testing at its core checks whether our statistic belongs
to the null hypothesis distribution or some other distribution. If it
does not belong to our null hypothesis distribution, we say that our
statistic comes from some other distribution and we reject the null
hypothesis.
This is how statisticians talk, if you are able to understand the below image
then you can say that you have understood the terminologies well. In case if
you don’t, read the chapter again and we hope that you would understand it.
Now, let us have a look at some of the statistical tests and also get an
understanding of their usage.
But before we dive into the statistical tests, it is important to understand how
data is represented and different levels of measurements as it plays a key role
in identifying which statistical test to use.
1
Image Source - https://miro.medium.com/max/2920/1*0T7xtPuohhs7VJl9CfLYvQ.png
Levels of Measurement
The way a set of data is measured is called it's level of measurement. There exist
four levels of measurement. Nominal, Ordinal, Interval, or Ratio (Interval and
Ratio levels of measurement are sometimes called Continuous or Scale).
Why is it important?
We need to understand the different levels of measurement, as how the
research question is phrased together with the levels of measurement, dictates
which statistical test is appropriate.
Sampling Techniques
Before starting with sampling let’s revisit what is a population and sample.
Population: According to Wikipedia, a population is a set of similar items or
events which is of interest for some question or experiment.
To understand it better let’s take an example, suppose we want to find out what
is the average salary of Data Scientists in India. So, all Data scientists in India
become our population or we can call it a subject of study.
Sample: A sample is a set of individuals or objects collected or selected from a
statistical population.
Our population at study is a set of all Data Scientists in India, but we have limited
time and resources. We cannot go and knock the door of every Data Scientist in
India and ask their salary. So, we randomly choose some observations from our
population.
Let’s say we randomly choose 100 Data Scientists and get their salary details, so
this becomes our sample with 100 observations, while the population is all the
Data Scientists in India.
Biased Sampling:
Biased Sampling is the worst method of all the Sampling methods since it doesn’t
give an equal chance to all the observations in the population. It occurs when
some set of observations of the population are favored over others.
They are of 2 types:
Convenience Sample - Only Includes people who are easy to reach.
For example, if we want to receive feedback from 5 customers and we have
around 100 customers. So, let’s say that we select our first 5 customers or
customers whose name starts with a particular alphabet. Here not every
customer is getting a chance so we call it Convenience Sample. The observations
of the sample were chosen as per our convenience.
Voluntary Response Sample- Consists of people who have nominated
themselves out of their own will.
For example, if we ask which all customers are interested in filling the feedback
form and around 5 customers are ready to do so on their will, then it becomes
a Voluntary Response Sample. Those with a strong interest are the ones who are
most likely to fill the feedback form.
A good sample is the one that is representative of the entire population and it
gives an equal chance of being chosen.
Unbiased Sampling:
Unbiased Sampling is the best of all the Sampling methods since it gives equal
chance to every observation in the population.
Different types of Unbiased Sampling methods:
Simple Random Sampling (SRS) - SRS is a sampling technique where every item
in the population has an equal chance or likelihood of being selected.
For example, we could randomly choose customers and ask them to fill the form.
We could also use a random number table and then choose our observations
accordingly. So, in this way, all customers will get an equal chance of being
chosen with no bias*.
* there is nothing which is free from bias, some bias in present unintentionally, we aim to have minimal bias.
1
Image Source - https://research-methodology.net/wp-content/uploads/2015/04/Simple-random-
sampling.png
Systematic Sampling: Systematic random sampling is the random sampling
method that requires selecting samples based on a system of intervals in a
numbered population.
For example, from all the customers we decide to call every 3rd customer for
feedback. So, we arrange them systematically and choose every 3rd customer
for the feedback.
One-Tailed tests
Suppose you run a potato chips company and you claim that the packet of chips
weighs not less than 250 gms but the customers claim that the weight is less
than 250 gms.
The null hypothesis would be - Weight >= 250 gms
And the alternate hypothesis would be - Weight < 250 gms
The one-tailed test could be left-tailed or right-tailed depending on the
formulated hypothesis
Two-Tailed tests
Consider the same example that we mentioned above but now let’s say that you
are claiming that the weight of chips is exactly 250 gm and the customer's claim
that there is a difference, it might be less than 250 gms or it might be more than
250 gms but it is not exactly equal to 250 gms.
The Null hypothesis would be - Weight = 250
And the alternate hypothesis would be - Weight ≠ 250
The image on the next page depicts a One-tailed test and a Two-tailed test. The
two images at the top of the image represent a Left tailed test and Right tailed
test. The image below depicts a Two-tailed test.
The area shaded in blue color is the critical region, if our p-value falls in the
critical region then we reject the Null hypothesis.
Figure 10: One tailed and two tailed test1
1
Image Source - https://miro.medium.com/max/619/1*Zu0iou9DD-zIZSOZjsUeEA.png
So, we should check that the test we think of applying doesn’t assume the
variance to be similar, in case if it assumes then we need to run a statistical test
to check the variance, and then we should apply our intended statistical test.
Types of data
Depending on the data type, it could be divided into two types-
Comparing means -
The T-test is used to compare means, there are many variants of the T-test.
ANOVA is also used to compare means and it also has different variants.
One sample Independent T-test -
This test is used to check if there is a significant difference between the sample
mean and the population mean. It assumes that data comes from a normal
distribution, it is used when we have 1 sample under observation. But if the
sample size is large enough then we don’t care about the assumption of
normality as well (more about this in Central Limit Theorem section).
Example - Suppose if we want to check if the energy intake of 12 people is
significantly different from a theoretical value, we could use t-test (assuming the
required assumptions are adhered by the data)
# Let’s take an example of 12 persons whose energy intake in KJ is as below:
energy = np.array([5260,5470,5640,6180,6390,6515,6805,7515,7515,7610,
8230,8770])
# Our population mean is supposed to be 7725. Let's calculate the sample mean
for above:
energy.mean() # 6825.0
# Output
# One sample t test
# tstatistic: -2.791930833083929, p value: 0.01752590611238294
If you want to check if the average pay for Software engineers is 4.5 lac and you
have a group of 15 friends with the same designation, their average pay comes
out to be 4.4 lac.
So, the group of 15 friends becomes your sample, the sample mean is 4.4 and
the population mean is 4.5. So, you could run an Independent T-test to check if
there is a significant difference between the two means.
group1 = np.array([9210,11510,12790,11850,9970,8790,9690,9680,9190,9970,
8790,9690])
group2 = np.array([7530,7480,8080,10150,8400,10880,6130,7900,7050,7480,
7580,8110])
# Output -
# Two sample t test
# tstatistic: 3.872200562257873, p value: 0.0008232715728032911
Since the p-value is less than 0.05, we can reject the null hypothesis at 5% level
of significance.
This means that there is a significant difference between the two means.
Paired T-test -
The use of this test could be best understood with the help of an example.
Consider that you own a fitness company and you want to check if the weight
loss program that you have launched is actually beneficial.
To check this, you pick 20 people at random and calculate their average weight
and it comes out to be ‘X’, post the weight loss program you calculate the
average weight of the same people and call this as ‘Y’.
Now you check if there is a significant difference between X and Y, if yes then
the weight loss program of yours has actually helped.
This means that we use Paired T-test when we want to compare means of 2
samples from the same population (People under observation are the same,
hence population is the same but their weight values have changed hence we
say that we are dealing with 2 different samples).
# Let's check is there any effect of the weight loss program
pre = np.array([92,67,78,81,69,87,96,87,100])
post = np.array([88,68,80,77,66,70,88,79,98])
Since the p-value is less than 0.05, we can reject the null hypothesis at 5% level
of significance.
This means that there is a significant difference between the weights before and
after the weight loss program.
ANOVA -
T-tests are used when we want to compare 2 means, but what if we want to
compare more than 2 means, in such a case ANOVA comes to our rescue. We
use ANOVA when we want to compare more than two means.
ANOVA (Analysis Of Variance), the name is quite misleading and so people get
carried away by its name. Some people assume that ANOVA is used to analyze
variance and which is WRONG. We know it is the name which brings us to such
a conclusion but the fact is ANOVA assumes that there is no significant
difference between the variance of the groups/samples. So here is the catch,
before proceeding with ANOVA we need to check if the variance is similar and
to do this, we have tests which compare the variances. Also similar to the t-test,
ANOVA also assumes the normality of data.
Comparing Variances
Fisher’s test -
This test could be used to compare the variance of 2 samples coming from a
population that is normally distributed. It could be used to validate the
assumption related to similar variance before applying ANOVA or any other test.
Correlation tests
Pearson’s Correlation Coefficient -
This test is used to check the association between quantitative variables. This
test also has a set of assumptions like no outliers should be present in the
variables under observation, the presence of a pair of values. For example, if this
test is applied on weight and height variables then each observation used should
have value for weight and height variable. Some more assumptions should be
adhered to before using this statistical test.
Non-Parametric tests -
There are scenarios where it becomes quite difficult to adhere to the
assumptions made by the parametric tests. This is where non-parametric tests
come to our rescue but one thing worth noting is that we could come up with
more strong conclusions using parametric tests than non-parametric tests.
Comparison medians -
Wilcoxon signed-rank test -
In most cases, it is used as a possible alternative to Paired T-test when the data
isn’t following the assumptions mentioned by the t-test.
This test aims at comparing median to a hypothesized median as when the data
is skewed or contains outliers, the median is preferred over mean.
grp_mba = ([4.5,6.5,3.98,4.2,5.3,4.25,5.4])
grp_ms = ([4.6,6.1,3.7,4.8,5.1,5.6,5.8])
Correlation tests –
Spearman’s Correlation -
It is used when one or both the variables aren’t normally distributed and mostly
used as an alternative to Pearson’s Correlation test. It is mostly used when the
data is ordinal. It calculates the strength and direction of the association
between the 2 variables.
Chi-square test -
It is used to check the association between two categorical variables. The
frequency of each category for one categorical variable is compared across the
categories of the other categorical variable. The data is represented using a
contingency table where each row represents categories of one variable and
each column represents categories of the other variable. The cells contain the
corresponding frequencies of the categories.
For Chi-square test
Null hypothesis - Both the variables are not associated
Alternate hypothesis - Both the variables are associated
Significance level = 0.05
Let us try to understand this with the help of an example -
We will try to identify is there is an association between stress level and drinking
habits.
Regular Occasional Doesn't Total
drinkers drinkers drink
Low Stress 14 9 2 25
Medium 11 8 3 22
stress
High Stress 28 13 12 53
53 30 17 100
From the above table, we can see that we have 100 observations, and 25% of
people have low stress, 53% of people are Regular drinkers (assuming the null
hypothesis is true).
So, the number of observations with Regular drinkers and low stress is 0.53 x
0.25 x 100 = 13.25. This is the expected number of observations if the variables
are not associated. Similarly, we will calculate for the rest.
We will get the calculated value after we do the calculations and this value will
be compared to the value we obtain from the chi-square table. But to obtain
value from the table, we need degrees of freedom (dof).
Let us try to understand degrees of freedom with the help of a small example,
suppose you have to find 3 numbers such that their sum is 10, call these
numbers as a, b and c. We are free to choose any value for variable ‘a’, consider
we chose it as 3, similarly we could choose any value for variable ‘b’ and suppose
we chose a value for ‘b’ as 5 then we are not free to choose any value we wish
for variable ‘c’. As soon as we fix the values for ‘a’ and ‘b’ we are bounded by
constraints and hence the value of ‘c’ comes out to be 2 so that the sum comes
up as 10.
We had 3 variables and we were free to vary the value of 2 variables so in this
case, we could say we have 2 degrees of freedom i.e. no. of variables – 1.
Similarly, we calculate degrees of freedom below -
The Central limit theorem is one the widely used theorem, it is used quite often
but its use is abstract and hence people don’t even realize that they are using it.
This theorem becomes very important when we are dealing with non-normal
distribution and so it is important to have some knowledge around this before
performing hypothesis testing.
The definition of the Central Limit theorem might seem quite technical so let us
try to understand it by thinking in the following manner. We draw a sample of
size ‘n’ from a population, using the sample we can calculate the sample mean
and standard deviation of the sample.
Let’s suppose that we do not have any idea about the distribution of the
population, we are also not aware of the mean and standard deviation of the
population. Assume that the sample size is greater than 30 and we want to apply
a parametric test on this sample, but the sample comes from a population
whose distribution is let’s say non-normal.
Should should we go ahead with a non-parametric test? Hang on for a minute
before answering this question.
Let us assume that we have drawn some more samples from the population and
have calculated their means (called as sample mean). When we try to plot the
sample means, the distribution that we would get would be close to a normal
distribution and as we increase the sample size, the distribution is more likely to
resemble a normal distribution (a bell-shaped curve). This is what the Central
limit theorem tries to convey.
It also conveys that the mean of the distribution obtained from the sample
means would be equal to the population mean irrespective of the distribution
of the population.
Formal definition -
The Central limit theorem says that the mean of the sampling distribution of the
sample means is equal to the population mean irrespective of the distribution of
the population and when the sample size is greater than 30.
Let us try understanding the meaning of the highlighted terms, sampling
distribution means that the distribution is made up of samples and the later part
i.e. sample means implies that the distribution is made up of “means of the
sample”. We know in the Central limit theorem we create several samples of
size greater than 30, calculate the mean of the samples, and then plot them.
Takeaways –
1. The mean of the sampling distribution of sample means is equal to the
population mean.
2. Also, the sampling distribution of the sample means follows a normal
distribution if the sample size is greater than 30.
So, you might be wondering how is this going to help us. Suppose if we run a t-
test, it assumes that the sample should follow a normal distribution. We are
trying to investigate the sample mean which is coming from a normal
distribution. Another way to say the same thing is that, it assumes the sample
mean should come from a normal distribution.
When we plot the sample means, it will follow a normal distribution (this is
according to CLT) and for the sample which we had in hand at the very first place,
we could say even that sample comes from the sampling distribution of sample
means (distribution of sample means). We know that this distribution would be
normal, so the sample mean in hand comes from a normal distribution and
hence the normality assumption holds true.
That is why we had said that as the sample size increases, we are less concerned
about the normality assumption as it is automatically taken care of by the
Central Limit Theorem.
Example - Suppose we have a Gender feature which contains 2 values Male and
Female, we need to check if this feature has any impact on the target
(continuous variable). We could run a t-test to check if Gender being Male or
Female has an impact on the target column. If their means are not significantly
different, we could possibly say that the Gender column with Male and Female
values doesn’t have an impact on the target (we need to take variance into
account as well, also need to check for the assumptions of t-test before applying
it).
But this is just to give you an idea about how we could go ahead building and
validating assumptions using statistical tests.
Bonus content
Feel free to get access to some of the mini Ebooks we have created. You can get
it at https://github.com/Dataebook/Ebook
Kaggle Exploration
This chapter would deal with the exploration of Kaggle, why is Kaggle important,
what are some of the challenges people face, how to overcome those challenges
and in the end, it would deal with a Kaggle problem statement. The entire code
could be found on Github1.
Why Kaggle?
There are many ways to learn and practice Data Science, then why does Kaggle
hold a special place? Below are some of the questions we would like to ask.
Are you new to learning Data Science?
Are you the person who likes to learn Data Science through the application?
Assumption making is an important skill, do you want to broaden your
assumption making skill?
Would you want to get a community and corporate support while you’re
learning to solve a Kaggle project?
Would you like to participate in competition post-learning or while learning Data
Science?
If these are some of the questions you have in mind then the best solution we
would suggest is Kaggle2 - a platform only related to machine learning, data
science, deep learning or AI stuff. Kaggle is a great place to try out something
you newly learned and it will be very beneficial to you. Kaggle is not about
competing with others in competitions rather it’s a great platform to learn from
the work other people have done.
Points to remember while exploring Kaggle:
Don’t panic when yo hear the term ‘competition’ when you are learning.
Kaggle is more about learning.
1
Github link - https://github.com/Dataebook/KaggleExploration
2
Kaggle link - https://www.kaggle.com/
Mistakes made while exploring Kaggle
1. Just looking and not experimenting - Just having a look at how a kernel has
solved a problem is not going to help any learner to learn anything. One has
to take the code and execute it on a local machine for better understanding.
Data science is an applied field, and the best way to solidify skills is by
practicing and not looking.
2. Failing to make prior assumptions - Without making prior assumptions and
validating them, you won’t be in a place to make correct decisions, try to
explore the data as much as possible and #FeelLikeABoss
3. Failing to understand the kernel’s point of view - Whenever a person is
trying to understand the solution of a kernel, he/she should try to keep
ourselves at the place of the kernel contributor and should try to understand,
why has that person made a certain set of assumptions, what might be
his/her thinking process behind the solution they have made. Try to keep
yourself under the shoes of the kernel contributor.
4. Sticking to one Kernel’s solution - You should not restrict yourself to one
kernel, try to explore as many kernels as possible, try to understand how the
sample problem has been solved by different people, you would find
something to learn from each one of them.
5. Spending too much time on theory - Many beginners fall into the trap of
spending too much time on theory, whether it be math related (linear
algebra, statistics, etc.) or machine learning related (algorithms, derivations,
etc.). This approach is inefficient for 3 main reasons:
First, it’s slow and daunting.
Second, you won’t retain the concepts as well.
Finally, there’s a greater risk that you’ll become demotivated and give
up if you don’t see how your learning connects to the real world.
6. Fail start using algorithms without knowing the math behind it - Blindly
using the algorithms hurts when a certain algorithm is not working as you
wish and you cannot find the exact reason for that because you don’t know
the intuition behind the algorithm.
7. Using the same algorithm for all the datasets - For example, Random Forest
or XGBoost is good, it works very well but, it is important to check other
models as well.
How to overcome these mistakes?
1. Change the way of approaching a problem statement - Approach a problem
statement by writing your assumptions. This makes one understand the
domain very well. Make sure you make at least 5-10 assumptions before
getting data in hand.
2. Research why ML/DL/AI is required - Every learner should be able to find
out how machine learning can help any industry (For eg, the Banking sector)
to address their problem. One can examine that by reading blogs or articles
to find why machine learning or deep learning is needed to solve that
particular problem.
For example, let’s say you are participating in a hackathon to predict the loan
amount using machine learning and here you can observe that the problem
statement is pretty much clear and you will start working with the data set
directly without knowing the business case. For sure you will face difficulty
while dealing with that particular hackathon.
First, do think why machine learning is needed & research about how
machine learning is being used in the banking sector while predicting the loan
amount. Understand the business problem, which may in turn give you
insights about feature engineering skills and you can observe changes in
youself while you approach in this manner.
3. Learn to be comfortable with partial knowledge - You’ll naturally fill in the
gaps as you progress.
4. Understand the Strength and Weakness of an algorithm - It is always
suggested to know the landscape of modern machine learning algorithms
and their strengths and weaknesses.
5. Practice: It is very important to constantly practice problem-solving.
Challenges faced while dealing with Kaggle kernels
1. It would be quite challenging to understand the Kernel’s approach while
learning.
2. The code might be new to the fresher which has been used by the kernel.
For example, maybe we could have learned matplotlib and seaborn for
visualization but some kernels would have used Plotly for visualization.
Understanding that would be a bit difficult.
Note: Remember that these are the common challenges faced by every
learner. Every master was once a learner.
“Every Olympic diver needed to learn how to swim first, and so should you.”
Our Experience
At the beginning, as soon as we read the problem statement, we would take the
dataset in hand and try to understand the given features. But later we realized
it is always better to understand the domain first before handling the dataset.
We started making some assumptions and it helped us to get more acquainted
with the data. Assumption making also helps in developing storytelling skills. It
has helped us and would definitely work for you as well, give it a try.
While learning, we thought EDA is only about filling null values and putting it
into Charts. But later we realized EDA is much more beyond those fancy
visualizations. "EDA is a process of approaching the problem statement before
Model building".
There was a Kaggle project which we weren't able to understand and we got
stuck in between to understand the code completely. And we spent many days
understanding it. However, we found the best way to understand the code as
well as the approach used in the kernel. It happened by taking ourselves at the
position of Kernel contributor and seeing things from his/her perspective. If we
are not understanding the code, we would try to understand the approach and
store it in our mind. At any point in time, if a similar kind of approach comes, we
would take the help of that code and manipulate it according to our requirement
to complete our work. Because Data Science is not only about coding, but it’s
more about giving solutions to the problem which would impact people's lives.
Now, let’s take one Kaggle problem and try to understand the EDA step by step.
The focus of the chapter is not to build a model, rather tell you about how to
approach a problem statement.
Note – We won’t be explaining the entire code in this book as it isn’t practically
possible, we would just be covering the important code snippets. For the
entire code, refer to the Github1 repo.
1
Github link - https://github.com/Dataebook/KaggleExploration
4. Check whether all members of the house have the same poverty level.
5. Check if there is a house without a family head.
6. Set the poverty level of the members and the head of the house within a
family.
7. Count how many null values are existing in columns.
8. Remove the null value from the target variable.
9. Predict the accuracy using a random forest classifier.
10. Check the accuracy using the Random forest with cross-validation.
Note: After reading the Problem Statement itself, we should make some
assumptions without looking at the data.
Assumptions:
The given problem statement focuses on the social-economic side. We can solve
any kind of problem from any domain with the help of enough data.
Let’s start making our assumptions one by one for our problem statement
before looking into the dataset:
Since the problem statement mentions household observable attributes, let’s
list down some of the attributes in our house and observable things. We have
listed a few, based on our thought process.
Refrigerator
Mobile phones
Laptop
Desktop
Water facility
Toilet facility
Wall type
Ceiling type
Fan, Light
Electricity
Stove
Cylinder
Kitchen room
No. of rooms/house
Type of Door lock
Number of persons living per house
Number of persons working/house
Number of persons dependent on others/house - like children,
grandma, grandpa(Age<18 and Age>65)
Own house or rented house
Some of the observable attributes are mentioned above. You can list others
from your end in case if we have missed any.
Let’s start making our assumptions.
1. If a household holds all the above-mentioned attributes then definitely they
should not be in the extreme poverty line.
2. Though a household holds all the above-mentioned attributes, if the
dependency rate is high, maybe we can categorize them in a moderate
poverty line.
3. If the wall is in bad condition, then it simply means that the household is not
having enough money to renovate that. With this assumption, we can say
that the household should be extremely poor. But it also depends on whether
the house is owned or rented and other similar factors like that.
4. If there is no power supply for a household, for a long time, we can say that
they do not have money to pay and get an EB connection for their home and
so even they are in the extreme poverty line.
5. If in a house there are like 5members and if only one person is working and
if the walls, ceilings, and floor material and if the home is rented, then he
should be in the extreme poverty line.
6. If a house has only one room and no space for the kitchen, then definitely he
should not be having good sanitary facilities. This kind of household should
be focused on our program.
7. If the number of rooms is more than 3, then we can say that this household
must own his house or pay his rent regularly. We can ignore this household
as he is not in an extreme poverty line.
8. If a house doesn’t have children and pregnant ladies, these people can
manage themselves somehow. But if only one person is earning and
considering he is living in a rented room and he has children and pregnant
women, it becomes difficult for him to manage everything. So depending on
his income, we can categorize him and target that household.
9. If there is no toilet facility for a household, it means that this particular
household should be in the extreme poverty line.
10.If there is no one working in a house, and if more persons are living in that
house, their daily food becomes a huge challenge.
These are the few assumptions we have made before looking into the dataset.
Now with these assumptions let’s look into the dataset and move forward.
What do we have to do? Identify what the problem statement it is all about -
We need to identify the "level of income qualification" needed for the families
in Latin America based on the household level and also have to maximize the
Accuracy score across all the categories. A generalized solution would be helpful
to IDB and other institutions working towards helping the economically weaker
sections of the society.
This is a supervised multi-class classification machine learning problem:
Supervised: Provided with the labels for the training data
Multi-class classification: Labels are discrete values with 4 classes
Each row represents one individual and each column is a feature, either unique
to the individual, or for the household of the individual. The training set has one
additional column, Target, which represents the poverty level on a 1-4 scale and
is the label for the competition. A value of 1 is the most extreme poverty.
# Output
# Train Dataset - Rows, Columns: (9557, 143)
# Test Dataset - Rows, Columns: (23856, 142)
We are provided with 2 datasets; one is a train and the other one is a test. So,
whatever changes we are doing to the training dataset, everything has to be
done to the Test dataset as well. For eg, let’s say if you find any null value in the
Train dataset, and you are treating it in the training dataset and forgot to do the
same in the test dataset, then our model would definitely go for a toss. Do you
agree? So, make sure that you treat the Train and Test dataset equally when you
are provided with two datasets - one with a Target column and the other one
without a Target column.
Understanding the features –
We have been provided with a Train and Test dataset which is almost the same
except the presence of the 'Target' variable in the test dataset. Now, let's
understand all the features by looking at its data type from the Train dataset.
The explanations for all 143 columns can be found on Github1, but a few to note
are below:
Id: a unique identifier for each individual (this feature can’t be used to
train our model, but can be used to identify a person).
idhogar: a unique identifier for each household. This variable is not an
important feature but will be used to group individuals by the household
as all individuals in a household will have the same identifier.
parentesco1: indicates if this person is the head of the household.
Target: the label, which should be the same for all members in a
household.
If we observe each row with parentesco1 column, we would understand that
every record is on the individual level, with each individual having unique
features and also information about their household. To create a dataset for the
task, we'll have to aggregate individual’s data for each household. Moreover,
we have to make predictions for every individual in the test set, but "only the
heads of household are used in scoring", which means we want to predict
poverty on a household basis.
1
Github link - https://github.com/Dataebook/KaggleExploration
Let’s look at the different values target contains and their frequency
train_data['Target'].value_counts()
# Output
# 4 5996
# 2 1597
# 3 1209
# 1 755
# Name: Target, dtype: int64
missmap = train_data.isnull().sum().to_frame()
missmap = missmap.sort_values(0, ascending = False)
missmap.head()
# Output
# rez_esc = 7928
# v18q1 = 7342
# v2a1 = 6860
# meaneduc = 5
# SQBmeaned = 5
Here we can see 5 columns with null values in the entire dataset. Let's look at
the data types and apply our common sense to fill the null values.
Fields with missing values -
rez_esc = Years behind in school
We can check for the unique values of the columns and then identify if the
column is a mixture of alphabetical and numerical values.
By doing a check, we observe that there are 3 Object data types with mixed
values. And they are 'dependency','edjefe','edjefa'.
According to the documentation for these columns:
dependency: Calculated as the ratio of ‘number of members of the household
younger than 19 or older than 64’ to ‘the number of members of the household
between 19 and 64’.
edjefe: years of education of the male head of household, based on the
interaction of escolari (years of education), head of household and
gender, yes=1 and no=0
For these three variables, it seems that “yes” = 1 and “no” = 0, we can change
these variables by mapping them correctly.
Once we have ensured that no column is a mixture of alphabets and numbers,
we can go ahead for Null value treatment. Remember that if mixed values are
not changed then it is going to be a wrong approach.
Anyone younger than 7 or older than 19 presumably has no years behind and
therefore the value should be set to 0. For this variable, if the individual is over
19 and they have a missing value, or if they are younger than 7 and have a
missing value we can set it to zero.
Null Value treatment for 'v18q1' column -
Applying Common Sense:
v18q1 - (total nulls: 7342) = Number of tablets household owns
Why only look at the null values? Let's look at a few rows with nulls in v2a1.
Columns related to Monthly rent payment are listed down:
tipovivi1= 1 own and fully-paid house
tipovivi3= 1 rented
tipovivi4= 1 precarious
Let's look at the columns related to average years of education for adults (18+):
edjefe -> years of education of the male head of household, based on the
interaction of escolari (years of education), head of household and
gender, yes=1 and no=0.
edjefa -> years of education of the female head of household, based on
the interaction of escolari (years of education), head of household and
gender, yes=1 and no=0.
Instlevel1 = 1 -> no level of education
Instlevel2 = 1 -> incomplete primary
data = train_data[train_data['meaneduc'].isnull()].head()
columns=['edjefe','edjefa','instlevel1','instlevel2']
data[columns][data[columns]['instlevel1']>0].describe()
By this, we can see that the meaneduc is null when no level of education is 0.
So, we can replace all the Nan’s with 0 to fix the meanedu column.
The things won’t be crystal clear here, refer to Github link for the entire code
with proper explanation.
Null Value treatment for ‘SQBmeaned’ column -
Applying Common Sense:
SQBmeaned - (total nulls: 5) = Square of the mean years of education of
adults (>=18) in the household.
This column is just the square of the mean years of education of adults.
data = train_data[train_data['SQBmeaned'].isnull()].head()
columns=['edjefe','edjefa','instlevel1','instlevel2']
data[columns][data[columns]['instlevel1']>0].describe()
The above code tells us that the SQBmeaned is null when no level of education
is 0. So, we can replace all the Nan’s with 0 to fix the ‘SQBmeaned’ column.
Almost all the null values have been treated properly with the domain and data
understanding.
Note: When we observe, we can see that all the null values are replaced with
zero in this problem statement. Do you think it is replaced just like that? No.
Based on the domain knowledge and common sense, it got perfectly replaced.
This is how Null value treatment has to be done.
Column Definitions
As a part of the analysis, we have to define the columns that are at an individual
level and a household level using the data descriptions. There is simply no other
way to identify which variables are at the household level, other than going
through the variables themselves in the data description.
We'll define different variables because we need to treat some of them
differently. Once we have the variables defined on each level, we can start
aggregating them as needed.
The process is as follows:
1. Break variables at a household level and an individual level.
2. Find suitable aggregations for the individual-level data.
Ordinal variables can use statistical aggregations.
Boolean variables can also be aggregated but with fewer stats.
3. Join the individual aggregations to the household level data.
With the help of the correlation score, the features which show the correlation
score greater than 90% are identified and removed to avoid redundancy.
Similarly, you have to work on finding the redundant variables in the dataset and
try to remove them before building the model.
Note: Remember whatever changes you are doing to the train dataset, it has
to be applied to the test dataset as well.
For complete code with detailed steps, refer to Github1.
1
Github link - https://github.com/Dataebook/KaggleExploration
Churn Prediction
Brief about the problem statement
Churn prediction is all about retaining the customers. When you offer a service,
there might be a group of people who are not likely interested in availing the
service that is being offered by you. Such a set of customers are more likely to
stop using your service in the coming time, these set of customers are referred
to as churned customers.
This is not just true for services, this might also happen with products as well.
Customers might start disassociating themselves from the product with which
they had been associated since long. The reason for this might be any.
Customer churn happens when customers stop doing business with a company.
Did you ever think why service providers like AIRTEL and IDEA give different
offers to different customers even when they have over millions of customers?
It might have happened that you might have got an offer for Rs 450 and your
friend or a family member might have got a similar offer at a cheaper price from
the same service provider.
Building assumptions and use of common sense plays a major role in the Data
Science domain, so we need to use our common sense to come up with
assumptions and statistical tests would help us to validate those assumptions.
Before diving into the data, it is important to understand the importance of the
problem. What change would you bring in after solving this problem? if the
change isn’t that big, it won’t motivate you to solve the problem. That is why it
is important to understand the weightage of the business problem that you are
going to solve.
Interact with your client, ask him how your work is going to affect their
organization and trust me, the client would be happy to see that you are
interested in understanding their business and not just working like a bot. And
the importance of the business problem would keep you motivated throughout
the process as there might be many instances where you might get demotivated
because of the challenges you might face during the entire process.
Retention of a customer is cheap as compared to getting new customers. The
old customers build a certain level of trust in their minds for the company when
they associate with a company for some time duration. But when it comes to
getting new customers, it is a costly job because money is spent on drawing the
attention of new customers towards your company.
New customers are hesitant when it comes to establishing a new relationship. It
is but obvious that even we might be hesitant in trying a new product. So to deal
with this, the new company might offer some services for free and after that,
there might be chances that the new customer could potentially get engaged
with the company. But this isn’t that easy, out of around 1000 customers only a
handful might turn into leads. The conversion from Potential customers to a
Lead is quite challenging, the conversion ratio is quite low.
This is what Michael Redbord, general manager of Service Hub at HubSpot has
to tell about churn prediction:
“In a subscription-based business, even a small rate of monthly/quarterly churn
will compound quickly over time. Just 1 percent monthly churn translates to
almost 12 percent yearly churn. Given that it’s far more expensive to acquire a
new customer than to retain an existing one, businesses with high churn rates
will quickly find themselves in a financial hole as they have to devote more and
more resources to new customer acquisition.”
Customer Segmentation -
In this, the customers are segmented into groups depending on the type of
services they avail, their needs, level of engagement, the monetary value, their
feedback, and a lot of many things. These customer categories share a common
belief and common behavior patterns and this allows us to focus on these
groups/categories rather than focussing on the entire set of customers. If we try
to focus on the entire set of customers, it might happen that different customers
might have faced different issues but some people might have similar issues so
they could be grouped and targeted collectively. So instead of building a model
using the entire set of customers, we can build models that are specific to a
group/category representing the entire segment. This method is observed to
result in a far better solution than targeting all the customers in a similar way.
Observation window -
We aim to get some idea about the behavior of the customers who have churned
and then check if our existing customers are following the same pattern. If we
get to see a similar pattern then this means that a specific customer might churn.
So we observe the behavior of customers for a particular period (called a
window), this ends before a specific point in time and we make predictions
about a period or window that starts after the observation window has ended.
The observation window is the customer history which contains how the
customer behaved in the past and the prediction window is also called a
performance window, it is the one where we try to predict if the customer will
churn depending on the customer history.
Spotify had incorporated a similar approach when they were new in the market.
They had offered free membership for a month and they checked if the people
were using their app in the second week after the registration. If they see that
the customers are not using it in the second week then the customers are likely
to churn. So they had 3rd and 4th week in hand (called re-engagement week)
where they tried to re-engage the customers who had stopped using
services after the 2nd week. In their case, the observation window was 2 weeks
because they had observed that in the 2nd-week customers are most likely to
churn. This was specific to their business, and it varies with business.
So the correct knowledge of the size of the window comes with business
knowledge and experience. Too short observation window might give you ample
time for re-engagement but you might not be able to correctly identify the
behavior of the customer.
Our data has around 90% samples for class 0 and 10% samples for class 1.
From the below plot we can see that our data has a huge imbalance
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝑅𝑒𝑐𝑎𝑙𝑙) = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
3. Specificity - It tells us how well our model predicts the negative labels out of all
the negative labels, also it is the fraction of the total amount of relevant
instances that were actually retrieved.
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠+𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
4. False Positive Rate (FPR) - It is the ratio of labels that were incorrectly classified
as positive (when they are actually negative) to the total number of negative
labels.
𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝐹𝑃𝑅 (1 − 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦) = 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠+𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
5. Precision - It is the ratio of correctly predicted positive labels to all the labels
predicted as positive by our model (correct + incorrect). It tells us how correctly
our model has predicted the positive labels out all the labels that our model has
predicted as positive. Also, it is the fraction of relevant instances among the
retrieved instances.
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
Let us understand Precision and Recall with a real-life example –
In the image above we have searched for 'hp pavilion', so the ratio of relevant
suggestions out of all the rendered suggestions becomes our Precision.
Recall would be the ratio of the relevant suggestions rendered to all the relevant
suggestions that exist.
The suggestions we could see aren't the only relevant suggestions, there are
many other terms/strings which are relevant to the search query.
So we should thoroughly understand the problem statement before deciding
which metric to use.
Suppose we build an AI system to list the corrupt people, in this case, we would
be interested in High Recall as we would want to identify all the corrupt people
that exist as missing anyone of them is injustice and harm to society.
If correctly identifying the positives is our goal then we should go with Sensitivity
(Recall) but if our goal is to correctly identify the negative then we should go for
Specificity. So the choice of metric to use completely depends on what you aim
to achieve.
We should not use Accuracy when the data is
imbalanced. Why?
Suppose in the above data set with features X and target y, if we build a model
which always predicts the majority class let us see what will happen.
We will fill y_pred with the mode of the target column i.e. y
# Populate y_pred with the value which has max freq.
y_pred = np.full(shape=y.shape, fill_value=stats.mode(y)[0][0])
accuracy_score(y, y_pred)
# Output - 0.9001
This gives us an accuracy of 90% as we have classified all the samples as the
majority class, but we have badly failed to classify any of the samples of the
minority class.
# Print Precision and Recall
print('Precision: {:.2f}, Recall: {:.2f}'.format(precision_score(y, y_pred),
recall_score(y, y_pred)))
# Output - Precision: 0.00, Recall: 0.00
Our data has around 90% samples for class 1 and 10% samples for class 0.
Figure 19: Imbalanced data
We can see that earlier label 1 was our minority class which has now become
our majority class. Let us do the same that we have done above and then check
the metrics.
# Doing same as we have done above
y_pred_new = np.full(shape=y_new.shape, fill_value=stats.mode(y_new)[0][0])
accuracy_score(y_new, y_pred_new)
# Output - 0.9
The accuracy is still the same, but the Precision and Recall have improved
terribly. High Precision and High Recall is what we strive for. This means
that our model is fabulous. And it is performing awesome even without using
Machine Learning.
But we know that our model would badly fail to correctly predict the minority
class. So our belief that Precision and Recall could help us when we have
imbalance data, failed.
This doesn't mean that Precision and Recall don't help when we deal with
imbalanced data, they are super useful, but Precision and Recall have a
huge hatred towards 0 and this is the case with most of the stuff in Data Science.
This is what we have been taught, zero is bad, so we tend to ignore zero, same
is the case with Precision and Recall.
Had it been the case where label 0 was the minority class and label 1 was
the majority class, Precision and Recall would have told us that our model is
poor. We have already tried this at the start.
Usually, we tend to see a severe imbalance in Fraud Detection, Churn Prediction,
etc. In such cases our focus is to correctly predict the minority class, so we want
to give more importance to the minority class. If such is a case, then we should
represent the minority class using label 1, if we do so then we could trust the
result given by Precision and Recall.
Generally, it is suggested to label the minority class with label 1 so that we
could trust Precision and Recall.
The same would be the case with the F1 score because the F1 score uses
Precision and Recall for its calculation. And also when Precision and
Recall is high F1 score is high.
But an important thing to note is that it is very difficult to get a high F1 score,
as a high F1 score would require High Precision and High Recall but it is
difficult to achieve. Because when we try to achieve High Recall we tend to
say that we want to predict all the positive labels correctly which also leads to
increase in False positive, increase in False positive leads to decrease
in Precision (this is also evident from the formula)
If we consider the Churn prediction use-case, here if 100 customers might leave
our service and out of those we predict that just 20 are about to churn. Let's say
out of those 20, 19 are correct (they would churn) so our Precision is quite
high in this case. As Precision tries to check out of the 20 we predicted how
many were correct, which is 19 in our case and that's quite good. But Recall
tries to check out of 100 how many customers have we correctly identified, so
that ratio is quite low as we have just identified 19 out of 100.
We aim to identify maximum customers correctly and so Recall makes sense,
increasing Recall might increase False Positive but False positive is
not a major concern in this use-case as even if we identify a customer who won't
churn as a customer who will churn, there's not much harm in that, but if we fail
to identify a customer who will churn in future as a customer who won’t churn,
then that's a huge problem.
So the selection of a metric depends on what we want to achieve.
Now let’s try building a model the usual way i.e. using train_test_split.
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
stratify=y, random_state=1
The y_train and y_test have a distribution similar to the original dataset i.e.
90% of negative samples. This has happened because we have used the
stratify=y parameter.
Oversampling
RandomOverSampler
Over-sample the minority class(es) by picking samples at random with
replacement.
# Using Random Over sampling
oversampled_data = RandomOverSampler(sampling_strategy=0.5)
X_over, y_over = oversampled_data.fit_resample(X_train, y_train)
After building a model, we get above metrics, in our case the metrics are the
same as we only have 2 features but you would observe a difference in metrics
when you apply these techniques on datasets with more number of features.
Pipeline
There are many scenarios where we have a fixed set of transformations to be
performed on the data, post which we would like to fit a model. All these things
need not be done separately, we can create a pipeline and add all the
transformations followed by the model we want to fit into the pipeline. Using
the pipeline makes our life easier.
Intermediate steps of the pipeline must be ‘transformations’, that is, they must
implement fit and transform methods. The final estimator only needs to
implement fit.
Below we create a pipeline where we first add oversampler followed by
undersampler which is then followed by a ML model which we would like to
fit.
# Initilaize pipeline with the required steps
pipeline = Pipeline([('smote', oversampled_data),
('under', under_sampler),
('model', LogisticRegression())])
scoring = ['accuracy', 'precision', 'recall']
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluating the model
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1,
return_train_score=True)
print('Accuracy: {:.2f}, Precision: {:.2f}, Recall: {:.2f}'.format(
np.mean(scores['test_accuracy']),
np.mean(scores['test_precision']),
np.mean(scores['test_recall'])))
combined_sampling = SMOTEENN()
Here we can observe that recall is 0.88 and we have used Cross-Validation
so the model is tested on entire data. So this result is always better than the
result that we obtain from train_test_split as it is more trustful.
Incorrect practice
Sampling techniques or transformations are meant for train data and not for test
or validation data.
# Incorrect practice - Leading to Data-Leakage
oversampled_data = SMOTE(sampling_strategy=0.5)
# We use liblinear because the documentation says that for small datasets,
# 'liblinear' is a good choice
logistic = LogisticRegression(solver='liblinear')
logistic.fit(X_train, y_train)
y_pred = logistic.predict(X_test)
1
Image Source - https://imgflip.com/
But once you make analogy you would feel –
1
Image Source - https://imgflip.com/
But this won’t give us a clear idea about which website is better. We will have
to run some additional tests to come to a final conclusion.
You cannot just blindly replace an old website by a new one, this could have
some serious consequences, and also you cannot interact with the entire
population hence we go ahead with Hypothesis testing.
Every Data Scientist will first understand the business problem and then apply
statistics and probability knowledge to find out the different insights to work
with.
Having known that, there are many visualization tools out there in the market,
you have to think, whether the job is going to be easy or difficult?
For sure the job is going to be more difficult and you have to be very smart to
know how to make use of such tools.
Automation tools are made to reduce your time and effort but you should be in
a position to think and take a call. If you don’t know why you are plotting a
particular graph and what you are looking from particular visualization, then
tools can’t help you out in any way.
The hypothesis you come up with might be wrong but it would help you get
a better understanding about the data.
Bot detection
Problem Statement: Build a model that can detect the Non-Human Traffic
present on a website.
Wrong practice - As the problem statement seems quite easy so most of you will
directly start working on dataset without understanding the business case in
detail. That is where you would go wrong. First, we will spend some time making
ourselves familiar with the problem statement.
Why do we need to detect Bots?
No matter how good your website is, you’re always guaranteed to receive traffic
from bots at some point in time. These bots can do different things on your
website ranging from indexing web pages to scraping your content. With so
many different bots out there, how can you detect bot traffic on your website?
And should you be concerned about it?
Here are 5 reasons to Why you need Bot Detection?
Bots can steal your content - You know that the content you worked upon is
hard to develop. You have carefully crafted all the blog posts and pages, all the
effort could be lost in a second if you let bots access your site. Bots can scrape
your website for data, information, and even pricing in just a matter of time.
This data can be used on other sites, redistributed, or even be sold to someone
else.
Bots can slow down your site - Bots bog down your site and overwhelm it with
inauthentic, fraudulent traffic. This results in slower page load times for your
actual paying customers, which could affect their level of satisfaction or even
deter them from buying or visiting altogether.
Bots can threaten your website. Malicious bots can hack your website, insert
inappropriate links and content, or even crash your site altogether. This can hurt
your traffic, your customers, and your sales.
A bot can take up extra time and money. Many bots spend their time posting
spam comments to websites and blogs. While this may not seem like a huge
issue, it can be quite frustrating. You’ll have to spend hours each month sorting
through these comments to separate the human commenters from the
fraudulent ones, which takes you and your resources away from actually running
your business. If you don’t remove these spam comments, they end up annoying
your readers and possibly leading them away from your site.
Bots can mess up your analytics. Analytics is highly important to a website
owner. They tell you how your site is performing, where traffic is coming from,
and what you might want to tweak throughout the site. Unfortunately, if you
have a significant amount of bots accessing your site, this can throw your
analytics into upheaval. You won’t have a clear picture of your site’s
performance or your next steps for improvement, and you won’t be able to tell
what’s real and what’s fake.
So, we have done a bit of research about the problem statement and we have
understood why is it important to detect a bot. It is advised that you research at
your end as well, try to read about how can we detect bots, techniques used by
different people out there. This would help you to develop an understanding of
how should you proceed and will give you a starting point to think further.
Now we are going to look at how this problem can be solved by a wrong
approach and then by the right approach using Feature Engineering and making
assumptions.
Note - Main motive is to tell you the right approach to solve any project, we will
focus on making assumptions and will not be discussing the code, it could be
downloaded from the Github1 link. So feel free to go through the code and you
can get in touch with the authors on Linkedin for your doubts
1
Github link - https://github.com/Dataebook/BotdetectionSnippet
Wrong Approach
At first, we will see how this problem statement can be dealt using the wrong
approach as this is what most of us would do -
1. We might delete some columns which we think are not useful, this is the
approach which most of them follow, if we have null values more than 40%, we
just delete that column.
# dropping uselss columns & too many nulls
data.drop('Unnamed: 0',axis=1,inplace=True)
data.drop('device_type',axis=1,inplace=True)
3. Then we may check for the different values of year and date.
Now, we will again check for Nulls and we won’t find any as we have dropped
all the Nulls without understanding whether that particular column makes sense
to our business problem or not.
We have a big dataset with millions of records and we can’t work with the entire
dataset at once, so we may check which country contributes to maximum traffic
with the help of the below code.
plt.figure(figsize=(15,8))
sns.countplot(y=data['intgrtd_mngmt_name'])
plt.title('intgrtd_mngmt_name',size=20)
After seeing the output we come to know that most of the IPs are from India,
USA, and Japan.
4. Visualization we may look for -
Which is the most used OS?
Which Website is Visited Most?
5. Without any understanding about the problem statement, we may drop some
columns as we might think they are not useful and we do this without any
research
useless = ['city','st','sec_lvl_domn','operating_sys','wk',
'mth','yr']
data.drop(useless,axis=1,inplace=True)
6. In the end, we might decide to work only with some of the top countries (traffic-
wise). Post this we would proceed to build a model and our model would be able
to detect when it’s a bot but would fail to detect when it isn’t a bot, which is
important as well.
We didn’t observe whether our model is learning or memorizing. The saddest
part is that we finished this project of millions of data points in just some hours
without any research and are feeling overwhelmed by the fact that we have
applied Machine Learning Algorithms.
Note - You can find the notebook with the wrong solution on Github1.
1
Github link - https://github.com/Dataebook/BotdetectionSnippet
Right Approach
We will try to understand every feature and will try to understand their
importance by doing some research on the internet.
Also, check if there are some already available ways to solve this use-case as it
could give us some idea and will enhance our thinking.
We will try to understand the behavior of the bots and will try to observe it quite
closely as it would help us in feature engineering and also to make assumptions.
Features Overview
ctry_name - Represents the country name.
user_agent - Gives information about the browser that was used to hit the
URL and version type of user agent.
VISIT - Tells the number of times a particular ip_address visited an URL.
wk – Week info.
Page vw ts – Tells about the time when the page was visited.
4. Does user_agent make sense while detecting bot, if yes, think how?
5. Difference between wk, mth, yr, and page vw ts, and are all the four
features important?
7. Does sec lvl domain make any sense for our use-case?
9. As we know we have 10 million + records and for sure we can’t deal with all IP’s
and at the same time we can have a lot of duplicates, so how will we deal with
it?
Note: You will get the required .ipynb file on Github1.
We can't work on all the IPs that visit our website in a day because many of them
visit just once. So we try to filter those IPs that have high no. of views or visit too
many times. In rules Of detecting bot, it is mentioned that bots show a similar
pattern in visiting any website.
We can assume that a normal human won’t visit or view a particular website
more than 24 times a day, just an assumption as the day is of 24 hours.
1
Github link - https://github.com/Dataebook/BotdetectionSnippet
# IP Addres's that has total views greater than 24 in a day
ip_views = pd.DataFrame(data_ibm.groupby('ip_addr').VIEWS.sum().sort_value
s())
unique_ip_address = list(ip_views[ip_views.VIEWS > 24].index)
# Limiting the Dataset to those rows that contain one of the IPs present
# in unique_ip_address
new_data = data_ibm[data_ibm.ip_addr.isin(unique_ip_address)]
# Output
# No. Of unique ip's 7231
# ['38d87886d615dd8e5f3f92d4b3bc7c344e4125633e6ea0cc90f70a5bffc1a69a',
# '2d514edec300dea1ee1eae5170bd1dd24c6e628d2f28074ec7ffe62ccb009b00',
# 'bc47449f582bde3943caa85c67a59a7c2b5dee4d2800a4ee8723e065d68eb74e',
# '7a8211f17123bbd84bbfd914498104a1f42932a691f5eb6299fe7217e3dc67a3',
# 'ee1c6a74446bbf39ac19431e415c431e4c6e47f9a415bbd514e3c6d1acb6386b',
# '14fdc36060a6c319e7f616157cc48d83c253caccac6ac1d2838de56c1e23ce6d',
# '5134b48b14c000e886c74619ee11cccb1dbe98c6ed3c3dc82550a7a33bc6d9ee',
# '16ebc267de6c5c886c7c515fbac4b9137abe0611f8ceba82835faa44913e1ad1',
# '23e225f92cf2669e1aa550a7e4a92efa943474e02c78aea18bb35774032bf497',
# '13656abd7d885ddee912bd9d8a96a2feed0362a8383ac7527d74e777e3d40ab0']
Step 1 -
We have come up with a new feature known as Bounce_rate which implies
how many times a particular IP has hit the URL on an hourly basis and it's the
same as the number of vibrations per second known as frequency.
Also, we have created new features for the number of hours starting from
0_hour to 23_hour which would help us to visualize in which hour of the day
a particular IP address visits the URL and will also help in monitoring the
behavior of IP, for example, if one IP visits the URL more than 15 times in an
hour then we can assume it can be a bot.
Just give a thought to what you just read.
We can then look for the information about the unique IP's, this will give info
about visits per hour, the corresponding Bounce rate of each IP and also
information related to its origin and using this all information we can construct
a dataframe on which we can train a model.
We have introduced some more features like hour_avg and daily_avg which
helps us to identify how often that particular IP visits the URL in a particular hour
and based on that we will calculate the bounce_rate.
Download the .ipynb file from Github for better understanding.
As this book is something where you would have to feel the pain so give it a
thought, do some research about what we are discussing, and get back to us
for any queries.
Step 2 -
Here we will make a dataset called Global Dataset, basically, Global Dataset will
contain all the history of an ip_addr that has visited us before and if it visits
us again then this will append its important values to the global_dataset.
This Dataset will be the most important part of the program because it will help
in labeling the classes as a bot, it contains the information that will be used by
our next part when we label the classes.
Importance of ‘Global Dataset’ -
For example, if you productionize the model, and imagine an IP comes in then:
1. It will be compared with the existing dataset, if that particular IP address has
visited us earlier at some point in time, then based on the past behavior of the
IP address the model will decide whether it is a bot or not.
2. What if the particular IP address is not present in past data, then it won’t be sent
directly to the model. At first, it will be sent to the global dataset and where its
daily, hourly, and weekend activity is calculated. Also, we have created some
attributes like hourly_avg and daily_avg to calculate the exact bounce rate
based on which model will predict whether it’s a bot or not.
Just think about this
Let’s say you have created a new attribute bounce_rate and with the help of
hourly_avg and daily_avg model we will calculate the exact
bounce_rate. But when you productionize the model the unseen data or the
data coming in won’t have the new features that you have created with the help
of research and understanding. So before feeding data to the model you have
to work with custom code or create your pipeline which can convert the original
attribute at production to the attributes that you created using feature
engineering (feature engineering is something which is not defined and your
research about the business problem with common sense helps to get it done).
So, how will you tackle such things?
Step 3 -
In this part, we will label the IPs in the "new_ip_data" dataset with the help of
''Global Dataset" and the Rules.
With the help of research and understanding the bot behavior we have come
up with some rules which will help our model to achieve a scalable solution.
We will write a function that will return a dataframe that has all the historical
data of a single IP address (Don’t worry, download the .ipynb file from Github
and play around with the code to make your self comfortable).
In the end, after a lot of research (which takes time with patience),
understanding, and commonsense, we have come up with some rules and
better ways to understand how your model should work at production.
So, give it a thought and analyze how these rules can make our model learn
incrementally.
We don’t believe in spoon-feeding, perform your EDA for the same and build
your assumptions as there are multiple solutions for the same, do get in touch
with us for more discussion on the same at any point of time.
Rule 1 - Labeling based on hour
Rule 2 - Labeling on the basis of daily_avg
Rule 3 - Labeling on the basis of weekday_avg
Rule 4 - Labeling on the basis of bounce rate
Rule 5 - Labeling on the basis of Operating system
If we don't have past information of an ip_address then we are left with only
three rules.
Hourly_average
Bounce_rate
Operating_system
So, we have come up with a lot of assumptions, do download the .ipynb file from
the Github1. Try, fail, build, and discuss with us at any point in time.
Remember this isn’t the only possible solution, there could be many other
solutions available and they could be found by doing some research and taking
efforts to solve this use-case.
1
Github link - https://github.com/Dataebook/BotdetectionSnippet
FROM THE AUTHORS DESK
Vivek Chaudhary
Creator at Dataebook || Data Scientist || Community
Builder
Linkedin1
Around August '17, a question popped up in my mind about how companies like
Samsung fix the price of their new phones & are confident enough to get a good
margin of profit from the market.
That hit my mind & then I started researching about the same and that was my
first step into the Data Science domain.
From that day, on a daily basis, I used to read different case studies about how
Data Science is helping industries to do better, which in turn helped me to
understand this domain closely to get expertise with.
You have to be a research enthusiast before getting into this domain & be clear
about why you want to get into this domain, as Data Science is not Everyone’s
cup of tea.
Some of the advice I would love to deliver to the readers,
1. Don’t compare yourself with other people in the same domain, if you do then
you doubt yourself & lead yourself to nowhere except negative thoughts.
2. Sometimes we may think that Kaggle is something that is out of the box &
you can’t compete with Kaggle Grandmasters but did you ever think, once
those masters were also at your position and had the same thinking. But they
didn’t give up, patience & enthusiasm about research is something that
helped them to achieve it.
“No-one is born as an expert but one can die as an expert. You just have to
work with patience because good things take time”.
1
Linkedin Profile - https://www.linkedin.com/in/-vivek-chaudhary
3. Don’t focus on building a Machine Learning model with 95% accuracy,
instead focus on the process to build a Machine Learning model which starts
from understanding the business case, making assumptions, EDA, applied
statistics, data cleaning technique, feature encoding technique,
preprocessing before building a model.
4. Don’t waste your time learning python for data science, instead choose any
industry, research about how Data Science is helping out that
industry/organization & look at the existing solution and then figure out if
there something new you can come up with.
Remember, if you research, only then you can come up with your thoughts
otherwise you will be busy with building Machine Learning models with
95% accuracy which is of no use for your client.
5. Pick any existing project with a solution, understand it, and try to analyze
the code.
Don’t worry if you don’t understand any line of code, just research about
the same & learn about it while you apply.
● For sure it requires your patience & hard work because some of the time
you may have to invest 4 to 5 hours to understand a couple of lines of
codes but believe me if you follow this process for at least 2 to 4 projects
then you would definitely get good at it.
6. If you want to get into this domain, first you have to believe in yourself
because if you don’t, then no one will. You can't master this domain as new
things come up every now and then, but yes, if you follow such an approach
then for sure, you will be confident enough to work in a project because
you have learnt it in a hard way.
7. Don’t make yourself limited to a certain technique, for example, most of
the individual will apply one-hot encoding or label encoding while dealing
with categorical features without knowing the importance. Instead if you
search, you would find a lot of different techniques & if we understand the
concept, then who knows whether we may come up with some new
techniques to deal with.
Don’t be in a rush, because this domain needs patience and a lot of smart as
well as hard work.
Thanks for reading & you can get in touch with me at any point of time for
more discussion on the same.
Anirudh Dayma
Machine Learning Engineer | Technical Writer
Linkedin1 | Medium 2
I love to explore this field of Data Science, I also write technical articles for
Analytics Vidhya and Towards AI. My aim is to learn new things and explain
them in the simplest possible way. As Albert Einstein has rightly said that “If
you can't explain it to a 6-year-old, you don't understand it yourself”. I truly
believe in this statement of his and hence try my best to explain stuff in a
simpler way.
1
Linkedin Profile - https://www.linkedin.com/in/anirudh-dayma-457861144/
2
Medium Profile - https://medium.com/@anirudh.daymaa
Post this I started exploring Statistics when I realized the importance of making
assumptions and it led me to Hypothesis Testing. Statistics is the topic which
many of the Data Science Enthusiasts skip because they don’t find it interesting
also there are not many resources that explain Statistics in a simple manner.
We have tried to explain statistics by sighting some real-life examples so that
after reading this book even if you don’t remember the definition, you would
remember the example which would help you to remember the concept.
How to master Data Science? Well, I am still looking for that person who says
that he has mastered Data Science. You cannot master Data Science, no one can,
it is a process where you apply your knowledge to solve the problem at hand,
and one problem may have different solutions, there is no such streamlined
process that will make you the master of data science. So try to master problem-
solving and critical thinking skills. If you are a good problem solver you can be a
good Data Scientist too. And to master problem-solving skills you have to get
your hands dirty.
In the first phase, I used to learn algorithms and apply them to some dataset.
And this kept on going for quite some time. When I talked with some of the
1
Linkedin Profile - https://www.linkedin.com/in/me-manvendra/
industry experts then I realized, no one cares about your fancy algorithms. They
want to see how you solved a given problem statement and what value it adds
to society and the company. Then started solving the problem statements, not
building a fancy classifier. And when you work on these problem statements and
the projects and get your hands dirty that way you learn to solve problems, you
start with analytical thinking and reasoning.
So my advice to anyone who is looking to start their career in this field would be
to work on projects. The learning you will have from working on projects cannot
be compared with other programs. You will become a better problem solver and
all the technology will be just a tool for you to solve these problems.
John Gabriel T J
Data Analyst || Data Science Enthusiast
Linkedin1
In a quick note, a Data Scientist is someone who can predict the future based
on past patterns. Who wouldn’t be glad and interested to work to make
changes to the world we live in, with the help of data?
1
Linkedin Profile - https://www.linkedin.com/in/johngabrields/
Because it’s so easy to get frustrated while learning new things, not only Data
Science. Without proper motivation, it would be very difficult for anyone to
continue learning new stuff. However, in my case, I started understanding
concepts and was able to learn better as I kept engaging myself.
Along the way, I’ve attended a few sessions of Vivek. After attending his few
lectures, I was able to learn things from a different perspective. All the stuff I
learnt until that time, I was only learning and nowhere applied. After trying out
his suggested way of learning, I can now say that it is more practical and more
interesting as well. All he said was, learn Data Science in a reverse engineering
method.
My motivation:
“Learn from mistakes and with the lesson what you learnt, apply it in daily
life to overcome it. Don’t wish for it. Do it”.
Though my career has started from being a Developer, I’ve now found what
work suits me best and working on it to achieve it very seriously.
Leave a review
Please share your thoughts on this book by leaving a review on the site that
you bought it from. This would help other potential readers to make
purchasing decisions.