Project Management Methodology-Batch - 17082020-7AM
Project Management Methodology-Batch - 17082020-7AM
Source
Introduction:
In this article, we will learn all the important statistical concepts which are required
for Data Science roles.
Table of Contents:
1. Difference between Parameter and Statistic
2. Statistics and its types
3. Data Types and Measurement levels
4. Moments of Business Decision
5. Central Limit Theorem (CLT)
6. Probability Distributions
7. Graphical representations
8. Hypothesis Testing
In our day in day out, we keep speaking about the Population and sample. So, it is
very important to know the terminology to represent the population and the
sample.
A parameter is a number that describes the data from the population. And, a
statistic is a number that describes the data from a sample.
The Wikipedia definition of Statistics states that “it is a discipline that concerns the
collection, organization, analysis, interpretation, and presentation of data.”
1. Descriptive Statistics
2. Inferential Statistics
Descriptive Statistics:
As the name suggests in Descriptive statistics, we describe the data using the
Mean, Standard deviation, Charts, or Probability distributions.
Inferential Statistics:
From the survey conducted, it is found that 800 people out of 1000 (800 out of
1000 is 80%) are two-wheelers. So, we can infer these results to the population and
conclude that 4L people out of the 5L population are two-wheelers.
Qualitative data is non-numerical. Some of the examples are eye colour, car brand,
city, etc.
On the other hand, Quantitative data is numerical, and it is again divided into
Continuous and Discrete data.
Categorical data: represent the type of data that can be divided into groups.
Examples are age, sex, etc.
Level of Measurement
1. Nominal Scale
2. Ordinal Scale
3. Interval Scale
4. Ratio Scale
1. Nominal Scale: This scale contains the least information since the data have
names/labels only. It can be used for classification. We cannot perform
mathematical operations on nominal data because there is no numerical value to
the options (numbers associated with the names can only be used as tags).
2. Ordinal Scale: In comparison to the nominal scale, the ordinal scale has more
information because along with the labels, it has order/direction.
3. Interval Scale: It is a numerical scale. The Interval scale has more information
than the nominal, ordinal scales. Along with the order, we know the difference
between the two variables (interval indicates the distance between two entities).
4. Ratio Scale: The ratio scale has the most information about the data. Unlike the
other three scales, the ratio scale can accommodate a true zero point. The ratio
scale is simply said to be the combination of Nominal, Ordinal, and Intercal scales.
We have four moments of business decision that help us understand the data.
Talks about the centrality of the data. To keep it simple, it is a part of descriptive
statistical analysis where a single value at the centre represents the entire dataset.
Mean: It is the sum of all the data points divided by the total number of values in
the data set. Mean cannot always be relied upon because it is influenced by
outliers.
Mode: It is the most repeated value in the dataset. Data with a single mode is
called unimodal, data with two modes is called bimodal, and data with more than
two modes is called multimodal.
Variance: It is the average squared distance of all the data points from their mean.
The problem with Variance is, the units will also get squared.
Median The middle value of the data The middle value of the data
4.3. Skewness
It measures the asymmetry in the data. The two types of Skewness are:
4.4. Kurtosis
Talks about the central peakedness or fatness of tails. The three types of Kurtosis
are:
Instead of analyzing entire population data, we always take out a sample for
analysis. The problem with sampling is that “sample means is a random variable –
varies for different samples”. And random sample we draw can never be an exact
representation of the population. This phenomenon is called sample variation.
To nullify the sample variation, we use the central limit theorem. And according to
the Central Limit Theorem:
2. the distribution of sample means follows a normal distribution even though the
population is not normal. But the sample size should be large enough.
3. The grand average of all the sample mean values give us the population mean.
4. Theoretically, the sample size should be 30. And practically, the condition on
the sample size (n) is:
6. Probability distributions
Please read this article of mine about different types of Probability distributions.
7. Graphical representations
For a single variable (Univariate analysis), we have a bar plot, line plot, frequency
plot, dot plot, boxplot, and the Normal Q-Q plot.
7.1. Boxplot
The five numbers are minimum, first Quartile (Q1), median (Q2), third Quartile
(Q3), and maximum.
The box region will contain 50% of the data. The lower 25% of the data region is
called the Lower whisker and the upper 25% of the data region is called the Upper
Whisker.
The Interquartile region (IQR) is the difference between the third and first
quartiles. IQR = Q3 – Q1.
Outliers are the data points that lie below the lower whisker and beyond the upper
whisker.
The outliers that lie below the lower whisker are given as Q1 – 1.5 * (IQR)
The outliers that lie beyond the upper whisker are given as Q3 + 1.5 * (IQR)
Boxplot (Source)
A Normal Q-Q plot is a kind of scatter plot that is plotted by creating two sets of
quantiles. It is used to check if the data is following normality or not.
Normal Q-Q plot (Source)
On the x-axis, we have the Z-scores and on the y-axis, we have the actual sample
quantiles. If the scatter plot forms a straight line, data is said to be normal.
8. Hypothesis Testing
End Notes:
Thank you for reading till the conclusion. By the end of this article, we are familiar
with the important statistical concepts.
I hope this article is informative. Feel free to share it with your study buddies.
Project Management Methodology – CRISP DM
1. Business Problem: During certain seasons the inflow of emergency cases are high, and we
are unable to plan for more physicians during those days because of no forecasting available
a. Business Objective: Maximize the patient care
b. Business Constraints: Minimize Operational Expenses
2. Data Collection
a. Primary Data Sources (Data collected at source) – E.g. Surveys, Experiments,
Interviews
i. Costly & quality of data is not guaranteed
ii. Get exact data which is needed for analysis
b. Secondary Data Sources (Data already available) e.g. HMS, EMR, HRE
i. Available for free & easily
ii. May or may not have exact data needed for analysis
c. Data Types
i. Continuous data – Decimal format & make some sense
ii. Discrete data – Decimal format does not make sense
1. Categorical
a. Binary – 2 categories
b. Multiple - >2 categories
2. Count
iii. Structured vs Unstructured
iv. Streaming data (Real Time processing) vs Batch processing
v. Cross – sectional data vs Time series
vi. Balanced vs Imbalanced
d. Drawing samples from population and analysis the sample to make statements
about the population is called Inferential Statistics. Any measures calculated on
sample is called as Sample statistics & any calculations done on population is
called as population parameter
3. Data Cleaning / Exploratory Data Analysis /Feature Engineering
a. Data Cleaning / Data Preparation
i. Outlier Analysis/ Treatment 3R
1. Rectify
2. Remove
3. Retain
ii. Imputation of missing values
1. Mean
2. Median
3. Mode
iii. Normalization / Standardization – Unitless & Scale free
1. Normalization -> Z = X-mean/stdev
a. Mean =0 & stdev =1
2. Standardization -> X-min(x)/max(x)-min(x)
a. Min=0 & Max =1
iv. Transformation
v. Discretization / Binning / Grouping
b. Exploratory Data Analysis /Descriptive Analytics
i. First Moment Business Decision / Measure of central tendency
1. Mean –Average – It gets influenced by Outliers
2. Median – Middle value of Dataset – Does not influenced by Outliers
3. Mode – Value, which repeats maximum times
ii. Second Moment Business Decision / Measure of Dispersion
1. Variance – Units get squared
2. Standard Deviation – Square root of Variance; get back the original
units
3. Range – Max –min
iii. Third Moment Business Decision – Skewness
iv. Graphical Representation
1. Univariate
a. Histogram
b. Bar plot
c. Box Plot
d. Q-Q Plot
2. Bivariate
a. Scatter Plot
b. Correlation
I hope you find this helpful and wish you the best of luck in your
data science endeavors!
Join Medium with my referral link - Terence Shin
As a Medium member, a portion of your membership fee goes to
writers you read, and you get full access to every story…
terenceshin.medium.com
Table of Content
Miscellaneous
There are many steps that can be taken when data wrangling and
data cleaning. Some of the most common steps are listed below:
Boxplot vs Histogram
L2 is less robust but has a stable solution and always one solution.
L1 is more robust but has an unstable solution and can possibly have
multiple solutions.
Q: What is cross-validation?
Created by author
A) Adjusted R-squared.
B) Cross-Validation
A method common to most people is cross-validation, splitting the
data into two sets: training and testing data. See the answer to the
first question for more on this.
Q: What is overfitting?
Overfitting is an error where the model ‘fits’ the data too well,
resulting in a model with high variance and low bias. As a
consequence, an overfit model will inaccurately predict new data
points even though it has a high accuracy on the training data.
Q: What is boosting?
For this, I’m going to look at the eight rules of probability laid
out here and the four different counting methods (see more here).
Rule #8: For any two events A and B, P(A and B) = P(A)
* P(B|A); this is called the general multiplication rule
Counting Methods
Let’s say the first card you draw from each deck is a red Ace.
This means that in the deck with 12 reds and 12 blacks, there’s now
11 reds and 12 blacks. Therefore your odds of drawing another red
are equal to 11/(11+12) or 11/23.
In the deck with 24 reds and 24 blacks, there would then be 23 reds
and 24 blacks. Therefore your odds of drawing another red are equal
to 23/(23+24) or 23/47.
Since 23/47 > 11/23, the second deck with more cards has a higher
probability of getting the same two cards.
Edit: Thank you guys for commenting and pointing out that it
should be -$35!
1. The null hypothesis is that the coin is not biased and the
probability of flipping heads should equal 50% (p=0.5). The
alternative hypothesis is that the coin is biased and p !=
0.5.
3. Calculate Z-score (if the sample is less than 30, you would
calculate the t-statistics).
5. If p-value > alpha, the null is not rejected and the coin is
not biased.
If p-value < alpha, the null is rejected and the coin is
biased.
Since a coin flip is a binary outcome, you can make an unfair coin
fair by flipping it twice. If you flip it twice, there are two outcomes
that you can bet on: heads followed by tails or tails followed by
heads.
You can tell that this question is related to Bayesian theory because
of the last statement which essentially follows the structure, “What
is the probability A is true given B is true?” Therefore we need to
know the probability of it raining in London on a given day. Let’s
assume it’s 25%.
Therefore, if all three friends say that it’s raining, then there’s an
8/11 chance that it’s actually raining.
For example:
Therefore, the probability that the cards picked are not the same
number and the same color is 69.2%.
An inlier is a data observation that lies within the rest of the dataset
and is unusual or an error. Since it lies in the dataset, it is typically
harder to identify than an outlier and requires external data to
identify them. Should you identify any inliers, you can simply
remove them from the dataset to address them.
Mean/Median/Mode imputation
The method of testing depends on the cause of the spike, but you
would conduct hypothesis testing to determine if the inferred cause
is the actual cause.
Q: Give examples of data that does not have a
Gaussian distribution, nor log-normal.
The Law of Large Numbers is a theory that states that as the number
of trials increases, the average of the result will become closer to the
expected value.
Eg. flipping heads from fair coin 100,000 times should be closer to
0.5 than 100 times.
There are many things that you can do to control and minimize bias.
Two common things include randomization, where participants
are assigned by chance, and random sampling, sampling in which
each member has an equal probability of being chosen.
You can use hypothesis testing to prove that males are taller on
average than females.
The null hypothesis would state that males and females are the same
height on average, while the alternative hypothesis would state that
the average height of males is greater than the average height of
females.
k (actual) = 10 infections
lambda (theoretical) = (1/100)*1787
p = 0.032372 or 3.2372% calculated using .poisson() in excel or
ppois in R
P(3 or more heads) = P(3 heads) + P(4 heads) + P(5 heads) = 0.94
or 94%
Using Excel…
p =1-norm.dist(1200, 1020, 50, true)
p= 0.000159
x=3
mean = 2.5*4 = 10
using Excel…
p = poisson.dist(3,10,true)
p = 0.010336
Since 70 is one standard deviation below the mean, take the area of
the Gaussian distribution to the left of one standard deviation.
See here for full tutorial on finding the Confidence Interval for Two
Independent Samples.
This query says to choose the MAX salary that isn’t equal to the
MAX salary, which is equivalent to saying to choose the second-
highest salary!
SELECT MAX(salary) AS SecondHighestSalary
FROM Employee
WHERE salary != (SELECT MAX(salary) FROM Employee)
Given a Weather table, write a SQL query to find all dates' Ids with
higher temperature compared to its previous (yesterday's) dates.
+---------+------------------+------------------+
| Id(INT) | RecordDate(DATE) | Temperature(INT) |
+---------+------------------+------------------+
| 1 | 2015-01-01 | 10 |
| 2 | 2015-01-02 | 25 |
| 3 | 2015-01-03 | 20 |
| 4 | 2015-01-04 | 30 |
+---------+------------------+------------------+
SOLUTION: DATEDIFF()
In plain English, the query is saying, Select the Ids where the
temperature on a given day is greater than the temperature
yesterday.
SELECT DISTINCT a.Id
FROM Weather a, Weather b
WHERE a.Temperature > b.Temperature
AND DATEDIFF(a.Recorddate, b.Recorddate) = 1
PROBLEM #4: Department Highest Salary
The Employee table holds all employees. Every employee has an Id,
a salary, and there is also a column for the department Id.
+----+-------+--------+--------------+
| Id | Name | Salary | DepartmentId |
+----+-------+--------+--------------+
| 1 | Joe | 70000 | 1 |
| 2 | Jim | 90000 | 1 |
| 3 | Henry | 80000 | 2 |
| 4 | Sam | 60000 | 2 |
| 5 | Max | 90000 | 1 |
+----+-------+--------+--------------+
Write a SQL query to find employees who have the highest salary
in each of the departments. For the above tables, your SQL query
should return the following rows (order of rows does not matter).
+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT | Max | 90000 |
| IT | Jim | 90000 |
| Sales | Henry | 80000 |
+------------+----------+--------+
SOLUTION: IN Clause
Can you write a SQL query to output the result for Mary?
+---------+---------+
| id | student |
+---------+---------+
| 1 | Abbot |
| 2 | Doris |
| 3 | Emerson |
| 4 | Green |
| 5 | Jeames |
+---------+---------+
Note:
If the number of students is odd, there is no need to change the last
one’s seat.
Miscellaneous
Q: If there are 8 marbles of equal weight and 1 marble
that weighs a little bit more (for a total of 9 marbles),
how many weighings are required to determine which
marble is the heaviest?
1. You would split the nine marbles into three groups of three
and weigh two of the groups. If the scale balances
(alternative 1), you know that the heavy marble is in the
third group of marbles. Otherwise, you’ll take the group
that is weighed more heavily (alternative 2).
2. Then you would exercise the same step, but you’d have
three groups of one marble instead of three groups of three.
I’m not 100% sure about the answer to this question but will give
my best shot!
Let’s take the instance where there’s an increase in the prime
membership fee — there are two parties involved, the buyers and the
sellers.
Focusing on likes per user, there are two reasons why this would
have gone up. The first reason is that the engagement of users has
generally increased on average over time — this makes sense
because as time passes, active users are more likely to be loyal users
as using the platform becomes a habitual practice. The other reason
why likes per user would increase is that the denominator, the total
number of users, is decreasing. Assuming that users that stop using
the platform are inactive users, aka users with little engagement and
fewer likes than average, this would increase the average number of
likes per user.
To take it a step further, it’s possible that the ‘users with little
engagement’ are bots that Facebook has been able to detect. But
over time, Facebook has been able to develop algorithms to spot and
remove bots. If were a significant number of bots before, this can
potentially be the root cause of this phenomenon.
80/20 rule: also known as the Pareto principle; states that 80% of
the effects come from 20% of the causes. Eg. 80% of sales come from
20% of customers.
References