Revision Module 1,2,3
Revision Module 1,2,3
Decision Sciences
Revision
Presented by: Dr. Anuja Shukla
Agenda
Topic Time
Module 1 Quiz Module 1 5 minutes
https://forms.gle/fKZ4zUt1cdK4jfgS8
Doubt collection 5 Minutes
Doubt resolution 20 minutes
Module 2 Quiz Module 2 5 minutes
https://forms.gle/qD9vWqA2ziZnS1ea7
Doubt collection 5 Minutes
Doubt resolution 20 minutes
Module 3 Quiz Module 3 5 minutes
https://forms.gle/ztguRfJhWqAWhoUKA
Doubt collection 5 Minutes
Doubt resolution 20 minutes
Quiz: Final For Practice
https://forms.gle/SPqJEteFmDvj3TLd8
Total Time 90 Minutes
Module 1
• Session 1: Basics of Statistics
• Session 2: Measure of Central Tendency
• Session 3: Probability and Probability Distributions
• Session 4: Sampling and Estimation
Descriptive and Inferential Statistics
• Descriptive Statistics is a branch of statistics that describes or summarizes a collection
of information. Descriptive statistics uses the data to provide descriptions of the
population, either through numerical calculations or graphs or tables.
• Inferential statistics makes inferences and predictions about a population based on a
sample of data taken from the population in question.
Statistics
Inferential Statistics
Descriptive Statistics
Drawing conclusions about a
Presenting, organising and
population based on data
summarizing data
observed in a sample
Data Visualisation
• Data visualization is the graphical representation of information and data. By using
visual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data.
• Bar Graph
• Pareto Chart
• Histogram
• Line Chart
• Pie Chart
• Pivot chart
Mean
Median
Mode
Mean
• Mean is the sum of all the data values divided by the total number of sample
values.
• Mean is commonly represented by the symbol 𝝁.
• Mean = Sum of all observations/ Number of observations
• Arrange the sample data in ascending order of frequency, from left to right, the
value in the middle is called the median.
• For an odd number of values, we have one median.
• For an even number of values, the median is the average of the two central
values.
• Suitable in case of outlier
Data:
a. N is odd
4,7,9,10,12
Median= 3rd observation =9
b. N is even
4,7,9,10
Median=Average of 2nd and 3rd observation =(7+9)/2
Mode
Variance
Standard deviation
Interquartile ranges
Variance
• One way of measuring the variation within a dataset can be with respect to a fixed
reference value. In the case of variance, this fixed value is the mean.
• Variance is defined as the mean of the square of the difference between data
points and the mean value of all data points within a dataset.
• Variance is a measure of variability that utilizes all the data.
• Average (approximately) of squared deviations of values from the mean.
• So, the formula of variance is —
S= i =1
n -1
Interquartile range
• Suitable in case of outlier
• The best way to find an outlier is to calculate the standard deviation. If the
result is much higher than expected, there is a high chance that your data
contains an outlier.
• In such cases, the interquartile spread is a much better way to communicate
the variation or spread in the data.
• Quartile values are the values in a sample at the 25th, 50th, 75th, and 100th
percentiles.
p+q=1
Types of Probability
Probability
Classical or Empirical or
Subjective
Theoretical Frequentist
Approach
Approach Approach
Classical or Theoretical Approach
• P(H)=5052/10,000
• Following the frequentist approach, you would conclude that the probability of getting heads is
0.5052 and that of getting tails is 0.4948 for that particular coin.
Example
• Suppose an insurance company knows from past actuarial data that of all males 40
years old, about 60 out of every 100,000 will die within a 1 –year period. Using this
method, the company estimates the probability of death for that age group as 60
per 100,000. Calculate the probability of death.
p=60/10000 = 0.0006
Subjective Probability
Discrete Continuous
(Binomial Distribution) (Normal Distribution)
Discrete random variable
• A discrete random variable is one which may take on only a countable number of distinct
values such as 0,1,2,3,4,........
• The variables which can be counted, and do not have any decimal parts, are known as
discrete variables.
• Discrete random variables are usually (but not necessarily) counts. If a random variable can
take only a finite number of distinct values, then it must be discrete. The probability
distribution of a discrete random variable is a list of probabilities associated with each of its
possible values. It is also sometimes called the probability function or the probability mass
function.
• Examples: No of customer, roll of die, roll of coin, no of students in class, number of
children in a family, the Friday night attendance at a cinema, the number of patients in a
doctor's surgery, the number of defective light bulbs in a box of ten.
• For example, the number of students in a class. A class can have 10 students or 11 students,
but it cannot have 10.25 students.
Continuous Random Variables
• A continuous random variable is one which takes an infinite number of possible
values.
• The variables which can be divided infinitely into smaller parts are known as
continuous variables. For example, a student’s height can be 1 metre or 0.99
metre, or 0.998 metre.
• Example: height, weight, the amount of sugar in an orange, the time required to
run a mile.
Density Curve
• Suppose a random variable X may take all values over an interval of real numbers.
Then the probability that X is in the set of outcomes A, P(A), is defined to be the
area above A and under a curve. The curve, which represents a function p(x), must
satisfy the following:
• 1: The curve has no negative values (p(x) > 0 for all x)
• 2: The total area under the curve is equal to 1.
• A curve meeting these requirements is known as a density curve
• Link: http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm
Binomial Distribution
• The binomial probability distribution is the theoretical probability distribution of all
numbers of possible successes over a certain number of Bernoulli trials.
• A binomial experiment is a type of simple random experiment where only two mutually
exclusive outcomes are possible on any trial and those two outcomes are a success and
failure. Such trials where only one of two mutually exclusive outcomes is possible are
Bernoulli trials
• For example, flipping a coin is a Bernoulli trial, because only heads and tails are
possible. Heads could be defined as a “success” and tails could be defined as a
“failure.”
• A person with cancer who is taking a new experimental type of chemotherapy is a
Bernoulli trial, where the patient being cured is a “success” and the patient not
being cured is a “failure.”
• The binomial probability is the probability of observing a certain number of successes (r)
over a certain number of independent Bernoulli trials (n)= nC rprqn-r where n! = n*(n-
1)*(n-2)*(n-3)....1
How many heads when we toss 3 coins?
• The three coins can land in eight possible ways:
HHH, HHT, HTT, HTH, THH, THT, TTH, TTT
• Sample space= {0, 1, 2, 3}
• We see just 1 case of Three Heads, but 3 cases of Two Heads, 3 cases of One Head,
and 1 case of Zero Heads. So:
• Total outcomes= 8
Examples:
1. A new drug is introduced to cure a disease, it
either cures the disease (it’s successful) or it doesn’t
cure the disease (it’s a failure).
2. If you purchase a lottery ticket, you’re either
going to win money, or you aren’t. Binomial Distribution
Note: Binomial distribution is a process where there are only two possible outcomes: true or false
Example 1: The Binomial Probability Distribution
Ten consumers were asked to state their preferences between two types of
ice-cream. Assuming that there is no difference between two types of
icecream, calculate the probability that:
( )
P ( X = x ) = nx p x q n − x =
n!
x ! ( n − x )!
pxq n−x
a. What is the probability exactly 10 of the students surveyed will correctly identify
Coke or Pepsi?
b. What is the probability at least 10 of the students will correctly identify Coke or
Pepsi?
https://learn.upgrad.com/v/course/791/session/90258/segment/504954
Sample vs Population
Population Sample
The measurable quality is The measurable quality is
called a parameter called statistics
• Central limit theorem states that if you take sufficiently large random samples
(sample size ‘n’) from any population distribution with a mean μ and standard
deviation σ, the distribution of sample means (or the ‘sampling distribution of
sample means’) will be a normal distribution with a mean µ and standard deviation
σ/√n.
• First, the mean of the sampling distribution is assumed to be equal to the mean of
the population.
• Second, the standard deviation of the sampling distribution is assumed to be equal
to standard deviation of the population divided by the square root of the sample
size.
• Standard deviation of the sample means distribution is also referred to as the
‘standard error of the mean’, or simply the ‘standard error’, and is denoted by ‘SE’.
• Sample standard deviation (n>30) = σ/√n.
Refer: http://onlinestatbook.com/stat_sim/sampling_dist/
Central Limit Theorem
• From the formula of the standard error, it is clear that as the sample size increases, the
sampling distribution of sample means becomes narrower and better resembles a
normal distribution.
• To summarise, the central limit theorem claims that irrespective of the probability
distribution of the population, the distribution of sample means follows a normal
distribution if the sample size is sufficiently large.
• The Central Limit Theorem, tells us that if we take the mean of the samples (n)
and plot the frequencies of their mean, we get a normal distribution! And as the
sample size (n) increases --> approaches infinity, we find a normal distribution.
Z test (two tail, left tail, right tail) One sample Mean
Type of Test
Test (normality One sample Population standard N>30 Z test
assumption) deviation is known
Types of Hypothesis
Null hypothesis refers to a specified value of the population parameter not sample
A null hypothesis may be rejected, but it can never be accepted based on a single test.
Tails of test
• Hypothesis Statement
Null hypothesis: The mileage is greater than or equal to 17
(as this is the default claim made by the brand )
Alternative hypothesis: The mileage is less than 17
(as this challenges the null hypothesis)
• Mathematically
Ho: Mileage (mean) ≥ 17
Hα: Mileage (mean) < 17
Formulating the Hypotheses
1 At least
More than or equal to
2 More than
3 Less than
Less than or equal to
4 At most
Ho: Decrease in wear after 3 years ≤ 9%; Ha: Decrease in wear after 3 years > 9%
Formulating the Hypotheses
• Mr. Mohan of the Civil Engineering Department wants to test the load bearing
capacity of an old bridge which must be more than 10 tons, in that case he can
state his hypotheses as under:
• Null hypothesis H0 : tons µ<=10
• Alternative Hypothesis Ha : tons µ > 10
A Broad Classification of Hypothesis Tests
Hypothesis Tests
Tests of Tests of
Association Differences
• Null hypothesis H0 : µ = 80
• Alternative Hypothesis Ha : µ ≠ 80
Formulating the Hypotheses
P value method
If p<=alpha , Reject Ho
If p> alpha, Fail to Reject Ho
Confidence level
✓ The confidence level or reliability is the expected percentage of times that the actual value will fall
within the stated precision limits.
✓ Thus, if we take a confidence level of 95%, then we mean that there are 95 chances in 100 (or .95 in
1) that the sample results represent the true condition of the population within a specified precision
range against 5 chances in 100 (or .05 in 1) that it does not.
✓ Confidence level indicates the likelihood that the answer will fall within that range, and the
significance level indicates the likelihood that the answer will fall outside that range.
✓ We can always remember that if the confidence level is 95%, then the significance level will be (100
– 95) i.e., 5%; if the confidence level is 99%, the significance level is (100 – 99) i.e., 1%.
✓ We should also remember that the area of normal curve within precision limits for the specified
confidence level constitute the acceptance region and the area of the curve outside these limits in
either direction constitutes the rejection regions.
Z Score
✓If the z-score of the sample lies further away from the center than the critical z-
values, the null hypothesis is rejected.
✓Otherwise, the test fails to reject the hypothesis.
✓The only two possible outcomes of a hypothesis test are ‘reject the null
hypothesis’ or ‘fail to reject the null hypothesis’. This hypothesis can never be
‘accepted’.
Commonly used critical z scores
Left tail test Two tail test Right tail test
Testing hypothesis : Z score
Example 2: MS EXCEL
• Hypothesis Statement
• Ho: Increase in XYZ =10 micro-units
• Hα : Increase in XYZ ≠10 micro-units
• Example 4: MS EXCEL
Left Tail test
• Imagine you’re the owner of a pizza company, and you claim that your pizzas
are more than 9 inches in diameter. But you’ve been receiving complaints
from some of your customers, who say that the pizzas are actually smaller.
Your task is to now find out whether your chefs are producing smaller pizzas.
In this case, you will conduct a ‘left-tailed test’ by checking whether your
sample mean is significantly lesser than 9 inches, since you’re checking
whether the complaints about smaller pizzas are true.
• Hypothesis Statement
• Null hypothesis : Pizza size is at least 9 inches (i.e. 9 or more).
• Alternative hypothesis : Pizza size is less than 9 inches
• Mathematically
• Null hypothesis : Pizza size ≥ 9.
• Alternative hypothesis : Pizza size is < 9.
Two Sample t Test
• When there is a need to compare the means of two samples, a two-
sample t-test is conducted. In such a case, the formula for the t-statistic
becomes
Types of two sample test
• Paired t test - Paired means that both samples consist of the same test subjects
• Unpaired t test- Unpaired means that both samples consist of distinct test subjects.
Example
• Let’s take the medicine example again. This time, you want to test the
default belief that the medicine affects males and females in the same
way.
• You have 15 male volunteers and 15 female volunteers. You measure
the increase in hormone XYZ in these patients, on taking the medicine.
• Male=Sample A, Female= Sample B.
• Remember, the default belief states that the medicine has the same
effect in both sexes.
• Hypothesis Statements
• Null hypothesis : Mean of the male sample and the mean of the female
sample are equal. Or
• Null hypothesis : The mean of sample A - mean of sample B = 0
• Alternative hypothesis : The mean of sample A - mean of sample B ≠ 0
• Example 5: Excel
Summary
1. Define the hypothesis statements: Your test will either ‘reject’ or ‘fail to reject’ the null hypothesis.
2. Collect as many data points as possible: The data points you collect will produce one sample. The size
of this single sample will depend on how many data points you take.
3. Measure the sample mean and the sample standard deviation: The standard deviation should be
calculated using the ‘n-1’ method. The STDEV function in Excel takes care of this.
4. Identify the distribution of the sample means: If the sample size is larger than 30, the distribution will
be a normal one (We’re only focusing on normal distributions for now).
5. Define the confidence level: This is the level of surety that you demand from a hypothesis test. The
higher the confidence level, the harder it is to reject the null hypothesis.
6. Find the critical z-scores of the confidence level and the test statistic or the z-score of the sample: The
z-score of the sample can be calculated by subtracting the hypothesised mean from the sample mean
and dividing it by the population standard deviation, divided by the root over sample size.
7. Compare the sample test statistic with the critical z-scores: Here, you check whether the sample
statistic is more extreme than the z-scores.
8. If the sample test statistic is more extreme than the critical z-scores, you will reject the null
hypothesis. Otherwise, you will fail to reject it.
Errors in Hypothesis test
• Type 1- Null hypothesis was true but rejected, pizza>=9, but I rejected
• Type 2 error- Accept Ho when ho is false, pizza was not >=9 but accepted it
Cost of Error
• However, John’s method can backfire. This is because he did not bother to check for
statistical significance. The difference in performance observed may be due to plain
old randomness. Thus, there’s a high probability that he may end up with an
inferior website colour.
• Null hypothesis (H0): Visitors that receive Layout B will not have higher
end-of-visit conversion rates compares to visitors that receive Layout A
• Alternative hypothesis (H1): Visitors that receive Layout B will have
higher end-of-visit conversion rates compared to visitors that receive
layout A
Example: A/ B testing
Example: A/ B testing
Example: A/ B testing
References
Download stat pro
➢ file:///D:/nptel/19%20HYPOTHESIS%20 https://faculty.chicagobooth.edu/jeffrey.russell/teaching/bstats/StatPro.zip
TESTING%20T_%20Z%20TEST%20(1).p Install Stat Pro
df https://www.youtube.com/watch?v=S24BV6tCkQQ
Download Real stats (for Logistic regression)
➢ https://hbr.org/2017/06/a-refresher- http://www.real-statistics.com/free-download/real-statistics-resource-pack/
on-ab-testing Install Real stats (for Logistic regression)
➢ Upgrad Study Material https://www.youtube.com/watch?v=EKRjDurXau0
P value calculator
➢ Kothari, C. R. (2004). Research http://courses.atlas.illinois.edu/spring2016/STAT/STAT200/pnormal.html
methodology: Methods and
techniques. New Age International. Ab testing
https://www.surveymonkey.com/mp/ab-testing-significance-calculator/
➢ Malhotra, N., & Birks, D.
Market Research, Naresh Malhotra
(2007). Marketing Research: an applied http://www.ru.ac.bd/wp-content/uploads/sites/25/2019/03/407_08_00_Malhotra-Marketing-
approach: 3rd European Edition. Research-An-Applied-Orientation.pdf
Pearson education. XL Stat links
➢ List of test: https://help.xlstat.com/s/article/k-means-clustering-in-excel-tutorial?language=en_US
https://help.xlstat.com/s/article/which https://help.xlstat.com/s/article/conjoint-analysis-in-excel-tutorial-new?language=en_US
-statistical-test-should-you- https://help.xlstat.com/s/article/discriminant-analysis-in-excel-tutorial?language=en_US
use?language=en_US Kotler
http://eprints.stiperdharmawacana.ac.id/24/1/%5BPhillip_Kotler%5D_Marketing_Management_1
4th_Edition%28BookFi%29.pdf
Module 3: Regression Analysis
• Covariance
• Correlation
• Difference between Covariance
and Correlation
• Regression
• Simple Linear Regression
• Multiple Linear Regression
• Logistic Regression
• Hands on experience using MS
EXCEL
• Hypothesis testing and
Interpretation
Covariance
• There are several methods of determining the relationship between variables, but
no method can tell us for certain that a correlation is indicative of causal
relationship.
• Thus we have to answer two types of questions in bivariate or multivariate
populations viz.,
(i) Does there exist association or correlation between the two (or more) variables? If yes, of what
degree?
(ii) Is there any cause and effect relationship between the two variables in case of the bivariate
population or between one variable on one side and two or more variables on the other side in
case of multivariate population? If yes, of what degree and in which direction?
The first question is answered by the use of correlation technique and the second
question by the technique of regression.
Karl Pearson’s coefficient of correlation
• Karl Pearson’s coefficient of correlation (or simple correlation) is the most widely
used method of measuring the degree of relationship between two variables. This
coefficient assumes the following:
• that there is linear relationship between the two variables;
• that the two variables are casually related which means that one of the variables is
independent and the other one is dependent;
• a large number of independent causes are operating in both variables so as to produce a
normal distribution.
• Karl Pearson’s coefficient of correlation can be calculated using following formula:
Correlation
Covariance Correlation
• Null Hypothesis
Null hypothesis: Temperature does not significantly influence air conditioner
sales
• Alternative Hypothesis
Alternative hypothesis : Temperature has a significant influence on the sales
➢ Disclaimer: Some data/pictures are used from internet resources just for teaching purpose
Doubts?
All the Best!
https://www.youtube.com/watch?v=Z9Gw9dIJGiA&t=86s&ab_channel=upGrad_Gmba