0% found this document useful (0 votes)
127 views

Inferential Statistics and Linear Regression

The central limit theorem states that as sample size increases, the sampling distribution of the sample mean approximates a normal distribution, even if the

Uploaded by

Mahesh Babu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views

Inferential Statistics and Linear Regression

The central limit theorem states that as sample size increases, the sampling distribution of the sample mean approximates a normal distribution, even if the

Uploaded by

Mahesh Babu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Inferential Statistics

▪ What is inferential
Statistics?
▪ Making inferences
about populations
based on samples
Descriptive Vs Inferential Statistics
• Inferential Statistics refers to
▪ Descriptive – Description of population using sample
Statistics • Hypothesis Testing: Testing of
refers to statements about population based
▪ Summary/ on sample characteristics
Description of – F Test
sample – T Test
only(not
population) – Chi-square Test
• Predictions
– Regression
– Classification
Some examples of inferential
statistics

• Given the IQ scores of sample of some men and


women, are the IQ Level of Men and Women the
same?
• Given a crime data set (sample), check if black and
white are equally likely to commit crime
Inferential Statistics using
Probability Density Functions(PDF)
• Box-plot vs Normal Distribution
Inferential Statistics using
Probability Density Functions(PDF)
Answers to the following • Modeling:
questions are statistical – Outcome of experiment:
inference about the true or false
population
– Use Bernoulli
• Will a client buy the distribution
product, or not?
Bernoulli Distribution:
• Will the medicine help the
P(x)=pxq(1-x)
patient to recover, or not?
where p is the probability
• Will the online ad be
of x=1, q=(1-p)
clicked on, or not?
Inferential Statistics using
Probability Density Functions(PDF)
Suppose the questions • Modeling:
about populations, when n – Outcome of experiment
trials performed, are is true or false, and
• What is probability of x n trial are performed
clients to buy a product
– Use Binomial
out of n clients visited a
Distribution:
shop?
• Will the medicine help x
patients out of n patients to
recover? where p is the probability
• Will the online ad be of Success, q=1-p
clicked by x persons out of
n persons?
Binomial Distribution

To find probability of getting x


heads from 10 trials
from scipy.stats import binom
import matplotlib.pyplot as plt
num_trials = 10
heads_probability = .5
probs = [binom.pmf(i, num_trials,
heads_probability) for i in
range(11)]
plt.stem(list(range(11)), probs)
plt.show()
Poison Distribution
Suppose the questions When to Model using
about populations are poison distribution:
• What is the probability of k • Events corresponds to large
road accidents to happen in value of random variable
Kandigai in a day? occurs rarely
• What is the probability of k • Experiment outcome: 0,1,2,3,….
clients make a purchase in your
online store every X minutes Poisson Distribution:
• What is the probability that
every Yth product coming off
the assembly line is defective.
• What is the probability that x
users may access your ML
model deployed in cloud
Note that x!>> lamda^x, for very large x
Poisson Distribution

• import matplotlib.pyplot as plt


• from scipy.stats import poisson
• rate = 3.3
• probs = [poisson.pmf(i, rate) for
i in range(15)]
• plt.stem(list(range(15)), probs)
• plt.show()
Exponential Distribution
Suppose the questions
about populations are
• What is probability that two
consecutive road accidents in
Kandigai happens in x minutes? ● Here λ > 0 is the parameter of the
distribution, often called the rate parameter.
• What is the probability that two It is equal to 1/μ
consecutive clients make a ● The exponential distribution is often
purchase in your online store in X concerned with the amount of time until
some specific event occurs.
minutes ● For example, the amount of time (beginning
• What is the probability that time now) until an earthquake occurs has an
exponential distribution.
taken for two consecutive
● Other examples include the length, in
defective products coming off the minutes, of long distance business telephone
assembly is x minutes. calls, and the amount of time, in months, a
car battery lasts.
• What is the probability that time ● It can be shown, too, that the value of the
taken for two consecutive users change that you have in your pocket or purse
access your ML model deployed in approximately follows an exponential
cloud is x seconds distribution.
Let X = amount of time (in minutes) a postal clerk spends with his or her customer. The time is known to have
an exponential distribution with the average amount of time equal to four minutes.

It is given that μ = 4 minutes. To do any calculations, you must know λ, the decay parameter. λ = 1/μ .
Therefore, λ = 1/4 = 0.25

For example, f(5) = 0.072. The postal clerk spends five minutes with the customers.
Exponential Distribution
import numpy as np
import seaborn as sns
scale = 1 / 3.3
draws = np.random.exponential(scale, size = 1_000_000)
sns.kdeplot(draws, shade=True, color='xkcd:lightish blue')
Normal(Gaussian) Distribution
• For what data, Normal
Distribution fits
– When probability of
occurrence of extreme value
from mean is low
• Example data where Normal
distribution fits
– Body temperature
– People's height
– Car mileage
– IQ scores
– Error distribution of
observed values of sensors
• Why to fit distribution
– To infer the occurrence of
events
Sample Distribution

• if we take many samples( the same number of


observations in each sample is the same) from a
population, and calculate a mean for each sample,
then the distribution of these means across the
samples is called as sample distribution

• Central Limit Theorem(CLT):


– Let m be the mean and s be the standard deviation of a
population P. The sample distribution of the population is
a normal distribution with mean m, and standard deviation
s/root(n), where n is the number of observation in each
sample
Sample Distribution

Frequency Dist of population of size 1 million Frequency Dist of sample distribution

Frequency Distribution of four different samples(of size 100 each)


An Application of CLT in Data Science

• Suppose an advt. company tells


customer that 20% expected click-
through-rate(CTR) will be provided,
where CTR is normally distributed .
• But the customer draw a sample of size
100, and finds out that only 16 people
clicked the advt.
• Can the customer take the advt. company
to court for meeting 20% CTR?
– Ans: No
– Customer should find the mean of
sampling distribution
– If mean of sampling distribution is
too short of .2, then complaint can As per CLT, mean of normally
be made distributed population is the same as
mean of sampling distribution
Applications of CLT in Data Science
▪ Sensor Error is usually normally distributed
▪ To find Expected error that sensor/device makes, Find the
mean of sample distribution
▪ To find SD of error that device will make, find the SD of
sample distribution, say s. By CLT, the SD of sample
distribution will be ES=s/sqrt(n), where n is the sample size.
Therefore the population SD, s = ES*sqrt(n) can be computed
from sample distribution
▪ .
Statistical Hypothesis Testing

• What is Conjecture?
– Any statement which is either true or false
• What is Statistical Hypothesis?
– Conjecture that can be tested experiments / observations
• Eg:
– Given drug X, and disease d, X is effective in treating d
– Avg monthly salary of an Indian is 10k
– Avg monthly salary of an Indian and a Chinese is the
same
– Performance of Algorithms A and B are statistically the
same
Statistical Hypothesis Testing
Hypothesis Testing using Z-Test to test
population mean

Test Statistics To check if the population mean is mu:


Two-Tailed Vs One tailed Test
A factory has a machine that dispenses 80 mL of fluid into a
bottle. An employee believes that the average amount of fluid
isn’t 80 mL. Using 40 samples, he measures the average
amount dispensed by the machine to be 78 mL with a
population standard deviation of 2.5 mL. State the null and
alternative hypothesis. At a 95% confidence level, is there
enough evidence to support the idea that the machine is not
working properly?

Suppose later that further testing shows that the machine was
working properly, what type of error did the employee make
(Type 1 or Type 2)?
P(x<2.31) =0.9896

Z-Table
Steps of Z-test for left tail to
population mean
Step 1: Formulate H0 and H1 Note:
The objective is to reject
H0: PM=50 (PM denotes Population mean)
null hypothesis when
H1: PM <50 population mean is
significantly less than50
Step 2: Select Significance Level
alpha = 5%

Step 3: Find Z* from the Z-table corresponding to


the chosen alpha
Z* = -1.65

Step 4: Calculate Z test statistics


Z= (SM-PM)/(SD/ sqrt(n)) (SM denotes sample
mean)
Z=(46-50)/(6/2)=-4/3=-1.33

Step 4): If Z< Z*, reject H0


Steps of Z-test for right tail to
test population mean
Step 1: Formulate H0 and H1 Note:
H0: PM=50 (PM denotes Population mean ) The objective is to reject
H1: PM>50 null hypothesis when
population mean is
significantly more than 50
Step 2: Select Significance Level
alpha = 5%

Step 3: Find Z* from the Z-table corresponding to


the chosen alpha
Z* = 1.65

Step 4: Calculate Z test statistics


Z= (SM-PM)/(SD/sqrt(n)) (SM denotes sample
mean)
Z=(55-50)/(6/2) =5/3 =1.67

Step 4): If Z> Z*, reject H0


Steps of Z-test for two tails to
test population mean
Step 1: Formulate H0 and H1 Note:
The objective is to reject null
H0: PM=50 (PM denotes Population mean )
hypothesis when population
H1: PM != 50 mean is significantly different
than 50
Step 2: Select Significance Level
alpha = 5%
Step 3: Find Z* from the Z-table corresponding to the
alpha/2 =2.5%
Z* =1.96
Step 4: Calculate Z test statistics
Z= (SM-PM)/(SD /sqrt(n)) (SM denotes sample mean)
Z=(55-50)/(6/2)) =5/3 =1.67

Step 4): When Z is positive, If Z> Z* , reject H0


When Z is negative, if Z<Z*, reject H0

Here we can not reject null hypothesis


Z-test to test difference of
means of two populations
• Two sample Z-test
– Objective: Check if means of two populations are
significantly close
• Null Hypothesis:

• The test statistics:


T-Test
t Table
Linear Regression
• This analysis aids in understanding the relation
between two or more variables.
• When in case of understanding two variables,
one is independent variables(input) and the
other variable is dependent variable(predicted
variable).

𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙

• where 𝑏0 → 𝑦 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 and 𝑏1 → 𝑆𝑙𝑜𝑝𝑒


Linear Regression

𝑦−𝑦 2
 Standard Error of the Estimate =
𝑛−2

 𝑏0 𝑖𝑠 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 𝑢𝑠𝑖𝑛𝑔 𝑚𝑒𝑎𝑛 𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒

𝑥−𝑥 𝑦−𝑦
 𝑏1 =
𝑥−𝑥 2
Linear Regression
x y
1 2

2 4

3 5

4 4

5 5
∑= 15 ∑= 20

𝑥=3 𝑦=4
Scatter plot
ȳ= 𝑏0 + 𝑏1x̄
(As per the example, to calculate the y intecercept
4 = 𝑏0 + 0.6 3
Contd.,
𝟐 𝟐
x-𝒙 y-𝒚 𝒙−𝒙 (x- 𝒚 ( 𝒚-y) 𝒚−y
𝒙)(y−𝒚)

-2 -2 4 4 2.8 0.8 0.64

-1 0 1 0 3.4 0.6 0.36

0 1 0 0 4 -1 1

1 0 1 0 4.6 0.6 0.36

2 1 4 2 5.2 0.2 0.04


Linear Regression

 Standard Error of the Estimate =

𝑦−𝑦 2 2⋅4
= = 0 ⋅ 89
𝑛−2 5−2

 𝑏0 𝑖𝑠 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 𝑢𝑠𝑖𝑛𝑔 𝑚𝑒𝑎𝑛 𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒

xi −𝑥 yi −𝑦 6
 𝑏1 (𝑆𝑙𝑜𝑝𝑒) = = = 0.6
xi−𝑥 2 10

You might also like