Module 2 Feb
Module 2 Feb
Module 2.1
Definition
Statistical Modelling is a simplified, mathematically-
formalized way to approximate reality (i.e. what generates your
data) and optionally to make predictions from this
approximation. The statistical model is the mathematical
equation that is used.
Non-deterministic-stochastic
Eg: Weight of 1000 potatoes-statistical approach.
Terminologies
Example:
Null: There is no difference in mean BMI
H(0): U1=U2 [U1 represents the population mean BMI for Males and
U2 represents the population mean BMI for females]
Here H(0) says that they are equal to each other
H(A): U1≠U2 [U1 represents the population mean BMI for Males and
U2 represents the population mean BMI for females]
Here H(A) says that they are not equal to each other
Step 3 Set Level of significance and
confidence(critical values)
• Confidence level (C)
• Significant level means the percentage risk to
reject a null hypothesis when it is true and it
is denoted by 𝛼. (1-c)Generally taken as 1%,
5%, 10% .(𝛼+C=1)
• (1 − 𝛼) is the confidence interval in which the
null hypothesis will exist when it is true
• Two –tailed test Ha: µ ≠ µ0 (two-sided)
• One tailed test (Ha: µ < µ0 (left-sided) or
Ha: µ > µ0 (right-sided) )
• Rejection Region
Step 4:Identify the test statistic
-To test the validity of the null hypothesis.
-To calculate the evidence of support of null
hypothesis.
Test statistic= Z –test /T-test
Z-Score (if the population standard
deviation is known)
Decision criteria
• Compare the test statistic value to critical
value
Example 1
Example 2
Example 3
• A factory has a machine that dispenses 80 ml
of fluid in a bottle. An employee believes the
average amount of fluid is not 80 ml. Using 40
samples ,he measures the average amount
dispensed by the machine to be 78 ml with a
S.D of 2.5.
• a)State the null and alternative hypothesis.
• b)At a 95% Confidence level,is there enough
evidence to support the idea that the
machine is not working properly.
T-test-Example
A Company manufactures car batteries with an average life span of 2 or more
years. An engineer believes the value to be less.
Using 10 samples, he measures the average life span to be 1.8 years with a S.D
of 0.15.
a)State Null and alternative hypothesis.
b)At a 99% C.L, is there enough evidence to discard the null hypotheis.
F-test
Test of hypothesis concerning : if the variances of two populations are equal.
F-test
Degrees of freedom.(no.of samples-1)
Rejection Criteria:
IF F calculated is smaller than the F-staistic
(table value), Ho accepted else reject.
Structural Equation Modelling
Structural Equation Modelling
• Structural equation modeling is a multivariate
statistical analysis technique that is used to analyze
structural relationships.
• This technique is the combination of factor
analysis and multiple regression analysis, and it is
used to analyze the structural relationship between
measured variables and latent constructs.
• This method is preferred by the researcher because it
estimates the multiple and interrelated dependence
in a single analysis.
• In this analysis, two types of variables are used
endogenous variables and exogenous variables.
• Endogenous variables are equivalent to dependent
variables and exogenous variables are equal to the
independent variable.
Structural Equation Modelling
Power Analysis
Power Analysis
Alpha :The probability of rejecting Ho assuming that the null hypothesis is true.
Type I Error (false positive). Reject the null hypothesis even though it is accurate and
should not be rejected.
Type II Error (false negative). Not reject the null hypothesis. Type II error describes
the error that occurs when one fails to reject a null hypothesis that is actually false
Power Analysis
Power (also called power of a hypothesis test) is the probability that a
test will correctly reject a false null hypothesis. ie, the probability of a
true positive result.
Power = 1 — Type II Error
• Low Statistical Power
• High Statistical Power
Four Parameters:
• Effect Size
• Sample Size
• Significance
• Statistical Power
Increase in effect Size
increases Power
Sample Size
Example 2
Sampling methods
Sampling is a technique of selecting a subset of
the population to make statistical inferences
from them and estimate characteristics of the
whole population.
time-convenient and a cost-effective method
Types of Probability Sampling
Types of Non-Probability Sampling
Resampling Methods
• Resampling methods involve:
1. Repeatedly drawing a sample from the training
data.
2. Refitting the model of interest with each new
sample.
3. Examining all of the refitted models and then
drawing appropriate conclusions.
Two common methods of Resampling are –
• Cross Validation
• Bootstrapping
Cross Validation
Cross-Validation is used to estimate the test
error associated with a model to evaluate its
performance.
Validation set approach
Cross Validation
Leave-one-out-cross-validation
k-fold cross-validation
Bootstrap
Samples are drawn from the dataset with
replacement (allowing the same sample to
appear more than once in the sample), where
those instances not drawn into the data sample
may be used for the test set.
Univariate Analysis
• It is the simplest form of Data Analysis in
which the data consists of only one variable.