0% found this document useful (0 votes)
6 views

Module 2 Feb

Uploaded by

Mhd Aslam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Module 2 Feb

Uploaded by

Mhd Aslam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Statistical modelling

Module 2.1
Definition
Statistical Modelling is a simplified, mathematically-
formalized way to approximate reality (i.e. what generates your
data) and optionally to make predictions from this
approximation. The statistical model is the mathematical
equation that is used.
Non-deterministic-stochastic
Eg: Weight of 1000 potatoes-statistical approach.
Terminologies

• Dependent and explanatory variables


• Model parameter
• Model residual
Statistical hypothesis generation and
testing-Definition
Hypothesis is described as a recommended solution
for an undefinable incident which doesn’t into
current theory”.
Hypothesis generation is a process beginning with
an educated guess
whereas
Hypothesis testing is a process to conclude that the
educated guess is true/false or the relationship
between the variables is statistically significant or
not.
Hypothesis generation and testing-Steps
involved
1.State null and alternative hypothesis
2.Choose level of significance.(α)
3.Find Critical value.(P)
4.Find test statistics
5.Draw Conclusion.
• Step-1: Describe Hypotheses
• Simple words
• Described using a population parameter(s.t. mean,
median, S.D, Proportion etc) about which a claim
(hypothesis) is made.
Eg:
Considering Italian adults from the age group 18-30
living in Italy, Do males have significantly higher mean
Body Mass Index (BMI) than females?

Here the population ? and the parameter of interest is ?

• Average time spent by women using social media is


more than men.
Step-2: Define Null and alternative Hypotheses
Null hypothesis: No rel. between 2 variables of considerations.
Initially, Null hypothesis is true.
Alternative Hypothesis: Complement
Defined using Population parameters-Null and alternative
Hypotheses

Example:
Null: There is no difference in mean BMI

H(0): U1=U2 [U1 represents the population mean BMI for Males and
U2 represents the population mean BMI for females]
Here H(0) says that they are equal to each other

Alternative: There is a significant difference in mean BMI

H(A): U1≠U2 [U1 represents the population mean BMI for Males and
U2 represents the population mean BMI for females]
Here H(A) says that they are not equal to each other
Step 3 Set Level of significance and
confidence(critical values)
• Confidence level (C)
• Significant level means the percentage risk to
reject a null hypothesis when it is true and it
is denoted by 𝛼. (1-c)Generally taken as 1%,
5%, 10% .(𝛼+C=1)
• (1 − 𝛼) is the confidence interval in which the
null hypothesis will exist when it is true
• Two –tailed test Ha: µ ≠ µ0 (two-sided)
• One tailed test (Ha: µ < µ0 (left-sided) or
Ha: µ > µ0 (right-sided) )
• Rejection Region
Step 4:Identify the test statistic
-To test the validity of the null hypothesis.
-To calculate the evidence of support of null
hypothesis.
Test statistic= Z –test /T-test
Z-Score (if the population standard
deviation is known)
Decision criteria
• Compare the test statistic value to critical
value
Example 1
Example 2
Example 3
• A factory has a machine that dispenses 80 ml
of fluid in a bottle. An employee believes the
average amount of fluid is not 80 ml. Using 40
samples ,he measures the average amount
dispensed by the machine to be 78 ml with a
S.D of 2.5.
• a)State the null and alternative hypothesis.
• b)At a 95% Confidence level,is there enough
evidence to support the idea that the
machine is not working properly.
T-test-Example
A Company manufactures car batteries with an average life span of 2 or more
years. An engineer believes the value to be less.
Using 10 samples, he measures the average life span to be 1.8 years with a S.D
of 0.15.
a)State Null and alternative hypothesis.
b)At a 99% C.L, is there enough evidence to discard the null hypotheis.
F-test
Test of hypothesis concerning : if the variances of two populations are equal.
F-test
Degrees of freedom.(no.of samples-1)
Rejection Criteria:
IF F calculated is smaller than the F-staistic
(table value), Ho accepted else reject.
Structural Equation Modelling
Structural Equation Modelling
• Structural equation modeling is a multivariate
statistical analysis technique that is used to analyze
structural relationships.
• This technique is the combination of factor
analysis and multiple regression analysis, and it is
used to analyze the structural relationship between
measured variables and latent constructs.
• This method is preferred by the researcher because it
estimates the multiple and interrelated dependence
in a single analysis.
• In this analysis, two types of variables are used
endogenous variables and exogenous variables.
• Endogenous variables are equivalent to dependent
variables and exogenous variables are equal to the
independent variable.
Structural Equation Modelling
Power Analysis
Power Analysis
Alpha :The probability of rejecting Ho assuming that the null hypothesis is true.

Type I Error (false positive). Reject the null hypothesis even though it is accurate and
should not be rejected.
Type II Error (false negative). Not reject the null hypothesis. Type II error describes
the error that occurs when one fails to reject a null hypothesis that is actually false
Power Analysis
Power (also called power of a hypothesis test) is the probability that a
test will correctly reject a false null hypothesis. ie, the probability of a
true positive result.
Power = 1 — Type II Error
• Low Statistical Power
• High Statistical Power
Four Parameters:
• Effect Size
• Sample Size
• Significance
• Statistical Power
Increase in effect Size
increases Power
Sample Size
Example 2
Sampling methods
Sampling is a technique of selecting a subset of
the population to make statistical inferences
from them and estimate characteristics of the
whole population.
time-convenient and a cost-effective method
Types of Probability Sampling
Types of Non-Probability Sampling
Resampling Methods
• Resampling methods involve:
1. Repeatedly drawing a sample from the training
data.
2. Refitting the model of interest with each new
sample.
3. Examining all of the refitted models and then
drawing appropriate conclusions.
Two common methods of Resampling are –
• Cross Validation
• Bootstrapping
Cross Validation
Cross-Validation is used to estimate the test
error associated with a model to evaluate its
performance.
Validation set approach
Cross Validation
Leave-one-out-cross-validation

k-fold cross-validation
Bootstrap
Samples are drawn from the dataset with
replacement (allowing the same sample to
appear more than once in the sample), where
those instances not drawn into the data sample
may be used for the test set.
Univariate Analysis
• It is the simplest form of Data Analysis in
which the data consists of only one variable.

• Here the data is being described and the


patterns are analyzed. Report County Cause of Severity of
No. Name Injury Injury
1 County A Fall 3
2 County B Auto 4
3 County C Fall 6
4 County C Fall 4
5 County B Fall 5
6 County A Violence 9
7 County A Auto 3
8 County A Violence 2
9 County A Violence 9
10 County B Auto 3
Data Analytics <Statistical Modeling> 44
Multivariate Analysis
• It corresponds to the set of Statistical
Techniques for analyzing data with a keen
focus on More than one variable.

• Here complex problems are realized and the


dimensions and insights are explicitly
revealed.

Data Analytics <Statistical Modeling> 45


Factor Analysis
Factor analysis is a technique used to study
interrelationship among many variables.
is an exploratory technique applied to a set of
observed variables that seeks to find underlying
factors (subsets of variables-latent ) from which the
observed variables were generated.
The main purpose of factor analysis is to group
large set of variable factors into fewer factors. Each
factor will account for one or more component.
Each factor is a combination of many variables.
Steps in Factor Analysis
Step 1 : the correlation matrix for all variables is
computed
Step 2 : Factor extraction
Step 3: Factor rotation
Step 4: Make final decisions about the number
of underlying factors
Factor Analysis
• Here is the steps in factor analysis:
– Factor Extraction
– Factor Rotation

• Also Factor extraction is a two step process:


• Principal component analysis
• Common factor analysis
Factor Analysis
• PCA
– Unlike factor analysis, principal components
analysis or PCA makes the assumption that there
is no unique variance, the total variance is equal
to common variance.
• CFA
– Common factor analysis was developed to
express the variance shared among n observed
variables as a function of p underlying common
factors.
Factor Analysis
The observed variables are modeled as linear
combinations of the factors, plus “error” terms.
Example :The Marketing Research Manager
prepares a questionnaire to study the customer
feedback. The researcher has identified six
variables for this purpose.
They are as follows. Fuel efficiency (A)Durability
of life (B)Comfort (C)Spare parts availability
(D)Breakdown frequency (E)Price (F)
Factor Analysis
• A,B,D,E into - factor – 1
• F into - factor – 2
• C into - factor – 3
• Factor – 1 can be termed as Technical factor
• Factor – 2 can be termed as Price factor
• Factor – 3 can be termed as Personal factor
Non-Parametric Testing
• A non parametric test (sometimes called a distribution
free test) does not assume anything about the underlying
distribution.
• In this testing method we have the population data which
doesn’t relies for a normal distribution.
• There are several statistical tests that can be used to assess
whether data are likely from a normal distribution.
• The most popular are the Kolmogorov-Smirnov test, the
Anderson-Darling test, and the Shapiro-Wilk test.
• Each test is essentially a goodness of fit test and compares
observed data to quantiles of the normal (or other specified)
distribution.
Non-Parametric Testing
• There are some situations when it is clear that the outcome
does not follow a normal distribution. These include
situations:
– when the outcome is an ordinal variable or a rank,
– when there are definite outliers or
– when the outcome has clear limits of detection
Non-Parametric Testing – Using
Ordinal scaled
Hypothesis Testing with Non-
Parametric Test
• In nonparametric tests, the hypotheses are not about
population parameters (e.g., μ=50 or μ1=μ2).
• Instead, the null hypothesis is more general. For example,
when comparing two independent groups in terms of a
continuous outcome, the null hypothesis in a parametric test
is H 0: μ1 =μ2.
• In a nonparametric test the null hypothesis is that the two
populations are equal, often this is interpreted as the two
populations are equal in terms of their central tendency.
Advantages of Non-Parametric Tests
• Non-parametric tests is the only way to analyze the
data that are ordinal, ranked, subject to outliers or
measured imprecisely are difficult to analyze with
parametric methods.

• Also Non-parametric test are easy to perform with


dataset that doesn’t have a normal distribution.
Activity – Spline-based model

You might also like