0% found this document useful (0 votes)
68 views

Advanced Statistics Day 1

This document provides an outline for an advanced statistics course. It will cover topics such as simple and multiple linear regression, logistic regression, factor analysis, path analysis, and structural equation modeling. Students will be required to write a full research paper in a publishable format. Regression analysis is commonly used in social science research to model relationships between variables and phenomena. It allows researchers to assess the interaction of multiple independent and dependent variables. The general linear model provides a framework to compare the effects of several variables on different continuous outcomes.

Uploaded by

촏교새벼
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Advanced Statistics Day 1

This document provides an outline for an advanced statistics course. It will cover topics such as simple and multiple linear regression, logistic regression, factor analysis, path analysis, and structural equation modeling. Students will be required to write a full research paper in a publishable format. Regression analysis is commonly used in social science research to model relationships between variables and phenomena. It allows researchers to assess the interaction of multiple independent and dependent variables. The general linear model provides a framework to compare the effects of several variables on different continuous outcomes.

Uploaded by

촏교새벼
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

ADVANCED STATISTICS

Exequiel R. Gono Jr., PhD


Professional School
Topic Outline

REVIEW OF BASIC CLASS ORIENTATION TOPIC OUTLINE


STATISTICS REQUIREMENTS
Topic Outline
◦ Simple Linear Regression
◦ Multiple Linear Regression
◦ Simple Logistic Regression
◦ Multiple Logistic Regression
◦ Exploratory Factor Analysis
◦ Confirmatory Factor Analysis
◦ Path Analysis
◦ Structural Equation Modelling
Requirements for the Subjects
◦ Full Research Paper Publishable Format
Role of Regression Analysis in Research

◦ We want to model a certain phenomena that influences human


behavior.
◦ Most inferential statistical procedures in social science research
are derived from a general family of statistical models called the
general linear model (GLM). A model is an estimated
mathematical equation that can be used to represent a set of
data, and linear refers to a straight line.
data

Are our inferences valid?…Best we can do is to calculate probability


about inferences !!!! Wisdom of the crowd
General Linear Model
The General Linear Model (GLM) is a useful framework for comparing
how several variables affect different continuous variables. In it’s
simplest form, GLM is described as:

Data = Model + Error (Rutherford, 2001, p.3)

The formula of the GLM:


Scatter Plot
When Should I Use Regression Analysis?
◦ Use regression analysis to describe the relationships
between a set of independent variables and the
dependent variable.
◦ Regression analysis produces a regression equation
where the coefficients represent the relationship
between each independent variable and the
dependent variable.
◦ You can also use the equation to make predictions.
Why Regression Analysis?

Predictor
A
Predictor A

Predictor B Criterion Predictor


B Criterion

Predictor C
Predictor C

Less accurate, weaker prediction More accurate, stronger prediction


Popular Research Design in Social
Research
Case Studies
Experimental
Action Research
Field Surveys
Ethnography
Secondary Data
FGD/KII
Bhattacherjee (2012). Social Science Research: Principles, Methods, and Practices. University of South Florida
Use of Regression to
Analyze a Wide Variety Sample Study
of Relationships

Do socio-economic status and race


affect educational achievement?
Model Assess interaction of
•IV- Socio Economic Status and Race
multiple Independent
Variable and •DV-Educational Achievement
independent Dependent
variables Variable
Do education and IQ affect
earnings?
•IV- Education and IQ
Include Use •DV-Earnings
continuous polynomial
and terms to
categorical model Do exercise habits and diet effect
variables curvature weight?
•IV-exercise habits and diet
•weight
The Research Process
Initial Observation
(Research
Question)

Conclusion
Generate Theory

Graph Data Analyze the Data Generate Identify the


Fit a Model Hypothesis Variables

Collect the data Measure


Variables
Operational Foundation

Level- Possible
score

Nominal- describes Ordinal- score (eg. Sex-2


levels)
the lowest level of represents some Group
measurement were rank order Score/ Groups of
Observation Participants
numbers are used
Levels of
Model
Measurement
Abstract to quantity Equation
and qualify “By chance”
Participants
True
Interval- the scoring Ratio- level of characteristics of
the population
People
participates in
rules are such that measurement differs the study
the spacing between from the interval level Statistic
scores reflects equal only in that negative Quantitative
amount of the values are not value of the
variable allowed sample
Independent and Predictor and
Dependent Criterion Variable
Variable
Some hypothesis suppose only that variables are
One variable affects another. related to each other .
In those hypotheses, the variable that is affected by the
In those cases we distinguish between the roles of
other is labeled the dependent variable because it
variables in the design by using the term predictor
depends on the other variable. The variable doing the variable for the one that is kind of “independenty”
affecting, the supposed casual variable, is called the and criterion variable for the one we are trying to
independent variable. explain or predict.

Independent Independent
Predictor Criterion
Variable Variable Variable Variable
Frequency analysis 2 • Binomial
T •is used when one is interested in the distribution of
cat. Test
Y one or more variable in a single sample >2 • Chi-
P
cat
Square
E
Group comparisons 2
O cat. • t test
F •analyses that compare groups to each other
>2
cat • F test
S
T
A Repeated measures analyses •ANOVA
T •use data from groups that have been measured (repeated)
I more than once, usually on the same variable
• Time
S •The comparisons are usually across time
T
Series
I
C Correlational analyses 2
• Logistic
cat
S involve one group of participants that have
been measured on more than one variable Interval
• Linear
Regression
Introduction to Statistical Learning
What is Statistical Learning?
◦Statistical learning refers to a vast set of tools for
understanding data.
• These tools can be classified as supervised or
unsupervised.
Supervised (input and output)
Unsupervised (Input)
Suppose we are statistical consultants hired by a client to provide advice on
how to improve sales of a particular product. We have sales data of the
product in 200 different markets, along with advertising budgets for three
different media: TV, radio, and newspaper.
◦The advertising budgets are input variables while sales
is an output variable. The input variables are typically
denoted using the symbol X, a subscript to distinguish
them.
◦ The inputs are also called as predictors, independent
variables, features, or sometimes just variables.
◦Output variables are also called response or
dependent variable.
◦ In a general set-up, we have p different predictors, 𝑋1, 𝑋2,
… , 𝑋𝑝.
◦ We assume that there is a relationship between 𝑌 and 𝑿 =
(𝑋1, 𝑋2, … , 𝑋𝑝) , which can be written in a general form
𝑌 = 𝑓 (𝑿) + 𝜀
◦ 𝑓() is some unknown function that represents the
systematic information that X provides about Y.
◦ 𝜀 is a random error term
In essence, statistical learning refers to a set of approaches
for estimating 𝑓.
We estimate 𝑓 for two main reasons: prediction and
inference.

Prediction- Using our estimate for 𝑓 which we denote by f̂ ,


we obtain the predicted values of Y, 𝑌 (hat) = 𝑓 (hat) (X)

Inferences- Here our goal is not much on predicting 𝑌 but


on understanding how 𝑌 changes as a function of 𝑿.
We refer to problems with a Regression
quantitative response as
regression problems.
Versus
Classification
Researchers will
Problems involving a qualitative
response are referred to as
focus on the
classification problems. response
variable or the
dependent
We tend to select statistical learning methods
on the basis of whether the response is
quantitative or qualitative; i.e. we might use
variable.
linear regression when quantitative and logistic
regression when qualitative.
Predictor Variable/IV 1
Simple Linear
Level of Measurement Interval
Regression
Number of Levels Many Analysis
Number of Groups 1

A simple linear regression


Criterion Variable/Response/DV 1
assumes an approximately
linear model between a
Level of Measurement Interval quantitative response Y on
the basis of 1 predictor
Number of Level Many variables X.

Measurement Occasions 1 or 2

108 + k – determine the sig predictor


Sample Size 100 + 8k – best fit model
Research Design- Quantitative Research , Correlational Research, Predictive Causation
Research
Objectives-
Prediction - Sometimes they want to predict a score in the future, such as
administrators looking at students' high school performance to guess what their college
grade point averages will be. We tend to say that we are predicting scores
Inferences- researchers are interested in exploring the relationship between two
variables to understand them better.

Statement of the Problem- Do the students’ high school performance predict the college
grade point?
Primary Statistical Questions equation?
◦ How accurate are our guesses using the regression equation?
◦ Example of a Study That Would Use Simple Linear Regression
Colleges don't have enough room for every high school
student who applies, and admissions offices must use some
information to try to guess who will succeed in order to make their
decisions. One popular predictor has always been SAT scores. In
the late 1960s, as the college population was changing researchers
were interested in what the actual linear relationship was between
scores on the verbal section of the SAT, an interval level variable
that ranged from 200 to 800, and college grade point average
(GPA) for the first year, which ranged from 0.00 to 4.00 They
collected data on both variables from a sample of about 4,000
students.
Predictor Variable/IV 2+
Multiple Linear
Level of Measurement Interval Regression
Number of Levels Many
Number of Groups 1
A multiple linear regression
Criterion Variable/Response/DV 1 assumes an approximately
linear model between a
quantitative response Y on
Level of Measurement Interval the basis of more than 1
predictor variables X.
Number of Level Many
Measurement Occasions 1
108 + k – determine the sig predictor
Sample Size 100 + 8k – best fit model
Research Design- Quantitative Research , Correlational Research, Predictive Causation
Research
Objectives-
Prediction - Sometimes they want to predict a score in the future, such as
administrators looking at students' high school performance to guess what their college
grade point averages will be. We tend to say that we are predicting scores
Inferences- researchers are interested in exploring the relationship between two
variables to understand them better.

Statement of the Problem- What best fit model that can be derived from relationship of
Teaching Competence and Academic Performance?
Primary Statistical Questions equation?
◦ What are the relative contributions of each predictor to the criterion variable?
Statistical Assumptions

◦Linear relationship
◦Multivariate normality
◦No or little multicollinearity
◦No auto-correlation
◦Homoscedasticity
Linear relationship
◦ First, linear regression needs the relationship between the independent and dependent
variables to be linear. It is also important to check for outliers since linear regression is
sensitive to outlier effects. The linearity assumption can best be tested with scatter
plots, the following two examples depict two cases, where no and little linearity is
present.
Multivariate normality
This assumption can best be checked with a histogram or a Q-Q-
Plot. Normality can be checked with a goodness of fit test, e.g., the
Kolmogorov-Smirnov test. When the data is not normally distributed a non-
linear transformation (e.g., log-transformation) might fix this issue.
Multivariate
normality
If the sig value <0.05,not
normal distribution.

If the sig value >0.05,


normal distribution.

Null hypothesis- It does


not deviate from normal
distribution.
Multicollinearity
Multicollinearity may be tested with three central criteria:
1) Correlation matrix – correlation coefficients need to be smaller than 1.
2) Tolerance – the tolerance measures the influence of one independent
variable on all other independent variables. Tolerance is defined as T = 1 –
R² for these first step regression analysis. With T < 0.1 there might be
multicollinearity in the data and with T < 0.01 there certainly is.
3) Variance Inflation Factor (VIF) – the variance inflation factor of the
linear regression is defined as VIF = 1/T. With VIF > 5 there is an indication
that multicollinearity may be present; with VIF > 10 there is certainly
multicollinearity among the variables.
◦ There is no multicollinearity since T> 0.01 and VIF <5.
How to solve:

◦If multicollinearity is found in the data, centering


the data (that is deducting the mean of the
variable from each score) might help to solve
the problem. However, the simplest way to
address the problem is to remove independent
variables with high VIF values.
No auto-correlation
◦ Autocorrelation occurs when the residuals are not independent from each other. For
instance, this typically occurs in stock prices, where the price is not independent from
the previous price.
◦ Use the Durbin-Watson test.
◦ Durbin-Watson’s d tests the null hypothesis that the residuals are not linearly auto-
correlated. While d can assume values between 0 and 4, values around 2 indicate no
autocorrelation. As a rule of thumb values of 1.5 < d < 2.5 show that there is no auto-
correlation in the data.
◦ However, the Durbin-Watson test only analyses linear autocorrelation and only between
direct neighbours, which are first order effects.
.
◦ There is no auto-correlation 1.5 < d < 2.5
Methods of Regressions
◦ Forced Enter (default) . All independent variables are entered
into the equation in (one step), also called "forced entry".
◦ Stepwise Methods- Stepwise methods include or remove one
independent variable at each step, based (by default) on the
probability of F (p-value); alternatively the F value can be used
instead.
◦ Hierarchical (Blockwise entry)- Predictors are selected based on
the past work and the experimenter decides in which order to
enter the model.
The Multiple Linear Regression Analysis
in SPSS
Research Problem.
This example is based on the FBI’s 2006 crime statistics. Particularly we are
interested in the relationship between size of the state, various property crime
rates and the number of murders in the city. It is our hypothesis that less violent
crimes open the door to violent crimes. We also hypothesize that even we
account for some effect of the city size by comparing crime rates per 100,000
inhabitants that there still is an effect left.
◦ Conceptual Framework

Independent Variable Dependent Variable

1. Motor vehicle theft


2. Burglary Murder
3. Larceny Theft
4. Residence Population
Results

This shows the multiple linear regression model summary and overall fit statistics. We find that
the adjusted R² of our model is .398 with the R² = .407. This means that the linear regression
explains 40.7% of the variance in the data. The Durbin-Watson d = 2.074, which is between
the two critical values of 1.5 < d < 2.5. Therefore, we can assume that there is no first order
linear auto-correlation in our multiple linear regression data.

If we would have forced all variables (Method: Enter) into the linear regression model,
we would have seen a slightly higher R² and adjusted R² (.458 and .424 respectively).
◦ The next output table is the F-test. The linear regression’s F-test has the null
hypothesis that the model explains zero variance in the dependent variable
(in other words R² = 0). The F-test is highly significant, thus we can assume that
the model explains a significant amount of the variance in murder rate.
In our stepwise multiple linear regression analysis, we find a non-significant intercept but highly
significant vehicle theft coefficient, which we can interpret as: for every 1-unit increase in vehicle
thefts per 100,000 inhabitants, we will see .014 additional murders per 100,000.

If we force all variables into the multiple linear regression, we find that only burglary and motor
vehicle theft are significant predictors. We can also see that motor vehicle theft has a higher
impact than burglary by comparing the standardized coefficients (beta = .507 versus beta =
.333).
Table 18. Empirical Analysis on the Indicator’s Influence of Spiritual Programs
towards Spiritual Development
Standardized
Unstandardized Coefficients Coefficients
Variables
B Std. Error Beta t Sig.

Constant 1.538 0.336 4.714 0

Beliefs about the Church 0.101 0.039 0.164 2.589 0.01*

Beliefs about my life -0.382 0.121 -0.31 -3.17 0.002*

The Practice of Worship 0.026 0.086 0.025 0.305 0.761

The Practice of Prayer 0.391 0.105 0.309 3.738 0.00*

The Practice of

Fellowship 0.202 0.444 0.289 4.623 0.00*


F Value=13.369; P Value=0.00; Adjust r square= 0.271 and R Value = 0.524

Shown in table 18 was the empirical analysis on the influence of the spiritual programs towards spiritual development. Using
Multiple Linear Regression Analysis, the model was best fit (F value= 13.369, P value= 0.00). This means that the regression models
results in significantly better prediction of spiritual development than mean value. Further, around 27.1% of the variability of the
spiritual development can be explained by the spiritual programs.
The indicators: Beliefs about the Church, Beliefs about my Life, The Practice of Prayer and The Practice of Fellowship
significantly predict the spiritual development of the students in San Pedro College.
Regression Analysis Using Dummy Variable
ANOVA VS Regression
Regression Analysis

ANOVA
Predictor Variable 1 Simple Logistic
Level of Measurement Nominal + Regression
Number of Levels 2+ It predicts the probability
that an observation falls
Number of Groups 1 into one of two categories
Criterion Variable 1 of a dichotomous
dependent variable
Level of Measurement Nominal based on one
independent variable that
Number of Level 2 can be either continuous
or categorical.
Measurement Occasions 1
n = 100 + 50i, I is
Sample Size the number of
predictors
Research Design- Quantitative Research ,
Correlational Research, Predictive Causation
Research

Primary Statistical Question


For those at each level (or at each score or each category) of
the independent variable, what are the probabilities that they will be
in each category on the dependent variable?
◦ Example of a Study That Would Use Simple Logistic Regression
A survey was administered to 1,431 inhabitants of seaside
community. As an independent variable, the amount of fish eaten
regularly was assessed (“Think of all the meals you eat in a week; how
many usually include fish?”). To simplify interpretation, the researchers
chose a “cut score” on the independent variable and created a
nominal independent variable with two levels. In this example, the
researcher decided that the key point on the independent variable was
whether villagers ate two or more fish meals a week. Anything less than
that, and they were “infrequent fish eaters”. As a dependent variable,
the surveys include items from a depression measure. Score above an
accepted point on the depression scale were interpreted as indicating
depression.
Predictor Variable 2+
Multiple
Level of Measurement Nominal + Logistic
Number of Levels 2+ Regression
Number of Groups 1
It predicts the probability
Criterion Variable 1 that an observation falls
into one of two categories
Level of Measurement Nominal of a dichotomous
dependent variable
Number of Level 2 based on 2 or more
independent variables
Measurement Occasions 1 that can be either
continuous or categorical.
n = 100 + 50i, I is the
Sample Size number of
predictors
Multiple Logistic Regression
Primary Statistical Question
For those at each level (or each score for each category) of
the independent variables, what are the probabilities that they will
be in each category on the independent variable?
Example of a Study That Would Use Multiple Logistic Regression
Researchers were interested in the dangers of tanning in term of risk of
getting skin cancer. They recruited two types of people, those who had skin
cancer and those who did not, and formed one large group. Two independent
variables that should theoretically be risk factors for skin cancer were chosen
and measured at the nominal level with two levels. They were type of job
(outdoors or indoors) and ability to tan (good tanner or bad tanner). The
dependent variable of presence or absence of skin cancer was nominal with
two levels.
Assumptions
Assumption #1: Your dependent variable should be measured on
a dichotomous scale.
Assumption #2: You have one or more independent variables, which can
be either continuous (i.e., an interval or ratio variable)
or categorical (i.e., an ordinal or nominal variable)
Assumption #3: You should have independence of observations and the
dependent variable should have mutually exclusive and exhaustive
categories.
Assumption #4: There needs to be a linear relationship between any
continuous independent variables and the logit transformation of the
dependent variable.
Interpretation

This is the chi-square statistic and its significance level. This is the same as the F test in Multiple Linear Regression.
This is the probability of obtaining this chi-square statistic (65.588) if there is in fact no effect of the independent
variables, taken together, on the dependent variable.

Cox & Snell R Square and Nagelkerke R Square – These are pseudo R-squares. Logistic regression does not have an
equivalent to the R-squared that is found in OLS regression; however, many people have tried to come up with
one. There are a wide variety of pseudo-R-square statistics (these are only two of them). Because this statistic does
not mean what R-squared means in OLS regression (the proportion of variance explained by the predictors), we
suggest interpreting this statistic with great caution.
Predicted – These are the predicted values of the dependent variable based on the full logistic regression model. This
table shows how many cases are correctly predicted (132 cases are observed to be 0 and are correctly predicted to be 0;
27 cases are observed to be 1 and are correctly predicted to be 1), and how many cases are not correctly predicted (15
cases are observed to be 0 but are predicted to be 1; 26 cases are observed to be 1 but are predicted to be 0).

Overall Percentage – This gives the overall percent of cases that are correctly predicted by the model (in this case, the
full model that we specified). As you can see, this percentage has increased from 73.5 for the null model to 79.5 for
the full model.
These are the values for the logistic regression equation for predicting the dependent variable from
the independent variable. They are in log-odds units. Similar to OLS regression, the prediction
equation is
log(p/1-p) = b0 + b1*x1 + b2*x2 + b3*x3 + b3*x3+b4*x4
where p is the probability of being in honours composition. Expressed in terms of the variables used
in this example, the logistic regression equation is
log(p/1-p) = –9.561 + 0.098*read + 0.066*science + 0.058*ses(1) – 1.013*ses(2)

P = e ^ (–9.561 + 0.098*read + 0.066*science + 0.058*ses(1) – 1.013*ses(2))


1 + e ^ (–9.561 + 0.098*read + 0.066*science + 0.058*ses(1) – 1.013*ses(2))

You might also like