Advanced Statistics Day 1
Advanced Statistics Day 1
Predictor
A
Predictor A
Predictor C
Predictor C
Conclusion
Generate Theory
Level- Possible
score
Independent Independent
Predictor Criterion
Variable Variable Variable Variable
Frequency analysis 2 • Binomial
T •is used when one is interested in the distribution of
cat. Test
Y one or more variable in a single sample >2 • Chi-
P
cat
Square
E
Group comparisons 2
O cat. • t test
F •analyses that compare groups to each other
>2
cat • F test
S
T
A Repeated measures analyses •ANOVA
T •use data from groups that have been measured (repeated)
I more than once, usually on the same variable
• Time
S •The comparisons are usually across time
T
Series
I
C Correlational analyses 2
• Logistic
cat
S involve one group of participants that have
been measured on more than one variable Interval
• Linear
Regression
Introduction to Statistical Learning
What is Statistical Learning?
◦Statistical learning refers to a vast set of tools for
understanding data.
• These tools can be classified as supervised or
unsupervised.
Supervised (input and output)
Unsupervised (Input)
Suppose we are statistical consultants hired by a client to provide advice on
how to improve sales of a particular product. We have sales data of the
product in 200 different markets, along with advertising budgets for three
different media: TV, radio, and newspaper.
◦The advertising budgets are input variables while sales
is an output variable. The input variables are typically
denoted using the symbol X, a subscript to distinguish
them.
◦ The inputs are also called as predictors, independent
variables, features, or sometimes just variables.
◦Output variables are also called response or
dependent variable.
◦ In a general set-up, we have p different predictors, 𝑋1, 𝑋2,
… , 𝑋𝑝.
◦ We assume that there is a relationship between 𝑌 and 𝑿 =
(𝑋1, 𝑋2, … , 𝑋𝑝) , which can be written in a general form
𝑌 = 𝑓 (𝑿) + 𝜀
◦ 𝑓() is some unknown function that represents the
systematic information that X provides about Y.
◦ 𝜀 is a random error term
In essence, statistical learning refers to a set of approaches
for estimating 𝑓.
We estimate 𝑓 for two main reasons: prediction and
inference.
Measurement Occasions 1 or 2
Statement of the Problem- Do the students’ high school performance predict the college
grade point?
Primary Statistical Questions equation?
◦ How accurate are our guesses using the regression equation?
◦ Example of a Study That Would Use Simple Linear Regression
Colleges don't have enough room for every high school
student who applies, and admissions offices must use some
information to try to guess who will succeed in order to make their
decisions. One popular predictor has always been SAT scores. In
the late 1960s, as the college population was changing researchers
were interested in what the actual linear relationship was between
scores on the verbal section of the SAT, an interval level variable
that ranged from 200 to 800, and college grade point average
(GPA) for the first year, which ranged from 0.00 to 4.00 They
collected data on both variables from a sample of about 4,000
students.
Predictor Variable/IV 2+
Multiple Linear
Level of Measurement Interval Regression
Number of Levels Many
Number of Groups 1
A multiple linear regression
Criterion Variable/Response/DV 1 assumes an approximately
linear model between a
quantitative response Y on
Level of Measurement Interval the basis of more than 1
predictor variables X.
Number of Level Many
Measurement Occasions 1
108 + k – determine the sig predictor
Sample Size 100 + 8k – best fit model
Research Design- Quantitative Research , Correlational Research, Predictive Causation
Research
Objectives-
Prediction - Sometimes they want to predict a score in the future, such as
administrators looking at students' high school performance to guess what their college
grade point averages will be. We tend to say that we are predicting scores
Inferences- researchers are interested in exploring the relationship between two
variables to understand them better.
Statement of the Problem- What best fit model that can be derived from relationship of
Teaching Competence and Academic Performance?
Primary Statistical Questions equation?
◦ What are the relative contributions of each predictor to the criterion variable?
Statistical Assumptions
◦Linear relationship
◦Multivariate normality
◦No or little multicollinearity
◦No auto-correlation
◦Homoscedasticity
Linear relationship
◦ First, linear regression needs the relationship between the independent and dependent
variables to be linear. It is also important to check for outliers since linear regression is
sensitive to outlier effects. The linearity assumption can best be tested with scatter
plots, the following two examples depict two cases, where no and little linearity is
present.
Multivariate normality
This assumption can best be checked with a histogram or a Q-Q-
Plot. Normality can be checked with a goodness of fit test, e.g., the
Kolmogorov-Smirnov test. When the data is not normally distributed a non-
linear transformation (e.g., log-transformation) might fix this issue.
Multivariate
normality
If the sig value <0.05,not
normal distribution.
This shows the multiple linear regression model summary and overall fit statistics. We find that
the adjusted R² of our model is .398 with the R² = .407. This means that the linear regression
explains 40.7% of the variance in the data. The Durbin-Watson d = 2.074, which is between
the two critical values of 1.5 < d < 2.5. Therefore, we can assume that there is no first order
linear auto-correlation in our multiple linear regression data.
If we would have forced all variables (Method: Enter) into the linear regression model,
we would have seen a slightly higher R² and adjusted R² (.458 and .424 respectively).
◦ The next output table is the F-test. The linear regression’s F-test has the null
hypothesis that the model explains zero variance in the dependent variable
(in other words R² = 0). The F-test is highly significant, thus we can assume that
the model explains a significant amount of the variance in murder rate.
In our stepwise multiple linear regression analysis, we find a non-significant intercept but highly
significant vehicle theft coefficient, which we can interpret as: for every 1-unit increase in vehicle
thefts per 100,000 inhabitants, we will see .014 additional murders per 100,000.
If we force all variables into the multiple linear regression, we find that only burglary and motor
vehicle theft are significant predictors. We can also see that motor vehicle theft has a higher
impact than burglary by comparing the standardized coefficients (beta = .507 versus beta =
.333).
Table 18. Empirical Analysis on the Indicator’s Influence of Spiritual Programs
towards Spiritual Development
Standardized
Unstandardized Coefficients Coefficients
Variables
B Std. Error Beta t Sig.
The Practice of
Shown in table 18 was the empirical analysis on the influence of the spiritual programs towards spiritual development. Using
Multiple Linear Regression Analysis, the model was best fit (F value= 13.369, P value= 0.00). This means that the regression models
results in significantly better prediction of spiritual development than mean value. Further, around 27.1% of the variability of the
spiritual development can be explained by the spiritual programs.
The indicators: Beliefs about the Church, Beliefs about my Life, The Practice of Prayer and The Practice of Fellowship
significantly predict the spiritual development of the students in San Pedro College.
Regression Analysis Using Dummy Variable
ANOVA VS Regression
Regression Analysis
ANOVA
Predictor Variable 1 Simple Logistic
Level of Measurement Nominal + Regression
Number of Levels 2+ It predicts the probability
that an observation falls
Number of Groups 1 into one of two categories
Criterion Variable 1 of a dichotomous
dependent variable
Level of Measurement Nominal based on one
independent variable that
Number of Level 2 can be either continuous
or categorical.
Measurement Occasions 1
n = 100 + 50i, I is
Sample Size the number of
predictors
Research Design- Quantitative Research ,
Correlational Research, Predictive Causation
Research
This is the chi-square statistic and its significance level. This is the same as the F test in Multiple Linear Regression.
This is the probability of obtaining this chi-square statistic (65.588) if there is in fact no effect of the independent
variables, taken together, on the dependent variable.
Cox & Snell R Square and Nagelkerke R Square – These are pseudo R-squares. Logistic regression does not have an
equivalent to the R-squared that is found in OLS regression; however, many people have tried to come up with
one. There are a wide variety of pseudo-R-square statistics (these are only two of them). Because this statistic does
not mean what R-squared means in OLS regression (the proportion of variance explained by the predictors), we
suggest interpreting this statistic with great caution.
Predicted – These are the predicted values of the dependent variable based on the full logistic regression model. This
table shows how many cases are correctly predicted (132 cases are observed to be 0 and are correctly predicted to be 0;
27 cases are observed to be 1 and are correctly predicted to be 1), and how many cases are not correctly predicted (15
cases are observed to be 0 but are predicted to be 1; 26 cases are observed to be 1 but are predicted to be 0).
Overall Percentage – This gives the overall percent of cases that are correctly predicted by the model (in this case, the
full model that we specified). As you can see, this percentage has increased from 73.5 for the null model to 79.5 for
the full model.
These are the values for the logistic regression equation for predicting the dependent variable from
the independent variable. They are in log-odds units. Similar to OLS regression, the prediction
equation is
log(p/1-p) = b0 + b1*x1 + b2*x2 + b3*x3 + b3*x3+b4*x4
where p is the probability of being in honours composition. Expressed in terms of the variables used
in this example, the logistic regression equation is
log(p/1-p) = –9.561 + 0.098*read + 0.066*science + 0.058*ses(1) – 1.013*ses(2)