Adstat Final Exam Reviewer2highlighted
Adstat Final Exam Reviewer2highlighted
It is
equal to zero if the two population means is
conjectured to be equal.
MODULE 4A - Testing the
difference between two means
using the z-test Formula for the test statistic Z
Traditional Method
1. State the null and alternative
hypotheses. Identify the claim. It is
necessary to identify group 1 and
group 2.
2. Test for normality if sample sizes are
less than 30. If satisfied, compute the
test statistic, 𝑧.
3. Find the critical value.
4. Make the decision.
5. Summarize the results.
Characteristics of a Chi-square
Distribution:
1. The chi-square distribution is a family
of curves based on the degrees of
freedom.
2. The chi-square distributions are
positively skewed.
3. All chi-square values are greater than
or equal to zero.
4. The total area under each chi-square
distribution is equal to one.
General Assumptions of a Chi-square
Distribution:
1. The sample was chosen using a
random sampling method.
2. The variables being analyzed are
categorical (nominal or ordinal).
3. All chi-square values are greater than
or equal to zero.
4. The total area under each chi-square
distribution is equal to one.
Chi-square Goodness-of-Fit Test //kung
may pattern
The Chi-square Distribution can be used also known as the chi-square goodness-
to of-fit test is used if we would like to see
• find a confidence interval for a whether the distribution of data follows a
variance or standard deviation specific pattern.
• test a hypothesis about a single For example:
variance or standard deviation • You would like to see whether the
• test concerning frequency distributions values obtained from an actual
• test the goodness-of-fit test observation on the monthly dividend
• test for independence of two in stocks differ considerably from the
categorical variables expected value.
• test the homogeneity of proportions • You may want to investigate whether
• test the normality of the variable the fluctuation on the interest rates
during Sundays is higher than the rest
of the days in a week.
Chi-square Test: //if may pinagkaiba
A chi-square test (or chi-squared test),
denoted by x2 is statistical hypothesis test Chi-square Test of Independence //if
• used to investigate whether related sila sa isat isa
distributions of categorical variables • can be used to test the independence
(at the nominal or ordinal levels of of two variables
measurement) significantly differ from • Is used when we would like to see:
one another. • whether or not two random
• commonly used to compare observed variables take their values
data (actual value) with data we independently.
would expect (expected value) to • whether the value of one relates
obtain according to a specific with another.
hypothesis. • whether one variable is associated
• used to test information about the with another
proportion or percentage of people or • this test of hypothesis use the chi-
things who fit into a category square distribution and the
contingency table.
For example:
• based on the distribution of data, you
want to see whether the success of an
individual in his chosen career is
independent or relates with his
academic performance in college.
Here, the two variables involved are
the success of an individual in his
chosen career and his academic
performance in college.
• you may want to see whether the life To illustrate the use of chi-square test:
in years of laptops is independent of
brand. Here, the two variables If, according to Mendel's laws, you expect 10
involved are the life in years of laptop of 20 offspring to be male and the actual
and the brand of laptops. observed number was 8 males, then you
• A study which involves on determining might want to know about the "goodness-of-
if job satisfaction can be associated fit" between the observed and expected
with income. The two variables are job data.
satisfaction and income. Were the deviations (differences between
observed and expected value) the result of
chance, or were they due to other factors?
Chi-square Test for Homogeneity of How much deviation can occur before we
Proportions conclude that something other than chance
• can also be used to test the is at work, causing the observed to differ
homogeneity of proportions. from the expected value.
• this is used to determine whether the The chi-square test is always testing what
proportions for a variable are equal scientists call the null hypothesis, which
when several samples are selected states that there is no significant
from different populations. difference between the expected and
• this also use the chi-square observed result.
distribution and the contingency table.
For Example:
• You would like to see if the Test for Goodness-of-Fit
proportions of each group of students The chi-square goodness-of-fit test is used
who play online gaming are equal to test the claim that an observed frequency
based on their program of affiliation, distribution fits some given expected
say proportions of accountancy frequency distribution.
students, engineering students, and Assumptions of Chi-square Goodness-
architecture students who play online of-Fit Test:
gaming. 1. The data are obtained from a random
• You may want to see if the proportions sample.
of employees who are in to stock 2. The expected frequency for each
market are equal based on the nature category must be 5 or more.
of their profession (IT, Medicine, • If the observed frequencies are close
Accounting, Engineering). to the corresponding expected
frequencies, the X2-value will be small,
indicating a good fit.
Two main types of Chi-square Tests to
• If the observed frequencies differ
be discussed here are:
considerably from the expected
Goodness-of-fit Tests of
frequencies, the x2-value will be large
tests which focus on independence
one categorical which focus on the and the fit is poor.
variable. relationship between • A good fit leads to the non-rejection of
two categorical H0 , whereas a poor fit leads to its
variables. Thus, the rejection.
contingency table (or
cross tabulation table
will be used to To calculate the expected frequencies,
present the data there are two rules to follow
values). 1. .If all the expected frequencies are
equal, the expected frequency E can
be calculated by using E =n/k, where Are these differences significant? (Which
n is the total number of observations means, there is a difference in the life span
and k is the number of categories. of the batteries for each category) or will it
2. .If all the expected frequencies are not be due to chance only? Thus, the two
equal, then the expected frequency E opposing statements are necessary before
can be calculated by E = n●p, where computing the test value, the null and
n is the total number of observations alternative hypotheses. Here, the null
and p is the probability for that hypothesis indicates that there is no
category(or p is the hypothesized difference or change among the categories.
proportion from the null hypothesis) Ho : There is no difference in the life span of
laptop batteries among three categories.
H1 : There is difference in the life span of
laptop batteries among three categories.
Example #2
A financial analyst wants to determine
whether investors have any preference on
Example #3
An article shows statistics of orders
made online on a particular product with
different online stores within city. The data is
based on the last six months of the previous
year as follows, July 17%, August 11%,
September 8%, October 14%, November
27%, and December 23%. The CECT online
store manager wants to compare the orders
made with his store with that of the data
revealed by the article. The manager listed
the number of orders in his store on the
same product stated in the article. The table
on the right shows the data collected by the
manager for the last six months in the
previous year.
At 0.01 level of significance, can we
support the claim that the proportions of
orders with CECT online store is the same as Module 5b – Test for
the rest of the online stores within city?
Independence
Months Number of Orders
made with CECT store
July 27
August 17 Test for Independence (CATEGORICAL
September 22 DATA)
October 45 The chi-squared test procedure can also be
November 30 used to test the hypothesis of independence
December 59 of two variables of classification.
*Note that this problem involves only one A contingency table with r rows and c
columns is referred to as an r c table.
categorical variable, months covered in a
year, so we use the goodness-of-fit-test
Example #1
An education analyst wishes to see whether
the academic achievement a person has
completed is related to his or her socio
economic status. A sample of 88 people is
randomly selected. At α = 0.05, can it be
conclude that a person’s academic Example #2
achievement is dependent on the person’s A study was conducted to see if there was a
socio economic status? relationship between the memory recall and
the length of gadget usage per day of
children. A sample of 338 grade level pupils
is randomly selected and the results are
shown on the table below. At α = 0.01 level
of significance, can it be assumed that
*Note that this problem involves two
memory recall and the length of gadget
categorical variables, the academic
usage per day of children are dependent?
achievement a person has
completed and his or her socio economic
status, so we use the independence test.
Ranking of Data
There are many applications in
business where data are reported not as
values on a continuum but rather on an
ordinal scale, thus, assigning ranks to the
values is necessary to draw an analysis of
the data. The distribution-free methods
therefore allows the data analyst to make an
analysis of ranks rather than the actual data One-Sample Runs Test //checks non
values which makes nonparametric tests randomness
very appealing and intuitive. The one-sample runs test is also called the
Wald-Wolfowitz test after its inventors,
For example, assuming that the Abraham Wald (1902-1950), and his student
nonparametric test is applicable, and an HR Jacob Wolfowitz.
personnel would like to determine the degree • One-sample runs test purpose is to
of relationship between the performance detect nonrandomness.
rank obtained by the ten trainees during the • A nonrandom pattern suggests that
first and second evaluation period. A the observations are not independent.
nonparametric test could then be used to • Here, we investigate whether each
determine if there is an agreement between observation in a sequence is
the two rank evaluations. independent of its predecessor (or the
appearance of one is not dependent
Thus, since nonparametric tests can on the appearance of another).
be applied to ordinal scale of data •
measurement, it is important for the analyst
to be efficient in ranking data sets.
Runs Test //checks for randomness
This test is to determine whether a sequence
of binary events (two outcomes involved)
follows a random pattern. A nonrandom
sequence suggests nonindependent
observations.
The hypotheses are:
Ho : Events follow a random pattern.
H1 : Events do not follow a random
pattern.
To test the hypothesis of randomness, we
first count the number of outcomes of each
type:
n1 : number of outcomes in the first
type
n2 : number of outcomes in the
second type
n = total sample size = n1 + n2
One-Way ANOVA
• Also known as Completely
Randomized design or OneFactor
ANOVA.
• Experimental units (or subjects are
assigned randomly to treatments or
groups
• Subjects are assumed homogeneous
• Only one factor or independent
variable
• With three or more treatment levels
• Analyzed by one-factor analysis of
variance
• This technique is used to test claims
involving three or more means.
Examples:
- Accident rates in an assembly line for 1st,
2nd and 3rd shifts
-Expected mileage for 5 brands of tires
-Time it takes for 3 groups of students to
solve a problem in FARR
*Note that for each example, there is only
one independent variable.
*If two variables are considered, the
technique is referred to as two-way ANOVA.
Appropriate hypotheses
Correlation Analysis
Regression Analysis
Scatter plots showing various correlation
coefficient values • The hypothesized relationship may be
linear, quadratic or some other form.
In Excel, use these functions to get the value
• The next slide presents some of the
of the correlation coefficient.
possible patterns.
1. =CORREL(array1, array2)
• The module will focus on the simple
2. =PEARSON(array1, array2)
linear model commonly referred to as
a simple regression equation.
Correlation Analysis
Test for significant correlation using
Student’s t:
• The sample correlation coefficient r is
an estimate of the population
correlation coefficient 𝜌 (Greek
alphabet rho).
• There is no flat rule for a “high”
correlation because sample size must
be taken into consideration.
• To test the hypothesis 𝐻𝑜: 𝜌 = 0, the
test statistic is
Interpreting an Estimated Regression
Equation
The slope tells use how much, and in what
direction, the dependent or response
variable will change for each one unit
increase in the predictor variable. On the
other hand, the intercept is meaningful only
if the predictor variable would reasonably
have a value equal to zero.
Equation:
Interpretation:
Each extra P1 million of advertising will
generate P7.37 million of sales on average.
The firm would average P268 million of sales
with zero advertising. However, the intercept
Simple Linear Regression Model may not be meaningful because Ads = 0
• Only one independent variable, X may be outside the range of observed data.
• The relationship between X and Y is
described by a linear function.
• The changes in Y are related to Interpreting an Estimated Regression
changes in X. Equation
Multiple Regression
• Multiple regression extends simple
regression to include several
independent variables (called
predictors or explanatory variables).
• It is required when a single-predictor
model is inadequate to describe the
relationship between the response
The Durbin-Watson, DW, Statistic variable (Y) and its potential
▪ The DW statistic is used to test for predictors (X1 , X2 , X3 , …).
autocorrelation. • The interpretation is similar to simple
regression since simple regression is a
special case of multiple regression.
• Calculations are done by computer.
• Using multiple predictors is more than
a matter of “improving its fit”. Rather,
it is a question of specifying a correct yields unbiased, consistent, efficient
model. estimates of the unknown parameters.
• A low R2 in a simple regression model
does not necessarily mean that X and
Y are unrelated, but may simply The estimated regression equation
indicate that the model is incorrectly
specified.
• Omission of relevant predictors (model
misspecification) can cause biased
estimates and misleading results.
ADJUSTED R2
• R-squared decreases when a new
Before determining which, if any, of the predictor variable X is added to the
individual predictors are significant, we model.
perform a global test for overall fit using the • This can be a disadvantage when
F-test. comparing models.
• What is the net effect of adding a new
variables?
• We lose a degree of freedom when a
new variable is added.
• Did the new X variable add enough
independent power to offset the loss
of one degree of freedom?
The adjusted R2 shows the proportion of
variation in Y explained by all X variables
adjusted for the number of X variables used.
Detecting MULTICOLLINEARITY
• These rules are merely suggestions.
• When the predictor variables are
related to each other instead of being
independent, we have a condition
SIGNIFICANCE OF PREDICTORS known as multicollinearity.
• We are usually interested in testing • Multicollinearity induces variance
each estimated coefficient to see inflation and makes the t statistics less
whether it is significantly different reliable.
from zero, that is, if a predictor • Least squares estimation fails when
variable helps explain the variation in this condition is present.
Y. Ways of detecting multicollinearity:
• Use t-tests of individual variable
• To check whether 2 predictors are
slopes.
correlated, compute the correlation
• Shows if there is a linear relationship
coefficients. Suspect multicollinearity
between the variables Y and Xi .
if two predictors are highly correlated
• Hypotheses:
(r ≥ 0.80) or if the correlation
coefficient exceeds the multiple R.
• Multicollinearity is present if variance
inflationary factor (VIF) is at least 10.
The VIF is provided in regression
output in JASP.
REGRESSION DIAGNOSTICS
Independence of errors – the error values
(difference between observed and estimated
values) are statistically independent OR non-
autocorrelated. (for time-series data and
panel data)
Normality of error – the error values are
normally distributed for any given value of X
Equal variance or homoskedasticity – the
probability distribution of the errors has
constant variance Measuring Autocorrelation
• Another way of checking for
independence of errors is by testing
Checking the assumptions by examining the significance of the Durbin Watson
the residuals Statistic.
• The Durbin-Watson Statistic measure
detects the presence of
autocorrelation.
• It is used when data are collected over
time to detect the presence of
autocorrelation.
• Autocorrelation exists if residuals in
one time period are related to
Checking the assumptions by examining residuals in another period.
the residuals • The presence of autocorrelation of
Residual Analysis for Normality: errors (or residuals) violates the
1. Examine the Stem-and-Leaf Display of regression assumption that residuals
the Residuals are statistically independent.
2. Examine the Box-and-Whisker Plot of
the Residuals
3. Examine the Histogram of the
Residuals
4. Construct a normal probability plot.
5. Construct a Q-Q plot