What Is Correlation Analysis
What Is Correlation Analysis
Types of Variables
Discrete Continuous
Describing Data
To transform a mass of raw data into a meaningful form is
important.
Descriptive Statistics: Frequency distributive and graphical
representations like histogram or frequency polygon.
Numerical Statistics: 2 important numerical ways to represent
data.
Measure of location- often references to as averages. The
purpose of measure of a location is to pinpoint the center
of a set of values Most common 5 measures of location are
arithmetic mean the weighted mean the geometric mean
are median the mode.
Measure of dispersion- often called the variation or spread
Weighted Mean=
Or
MD=
Population Variance, =
Population SD,
Sample Variance, =
Sample SD,
AM of grouped data,
SD of grouped data, =
Mean of a PD,
Variance of a PD,
Binomial PD is a special case of discrete probability distribution
with only 02 possible outcomes
Binomial PD,
Mean of a BPD,
Variance of a BPD,
Where:
- is the mean no. Of occurrences (successes) in a particular
interval; e= 2.71828; X- no. of occurrences (successes); P(x) is
the probability for a specified value of x;
Mean of PD,
Normal PD
system.
Continuous PD
The number of normal distributions is unlimited, each having a
different mean , standard deviation , or both. While is
possible to provide probability tables for discrete distribution
such as binomial and the Poisson, providing tables for the
infinite number of distributions is impossible fortunately, one
member of the family can be used to determine the
probabilities for all normal distributions. It is called the standard
normal distribution and it is unique because it has a mean of 0
and a standard deviation of 1.
Any normal distribution can be converted into a Standard
Normal Distribution by subtracting the mean from each
observation and dividing the difference by the SD. The results
are called Z values. They are also referred to as z scores, the z
statistics, the standard normal deviates, the standard normal
values, or just the normal deviate.
The CLT states that for large random samples, the shape of the
sampling distribution of the sample mean is close to a normal
probability distribution. This approximation is more accurate for
large samples than for small samples. This is one of the most
useful conclusions in statistics. We can reason about the
distribution of the sample mean with absolutely no information
about the shape of the population distribution from which the
sample is taken. In other words, the CLT is true for all
distributions.
Sampling Distribution of the Sample Mean: information about
the shape of the population distribution from which the sample
is taken means of samples of a specified size vary from sample
to sample.
Sampling Distribution of the Sample Mean: A probability
distribution of all possible sample means of a given sample size.
Central Limit Theorem
If all samples of a particular size are selected from any
population; the sampling distribution of the sample mean is
approximately a normal distribution. This approximation
improves with larger samples.
If the population follows a normal distribution, then for any
sample size the sampling distribution of the sample mean will
also be normal. If the population distribution is symmetrical
(but not normal), you will see the normal shape of the
distribution of the sample mean emerges with samples as small
as 10. On the other hand if you start with a distribution that is
skewed or has thick tails, it may require samples of 30 more to
observe the normality feature. A sample size of 30 or more to
be large enough for CLT be employed.
The CLT indicates that, regardless of the shape of the
population distribution, the sampling distribution of the sample
mean will move towards the normal probability distribution.
The larger the number of observations in each sample, the
stronger the convergence.
For larger sample sizes, it is observed that the mean of the
sampling distribution is the population mean, i.e., , and if
the standard deviation in the population is , the standard
S is an estimate of
No Yes
Yes No Yes
No
Hypothesis:
A hypothesis is a statement about a population. Data are then
used to check the resemblances of the statement.
= ∑ X / n = 220 /10 = 22
Ȳ = ∑ Y / n = 450 /10 = 45
As most of the data (except 8th data of Harish) one are in the 1st
or 3rd quadrant, we may assume a +ve relationship because in
both these quadrants (X- ) (Y - Ȳ) is +ve (both (X- ) and (Y - Ȳ)
have same signs either +ve (in the 1st quadrant) or both –ve in
(the 3rd quadrant) and as observed in table below:
Calculate the deviations from the mean data
Correlation Coefficient,
= 900 /[(10-1) * (9.189) * 14.337)]
= 0.759
We do a hypothesis testing
Hypothesis Testing: A statement about a population parameter
developed for the purpose of testing.
Null Hypothesis: A statement about the value of the
population parameter
Alternate Hypothesis: A statement that is accepted if the
sample data provide sufficient evidence that the Null
Hypothesis is false.
Steps in Hypothesis Testing
1. Establish the null hypothesis (H0) and the alternate
hypothesis (H1),
2. Select the level of significance, that is α,
(Rejecting the null hypothesis when it is in fact true is called a Type I error)
Errors in Making Decisions
Type I Error (H0 rejected when true)
When a true null hypothesis is rejected
The probability of a Type I Error is a
Called level of significance of the test
Set by researcher in advance
Type II Error [(Failure to reject H0 when it is false) or H1
accepted when false]
Failure to reject a false null hypothesis
The probability of a Type II Error is β
Using the 0.05 level of significance, the decision rule states that
if the computed t falls in the area ±2.306, the null hypothesis is
not rejected
Compute the test Statistic (here t)
where:
r is the correlation coefficient;
Sy is the SD of Y (the dependent variable)
Sx is the SD of X (the independent variable)
where:
Ȳ is the man of Y (the dependent variable)
is the mean of X (the independent variable)
a = Ȳ - b = 45 – (1.1842) * 22 = 18.9476
where:
t is the value of t from the t-table with n-2 df
William Gosset (1990) noticed that ± z(s) was not precisely
correct for small samples. He noticed that for small samples (n <
30), the vatiations around are more than ± z(s). We need to
compensate this, for small samples. It is observed that it follows
a t-distribution (a more flatter distribution that z-distribution).
But with sample sizes n≥30, the t-vales and z-vales are almost
equal.
where:
Y’= (18.9476 + 1.1842* 25) = 48.5526
t for n-2 = 08 df, 95 percent confidence interval is 2.306
= 48.5526 ± 2.306*(9.901)*(0.334428027)
= 48.5526 ± 7.6356
= 48.5526 ±2.306*(9.901)*(1.054439)
= 48.5526 ±24.0746
Thus the interval is from 24.478 to 72.627. (≈ 24 to 73)
We conclude the number of servers sold will be between 24 to
73, for a particular sales representative. This interval is quite
large. It is much larger than the confidence interval for all sales
representatives who made 25 calls. It is logical, however, that
there should be more variation in the sales estimates for an
individual than for a group.
More on Coefficient of Determination
Sales Calls Sold Y - Ȳ (Y - Ȳ)2
Representative (X) (Y)
Hari 20 30 -15 225
Rama 40 60 15 225
Shivani 20 40 -5 25
Ravi 30 60 15 225
Gautam 10 30 -15 225
Manish 10 40 -5 25
Pandu 20 40 -5 25
Harish 20 50 5 25
Venktesh 20 30 -15 225
Binny 30 70 25 625
Total 220 450 0 1850
Coefficient of Determination, r 2
The coefficient of determination is the portion of the
total variation in the dependent variable that is explained
by variation in the independent variable
The coefficient of determination is also called r-squared
and is denoted as r2
SSR regression sum of squares
r2
SST total sum of squares
Note:
0 r2 1
The Relationship among the Coefficient of Correlation, the
Coefficient of Determination, and the Standard Error of
Estimate
The standard error of estimate, measures how close the actual
value are to the regression line. When SEE is small, it indicates
that the two variables are closely related. In the calculation of
SEE the key term is ∑(Y-Y’)2. If the value of this term is small,
then SEE will also be small.
Model Diagnosis:
Before use of the model (for predictive etc.) to estimate the
sales units (y), there are several questions that should be
answered.
1. Is the model significant?
2. Are the individual variable(s) significant?
3. Is the SD of the model error too large to provide
meaningful results?
4. Is Multicollinearity a problem?
5. Have the regression analysis assumptions been satisfied?
Where:
SSR= sum of squares regression =
or if
for all j
coefficient.
Is multicollinearity a problem?
Multicollinearity - a high correlation between the independent
variables such that the two variables contribute redundant
information to the model.
Some of the obvious problems and indications of severe
Multicollinearity:
i) Unexpected/ incorrect signs on the coefficients.
ii) A sizeable change in the value of the previously
estimated coefficients when a new variable is added to
the model
iii) The estimate of the SD of the model error increases
when a variable is added to the model.
iv) Low t-values for significant variables.
One method of Measure of multicollinearity is known as the
Variance Inflation Factor (VIF)
Where:
SSR= sum of squares regression =
SSE= sum of squares error =
n= sample size
k= number of independent variable
df of regression = number of independent variable, k
df of errors = n-k-1
total df of MLR = n-1
If
or if
Coefficient of Determination, R 2
The coefficient of determination is the portion of the
total variation in the dependent variable that is explained
by variation in the independent variable
The coefficient of determination is also called r-squared
and is denoted as r2
SSR regression sum of squares
R2
SST total sum of squares
Note: 0 R2 1
(e t e t 1 ) 2
The Durbin-Watson test statistic (d): d t 2
n
e 2
t
The possible range is 0 ≤ d ≤ 4 t 1