statistical treatment of data
statistical treatment of data
Structure
4.1 Introduction
4.2 The Data: Meaning and Types
4.3 Frequency Distributions
4.4 Measures of Central Tendency
4.5 Measures of Dispersion
4.6 Hypothesis Testing and Inferential Statistics
4.7 Choosing a Statistical Test
4.8 Statistical Tests
4.9 Chi- Square Test
4.10 F- Test
4.11 Z- Test
4.12 t-Test
4.13 Correlation Analysis
4.14 Regression Analysis
4.15 Let Us Sum Up
4.16 Keywords
4.17 Bibliography and Selected Readings
4.18 Check Your Progress – Possible Answers
4.1 INTRODUCTION
Statistical tools are the pillars of the research study on which data analysis for
all types of developmental programmes stand. Those who are researchers
also need some understanding of statistical analysis to be able to produce a
meaningful research report. With the availability of several user friendly
software, the performance of statistical analysis has now become a reality,
even for non-statisticians, provided they are computer literate, and
understand the basic principles of statistical analysis.
216
ii. Inferential statistics which consist of statistical methods for making Overview of
Statistical Tools
inferences about a population based on information obtained from a
sample.
This unit aims to make you conversant with the basic statistical tools
applicable in developmental research.
• surveillance systems
• national surveys
• experiments
• health organizations
• private sector
• Government agencies and
• research studies conducted by research and academic institutions, etc,
which can be used for the benefit of society.
Types of Data
There are two types of data (i) Qualitative data, viz., occupation, sex, marital
status, religion and (ii) Quantitative data, viz., age, weight, height, income,
etc. These may be further categorized in the following two types.
Discrete: data that can be divided into categories or groups, such as male
and female, and can take only discrete values, as explained - nominal and
ordinal scale.
Continuous: data that can take any value including decimal are called
continuous data. A continuous data is, at least in interval or ratio scale, as
defined above in types of measurement. Measurements in ordinal scale
can also be considered under this category though it does not fulfill all
conditions of the continuous scale. In social sciences, it is difficult to
measure variables in interval or ratio scale and one has to depend on
measurements taken in the ordinal scale. It is further emphasized that the
statistics based on such measurement may provide ‘under’ or ‘over’
estimates of the population parameter.
Note
After having gone through the concept of data, answer the following
questions given in Check Your Progress 1.
220
Check Your Progress 1 Overview of
Statistical Tools
Note:
a) Write your answer in about 50 words.
b) Check your answer with possible answers given at the end of the unit
1. What are the different types of data?
…………………………………………………………………………
…………………………………………………………………………
…………………………………………………………………………
…………………………………………………………………………
…………………………………………………………………………
2. What do you understand by the term, non-parametric statistical test?
…………………………………………………………………………
…………………………………………………………………………
…………………………………………………………………………
…………………………………………………………………………
…………………………………………………………………………
4.4.1 Mean
The mean (or arithmetic mean) is also known as the average. It is calculated
by totalling the values of all the observations and dividing by the total
number of observations. Note that the mean can only be calculated for
numerical data.
221
Use of Basic S.No. Height (cm)
Statistics
1 141
2 141
3 143
4 144
5 145
6 146
7 155
A total of seven measurements = 1015 cm. The mean = (1015/ 7), which is
145 cm.
∑ fi
Note
Note 1: In the case of the grouped frequency data given in table 3.1, the
midpoint of the interval will become X i and a similar procedure for
computing the mean can be followed as shown above.
Note 2: However, in the case of open interval data, such as ‘less than 19’
(denoted as <19) or ‘more than 159’ (denoted as > 159), it is not possible to
fix a midpoint, and, therefore, the mean cannot be calculated. As a
consequence, we use the median in place of the mean, which is explained in
the next section.
222
4.4.2 Median Overview of
Statistical Tools
The median is a value that divides a distribution into two equal halves. The
median is useful when the data is in ordinal scale, i.e., some measurements
are much bigger or much smaller than the other measurement values. The
mean of such data will be biased toward these extreme values. Thus, the
mean is not a good measure of distribution, in this case. The median is not
influenced by extreme values. The median value, also called the central or
halfway value, (50th percentile, i.e., 50% value below median value, and 50%
above it) is obtained in the following way:
Example 3:
We can use the grouped data of the table 3.1 of section 3.3.1, ‘Distribution
of clinics according to number of patients treated for malaria in one month’
for calculation of median, which is given below
Step1: The total of frequency is first divided by 2, i.e., 80/2 (=40). The
cumulative frequency 40 will correspond to the class interval (80-99). This is
called the median interval.
N
Step2: The formula is Median =L + − F × d / f
2
Step3: Record all values of symbol variables from the table as given below:
Step4: Replace the symbol values with numeric values as noted in step3 in
the formula,
224
4.4.3 Mode Overview of
Statistical Tools
The mode is the most frequently occurring value in a set of observations. The
mode is not very useful for numerical data that are continuous. It is most
useful for numerical data that have been grouped. The mode is usually used
to find the norm among populations and is calculated when the calculation of
mean and median is inappropriate, viz., the average shoe size of the Indian
population, standard birth weight, etc. The mode can also be used for
categorical data, whether they are nominal or ordinal.
We will again use the grouped data given in table 3.1 to calculate the mode.
The steps for computing the mode are
f m − f1
Step1: Formula for calculating mode, M ode =
L+ × h
2f m − f 1 − f 2
Step2: Find the class interval with the largest frequency. In this case, the
class interval ‘80-99’ has the maximum frequency, equal to 19.
Step3: Record all values of the symbol variables used in the formula.
f 2 = (10) is the frequency of the class succeeding the modal class, and
Step4: Replace the symbol values with the numeric values as noted in step3
in the formula.
19 − 11
= Mode =
80 + × 20
2 × 19 − 11 − 10
4.5.1 Range:
It is the simplest measure of dispersion. This can be represented as the
difference between maximum and minimum values, or simply, as the
maximum and minimum values for all observations.
Although simple to calculate, the range does not tell us anything about the
distribution of the values between the two extreme ones.
The range would also be 72 – 40 = 32 kg, although the values are very much
different from those of the previous example.
4.5.2 Percentiles
A second way of describing the variation or dispersion of a set of
measurements is to divide the distribution into percentiles (100 parts). As a
matter of fact, the concept of percentiles is just an extension of the concept of
the median, which may also be called the 50th percentile. Percentiles are
points that divide all the measurements into 100 equal parts. The 30th
percentile (P30) is the value below which 30% of the measurements lie. The 227
Use of Basic 50th percentile (P50), or the median, is the value below which 50% of the
Statistics
measurements lie. To determine percentiles, the observations should be first
listed from the lowest to the highest just like when finding the median.
However, in case of grouped data, percentile can be calculated on similar
lines of calculating the median.
n −1
∑ Xi2 −
n
sd = Variance
X = Mean
n = Number of observations
Fortunately many pocket calculators can do this calculation for us, but it is
still important to understand what it means. In the case of grouped data, the
mid value of the interval may be taken as observation value and the above
procedure can be followed.
In the above sections you studied about the measures of central tendency and
the measures of dispersion. Now try and answer the questions in Check Your
Progress 2.
Note:
In order to help you choose the right test, a flowchart and matrices will be
presented for different sets of data. We will discuss how significance tests
230
work. Please keep in mind that independent groups are treated as independent Overview of
Statistical Tools
populations.
231
Use of Basic ii) How to state Null (Ho) and Alternative (H1) Hypothesis:
Statistics
In statistical terms the assumption that no real difference exists between
groups in the total study (target) population (or, that no real association
exists between variables) is called the Null Hypothesis (Ho). The
Alternative Hypothesis (H1) is that there exists a difference between
groups or that a real association exists between variables. Examples of
null hypotheses are
• There is no difference in the incidence of measles between
vaccinated and non-vaccinated children.
• There is no difference between the alcohol consumption of male and
female.
• There is no association between families’ income and malnutrition
in their children.
If the result is statistically significant, we reject the Null Hypothesis (Ho)
and accept the Alternative Hypothesis (H1) that there is real difference
between two groups, or a real association between two variables.
Examples of alternative hypotheses (H1) are:
• There is a difference in the incidence of measles between vaccinated
and non-vaccinated children.
• Males drink more alcohol than females.
• There is an association between families’ income and malnutrition in
their children.
Be aware that ‘statistically significant’ does not mean that a difference or
an association is of practical importance. The tiniest and most irrelevant
difference will turn out to be statistically significant if a large enough
sample is taken. On the other hand, a large and important difference may
fail to reach statistical significance if too small a sample is used.
iii) The Concept of Type I and Type II Error
There are four ways in which conclusion of the test might relate to in our
study (i) true positive (ii) true negative and (iii) false positive and (iv)
false negative. These may be expressed in terms of error in statistical test
of significance in following terms:
Type I error (α): We reject the null hypothesis when it is true, or false
positive error, or type I error ‘α’ (called alpha). It is the error in detecting true
effect.
In the above example, type I error would mean that the effects of two drugs
were found to be different by statistical analysis, when, in fact, there was no
difference between them.
Type II error (β): We accept the null hypothesis when it is false or false
negative error; or simply, type II error ‘β’ (called beta) can be stated as
232 failure to detect true effect. In the above example, type II error would mean
that the effects of two drugs were not found different by statistical analysis, Overview of
Statistical Tools
when in fact there was difference.
Actual Situation
True Ho False Ho
Investigator’s Accept Null Correct Acceptance Error (Type II)
Decision hypothesis
Reject Null Error (Type I) Correct
hypothesis Rejection
Note: Alpha (α) and beta (β) are the Greek letters and are used to denote
probabilities for type I error and type II error respectively.
We would like to carry our test, i.e., choose our critical region so as to
minimize both types of errors simultaneously, but this not possible in a given
fixed sample size. In fact decreasing one type of error may very likely
increase the other type. In practice, we keep type I error (α) fixed at a
specified value (i.e., at 1% or 5%).
Till now you have read about the various concepts of hypothesis testing. Now
try and answer the following questions in Check Your Progress 3.
Note:
235
Use of Basic sections will explain how you will choose an appropriate statistical test to
Statistics
determine differences between groups or associations between variables.
Under both of these categories, you also need to decide whether you have
paired or unpaired observations. In paired observations, individual
observations in one data set (e.g., experimental) are matched with individual
observations in another data set (e.g., controls), for example, by taking care
that both participants in the study come from the same location, have the
same age, same sex, same financial background, etc.
For ordinal data the significance test to be used depends on whether only
two groups or more than two groups are being compared. The tests to be used
for comparing two groups are based on ranking of data: Wilcoxon’s two-
sample test, which gives equivalent results to the Mann-Whitney U-test, for
unpaired observations, and Wilcoxon’s signed rank test for paired
observations. We will not deal with these tests in this unit.
For continuous data (in interval, or, ratio scale), as for ordinal data, the
choice of an appropriate significance test depends on whether you are
comparing two groups or more than two groups. The z-test, referred to as the
standard normal variate, and the t-test, also referred to as the Student’s t-test,
are used for numerical data of continuous nature, when comparing the means
of two groups. The chi-square test is used for categorical data, when
comparing proportions of events occurring in two or more groups.
i) χ2 test
ii) Z- test
iii) t-test
iv) f-test
Table 4.5: Utilization of health care centre by women living far from,
and near the clinic
Distance from health Used health Did not Use health Total
care centre care centre care centre
Less than 10km 51 (64%) 29 (36%) 80 (100%)
10km or more 35 (47%) 40 (53%) 75 (100%)
Total 86 69 155
From the table we conclude that there seems to be a difference in the use of
health care centre between those who live close to, and those who live far
from, the centre (64% versus 47%). We now want to know if this observed
difference is statistically significant or not. The chi-square ( χ 2 ) test, used to
test the statistical significance, is given below.
( Oi − Ei )
2
directs you to add together each value of (Oi
Chi-
2
χ =∑
df
Ei
– Ei) 2 / Ei for all the ‘k’ cells of the table.
square= Where, Oi and Ei are the observed and
expected frequency of each cell and ‘df’ is
the degree of freedom.
237
Use of Basic
O i2
Statistics χ2
= ∑ Ei
− N , where N is the total of observed frequency.
Hypothesis:
Note
Chi (χ) is a Greek letter. The chi-square test can be used to give us the
answer. This test is based on measuring the difference between the observed
frequencies and the expected frequencies if the null hypothesis (i.e., the
hypothesis of no difference) were true. To perform a chi-square test you need
to calculate, the chi-square value, use a chi-square table, and interpret the chi-
square.
1) Calculate the expected frequency (Ei) for each cell. To find the expected
frequency Ei of a cell, you multiply the row total by the column total of
Row total × Column total
the cell and divide by the grand total: E = .
Grand total
2) For each cell, subtract the expected frequency E i from the observed
frequency Oi , ( Oi − Ei ) , square the difference of ( Oi − Ei ) and divide it
by the corresponding expected frequency E i . (You may skip this step
and follow the next step 3 as an alternative step).
3) Alternatively, for each cell, simply square the observed frequency Oi and
divide it by the corresponding expected frequency, Ei .
4) Find the sum by adding the values calculated in step (3) for all the cells
and subtract N (total of observed frequency) from the sum.
238
Using a Chi-square table Overview of
Statistical Tools
The calculated chi-square value has to be compared with a theoretical chi-
square value in order to determine whether the null hypothesis is rejected or
not. Annex 2 contains a table of theoretical chi-square values.
1) First you must decide what significance, or alpha, level you want to use
(α value). We usually take 0.05.
2) Then, the degrees of freedom have to be calculated. With the chi-square
test the number of degrees of freedom is related to the number of cells,
i.e., the number of groups you are comparing. The number of degrees of
freedom is found as:
3) Then the chi-square value belonging to the α-value and the number of
degrees of freedom are located in the table.
Let us now apply the chi-square test to the data given in table 3.5 (utilization
of health care centre). This gives the following result:
First the expected frequencies for each cell are computed for calculating the
Chi-square value as follows:
Distance from Used health care Did not Use health Total
health care centre care centre
centre
Less than 10km 86 × 80 69 × 80 80
=E1 = 44.4 =E 2 = 35.6
155 155
10km or more 86 × 75 69 × 75 75
=E3 = 41.6=E 4 = 33.4
155 155
Total 86 69 155
Note that the expected frequencies refer to the values we would have
expected, given the total numbers of 80 and 75 women in the two groups, if 239
Use of Basic the null hypothesis, there is no difference between the two groups, were true.
Statistics
Now the chi-square value can be calculated:
Chi-
44.4 + 35.6 + 41.6 + 33.4
square=
=
(χ )2
With the alternative formula, the value of chi-square is same and is given
below.
The number of degrees of freedom (d.f.) for 2×2 table is 1. Use the table of
chi-square values in Annex 1. We decided, beforehand, on a level of
significance of 5% (α-value = 0.05).
As the number of d.f. is 1, we look along that row in the column where p =
0.05. This gives us the tabulated value of 3.84.
Our calculated value of chi-square 4.55 is larger than 3.84. Hence we reject
the null hypothesis.
We can now conclude that the women living within a distance of 10 km from
the clinic utilise antenatal care significantly more often than the women
living more than 10 km away.
Table 3.5 indicates that 64% of the women living within a distance of 10 km
from the clinic used antenatal care during pregnancy, compared to only 47%
of women living 10 km or further away from the nearest clinic. This
difference is statistically significant (chi-square = 4.55; p < 0.05).
Note:
The chi-square test can only be applied if the sample is large enough. The
general rule is that the total sample should be at least 40 and the expected
frequencies in each of the cells should be at least 5. If this is not the case,
Fisher’s exact test should be used. If the table is more than a two-by-two
table, the expected frequency in 1 of 5 cells (not more than 20% cells) is
240
allowed to be less than 5. (Please note, it is expected frequency and not the Overview of
Statistical Tools
observed frequency).
Unlike the t-test, the chi-square test can also be used to compare more than
two groups. In that case, a table with three or more rows/columns would be
designed, rather than a two-by-two table.
The high chi-square value never means high association. It only means high
probability of finding such a value, and low chance of finding this chi-square
value by chance.
It may be very misleading to pool dissimilar data. Pooling the age groups
masked an important real difference. In other situations pooling the data may
suggest a difference or association that does not exist, or, even a difference
opposite to that which exists. It is, therefore, important to analyze the data for
the different age groups/urban-rural/literate vs illiterate separately.
4.10 F-test
The F-test is used either for testing the hypothesis about the equality of two
population variances or the equality of two or more population means. The
ratio of two sample variances is distributed as F.
241
Use of Basic Interpreting the results:
Statistics
If the calculated value of F is greater than the tabled value, reject H0, if not,
accept H0.
8 9
= 37.848
10 11
242
= 26.606 Overview of
Statistical Tools
F = 37.848
26.606
=1.603
Tabled value
Ftab=3.07
The tabled value of F is 3.07. Since Fcal < Ftab, we accept the null hypothesis.
It means that the variation in life expectancy in various regions of Brazil in
1900 and in 1970 is the same.
4.11 Z -TEST
To test the hypothesis about a population mean or two population means
when the sample size is large (>30) and population variances are known, we
use the Z-test.
The P is the mean proportion of success of the two proportions, p1 and p2,
and n1 and n2 are the respective sample sizes.
The above calculated value of ‘z’ is compared with the tabulated value of ‘z-
normal variate’ for ‘α = 0.05’, i.e., at 5% level of significance, which is 1.96
and the value of z for ‘α =0.01’, i.e., at 1% level significance is 2.58.
H1: P ≠ 0.60
As Zcal > Ztab, hence H0 is rejected. It means that 60 percent of the employees
do not favour the new bonus scheme.
Hypothesis:
H0: µ = µ0
H1 : µ ≠ µ 0
Z = x - µ0
σ/ √n
where, x is the sample mean and σ is the standard deviation.
The Table value
The value of z (α) for comparing the calculated test statistics is taken as 1.96
at 5% (α) error; and it is 2.58 at 1% (α) error. Here, the sample size is treated
as large and, therefore, the degree of freedom plays no role, unlike in the t-
test.
Example 9: The table below gives the total income in thousand rupees per
year of 36 persons selected randomly from a particular class of people.
Income (thousand Rs)
6.5 10.5 12.7 13.8 13.2 11.4
5.5 8.0 9.6 9.1 9.0 8.5
4.8 7.3 8.4 8.7 7.3 7.4
5.6 6.8 6.9 6.8 6.1 6.5
4.0 6.4 6.4 8.0 6.6 6.2
4.7 7.4 8.0 8.3 7.6 6.7
244
On the basis of the sample data, can it be concluded that the mean income of Overview of
Statistical Tools
a person in this class of people is Rs. 10,000 per year?
H0 : µ1 ≠ 10,000
Since the sample size is 36, we will use a normal test for which the test
statistic is
Z = x - µ0
σ/√ n
x = 280.7/36 = 7.80
= 5.14
σ = 2.27
Z = √36 ( 7.80 – 10 )
2.27
= - 5.81
| Z |= 5.81
Interpretation: Since Zcal >Ztab, reject H0. It means that the average annual
income is less than ten thousand rupees.
Assumption for use of z statistics: The assumption for using the z statistics
is that the parent population, from where samples have been drawn, should be
normal. The z statistics presumes that the population variances ( σ12 and σ22 ) of
the parent populations are known and, therefore, the z statistics for testing
significance of difference between means is defined as z = ( X1 − X 2 ) .
2 2
σ1 σ 2
+
n1 n 2
4.12 t-Test
4.12.1 Testing the Significance of Independent Samples from
Two Groups for Continuous Data:
Example 10: It has been observed that in a certain province the proportion of
women, joining the army, is very high. A study is, therefore, conducted to
discover why this is the case. The height of women is supposed to be the
contributory factor; the researcher may want to find out if there is a
difference between the mean height of women in this province who preferred
joining army and of those who opted for other services. The null hypothesis
would be that there is no difference between the mean heights of the two
groups of women. Suppose the following results were found:
Table 4.6: Mean heights of women with normal deliveries and of women
with C-sections
The mean height for each of the two samples was calculated and compared,
using the t-test, to determine whether there was a difference.
Hypothesis:
H1: There is difference between the heights of the women joining army and
other services.
246
Test Statistic: A t-test would be the appropriate way to determine whether Overview of
Statistical Tools
the observed difference of 2 cm can be considered statistically significant.
t df =
(X − X )
1 2
deg rees of freedom , df = n 1 + n 2 − 2
1 1
S +
n1 n 2
Pooled Variance s 2 =
(n 1 − 1 )s d 12 + (n 2 − 1)sd 22
n1 + n 2 − 2
(1) Difference between the means. In the above example the difference is
(156-154 = 2 cm).
(2) Calculate the standard deviation (square root of variance S2) of all
observations pooled together for both the samples.
In case the standard deviations for each of the study groups are given or have
been calculated, then compute the pooled variance of samples (S2) as given:
1 1
S +
n1 n 2 SE =
2.96×0.01895 =
0.5608
(4) Finally, divide the difference between the means by the standard
error of the difference.
2
The value now obtained is called t-value.
= t = 3.57
0.5608
Once the t-value has been calculated, you will have to refer to a t-table, from
which you can determine whether the null hypothesis is rejected or not.
Annex II contains a t-table.
We now compare the absolute value of the t-value calculated in Step 1 (i.e.,
the t-value, ignoring the sign) with the t-value derived from the table in Step
2.
In our example the t-value calculated in step 1 is 3.6, which is larger than the
t-value derived from the table in step 2 (1.98). We, therefore, reject the null
hypothesis and conclude that the observed difference of 2 cm between the
mean heights of women who joined army and women who opted for other
services is a statistically significant difference.
Caution: The reader should understand the basic differences for the use of z
and t statistics. The following assumption will clarify the differences in the
use of two statistics.
(X − X ) , 2 ( n 1 − 1) sd 12 + ( n 2 − 1) sd 2 2
t df =
1 2
Pooled Variance s =
1 1 n1 + n 2 − 2
S +
n1 n 2
248
(The mathematical reason for doing so is very complicated and it is beyond Overview of
Statistical Tools
the scope of this study material). Further, the probability distribution often
takes care of the skewness (moderately) of samples/ population, i.e., if the
samples or the populations are moderately skewed. The pooled variance from
the two samples is better estimate of population variance. It is, therefore,
advantageous to use t statistics rather than z statistics, when population
variances are not known.
mean of differences d
t df = , t df =
s tan dard error of differences sd(d) n
Where, d is the mean difference of values obtained from subtracting two sets
of paired observation; sd(d) is the standard deviation of values obtained from
difference of two sets of paired observations; n is the sample size of paired
observations; the degrees of freedom (df = n - 1) is the number of paired
observations (sample size) minus 1.
The same table of t value is used, as for the t-test for unpaired observations
(see Annex II) to interpret result of the study. The use of the paired t-test is 249
Use of Basic illustrated on the results of the nutritional survey referred to in previous
Statistics
example above. The results are given table 4.7.
250
The Table Value Overview of
Statistical Tools
The degrees of freedom are the sample size (the number of pairs of
observations) minus 1, which in this case is (20 – 1 = 19). The tabled t-value
at 19 degrees of freedom is 2.09
The Interpretation
If the calculated t-value (ignoring the sign) is larger than the value indicated
in the table, the null hypothesis, stating that there is no difference, is rejected,
and it can be concluded that there is a significant difference in the result of
your study.
Note: Computers are helpful when dealing with large data sets. A variety of
software including Excel and SPSS provides options for various statistical
tests.
Note:
25
21, 21
20
17, 16
15
13, 13
Y
10, 11
10
5
5, 4
0
0 5 10 15 20 25
252
Overview of
A: Exam ple of Positive High B: Exam ple of Negative High Correlation
Statistical Tools
Correlation
25
25
20 5, 21
20
21, 21 10, 17
15
17, 16 13, 13
Y
15
13, 13 10 16, 10
10, 11
10
5 21, 5
5
5, 4 0
0
0 5 10 15 20 25
0 5 10 15 20 25
X
X
25
25
20 5, 21
7, 20
21, 21
20 15, 20
10, 17
12, 17 15 7, 14
15
17, 16 13, 13 17, 12
Y
6, 12
8, 13 13, 13 10 8, 10 16, 10 20, 10
10
10, 11 13, 8
10, 8 5 12, 6 21, 5
14, 7 20, 7
5
5, 4
0
0 0 5 10 15 20 25
0 5 10 15 20 25
X X
The slopes of both the lines are identical in these two examples, but the
scatter around the line is much greater in the second. Clearly the relationship
between variables y and x is much closer in the first diagram.
∑ (X i (
− X) Yi − Y )
r= i =1
nn
i
2
=i 1 =i 1
∑ (X − X) ∑ (Y − Y)
i
2
254
r=
∑ X Y − (∑ X ∑ Y ) / n
i i i i
Overview of
Statistical Tools
2 2
n n
n ∑ i X n ∑ Yi
∑
=
X i2 − i 1 =
∑ Yi2 − i 1
i 1 =i 1 n n
It would be more informative to investigate whether the two variables ‘family
income’ and ‘weight of five-year-olds’ are associated.
r = 0.466
n−2
t n−2 = r ×
1− r2
We compare this value of t to tables of the t distribution with (n - 2) degrees
of freedom, where n is the number of observations.
18
In our example: n = 20, r = 0.466, t18 =×
0.466 2.23
=
1 − 0.4662
Therefore, using α value (chosen p value) of 0.05, the t-table value for 18
degrees of freedom (t18; 0.05) = 2.10. Thus the calculated t-value is more than
the table value; this means that the p-value (for rejecting null hypothesis) is
less than 0.05, and therefore the linear relationship is statistically significant.
Suppose there are n data points {yi, xi}, where i = 1, 2, …, n. The goal is to
find the equation of the straight line,
which would provide a best fit for the data points. Here the best fit will be
understood as in the least-squares approach: a line that minimizes the sum of
squared residuals of the linear regression model.
By using calculus, it can be shown that the values of α and β that minimize
the objective function Q are
1 95 85 17 8 289 64 136
2 85 95 7 18 49 324 126
3 80 70 2 -7 4 49 -14
4 70 65 -8 -12 64 144 96
5 60 70 -18 -7 324 49 126
Sum 390 385 730 630 470
Mean 78 77
257
Use of Basic 4.14.4 Difference between Regression and Correlation
Statistics
S. Correlation Regression
No.
1 Correlation quantifies the Regression finds out the best fit line
degree to which two for a given set of variables.
variables are related. You
simply are computing a
correlation coefficient (r)
that tells you how much one
variable tends to change
when the other one does.
2 With correlation you don't With regression, you do have to think
have to think about cause about cause and effect as the
and effect. You simply regression line is determined as the
quantify how well two best way to predict Y from X.
variables relate to each
other.
3 With correlation, it doesn't With linear regression, the decision of
matter which of the two which variable you call "X" and
variables you call "X" and which you call "Y" matters a lot, as
which you call "Y". You'll you'll get a different best-fit line if
get the same correlation you swap the two. The line that best
coefficient if you swap the predicts Y from X is not the same as
two. the line that predicts X from Y.
4 Correlation is almost always With linear regression, the X variable
used when you measure is often something you
both variables. It rarely is experimentally manipulate (time,
appropriate when one concentration...) and the Y variable is
variable is something you something you measure.
experimentally manipulate.
5 In correlation, our focus is In regression analysis, we examine
on the measurement of the the nature of the relationship between
strength of such a the dependent and the independent
relationship. variables.
In correlation, all the In regression, at our level, we take the
variables are implicitly dependent variable as random, or
taken to be random in stochastic, and the independent
nature. variables as non-random or fixed.
There are also some of the data reduction techniques that are extensively used
in development studies but are not within the scope of this unit. One of such
technique is the Principal Component Analysis which is very commonly
258
used. This technique is generally used when the number of explanatory Overview of
Statistical Tools
variables is very high. In Principal Component Analysis, the variables are
compressed to get a lesser number of variables called the principal
components. There are statistical softwares which can be used to easily work
out the Principal Component Analysis. For example a researcher has
identified 20 variables that influence a farmer’s adaptation to climate change.
He may use principal component analysis which will help in grouping similar
variables into one component thus reducing the number of variables. The
analysis may thus identify just three or four principal components into which
all these variables can be grouped. The first principal component will amount
for maximum amount of variation out of the existing variables followed by
the second, third and the fourth. This kind of analysis helps the researcher to
identify those variables which if worked upon would increase the adaption of
the farmers to climate change.
Note:
The results we obtain by subjecting our data to analysis may actually be true
or may be due to chance or sampling variation. In order to rule out chance as
an explanation, we use the test of significance. In this unit we have confined
our discussion to four tests i.e. χ2 test, Z- test, t-test and f-test.
260
Annexure-II Overview of
Statistical Tools
4.16 KEYWORDS
Independent variable: the characteristic being observed or measured which
is hypothesized to influence an event or outcome (dependent variable), and is
not influenced by the event or outcome, but may cause it, or contribute to its
variation.
Median: the median is the value that divides a distribution into two equal
halves. The median is useful when some measurements are in ordinal scale,
i.e., much bigger or much smaller than the rest.
Percentiles: percentiles are points that divide all the measurements into 100
equal parts. The 30th percentile (P3) is the value below which 30% of the
measurements lie. The 50th percentile (P50), or the median, is the value
below which 50% of the measurements lie.
Castle, W.M. and North P.M. (1995), Statistics in Small Doses. Churchill
Livingstone Edinburgh, UK.
Hicks, C.M. (1999), Research Methods for Clinical Therapists. 3rd Edition,
Churchill Livingstone, Robert Stevenson House, 1-3 Baxter's Place,Leith
Walk, Edinburgh, UK.
Kidder, L.H. and C.M. Judd (1986), Research Methods In Social Relations,
CBS College Publishing, New York, USA.
Riegelman, R.F. (1981), Studying a Study and Testing a Test, Little Brown
and Company, Boston, MA, USA.
Swinscow, T.D.V. and M.J. Campbell (1998), Statistics at Square One (11th
ed.), British Medical Association, London, UK.
263
Use of Basic
Statistics
4.18 CHECK YOUR PROGRESS – POSSIBLE
ANSWERS
Check Your Progress 1
Answer 1: There are two types of data: (i) qualitative data, viz., occupation,
sex, marital status, religion, and; (ii) quantitative data viz., age, weight,
height, income, etc. These may be further be categorized in two types viz.,
Discrete and continuous data.
Answer 1: The three measures of central tendency are the mean, median, and
mode.
Answer 2: Type I error (α): We reject the null hypothesis when it is true, or
a false positive error or type I error α (called alpha). It is the error in detecting
true effect.
Type II error (β): We accept the null hypothesis when it is false, or a false
negative error; or simply, type II error ‘β’ (called beta) can be stated as
failure to detect true effect.
Answer 1: The chi-square test can only be applied if the sample is large
enough. The total sample should be at least 40 and the expected frequencies
in each of the cells should be at least 5. Unlike the t-test, the chi-square test
can also be used to compare more than two groups. In that case, a table with
three or more rows/columns would be designed, rather than a two-by-two
table. The chi-square is always applied on absolute numbers but not on
percentage values. The high chi-square value never means high association
but it only means high probability of finding such a value and low chance of
finding this chi-square value by chance. It may be very misleading to pool
dissimilar data.
264
Answer 2: When dealing with paired (matched) observations, comparison of Overview of
Statistical Tools
sample means is performed by using a modified t-test known as the paired t-
test. In the paired t-test, differences between the paired observations (say,
Post-test minus Pre-test, or matched observations of 2nd group minus 1st
group) are used instead of the observations of two sets of independent
samples.
265