0% found this document useful (0 votes)
0 views

statistical treatment of data

Unit 4 provides an overview of statistical tools essential for data analysis in research, covering topics such as types of data, frequency distributions, measures of central tendency, and hypothesis testing. It distinguishes between descriptive and inferential statistics, and outlines various statistical tests including parametric and non-parametric methods. The unit aims to equip researchers with the necessary statistical knowledge to analyze data effectively and produce meaningful research reports.

Uploaded by

Jonicio Dacuya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

statistical treatment of data

Unit 4 provides an overview of statistical tools essential for data analysis in research, covering topics such as types of data, frequency distributions, measures of central tendency, and hypothesis testing. It distinguishes between descriptive and inferential statistics, and outlines various statistical tests including parametric and non-parametric methods. The unit aims to equip researchers with the necessary statistical knowledge to analyze data effectively and produce meaningful research reports.

Uploaded by

Jonicio Dacuya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Use of Basic

Statistics UNIT 4 OVERVIEW OF STATISTICAL


TOOLS

Structure

4.1 Introduction
4.2 The Data: Meaning and Types
4.3 Frequency Distributions
4.4 Measures of Central Tendency
4.5 Measures of Dispersion
4.6 Hypothesis Testing and Inferential Statistics
4.7 Choosing a Statistical Test
4.8 Statistical Tests
4.9 Chi- Square Test
4.10 F- Test
4.11 Z- Test
4.12 t-Test
4.13 Correlation Analysis
4.14 Regression Analysis
4.15 Let Us Sum Up
4.16 Keywords
4.17 Bibliography and Selected Readings
4.18 Check Your Progress – Possible Answers

4.1 INTRODUCTION
Statistical tools are the pillars of the research study on which data analysis for
all types of developmental programmes stand. Those who are researchers
also need some understanding of statistical analysis to be able to produce a
meaningful research report. With the availability of several user friendly
software, the performance of statistical analysis has now become a reality,
even for non-statisticians, provided they are computer literate, and
understand the basic principles of statistical analysis.

There are two types of statistics:

i. Descriptive statistics which include techniques for organizing,


summarizing, and presenting data using tables, graphs, or single numbers

216
ii. Inferential statistics which consist of statistical methods for making Overview of
Statistical Tools
inferences about a population based on information obtained from a
sample.

This unit aims to make you conversant with the basic statistical tools
applicable in developmental research.

After studying this unit, you should be able to

• describe various measures of central tendencies and dispersion


• discuss the applicability of various tests involved in hypothesis testing
• explain the use of correlation in data analysis
• describe the use of regression in data analysis

4.2 THE DATA: MEANING AND TYPES


Statistics is a science that deals with the collection, organization, analysis,
interpretation and presentation of information that can be presented
numerically and/or graphically to help us answer a question of interest. This
helps us, in many ways, to

i. establish the problem and its effect


ii. approve or disapprove a hypothesis
iii. establish association and relationship, etc.

There are various sources of data, for example

• surveillance systems
• national surveys
• experiments
• health organizations
• private sector
• Government agencies and
• research studies conducted by research and academic institutions, etc,
which can be used for the benefit of society.

4.2.1 Data: Meaning and Types


In order to conduct a study, we collect raw data using one of the sampling
techniques/ methods. This raw data does not provide any meaningful
conclusion for the program or in policy-making unless; it is presented in
some meaningful format. In order to have meaningful information from a
study, we need to analyze the data using appropriate statistical tests and
techniques. Before applying any statistical test, it is important to identify the
type of data, as the choice of statistical tests depends on the type of data. The
217
Use of Basic theory of measurement consists of a set of separate or distinct theories, each
Statistics
concerning a distinct level of measurement. Here, we will discuss four levels
of measurement: Nominal, Ordinal, Interval, and Ratio and then the statistics
and statistical tests that are permitted with each level.

Types of Data

There are two types of data (i) Qualitative data, viz., occupation, sex, marital
status, religion and (ii) Quantitative data, viz., age, weight, height, income,
etc. These may be further categorized in the following two types.

 Discrete: data that can be divided into categories or groups, such as male
and female, and can take only discrete values, as explained - nominal and
ordinal scale.
 Continuous: data that can take any value including decimal are called
continuous data. A continuous data is, at least in interval or ratio scale, as
defined above in types of measurement. Measurements in ordinal scale
can also be considered under this category though it does not fulfill all
conditions of the continuous scale. In social sciences, it is difficult to
measure variables in interval or ratio scale and one has to depend on
measurements taken in the ordinal scale. It is further emphasized that the
statistics based on such measurement may provide ‘under’ or ‘over’
estimates of the population parameter.

4.2.2 Variable Types: two types of variables are used in


analysis of data.
 Independent Variable: the characteristic being observed or measured
which is hypothesized to influence an event or outcome (dependent
variable) and is not influenced by the event or outcome, but may cause it,
or contribute to its variation.
 Dependent variable: a variable, whose value is dependent on the effect
of other variables (independent variables) in the relationship, is being
studied. It is also called a Criterion, Outcome or Response variable.

4.2.3 Parametric and Non-Parametric Tests


 A parametric statistical test is a test whose model specifies certain
conditions about the parameters of the parent population from which the
sample was drawn. For example, the certain conditions could be:
observations must be independent; the variable must have been measured
in at least interval scale; and, the observations must have been drawn
from a normally distributed population. Since these conditions are not
ordinarily tested, they are assumed to hold.
 A non - parametric statistical test is a test whose model does not
specify conditions about the parameters of the parent population from
which sample was drawn. The non-parametric test does not assume any
218 shape (distribution) of parent population from where the sample was
drawn and, therefore, the test is also termed as the distribution free Overview of
Statistical Tools
method. However, certain assumptions are associated with most non-
parametric tests, i.e., that the observations are independent, and that the
variable under study has underlying continuity. But these assumptions
are fewer and weak as compared to parametric tests. Most non-
parametric tests are applicable to data in an ordinal scale, and some are
applicable to data in a nominal scale. Therefore, the statistical technique
is called non-parametric if it satisfies at least one of the following
conditions.
i. The data are measured/ analyzed using nominal scale and/or ordinal
scale of measurement.
ii. Count data represents the number of observation in each category.
iii. The inference does not concern a parameter in the population
distribution.
iv. The probability distribution of the statistic does not depend on
specific information or assumption about the population from which
sample(s) are drawn.

4.3 FREQUENCY DISTRIBUTIONS


Consider a situation, when a test is administered to a group of 60 students.
The marks obtained by each student can be listed against the names of
students in a register or mark list. The data in the original form are called
ungrouped data. When this is arranged in an ascending or descending order
of magnitude, the data are said to be arranged in an array. Counting the
number of times each variate value occurs, we get the frequency table of
‘number of marks’ and ‘number of students’. Therefore, a frequency
distribution is defined as ‘the values which a variate takes’ and ‘the
frequencies, or, the number of times each variate value is taken are specified’.

Consider an example in which 82 clinics in one district were asked to submit


the number of patients treated for malaria in one month. 80 clinics out of the
82 responded to the call. The researchers presented both the frequency
distribution and percentages (or relative frequencies).

Table 4.1: Distribution of clinics according to number of patients treated


for malaria in one month

Number of Number of Cumulative Cumulative Range of


patients clinics frequency Percentages cumulative
frequency
1 2 3 4 5
0 - 19 5 5 6.25 0- 5
20 - 39 8 13 16.25 6 - 13
40 - 59 10 23 28.75 14 - 23
219
Use of Basic 60 - 79 11 34 42.5 24 - 34
Statistics
80 - 99 19 53 66.25 35 - 53
100 - 119 10 63 78.75 54 - 63
120 - 139 9 72 90.00 64 - 72
140 - 159 8 80 100.00 73 - 80
Total 80
Missing value 2 82
(No response)

The frequency table given above is an improvement over the arrangement of


figures, which were just listed or entered in a register, or kept in a file, or
listed in an array form for each respondent, as it presents a clear idea of the
data. This type of representation of frequencies is called a grouped frequency
distribution. In the above example, 80 observations were divided into eight
groups. The groups are known as class limits or class intervals. The
difference between upper and lower limit is called the width of the class. The
cumulative frequency corresponding to a class is the total number of
observations less than or equal to the upper limit of that class, i.e., it is a total
of all frequencies up to and including that class.

Instead of presenting data in frequency tables using absolute numbers it is


often better to calculate percentages. A percentage is the number of units in
the sample with a certain characteristic, divided by the total number of units
in the sample and multiplied by 100. Percentages may also be called Relative
Frequencies. Percentages standardize the data, which means that they make it
easier to compare them with similar data obtained in another sample of
different size or origin. Sometimes relative frequencies are expressed in
proportions instead of percentages. A proportion is a numerical expression
that compares one part of the study units to the whole.

Note

One should be cautious when calculating and interpreting percentages where


the total number is small because one unit more or less would make a big
difference in terms of percentages. As a general rule, percentages should not
be used when the total is less than 30. Therefore, it is recommended that the
number of observations, or total cases studied, should always be given
together with the percentage.

After having gone through the concept of data, answer the following
questions given in Check Your Progress 1.

220
Check Your Progress 1 Overview of
Statistical Tools
Note:
a) Write your answer in about 50 words.
b) Check your answer with possible answers given at the end of the unit
1. What are the different types of data?
…………………………………………………………………………
…………………………………………………………………………
…………………………………………………………………………
…………………………………………………………………………
…………………………………………………………………………
2. What do you understand by the term, non-parametric statistical test?
…………………………………………………………………………
…………………………………………………………………………
…………………………………………………………………………
…………………………………………………………………………
…………………………………………………………………………

4.4 MEASURES OF CENTRAL TENDENCY


Frequency distributions and histograms provide useful ways of looking at a
set of observations of a variable. In many circumstances, it is essential to
present them to understand the pattern or trend in the data. However, if one
wants to further summarize a set of observations, it is often helpful to use a
measure which can be expressed in a single number. First of all, one would
like to have the measures of location or measures of central tendency of the
distribution. The three measures used for this purpose are the mean, median,
and mode.

4.4.1 Mean
The mean (or arithmetic mean) is also known as the average. It is calculated
by totalling the values of all the observations and dividing by the total
number of observations. Note that the mean can only be calculated for
numerical data.

Example 1: The measurements of the heights in centimetres of seven girls


are given below.

221
Use of Basic S.No. Height (cm)
Statistics
1 141
2 141
3 143
4 144
5 145
6 146
7 155

A total of seven measurements = 1015 cm. The mean = (1015/ 7), which is
145 cm.

The above is simplest example. You may be given frequency, or grouped


data, and in this situation it is bit difficult to calculate the mean. In the case of
frequency data, as given below

Example 2: Measurements of the heights in centimetres of 15 girls are given


below

S.No. Height X i Frequency fi fi × X i


1 141 2 282
2 143 3 429
3 144 4 576
4 145 3 435
5 146 2 292
6 155 1 155
Total 15 2169

Formula for calculating mean: X =∑ fi × X i = 2169/15 = 144.6

∑ fi

Note

Note 1: In the case of the grouped frequency data given in table 3.1, the
midpoint of the interval will become X i and a similar procedure for
computing the mean can be followed as shown above.

Note 2: However, in the case of open interval data, such as ‘less than 19’
(denoted as <19) or ‘more than 159’ (denoted as > 159), it is not possible to
fix a midpoint, and, therefore, the mean cannot be calculated. As a
consequence, we use the median in place of the mean, which is explained in
the next section.

222
4.4.2 Median Overview of
Statistical Tools
The median is a value that divides a distribution into two equal halves. The
median is useful when the data is in ordinal scale, i.e., some measurements
are much bigger or much smaller than the other measurement values. The
mean of such data will be biased toward these extreme values. Thus, the
mean is not a good measure of distribution, in this case. The median is not
influenced by extreme values. The median value, also called the central or
halfway value, (50th percentile, i.e., 50% value below median value, and 50%
above it) is obtained in the following way:

• List the observations in order of magnitude (from the lowest to the


highest value, or vice versa).
• Count the number of observations = n.
• The median value is the middle value, if n is odd {i.e., (n+1)/2} and the
mean of two middle values, if n is even {i.e., (n/2) and the next value}

Example 3:

Case 1: The weights of 7 women are:

S.No. Weight of women (kg)


1 40
2 41
3 42
4 43
5 44
6 47
7 72

The median value is the value belonging to observation number (7 + 1)/2,


which is the fourth one value: 43 kg.

Case 2: If there are 8 observations:

S.No. Weight of women (kg)


1 40
2 41
3 42
4 43
5 44
6 47
7 49
8 72 223
Use of Basic The median would be 43.5 kg {the average of ‘(n/2=8/2) 4th value i.e. 43’ and
Statistics
‘next value, i.e., 44’}; the median in this case would be (43+44)/2 = 43.5 kg}.

Calculation of Median for Grouped Data

We can use the grouped data of the table 3.1 of section 3.3.1, ‘Distribution
of clinics according to number of patients treated for malaria in one month’
for calculation of median, which is given below

Table 4.2: Distribution of clinics according to number of patients treated


for malaria in one month

Number of Number of Cumulative


patients clinics frequency
0 - 19 5 5
20 - 39 8 13
40 - 59 10 23
60 - 79 11 34
80 - 99 19 53
100 - 119 10 63
120 - 139 9 72
140 - 159 8 80
Total 80

Step1: The total of frequency is first divided by 2, i.e., 80/2 (=40). The
cumulative frequency 40 will correspond to the class interval (80-99). This is
called the median interval.

 N  
Step2: The formula is Median =L +  − F  × d  / f
 2  

Step3: Record all values of symbol variables from the table as given below:

L (=80) is the lower limit of the median interval,

F (=34) is the cumulative frequency of the class, preceding to median class,

d (=20) is the width of class interval,

f (=19) is the frequency of median class.

Step4: Replace the symbol values with numeric values as noted in step3 in
the formula,

Therefore, Median = 80+ [(40-34) x 20] / 19 = 80 + 6.32 = 86.32 patients.

224
4.4.3 Mode Overview of
Statistical Tools
The mode is the most frequently occurring value in a set of observations. The
mode is not very useful for numerical data that are continuous. It is most
useful for numerical data that have been grouped. The mode is usually used
to find the norm among populations and is calculated when the calculation of
mean and median is inappropriate, viz., the average shoe size of the Indian
population, standard birth weight, etc. The mode can also be used for
categorical data, whether they are nominal or ordinal.

In Example 1 (height of 7 girls) the mode is 141.

In Example 2 the mode is 144 as 4 persons are with this observation.

Calculation of Mode for Grouped Data

We will again use the grouped data given in table 3.1 to calculate the mode.
The steps for computing the mode are

 f m − f1 
Step1: Formula for calculating mode, M ode =
L+ × h
 2f m − f 1 − f 2 

Step2: Find the class interval with the largest frequency. In this case, the
class interval ‘80-99’ has the maximum frequency, equal to 19.

Step3: Record all values of the symbol variables used in the formula.

L = (80) is the lower limit of the modal class,


f m = (19) is the modal class frequency (i.e. interval having maximum
frequency),
f1 = (11) is the frequency of the class interval preceding the modal class,

f 2 = (10) is the frequency of the class succeeding the modal class, and

h = (20) is the width of the interval.

Step4: Replace the symbol values with the numeric values as noted in step3
in the formula.

The calculated value of mode is:

 19 − 11 
= Mode =
80 +   × 20
 2 × 19 − 11 − 10 

=80+9.41 = 89.41 patients

4.4.4 Relationship between Mean, Median, and Mode


In normal data, mean, median, and mode are same. However, in a moderately
skewed (non normal) distribution, Mode = 3 Median - 2 Mean.
225
Use of Basic In summary, the mean, the median, and the mode are all measures of central
Statistics
tendency or measures of location. The mean is most widely used. It contains
more information because the value of each observation is taken into account
in its calculation. However, the mean is strongly affected by values far from
the centre of the distribution, while the median and the mode are not. The
calculation of the mean forms the beginning of more complex statistical
procedures, like, correlation and regression, etc., to describe and analyze
data. In general, as the skewness increases, the mean and median move away
from the mode. If the mean is less than the median, the data is skewed to the
left; and if it is greater than median, the data is skewed to right. The choice of
central tendency, therefore, depends upon the type and distribution of data.
Nowadays, with the easy availability of scientific calculators and computers
(Excel and other statistical software), the calculation of mean, median, mode,
etc., has become very simple. Figure 4.1 shows a distribution curve in which
the mean, the median, and the mode have different values.

Figure 4.1: Distribution curve showing Mean, Median, and Mode

4.5 MEASURES OF DISPERSION


The mean, median, and mode are measures of the central tendency of a
variable, but they do not provide any information of how much the
measurements vary or are spread. This section will describe some common
measures of variation (or variability), which in statistical text books are often
referred to as measures of dispersion. Measures of dispersion or variability of
a data give an idea up to which extent the values are clustered or spread out.
In other words, it gives an idea of the homogeneity and heterogeneity of data.
Two sets of data can have similar measures of central tendency but different
measures of dispersion. Therefore, measures of central tendency should be
reported along with measures of dispersion. There are various measures of
226 dispersion. They are discussed below:
Overview of
Statistical Tools

4.5.1 Range:
It is the simplest measure of dispersion. This can be represented as the
difference between maximum and minimum values, or simply, as the
maximum and minimum values for all observations.

Example 4: If the weights of 7 women were

S.No. Weight of women (kg)


1 40
2 41
3 42
4 43
5 44
6 47
7 72

The range would be 72 – 40 = 32 kg.

Although simple to calculate, the range does not tell us anything about the
distribution of the values between the two extreme ones.

Example 5: If the weights of 7 other women were

S.No. Weight of women (kg)


1 40
2 46
3 50
4 55
5 60
6 65
7 72

The range would also be 72 – 40 = 32 kg, although the values are very much
different from those of the previous example.

4.5.2 Percentiles
A second way of describing the variation or dispersion of a set of
measurements is to divide the distribution into percentiles (100 parts). As a
matter of fact, the concept of percentiles is just an extension of the concept of
the median, which may also be called the 50th percentile. Percentiles are
points that divide all the measurements into 100 equal parts. The 30th
percentile (P30) is the value below which 30% of the measurements lie. The 227
Use of Basic 50th percentile (P50), or the median, is the value below which 50% of the
Statistics
measurements lie. To determine percentiles, the observations should be first
listed from the lowest to the highest just like when finding the median.
However, in case of grouped data, percentile can be calculated on similar
lines of calculating the median.

The concept of percentiles is used by nutritionists to develop standard growth


charts for specific countries from a representative sample of children whose
weight and height are measured according to their age in months.

4.5.3 Mean Deviation


Xi − X
It is the average of deviation from arithmetic mean ∑ , where ││
n
denotes Mod, considering all differences ‘as positive’ or ‘in absolute value’.

4.5.4 Standard Deviation


Standard deviation (sd) is always reported with mean. It denotes
(approximately) the extent of variation of values from the mean. If the
standard deviation is 10, then the values of the distribution tend to be three
times of sd (10 units) above and below the mean). Mathematically, it is
square root of variance ( sd = Variance ). All the deviation
(
values X i − X ) in a data set are calculated by subtracting the mean, X , of
the data from each observation X i . Variance is the mean of the sum of
squares of all the deviation scores of a data. Mathematically this is written as
follows.

Variance = s2 = , the short cut formula is:


2
1  ∑ Xi 
=s 2

n −1
∑ Xi2 −  
 n 

sd = Variance

X = Mean

n = Number of observations

Large values of variance and standard deviation represent higher variability


in the data and vice versa. If the value of variance is equal to ‘zero’, it
represents no variability in the data. To obtain the standard deviation of a set
of measurements one has to carry out the following steps:

i) Calculate the mean of all the measurements.


ii) Calculate the difference between each individual measurement and the
mean.

228 iii) Square all these differences.


iv) Take the sum of all squared differences. Overview of
Statistical Tools
v) Divide this sum by the number of measurements minus one.
vi) Finally take the square root of the value obtained (in order to get back to
the same unit of measurement).

Example 6: Suppose you need to calculate the standard deviation of 2, 4, 6,


8, 10 and 12.

1. We first calculate the mean: the mean value is 7.


2. Next, we calculate the distance of each measurement from the mean
(ignoring which one is greater, for example, 2-7 = -5; 4-7 = -3; 6-7 = -1;
8-1 =1; 10-7 = 3; 12-7 = 5); and then square the results to remove the
negative signs(25, 9, 1, 1, 9, 25).
3. The sum of these squared differences (25+9+1+1+9+25) is 70; divide
this by (n – 1 = 5): The Variance = 70/5 = 14
4. Finally, we take the square root to obtain the standard deviation (sd):
√14= 3.74.

Fortunately many pocket calculators can do this calculation for us, but it is
still important to understand what it means. In the case of grouped data, the
mid value of the interval may be taken as observation value and the above
procedure can be followed.

In the above sections you studied about the measures of central tendency and
the measures of dispersion. Now try and answer the questions in Check Your
Progress 2.

Check Your Progress 2

Note:

a) Write your answer in about 50 words.


b) Check your answer with possible answers given at the end of the unit
1. What are the different measures of central tendency?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
2. What are the different measures of dispersion?
……………………………………………………………………………
……………………………………………………………………………
…………………………………………………………………………… 229
Use of Basic
Statistics
4.6 HYPOTHESIS TESTING AND
INFERENCIAL STATISTICS
4.6.1 Understanding True Difference
The analysis and interpretation of the results of our study must be related to
the objectives of study. It is important to tabulate the data in univariate and/
or bi-variate or multivariate tables appropriate to the research objectives. We
may find some interesting results. For example, in a study on nutrition, we
find that 30% of the women included in the sample are anaemic as compared
to only 20% of the men. How should we interpret this result?

• The observed difference of 10% might be a true difference, which also


exists in the total population from which the sample was drawn.
• The difference might also be due to the chance; in reality there is no
difference between men and women, but the sample of men just
happened to differ from the sample of women. One can also say that the
observed difference is due to sampling variation.
• A third possibility is that the observed difference of 10% is due to
defects in the study design (also referred to as Bias). For example, we
only used male interviewers, or omitted a pre-test, so we did not discover
that anemia is a very important topic for women which require a female
investigator.

If we feel confident that an observed difference between two groups cannot


be explained by bias, we would like to find out whether this difference can be
considered as a true difference. We can only conclude that this is the case if
we can rule out chance (sampling variation) as an explanation. We
accomplish this by applying a test of significance. A test of significance
estimates the likelihood that the observed result (e.g., a difference between
two groups) is due to chance or real. In other words, a significance test is
used to find out whether a study result, which is observed in a sample, can be
considered as a result which indeed exists in the study population from which
the sample was drawn.

4.6.2 Tests of Significance


Different sets of data require different tests of significance. Throughout this
module, two major sets of data will be distinguished.

• Two (or more) groups, which will be compared to detect differences.


(e.g., men and women, compared to detect differences in anemia.)
• Two (or more) variables, which will be measured in order to detect if
there is an association between them. (e.g., between anemia and income.)

In order to help you choose the right test, a flowchart and matrices will be
presented for different sets of data. We will discuss how significance tests
230
work. Please keep in mind that independent groups are treated as independent Overview of
Statistical Tools
populations.

i) How Tests of Significance Work


The reasoning behind significance tests is the same, no matter whether a
researcher is comparing two groups for differences or whether s/he is
measuring two variables to detect possible associations.
We will first concentrate on the comparison of groups.
• Suppose you observed a difference between two groups in your
sample.
• You want to know whether this observed difference between the two
groups represents a real difference in the study population from
which the sample was drawn, or whether it just occurred by chance
(due to sampling variation).
• To find out, you determine how likely this difference could have
occurred by chance, if in the population no difference exists between
the two groups.
In any study that is looking for differences between groups or
associations between variables, the likelihood or probability (p) of
observing a certain result by chance has to be calculated by statistical
tests. The probability of observing a result by chance is usually
expressed as a p value.
In the anaemia study, the calculated p value, determining whether the
observed difference between men and women in their anaemic
proportion was due to chance, was 0.009. The chosen significance level
was 0.01 (1%), the ‘p = 0.009’ value is less than 0.01. We can therefore
be more than 99% sure that women are more anaemic than men in the
selected study population. We then say that this result is statistically
significant at the 0.01 level. If the p value had been higher than 0.01
(e.g., 0.03), the result would not have been statistically significant at the
0.01 level. Note that the same reasoning applies for associations between
variables.
Suppose we find that we have categorised anaemia in four groups: no
anaemia, moderate, severe and excessive. Income groups have been
categorised as low, moderate and high. We observe a trend that anaemia
is more severe or excessive in lower income groups. The probability that
this association between anaemia and income occurs by chance will now
have to be calculated. The calculated ‘p’ value is 0.07. We had chosen a
significance level of 0.05%. As our p value is higher than 0.05, this result
is not statistically significant at the 0.05 level, and we cannot be 95%
sure that the association between anaemia and income is a real one.

231
Use of Basic ii) How to state Null (Ho) and Alternative (H1) Hypothesis:
Statistics
In statistical terms the assumption that no real difference exists between
groups in the total study (target) population (or, that no real association
exists between variables) is called the Null Hypothesis (Ho). The
Alternative Hypothesis (H1) is that there exists a difference between
groups or that a real association exists between variables. Examples of
null hypotheses are
• There is no difference in the incidence of measles between
vaccinated and non-vaccinated children.
• There is no difference between the alcohol consumption of male and
female.
• There is no association between families’ income and malnutrition
in their children.
If the result is statistically significant, we reject the Null Hypothesis (Ho)
and accept the Alternative Hypothesis (H1) that there is real difference
between two groups, or a real association between two variables.
Examples of alternative hypotheses (H1) are:
• There is a difference in the incidence of measles between vaccinated
and non-vaccinated children.
• Males drink more alcohol than females.
• There is an association between families’ income and malnutrition in
their children.
Be aware that ‘statistically significant’ does not mean that a difference or
an association is of practical importance. The tiniest and most irrelevant
difference will turn out to be statistically significant if a large enough
sample is taken. On the other hand, a large and important difference may
fail to reach statistical significance if too small a sample is used.
iii) The Concept of Type I and Type II Error
There are four ways in which conclusion of the test might relate to in our
study (i) true positive (ii) true negative and (iii) false positive and (iv)
false negative. These may be expressed in terms of error in statistical test
of significance in following terms:
Type I error (α): We reject the null hypothesis when it is true, or false
positive error, or type I error ‘α’ (called alpha). It is the error in detecting true
effect.
In the above example, type I error would mean that the effects of two drugs
were found to be different by statistical analysis, when, in fact, there was no
difference between them.
Type II error (β): We accept the null hypothesis when it is false or false
negative error; or simply, type II error ‘β’ (called beta) can be stated as
232 failure to detect true effect. In the above example, type II error would mean
that the effects of two drugs were not found different by statistical analysis, Overview of
Statistical Tools
when in fact there was difference.

The definition can be summarized as given below.

Actual Situation
True Ho False Ho
Investigator’s Accept Null Correct Acceptance Error (Type II)
Decision hypothesis
Reject Null Error (Type I) Correct
hypothesis Rejection

Note: Alpha (α) and beta (β) are the Greek letters and are used to denote
probabilities for type I error and type II error respectively.

We would like to carry our test, i.e., choose our critical region so as to
minimize both types of errors simultaneously, but this not possible in a given
fixed sample size. In fact decreasing one type of error may very likely
increase the other type. In practice, we keep type I error (α) fixed at a
specified value (i.e., at 1% or 5%).

4.7 CHOOSING A STATISTICAL TEST


4.7.1 Paired and Unpaired Tests
Two measurements or values of a variable in the same subject before and
after an intervention are called paired data. Measurements or values of a
variable in subjects who are not related to each other (e.g., in experimental
cases vs. control group) are called unpaired data. A few examples of paired
and unpaired data are given below:

Table 4.3: Illustration of paired and unpaired data

Sr Paired Data Unpaired Data


No
1. Blood pressure before and after Blood pressure in drug injected
exercise and control group
2. Bed occupancy in a hospital in Gene frequency in cancer patients
winter vs. summer vs. control group
3. Blood sugar before and after Blood sugar in diabetic and non –
lactose injection diabetic patients

4.7.2 Choosing a Statistical Test


Once a study has been completed, data have been tabulated into the spread
sheet; we need to decide what statistical test needs to be performed for
different variables tabulated into the spreadsheet. The choice of different tests
233
Use of Basic depends on the following factors: (1) whether the variable is categorical or
Statistics
continuous; (2) whether the data of the variable in parent population (an
assumption) is normally distributed (parametric), or, not normally distributed
(non-parametric; it cannot assume that data of the parent population has some
known shape/ distribution); (3) Whether the data is paired or unpaired. Table:
3.4 shows the choice of various statistical tests based on the above
parameters.

Table 4.4: Choosing a statistical test as per data requirement

S. Type of Hypothesis Parametric Equivalent Example


No. Test non-
parametric
test
1. Compares two sets One sample Wilcoxon To compare
of paired (paired t-test) sign rank income levels
observation on a test before and after
single sample providing skill
(related samples) based training
2. Compares two Two Samples Mann- To compare
independent samples (t-test of Whitney U heights of boys
drawn from a same independent test and girls of the
group of individuals samples) same class
(group of
students)
3. For comparison of One way Kruskall- To compare
three or more groups analysis of Wallis effect on three
of continuous data variance (F analysis of groups about
Test) using variance by only Counselling,
total sum of rank Counselling +
squares Meditation and
Counselling+
Meditation+
Exercise
4. Test the Null Mc Nemar’s Chi –square To assess
Hypothesis that the Test test whether students
proportion of are more likely to
discontinuous/ pass if the
categorical variables earnings in their
estimated from two family is more
or more independent
samples are the
same
5. To assess strength of Pearson’s Spearman’s To assess
linear association correlation rank whether
234
between two coefficient correlation education is Overview of
Statistical Tools
continuous variables (product coefficient related to income
moment) in the
community.
6. To describe Regression No direct To evaluate how
numerical relation analysis by equivalent Blood Pressure
between two least square increases with
quantitative method age
variables allowing
one value to be
predicted from the
other

Till now you have read about the various concepts of hypothesis testing. Now
try and answer the following questions in Check Your Progress 3.

Check Your Progress 3

Note:

a) Write your answer in about 50 words.


b) Check your answer with possible answers given at the end of the unit
1. What do you understand by a null and an alternative hypothesis?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
2. What are type I and type II errors?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………

4.8 STATISTICAL TESTS


Depending on the aim of your study and the type of data collected, you have
to choose appropriate tests of significance. Before applying any statistical
test, state the null hypothesis in relation to the data to which the test is being
applied. This will enable you to interpret the results of the test. The following

235
Use of Basic sections will explain how you will choose an appropriate statistical test to
Statistics
determine differences between groups or associations between variables.

4.8.1 Determining Significant Difference between Groups


There are several issues one should consider, while deciding, which test to
use to determine differences between groups. It is necessary to determine
whether the data are categorical (nominal and ordinal) or continuous
(numerical).

Under both of these categories, you also need to decide whether you have
paired or unpaired observations. In paired observations, individual
observations in one data set (e.g., experimental) are matched with individual
observations in another data set (e.g., controls), for example, by taking care
that both participants in the study come from the same location, have the
same age, same sex, same financial background, etc.

For Categorical data, or Nominal data (paired or unpaired) the


significance test to be used depends on whether the sample is small or large.
There is no clear guide to what should be considered ‘small’ or ‘large’.
However, in the case of unpaired observations it is better to use Fisher’s
exact test rather than the Chi-square test if the total sample is less than 40, or
if any cell of the table, which must be constructed, has an expected number of
less than 5. If you have categorical data, the chi-square test is used to find
out whether observed differences between proportions of events in two or
more groups may be considered statistically significant.

For ordinal data the significance test to be used depends on whether only
two groups or more than two groups are being compared. The tests to be used
for comparing two groups are based on ranking of data: Wilcoxon’s two-
sample test, which gives equivalent results to the Mann-Whitney U-test, for
unpaired observations, and Wilcoxon’s signed rank test for paired
observations. We will not deal with these tests in this unit.

For continuous data (in interval, or, ratio scale), as for ordinal data, the
choice of an appropriate significance test depends on whether you are
comparing two groups or more than two groups. The z-test, referred to as the
standard normal variate, and the t-test, also referred to as the Student’s t-test,
are used for numerical data of continuous nature, when comparing the means
of two groups. The chi-square test is used for categorical data, when
comparing proportions of events occurring in two or more groups.

Steps to Follow for testing a Hypothesis

Step1: Formulate the null hypothesis


Step 2: Formulate the alternative hypothesis
Step 3: Choose the level of significance of the test
Step 4: Choose an appropriate test statistic
236
Step 5: Compute the observed value of the test statistic Overview of
Statistical Tools
Step 6: Compare the calculated value of the statistic with the tabled value
Step 7: Accept or Reject the null hypothesis

Here we will confine our discussion to four types of tests:

i) χ2 test
ii) Z- test
iii) t-test
iv) f-test

4.9 CHI-SQUARE TEST


Suppose that in a cross-sectional study of the factors affecting the utilisation
of health care centres, you found that 64% of the women who lived within 10
kilometres of the centres came for health check-ups, compared to only 47%
of those who lived more than 10 kilometres away. This suggests that health
care centre is used more often by women who live close to the centres. The
complete results are presented in Table 3.5.

Table 4.5: Utilization of health care centre by women living far from,
and near the clinic

Distance from health Used health Did not Use health Total
care centre care centre care centre
Less than 10km 51 (64%) 29 (36%) 80 (100%)
10km or more 35 (47%) 40 (53%) 75 (100%)
Total 86 69 155

From the table we conclude that there seems to be a difference in the use of
health care centre between those who live close to, and those who live far
from, the centre (64% versus 47%). We now want to know if this observed
difference is statistically significant or not. The chi-square ( χ 2 ) test, used to
test the statistical significance, is given below.

∑ (the summation sign, also called sigma)

( Oi − Ei )
2
directs you to add together each value of (Oi
Chi-
2
χ =∑
df
Ei
– Ei) 2 / Ei for all the ‘k’ cells of the table.
square= Where, Oi and Ei are the observed and
expected frequency of each cell and ‘df’ is
the degree of freedom.

The alternative simplified formula for calculating chi-square is:

237
Use of Basic
O i2
Statistics χ2
= ∑ Ei
− N , where N is the total of observed frequency.

These are the steps to follow, in determining which test to use.

Hypothesis:

• Ho: There is no difference in the utilization of health care centre among


women living within distance of less than and more than 10 kms
• H1: There is difference in the utilization of health care centre among
women living within distance of less than and more than 10 kms

Test Statistic: chi-square ( χ 2 ) Test

Note

Chi (χ) is a Greek letter. The chi-square test can be used to give us the
answer. This test is based on measuring the difference between the observed
frequencies and the expected frequencies if the null hypothesis (i.e., the
hypothesis of no difference) were true. To perform a chi-square test you need
to calculate, the chi-square value, use a chi-square table, and interpret the chi-
square.

Calculating the Chi-square χ 2 value

1) Calculate the expected frequency (Ei) for each cell. To find the expected
frequency Ei of a cell, you multiply the row total by the column total of
Row total × Column total
the cell and divide by the grand total: E = .
Grand total
2) For each cell, subtract the expected frequency E i from the observed
frequency Oi , ( Oi − Ei ) , square the difference of ( Oi − Ei ) and divide it
by the corresponding expected frequency E i . (You may skip this step
and follow the next step 3 as an alternative step).
3) Alternatively, for each cell, simply square the observed frequency Oi and
divide it by the corresponding expected frequency, Ei .
4) Find the sum by adding the values calculated in step (3) for all the cells
and subtract N (total of observed frequency) from the sum.

( O1 −E1 ) + ( O2 −E2 ) + ( O3 −E3 ) + ( O4 −E4 )


2 2 2 2
2
For a 2×2 table the formula=
is: χ ,
E1 E2 E3 E4

 O12 O 22 O32 O24 


The alternative formula χ=
2
1  + + + −N
 E1 E 2 E3 E 4 

238
Using a Chi-square table Overview of
Statistical Tools
The calculated chi-square value has to be compared with a theoretical chi-
square value in order to determine whether the null hypothesis is rejected or
not. Annex 2 contains a table of theoretical chi-square values.

1) First you must decide what significance, or alpha, level you want to use
(α value). We usually take 0.05.
2) Then, the degrees of freedom have to be calculated. With the chi-square
test the number of degrees of freedom is related to the number of cells,
i.e., the number of groups you are comparing. The number of degrees of
freedom is found as:

degrees of freedom = ( Number of rows − 1) × ( Number of colmuns − 1) = ( r − 1) × ( c − 1)


For a simple 2 x 2 table, the number of degrees of freedom is 1: d.f. = (2-1) ×
(2-1) = 1

3) Then the chi-square value belonging to the α-value and the number of
degrees of freedom are located in the table.

Interpreting the results

♦ If the calculated chi-square value is equal to or larger than the chi-square


value from the table then we reject the null hypothesis and conclude that
there is a statistically significant difference between the groups.
♦ If the calculated chi-square value is smaller than the chi-square value
from the table, then we accept the null hypothesis and conclude that the
observed difference is not statistically significant (do not differ
significantly).

Let us now apply the chi-square test to the data given in table 3.5 (utilization
of health care centre). This gives the following result:

Calculating the Chi-square χ 2 value

First the expected frequencies for each cell are computed for calculating the
Chi-square value as follows:

Distance from Used health care Did not Use health Total
health care centre care centre
centre
Less than 10km 86 × 80 69 × 80 80
=E1 = 44.4 =E 2 = 35.6
155 155
10km or more 86 × 75 69 × 75 75
=E3 = 41.6=E 4 = 33.4
155 155
Total 86 69 155

Note that the expected frequencies refer to the values we would have
expected, given the total numbers of 80 and 75 women in the two groups, if 239
Use of Basic the null hypothesis, there is no difference between the two groups, were true.
Statistics
Now the chi-square value can be calculated:

( 51 − 44.4 ) ( 29 − 35.6 ) ( 35 − 41.6 ) ( 40 − 33.4 )


2 2 2 2

Chi-
44.4 + 35.6 + 41.6 + 33.4
square=

=
(χ )2

0.98 + 1.22 + 1.05 + 1.30 4.5


= 5

With the alternative formula, the value of chi-square is same and is given
below.

 512 292 352 402 


χ12 
= + + +  − 155
= 4.56
 44.4 35.6 41.6 33.4 

Using a Chi-square table

The number of degrees of freedom (d.f.) for 2×2 table is 1. Use the table of
chi-square values in Annex 1. We decided, beforehand, on a level of
significance of 5% (α-value = 0.05).

As the number of d.f. is 1, we look along that row in the column where p =
0.05. This gives us the tabulated value of 3.84.

Interpreting the result

Our calculated value of chi-square 4.55 is larger than 3.84. Hence we reject
the null hypothesis.

We can now conclude that the women living within a distance of 10 km from
the clinic utilise antenatal care significantly more often than the women
living more than 10 km away.

It is important to present your data clearly and to carefully formulate any


conclusions based on statistical tests in the final report of your study.

Table 3.5 indicates that 64% of the women living within a distance of 10 km
from the clinic used antenatal care during pregnancy, compared to only 47%
of women living 10 km or further away from the nearest clinic. This
difference is statistically significant (chi-square = 4.55; p < 0.05).

Note:

The chi-square test can only be applied if the sample is large enough. The
general rule is that the total sample should be at least 40 and the expected
frequencies in each of the cells should be at least 5. If this is not the case,
Fisher’s exact test should be used. If the table is more than a two-by-two
table, the expected frequency in 1 of 5 cells (not more than 20% cells) is
240
allowed to be less than 5. (Please note, it is expected frequency and not the Overview of
Statistical Tools
observed frequency).

Unlike the t-test, the chi-square test can also be used to compare more than
two groups. In that case, a table with three or more rows/columns would be
designed, rather than a two-by-two table.

The chi-square is always applied on absolute numbers, but not on percentage


values.

The high chi-square value never means high association. It only means high
probability of finding such a value, and low chance of finding this chi-square
value by chance.

It may be very misleading to pool dissimilar data. Pooling the age groups
masked an important real difference. In other situations pooling the data may
suggest a difference or association that does not exist, or, even a difference
opposite to that which exists. It is, therefore, important to analyze the data for
the different age groups/urban-rural/literate vs illiterate separately.

4.10 F-test
The F-test is used either for testing the hypothesis about the equality of two
population variances or the equality of two or more population means. The
ratio of two sample variances is distributed as F.

i) Test of Equality of Two Population Variances


Let there be two normal populations N (µ1, σ12) and N (µ2, σ22).
Where µ1 and µ2 are the population means and σ12 and σ22 are the
population variances.
We use F test to test the hypothesis
H0: σ12 = σ22
H1: σ12 ≠ σ22
Let an independent sample of size n1 be selected from population N (µ1,
σ12), and of size n2 from population N (µ2, σ22).
The test statistic:
Fk1,k2 = S12
S22
Where k1 = (n1- 1) and K2 = (n2 - 1).
As a norm, the larger variance is taken in the numerator and the degrees
of freedom corresponding to it is denoted as k1.

241
Use of Basic Interpreting the results:
Statistics
If the calculated value of F is greater than the tabled value, reject H0, if not,
accept H0.

Example 7: Life expectancy in 9 regions of Brazil in 1900, and in 11


regions of Brazil in 1970 was as given in the table below:

Regions Life Expectancy (Years)


1900 1970
1 42.7 54.2
2 43.7 50.4
3 34.0 44.2
4 39.2 49.7
5 46.1 55.4
6 48.7 57.0
7 49.4 58.2
8 45.9 56.6
9 55.3 61.9
10 57.5
11 53.4

We want to confirm whether the variation in life expectancy in various


regions in 1900 and in 1970 is the same or not. Let the populations in 1900
and 1970 be considered N (µ1, σ12) and N(µ2, σ22) respectively.

We use F test to test the hypothesis

H0: σ12 = σ22

H1: σ12 ≠ σ22

Calculating the test statistic

First we calculate S12 and S22.

S12 = 1{∑ – (∑x1i)2}

8 9

= 37.848

S22 = 1{∑ – (∑x2i)2}

10 11
242
= 26.606 Overview of
Statistical Tools
F = 37.848

26.606
=1.603

Tabled value

Ftab=3.07

Interpreting the results

The tabled value of F is 3.07. Since Fcal < Ftab, we accept the null hypothesis.
It means that the variation in life expectancy in various regions of Brazil in
1900 and in 1970 is the same.

4.11 Z -TEST
To test the hypothesis about a population mean or two population means
when the sample size is large (>30) and population variances are known, we
use the Z-test.

4.11.1 Testing of Significance of Difference between Two


Proportions (Two Large Samples)
Let there be two large samples drawn from one population or from two
populations, having same variance. The populations are distributed normally.
The proportions, p1 and p2 are the two proportions of an event from the two
samples ( >30), which are compared for their difference. We may apply the
( p1 − p 2 ) n × p + n 2 × p2
formula: z = , where, P = 1 1 ,Q=1-P
1 1  n1 + n 2
P×Q× + 
 n1 n 2 

The P is the mean proportion of success of the two proportions, p1 and p2,
and n1 and n2 are the respective sample sizes.

The above calculated value of ‘z’ is compared with the tabulated value of ‘z-
normal variate’ for ‘α = 0.05’, i.e., at 5% level of significance, which is 1.96
and the value of z for ‘α =0.01’, i.e., at 1% level significance is 2.58.

Example 8: To test the conjecture of the management that 60 % of


employees prefer the new bonus scheme, a sample of 150 employees was
drawn, and their opinion was taken, whether they favoured it or not. Only 55
employees out of 150 favoured the new bonus scheme.

Thus, we test the hypothesis, H0: P = 0.60

H1: P ≠ 0.60

Calculating the test statistic: 243


Use of Basic Z = 0.367-0.60 Since p = 55/150 = 0.367
Statistics
√0.60*0.40
150
Z = - 11.65
| Z | = 11.65

The Table value

At α = 0.01, ztab = 2.58.

Interpreting the results

As Zcal > Ztab, hence H0 is rejected. It means that 60 percent of the employees
do not favour the new bonus scheme.

4.11.2 Testing of Significance of Difference between Means of


Two Large Samples (Continuous Data)
In case of large samples where population variances are known, we test the
equality of two population means using z – test.

Hypothesis:
H0: µ = µ0
H1 : µ ≠ µ 0

Calculating the test statistic:

Z = x - µ0
σ/ √n
where, x is the sample mean and σ is the standard deviation.
The Table value
The value of z (α) for comparing the calculated test statistics is taken as 1.96
at 5% (α) error; and it is 2.58 at 1% (α) error. Here, the sample size is treated
as large and, therefore, the degree of freedom plays no role, unlike in the t-
test.
Example 9: The table below gives the total income in thousand rupees per
year of 36 persons selected randomly from a particular class of people.
Income (thousand Rs)
6.5 10.5 12.7 13.8 13.2 11.4
5.5 8.0 9.6 9.1 9.0 8.5
4.8 7.3 8.4 8.7 7.3 7.4
5.6 6.8 6.9 6.8 6.1 6.5
4.0 6.4 6.4 8.0 6.6 6.2
4.7 7.4 8.0 8.3 7.6 6.7
244
On the basis of the sample data, can it be concluded that the mean income of Overview of
Statistical Tools
a person in this class of people is Rs. 10,000 per year?

We have to test the hypothesis H0 : µ = 10,000

H0 : µ1 ≠ 10,000

Calculating the test statistic:

Since the sample size is 36, we will use a normal test for which the test
statistic is

Z = x - µ0

σ/√ n

Now we compute x and σ.

x = 280.7/36 = 7.80

σ2 = 1/35{2368.75 – (2368.75 – (280.7)2/36}

= 5.14

σ = 2.27

Z = √36 ( 7.80 – 10 )

2.27

= - 5.81

| Z |= 5.81

The Table value: Ztab = 1.96

Interpretation: Since Zcal >Ztab, reject H0. It means that the average annual
income is less than ten thousand rupees.

Assumption for use of z statistics: The assumption for using the z statistics
is that the parent population, from where samples have been drawn, should be
normal. The z statistics presumes that the population variances ( σ12 and σ22 ) of
the parent populations are known and, therefore, the z statistics for testing
significance of difference between means is defined as z = ( X1 − X 2 ) .
2 2
σ1 σ 2
+
n1 n 2

Since the population variances are presumed to be known, therefore the


samples drawn from a population should also be normally distributed. It is,
therefore, pertinent to verify whether the samples satisfy the criteria of
normality or not. If the samples drawn from the population are sufficiently
large, then, one can estimate the population variances from the samples itself.
As in practice population variances are not known, the researcher has to
245
Use of Basic estimate the population variances from the samples ( sd 12 and sd 22 ) itself and
Statistics

use the z statistics


(X − X )
1 2
. In z statistics, we refer to the tables of normal
sd12 sd 22
+
n1 n2
area curve, where degrees of freedom do not play any role. The computation
of z statistics is almost same except for computing of standard error (SE),
sd12 sd 22
which is given as + .
n1 n2

4.12 t-Test
4.12.1 Testing the Significance of Independent Samples from
Two Groups for Continuous Data:
Example 10: It has been observed that in a certain province the proportion of
women, joining the army, is very high. A study is, therefore, conducted to
discover why this is the case. The height of women is supposed to be the
contributory factor; the researcher may want to find out if there is a
difference between the mean height of women in this province who preferred
joining army and of those who opted for other services. The null hypothesis
would be that there is no difference between the mean heights of the two
groups of women. Suppose the following results were found:

Table 4.6: Mean heights of women with normal deliveries and of women
with C-sections

Type of Service Sample size Mean height in Standard


cm deviation
Joined army n1 = 60 X1 = 156 sd1 = 3.1
(Group1)
Other Services n 2 = 52 X2 = 154 sd 2 = 2.8
(Group2)

The mean height for each of the two samples was calculated and compared,
using the t-test, to determine whether there was a difference.

These are the steps to follow, in determining which test to use.

Hypothesis:

Ho: There is no significant difference between the heights of the women


joining army and other services.

H1: There is difference between the heights of the women joining army and
other services.

246
Test Statistic: A t-test would be the appropriate way to determine whether Overview of
Statistical Tools
the observed difference of 2 cm can be considered statistically significant.

To calculate the t-value, you need to complete the following tasks:

t df =
(X − X )
1 2
deg rees of freedom , df = n 1 + n 2 − 2
1 1
S +
n1 n 2

Pooled Variance s 2 =
(n 1 − 1 )s d 12 + (n 2 − 1)sd 22
n1 + n 2 − 2

(1) Difference between the means. In the above example the difference is
(156-154 = 2 cm).

(2) Calculate the standard deviation (square root of variance S2) of all
observations pooled together for both the samples.

In case the standard deviations for each of the study groups are given or have
been calculated, then compute the pooled variance of samples (S2) as given:

59 × 9.61 + 51 × 7.84 966.83


=S2 = = 8.80
59 + 51 110
Standard deviation (of pooled
sample) = 2.96
(3) Calculate the standard error of the difference between the two means

1 1
S +
n1 n 2 SE =
2.96×0.01895 =
0.5608
(4) Finally, divide the difference between the means by the standard
error of the difference.
2
The value now obtained is called t-value.
= t = 3.57
0.5608

The Table value

Once the t-value has been calculated, you will have to refer to a t-table, from
which you can determine whether the null hypothesis is rejected or not.
Annex II contains a t-table.

1) First, decide which significance level (α value) you want to use.


Remember that the chosen significance level (α value) is an expression
of the likelihood of finding a difference by chance when there is no real
difference. Usually we choose a significance level of 0.05.
2) Second, determine the number of degrees of freedom for the test being
performed. Degrees of freedom are a measure derived from the sample
size, which has to be taken into account when performing a t-test. The
bigger the sample size (and the degrees of freedom) the smaller the
difference needed in order to reject the null hypothesis. For student’s t- 247
Use of Basic test the number of degrees of freedom is calculated as the sum of the two
Statistics
sample sizes minus 2. Thus, for Example 1, comparing the heights of
women opting for army and other services is calculated as follows:
The number of degrees of freedom is: d.f. = 60 + 52 - 2 = 110
3) Third, the t-value belonging to the ‘α’ value (the significance level we
chose) and the degrees of freedom are located in the table.
♦ In our example we look up the t-value belonging to α = 0.05 and d.f. =
120 (nearest to 110 in the table) and we find it is 1.98.
Interpreting the result
♦ If the calculated t-value is equal to or larger than the value derived from
the table, the p-value of significance is smaller than the chosen α -value
(indicated at the top of the column). We then reject the null hypothesis
and conclude that there is a statistically significant difference between
the two means.
♦ If the calculated t-value is smaller than the value derived from the table,
the p-value is larger than the α -level we chose. We then accept the null
hypothesis and conclude that the observed difference is not statistically
significant.

We now compare the absolute value of the t-value calculated in Step 1 (i.e.,
the t-value, ignoring the sign) with the t-value derived from the table in Step
2.

In our example the t-value calculated in step 1 is 3.6, which is larger than the
t-value derived from the table in step 2 (1.98). We, therefore, reject the null
hypothesis and conclude that the observed difference of 2 cm between the
mean heights of women who joined army and women who opted for other
services is a statistically significant difference.

Caution: The reader should understand the basic differences for the use of z
and t statistics. The following assumption will clarify the differences in the
use of two statistics.

Assumption for use of t statistics: In t statistics, we draw two samples from


one population with unknown variance; or draw two samples from two
populations, whose variances are presumed to be equal, but unknown. The
crux of the problem is that the population variance(s) of the parent
population(s) is/are not known and therefore, we estimate the variance(σ2) of
the population from the samples (S2) itself; the data of two samples are
pooled together to estimate the common variance of two populations, which
are given below:

(X − X ) , 2 ( n 1 − 1) sd 12 + ( n 2 − 1) sd 2 2
t df =
1 2
Pooled Variance s =
1 1 n1 + n 2 − 2
S +
n1 n 2
248
(The mathematical reason for doing so is very complicated and it is beyond Overview of
Statistical Tools
the scope of this study material). Further, the probability distribution often
takes care of the skewness (moderately) of samples/ population, i.e., if the
samples or the populations are moderately skewed. The pooled variance from
the two samples is better estimate of population variance. It is, therefore,
advantageous to use t statistics rather than z statistics, when population
variances are not known.

4.12.2 Paired t Test


This module describes the most commonly used tests for paired observations
pertaining to numerical and to nominal data. The concept of pairing or
matching subjects is illustrated by the examples that follow.

Example 11: A researcher wanted to find out whether a class of students


taught with audio-visual aids (AV) receives on average better grades than
those who are taught without audio-visual aids. In other words, whether there
is any gain in achievement or not due to use of two teaching methods. To
minimize the effect of confounding variables such as social status and
previous knowledge of the subjects, each student in the AV class was paired
with another in the non-AV class of similar social status and knowledge
level. Paired or matched observations are carried out if researcher wants to
ensure through their study design that the relationship between two variables
they are interested in is not confounded by another variable. They, therefore,
have to sample their cases and controls in such a way that these are similar

Example 12: During a nutritional survey, a quality control exercise was


carried out to check the agreement between two observers in measuring the
children’s weight. In this instance we have paired observations as we have a
set of two observations on the same child.

When dealing with paired (matched) observations, comparison of sample


means is performed by using a modified t-test known as the paired t-test. In
the paired t-test, differences between the paired observations (say, Post-test
minus Pre-test, or matched observations of 2nd group minus 1st group) are
used instead of the observations of two sets of independent samples. The
paired t-test calculates the value of ‘t’ as

mean of differences d
t df = , t df =
s tan dard error of differences sd(d) n

Where, d is the mean difference of values obtained from subtracting two sets
of paired observation; sd(d) is the standard deviation of values obtained from
difference of two sets of paired observations; n is the sample size of paired
observations; the degrees of freedom (df = n - 1) is the number of paired
observations (sample size) minus 1.

The same table of t value is used, as for the t-test for unpaired observations
(see Annex II) to interpret result of the study. The use of the paired t-test is 249
Use of Basic illustrated on the results of the nutritional survey referred to in previous
Statistics
example above. The results are given table 4.7.

Table 4.7: Results of quality control exercise during a nutritional survey

The null hypothesis in this study is

Ho: The mean difference of measurements between observers A and B would


be zero

H1:. The mean difference of measurements between observers A and B


would not be zero

Calculating the test statistic

The paired t-test (significance test) is calculated as follows:

i) Calculate the mean difference of the measurements between A and B in


the sample. This is the sum of the differences divided by the number of
measurements:
21.1
Mean difference = = 1.05
20
ii) Calculate the standard deviation of the differences and Standard Error,
1.77
SE
Standard deviation = 1.77, and the standard error: = = 0.40
20
iii) The value of ‘t’ is the mean difference divided by the standard error:
1.05
= t = 2.62
0.40

250
The Table Value Overview of
Statistical Tools
The degrees of freedom are the sample size (the number of pairs of
observations) minus 1, which in this case is (20 – 1 = 19). The tabled t-value
at 19 degrees of freedom is 2.09

The Interpretation

tcal > ttab

If the calculated t-value (ignoring the sign) is larger than the value indicated
in the table, the null hypothesis, stating that there is no difference, is rejected,
and it can be concluded that there is a significant difference in the result of
your study.

Note: Computers are helpful when dealing with large data sets. A variety of
software including Excel and SPSS provides options for various statistical
tests.

Check Your Progress 4

Note:

a) Write your answer in about 50 words.


b) Check your answer with possible answers given at the end of the unit
1. What are the important considerations that one should keep in mind
while applying the chi-square test?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
2. What is a paired t-test and when is it used?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………

4.13 CORRELATION ANALYSIS


When exploring associations between variables we have to distinguish
between nominal, ordinal, and continuous data. In this unit, we will examine
associations between continuous data where a linear relationship is suspected.
251
Use of Basic For associations between ordinal data Spearman’s rank correlation coefficient
Statistics
or Kendall’s tau can be calculated and tested for significance.

4.13.1 Understanding Correlation


Correlation is relationship between the two sets of continuous data; for
example the relationship between height and body weight. Correlation
statistics are used to determine the extent to which two independent variables
are related and can be expressed by a measure called ‘coefficient of
correlation’. The correlation coefficient may be positive or negative and
therefore it may vary from ‘-1’ to ‘+1’. Positive correlation means that values
of two different variables increase and decrease together. For example, height
and weight correlate positively. Negative correlation means that if the value
of one variable decreases, then the value of the other variable increases
(inverse relationship). For example, literacy and number of children in family
may correlate negatively.

The strength of a correlation is determined by the absolute value of the


correlation coefficient; the closer the value to 1, the stronger the correlation.
For example, a correlation of -0.9 indicates an inverse relationship between
two variables and shows a stronger relationship than that associated with a
correlation of +0.2 or -0.5. Correlation between two variables is shown by
scatter plot (Fig. 3.2).

Correlation analysis is important because it can be used to predict values of


one variable on the basis of value of other variable. A correlation does not
mean causation but it also does not mean absence of causation, that is, if two
variables exhibit strong correlation, then, one of the variables may cause the
other. Correlation data is, therefore, not sufficient evidence for causation.
Consider the following two scatter diagrams (Fig 3.2):

Figure 4.2: Scatter Diagram showing relation between two variables

Example of Positive Correlation

25

21, 21
20

17, 16
15
13, 13
Y

10, 11
10

5
5, 4

0
0 5 10 15 20 25

252
Overview of
A: Exam ple of Positive High B: Exam ple of Negative High Correlation
Statistical Tools
Correlation
25
25
20 5, 21
20
21, 21 10, 17
15
17, 16 13, 13

Y
15
13, 13 10 16, 10
10, 11
10
5 21, 5
5
5, 4 0
0
0 5 10 15 20 25
0 5 10 15 20 25
X
X

C: Exam ple of Positive Low D: Example of Negative Low Correlation


Correlation

25
25
20 5, 21
7, 20
21, 21
20 15, 20
10, 17
12, 17 15 7, 14
15
17, 16 13, 13 17, 12
Y

6, 12
8, 13 13, 13 10 8, 10 16, 10 20, 10
10
10, 11 13, 8
10, 8 5 12, 6 21, 5
14, 7 20, 7
5
5, 4
0
0 0 5 10 15 20 25
0 5 10 15 20 25

X X

The slopes of both the lines are identical in these two examples, but the
scatter around the line is much greater in the second. Clearly the relationship
between variables y and x is much closer in the first diagram.

If we are interested only in measuring the association between the two


variables, then Pearson’s Correlation Coefficient (r) gives us an estimate of
the strength of the linear association between two numerical variables.
Pearson’s Correlation Coefficient can either be calculated using a calculator
or on computer using various analytical software. Note that in case there is
curvilinear relationship, the value of r will be shown to be zero. The
correlation coefficient has the following properties:
253
Use of Basic 1. For any data set, r lies between ‘-1’ and ‘+1’.
Statistics
2. If r = +1, or -1, the linear relationship is perfect, that is, all the points lie
exactly on a straight line. If most of the points lie on the line, then it is
very strong relationship and r is near to 1. If r = +1, variable y increases
as x increases (i.e., the line slopes upwards). (See Diagram A.) If r = -1,
variable y decreases as x increases (i.e., the line slopes downward). (See
Diagram B.)
3. If r lies between 0 and +1, the line slopes upwards, but the points are
scattered about the line. (See Diagram C.) The same is true of negative
values of r, between 0 and -1, but in this case the line slopes downward.
(See Diagram D.)
4. If r = 0, there is very low linear relationship between y and x. This may
mean that there is no relationship at all between the two variables (i.e.,
knowing x tells us nothing about the value of y). (See Diagram E.).

4.13.2 Calculation of the Pearson’s Correlation Coefficient


In a nutrition study in a large rural district, a sample of 20 children 5 years of
age were weighed and their family incomes estimated. The results were as
follows:

Table 4.8: Weights and family incomes of 20 children 5 years of age

Serial Family Weight Serial Family Weight in


Number Income in in kg Number Income in kg
$ Per year $ Per year
1. 130 15.5 11. 225 18.1
2. 200 19.8 12. 95 17.4
3. 345 21.5 13. 130 17.9
4. 245 18.8 14. 330 17.0
5. 155 12.8 15. 295 18.7
6. 300 18.8 16. 170 16.0
7. 360 18.1 17. 250 18.2
8. 105 18.7 18. 355 16.4
9. 80 13.1 19. 220 15.4
10. 275 20.1 20. 175 17.6

The formula for calculating of the correlation coefficient is as follows:

∑ (X i (
− X) Yi − Y )
r= i =1
nn

i
2

=i 1 =i 1
∑ (X − X) ∑ (Y − Y)
i
2

254
r=
∑ X Y − (∑ X ∑ Y ) / n
i i i i
Overview of
Statistical Tools
2 2
 n   n 
n ∑ i X n  ∑ Yi 

=
X i2 −  i 1 = 
∑ Yi2 −  i 1 
i 1 =i 1 n n
It would be more informative to investigate whether the two variables ‘family
income’ and ‘weight of five-year-olds’ are associated.

In our example of weight and family income, this would mean:


4440 × 349.9
79416.5 −
r= 20
19713600 122430
1141550 − 6210.77 −
20 20

r = 0.466

which is positive (indicating an upward sloping line), but away from 1


(indicating that there is plenty of scatter around the line)

4.13.3 The Significance of the Correlation Coefficient


The value of r was calculated from a sample of just 20 children. The result is,
therefore, subject to sampling error, and is unlikely to be equal to the true
value of r, which we would obtain if we measured all 5-year-old children in
this district. The question arises as to whether there really is any relationship
at all between weight and income. Perhaps, in the entire population of 5-year-
old children, the scatter diagram would look like diagram ‘E’, above, (no
relationship between y and x) and the positive relationship in our sample
occurred by chance. To assess whether this is the case we do a significance
test on r. The null hypothesis is that in the whole population there is no linear
relationship between y and x. To do the test we calculate

n−2
t n−2 = r ×
1− r2
We compare this value of t to tables of the t distribution with (n - 2) degrees
of freedom, where n is the number of observations.

18
In our example: n = 20, r = 0.466, t18 =×
0.466 2.23
=
1 − 0.4662

Therefore, using α value (chosen p value) of 0.05, the t-table value for 18
degrees of freedom (t18; 0.05) = 2.10. Thus the calculated t-value is more than
the table value; this means that the p-value (for rejecting null hypothesis) is
less than 0.05, and therefore the linear relationship is statistically significant.

4.14 REGRESSION ANALYSIS


A simple linear regression fits a straight line through the set of n points in
such a way that it makes the sum of squared residuals of the model (that is, 255
Use of Basic vertical distances between the points of the data set and the fitted line) as
Statistics
small as possible. In simple terms, simple linear regression involves finding
the equation of a line in such a way that if the observations are plotted in a
scatter diagram, most of the observed points will lie close to the line and the
sum of square of the distance between the line and each of the observed
points will be minimum.

4.14.1 Fitting the Regression Line

Suppose there are n data points {yi, xi}, where i = 1, 2, …, n. The goal is to
find the equation of the straight line,

Where α = the intercept of the regression line

ß = the slope of the regression line

which would provide a best fit for the data points. Here the best fit will be
understood as in the least-squares approach: a line that minimizes the sum of
squared residuals of the linear regression model.

By using calculus, it can be shown that the values of α and β that minimize
the objective function Q are

where rxy is the sample correlation coefficient between x and y, sx is the


standard deviation of x, and sy is correspondingly the standard deviation of y.
Horizontal bar over a variable means the sample average of that variable. For
example:

Sometimes people consider a simple linear regression model without the


intercept term: y = βx. In such a case the OLS estimator for β will be given by
the formula.

In the regression without the intercept

In simple regression analysis, there is one quantitative dependent variable


and one independent variable. In multiple regression analysis, there is one
quantitative dependent variable and two or more independent variables. For
example, one may derive a formula to predict blood pressure (dependent
256 variable) from the age, diet, exercise level, genetic background, etc.,
(independent variable). Linear regression statistics finds the best fit line (line Overview of
Statistical Tools
of regression) that predicts dependent variable from independent variables.
Linear regression statistics is applied to data where dependent variable is
continuous, but independent variables are continuous and/or categorical
variable. It is beyond the scope this study material to give more detailed
information about the regression analysis.

4.14.2 How to Find the Regression Equation


In the table below, the xi column shows scores on the aptitude test. Similarly,
the yi column shows statistics grades. The last two rows show sums and
mean scores that we will use to conduct the regression analysis.

Student xi yi (xi - x) (yi - y) (xi - x)2 (yi - y)2 (xi - x)(yi - y)

1 95 85 17 8 289 64 136
2 85 95 7 18 49 324 126
3 80 70 2 -7 4 49 -14
4 70 65 -8 -12 64 144 96
5 60 70 -18 -7 324 49 126
Sum 390 385 730 630 470
Mean 78 77

The regression equation is a linear equation of the form: ŷ = α + ßx. To


conduct a regression analysis, we need to solve for α and b1. The
computations are shown below.

ß = Σ [ (xi - x)(yi -y) ] / Σ [ (xi - x)2] α = y - ß*x


ß = 470/730 = 0.644 α = 77 - (0.644)(78) = 26.768

Therefore, the regression equation is: ŷ = 26.768 + 0.644x.

4.14.3 How to Use the Regression Equation


Once you have the regression equation, using it is easy. Choose a value for
the independent variable (x), perform the computation, and you have an
estimated value (ŷ) for the dependent variable. In our example, the
independent variable is the student's score on the aptitude test. The dependent
variable is the student's statistics grade. If a student made an 80 on the
aptitude test, the estimated statistics grade would be

ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80 = 26.768 + 51.52 = 78.288

257
Use of Basic 4.14.4 Difference between Regression and Correlation
Statistics

S. Correlation Regression
No.
1 Correlation quantifies the Regression finds out the best fit line
degree to which two for a given set of variables.
variables are related. You
simply are computing a
correlation coefficient (r)
that tells you how much one
variable tends to change
when the other one does.
2 With correlation you don't With regression, you do have to think
have to think about cause about cause and effect as the
and effect. You simply regression line is determined as the
quantify how well two best way to predict Y from X.
variables relate to each
other.
3 With correlation, it doesn't With linear regression, the decision of
matter which of the two which variable you call "X" and
variables you call "X" and which you call "Y" matters a lot, as
which you call "Y". You'll you'll get a different best-fit line if
get the same correlation you swap the two. The line that best
coefficient if you swap the predicts Y from X is not the same as
two. the line that predicts X from Y.
4 Correlation is almost always With linear regression, the X variable
used when you measure is often something you
both variables. It rarely is experimentally manipulate (time,
appropriate when one concentration...) and the Y variable is
variable is something you something you measure.
experimentally manipulate.
5 In correlation, our focus is In regression analysis, we examine
on the measurement of the the nature of the relationship between
strength of such a the dependent and the independent
relationship. variables.
In correlation, all the In regression, at our level, we take the
variables are implicitly dependent variable as random, or
taken to be random in stochastic, and the independent
nature. variables as non-random or fixed.

Use of Data Reduction Methods in Development Research

There are also some of the data reduction techniques that are extensively used
in development studies but are not within the scope of this unit. One of such
technique is the Principal Component Analysis which is very commonly
258
used. This technique is generally used when the number of explanatory Overview of
Statistical Tools
variables is very high. In Principal Component Analysis, the variables are
compressed to get a lesser number of variables called the principal
components. There are statistical softwares which can be used to easily work
out the Principal Component Analysis. For example a researcher has
identified 20 variables that influence a farmer’s adaptation to climate change.
He may use principal component analysis which will help in grouping similar
variables into one component thus reducing the number of variables. The
analysis may thus identify just three or four principal components into which
all these variables can be grouped. The first principal component will amount
for maximum amount of variation out of the existing variables followed by
the second, third and the fourth. This kind of analysis helps the researcher to
identify those variables which if worked upon would increase the adaption of
the farmers to climate change.

Check Your Progress 5

Note:

a) Write your answer in about 50 words.


b) Check your answer with possible answers given at the end of the unit
1. Differentiate between correlation and regression.
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………

4.15 LET US SUM UP


Statistics is a science that deals with the collection, organization, analysis,
interpretation, and presentation data. The information or data collected may
be classified as qualitative and quantitative. It may also be classified as
discrete or continuous. Frequency distribution is an improved way of
presenting a data. For better and more concise presentation of the information
contained in a data set, the data is subjected to various calculations. If one
wants to further summarize a set of observations, it is often helpful to use a
measure which can be expressed in a single number like the measures of
location or measures of central tendency of the distribution. The three
measures used for this purpose are the mean, median, and mode. Measures of
dispersion, on the other hand, give an idea about the extent to which the
values are clustered or spread out. In other words, it gives an idea of
homogeneity and heterogeneity of data. Two sets of data can have similar
measures of central tendency but different measures of dispersion. Therefore,
measures of central tendency should be reported along with measures of 259
Use of Basic dispersion. The measures of dispersion include range, percentiles, mean
Statistics
deviation and standard deviation.

The results we obtain by subjecting our data to analysis may actually be true
or may be due to chance or sampling variation. In order to rule out chance as
an explanation, we use the test of significance. In this unit we have confined
our discussion to four tests i.e. χ2 test, Z- test, t-test and f-test.

Correlation is relationship between the two sets of continuous data; for


example relationship between height and body weight. Correlation statistics
is used to determine the extent to which two independent variables are related
and can be expressed by a measure called the coefficient of correlation.
Regression, on the other hand, deals with the cause and effect relation
between two sets of data. Simple linear regression fits a straight line through
the set of n points in such a way that makes the sum of squared residuals of
the model (that is, vertical distances between the points of the data set and the
fitted line) as small as possible. The regression line, thus, obtained helps us to
predict the value of dependent variable for a given value of independent
variable.

Annex I: Table of chi-square values

260
Annexure-II Overview of
Statistical Tools

4.16 KEYWORDS
Independent variable: the characteristic being observed or measured which
is hypothesized to influence an event or outcome (dependent variable), and is
not influenced by the event or outcome, but may cause it, or contribute to its
variation.

Dependent variable: a variable whose value is dependent on the effect of


other variables (independent variables) in the relationship being studied.
261
Use of Basic Mean: the mean (or, arithmetic mean) is also known as the average. It is
Statistics
calculated by totaling the results of all the observations and dividing by the
total number of observations.

Median: the median is the value that divides a distribution into two equal
halves. The median is useful when some measurements are in ordinal scale,
i.e., much bigger or much smaller than the rest.

Mode: the mode is the most frequently occurring value in a set of


observations. The mode is not very useful for numerical data that are
continuous. It is most useful for numerical data that have been grouped. The
mode is usually used to find the norm among populations.

Range: this can be represented as the difference between maximum and


minimum value or, simply, as maximum and minimum values.

Percentiles: percentiles are points that divide all the measurements into 100
equal parts. The 30th percentile (P3) is the value below which 30% of the
measurements lie. The 50th percentile (P50), or the median, is the value
below which 50% of the measurements lie.

Mean Deviation: this is the average of deviation from arithmetic mean

Standard Deviation: this denotes (approximately) the extent of variation of


values from the mean.

Parametric statistical test is a test whose model specifies certain conditions


about the parameters of the parent population from which the sample was
drawn.

Non-parametric statistical test is a test whose model does not specify


conditions about the parameters of the parent population from which sample
was drawn.

Normal Distribution: The normal distribution is symmetrical around the


mean. The mean, median, and mode assume the same value if observations
(data) follows a normal distribution.

Sampling Variation: any value of a variable obtained from the randomly


selected sample (e.g., a sample mean) cannot assume the true value in the
population. The variation is called a sampling variation.

Test of Significance: a test of significance estimates the likelihood that an


observed study result (e.g., a difference between two groups) is due to chance
or real.

4.17 BIBLIOGRAPHY AND SELECTED


READINGS
Altman, D.G. (1991), Practical Statistics for Medical Research, Chapman
262 and Hall, London.
Barker, D.J.P. (1982), Practical Epidemiology. (3rd ed.), Churchill Overview of
Statistical Tools
Livingstone Edinburgh, UK.

Bradford, H. A. (1984), A Short Textbook of Medical Statistics (11th ed.),


Hodder and Stoughton London, UK.

Castle, W.M. and North P.M. (1995), Statistics in Small Doses. Churchill
Livingstone Edinburgh, UK.

Fletcher, R. H., S. W. Fletcher and E. H. Wagner (1996), Clinical


Epidemiology: The Essentials, Lippincott Williams and Wilkins, 351 West
Canadian Street Baltimore, Maryland, USA.

Glaser, A.N. (2000), High-yield Biostatistics, Lippincott Williams and


Wilkins, 227 East Washington Square, Philadelphia, USA.

Greenhalgh, T.(1998), How to Read a Paper: The Basics of Evidence Based


Medicine, BMJ publishing group, BMA House, Tavistock Square,
London,UK.

Hicks, C.M. (1999), Research Methods for Clinical Therapists. 3rd Edition,
Churchill Livingstone, Robert Stevenson House, 1-3 Baxter's Place,Leith
Walk, Edinburgh, UK.

Kelsey, J.L., W.D. Thompson and A.S. Evans (1986), Methods in


Observational Epidemiology, Oxford University Press, Oxford, UK.

Kidder, L.H. and C.M. Judd (1986), Research Methods In Social Relations,
CBS College Publishing, New York, USA.

Kleinbaum, D.G., L.L. Kupper and H. Morgenstern (1982), Epidemiologic


Research - Principles and Quantitative Methods, Van Nostrald Reinhold,
New York, USA.

Riegelman, R.F. (1981), Studying a Study and Testing a Test, Little Brown
and Company, Boston, MA, USA.

Schlesselman, J.J. (1982), Case-Control Studies - Design, Conduct, Analysis,


Oxford University Press, Oxford, UK.

Siegel, S. (1956), Nonparametric Statistics for the Behavioral Sciences,


McGraw-Hill Book Company.

Swinscow, T.D.V. and M.J. Campbell (1998), Statistics at Square One (11th
ed.), British Medical Association, London, UK.

263
Use of Basic
Statistics
4.18 CHECK YOUR PROGRESS – POSSIBLE
ANSWERS
Check Your Progress 1

Answer 1: There are two types of data: (i) qualitative data, viz., occupation,
sex, marital status, religion, and; (ii) quantitative data viz., age, weight,
height, income, etc. These may be further be categorized in two types viz.,
Discrete and continuous data.

Answer 2: A non-parametric statistical test is a test whose model does not


specify conditions about the parameters of the parent population from which
sample was drawn.

Check Your Progress 2

Answer 1: The three measures of central tendency are the mean, median, and
mode.

Answer 2: The measures of dispersion are range, percentiles, mean deviation


and standard deviation.

Check Your Progress 3

Answer 1: In statistical terms, the assumption that no real difference exists


between groups in the total study (target) population (or that no real
association exists between variables) is called the Null Hypothesis (Ho). The
Alternative Hypothesis (H1) is that there exists a difference between groups
or that a real association exists between variables.

Answer 2: Type I error (α): We reject the null hypothesis when it is true, or
a false positive error or type I error α (called alpha). It is the error in detecting
true effect.

Type II error (β): We accept the null hypothesis when it is false, or a false
negative error; or simply, type II error ‘β’ (called beta) can be stated as
failure to detect true effect.

Check Your Progress 4

Answer 1: The chi-square test can only be applied if the sample is large
enough. The total sample should be at least 40 and the expected frequencies
in each of the cells should be at least 5. Unlike the t-test, the chi-square test
can also be used to compare more than two groups. In that case, a table with
three or more rows/columns would be designed, rather than a two-by-two
table. The chi-square is always applied on absolute numbers but not on
percentage values. The high chi-square value never means high association
but it only means high probability of finding such a value and low chance of
finding this chi-square value by chance. It may be very misleading to pool
dissimilar data.
264
Answer 2: When dealing with paired (matched) observations, comparison of Overview of
Statistical Tools
sample means is performed by using a modified t-test known as the paired t-
test. In the paired t-test, differences between the paired observations (say,
Post-test minus Pre-test, or matched observations of 2nd group minus 1st
group) are used instead of the observations of two sets of independent
samples.

Check Your Progress 5

Answer 1: Correlation statistics are used to determine the extent to which


two independent variables are related and can be expressed by a measure
called the coefficient correlation. Correlation coefficient may be positive or
negative and may vary from -1 to +1. In simple regression analysis, there is
one quantitative dependent variable and one independent variable. In multiple
regression analysis, there is one quantitative dependent variable and two or
more independent variables.

265

You might also like