0% found this document useful (0 votes)
3 views

EPSC 123

The document provides an overview of statistics in education, detailing the two main types: descriptive and inferential statistics, along with essential concepts such as population, sample, parameters, and variables. It explains measurement theory, scales of measurement, and the organization of data through various methods including statistical tables and graphical representations. Additionally, it covers the properties of class intervals, grouping errors, and the advantages of graphical data representation.

Uploaded by

1960dembe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

EPSC 123

The document provides an overview of statistics in education, detailing the two main types: descriptive and inferential statistics, along with essential concepts such as population, sample, parameters, and variables. It explains measurement theory, scales of measurement, and the organization of data through various methods including statistical tables and graphical representations. Additionally, it covers the properties of class intervals, grouping errors, and the advantages of graphical data representation.

Uploaded by

1960dembe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 38

EPSC 123: STATISTICS IN EDUCATION

Statistics is the science of planning, collecting, organizing, interpreting, describing and


analyzing data. It also refers to indices which are derived from data through statistical
procedures. There are two main types of statistics:-

a) Descriptive Statistics
These are indices which describe a sample. They are the statistical characteristics of a
sample. They include measures of central tendency, measures of dispersion, distributions and
relationships. Descriptive statistics provide summaries of raw data, making it meaningful and
easy to understand.

b) Inferential Statistics
These are the statistics used to make inferences about a given phenomenon in a population.
Inferences are made about the population based on the statistics of the sample. The purpose
of inferential statistics is to test hypothesis, and to enable generalization of sample findings to
the population.

Basic Concepts in Statistics


1) Population and Sample
Population refers to an entire group of individuals, events or objects having a common
observable or measurable characteristic. It is the aggregate of all that conforms to a given
specification. E.g. All students at Egerton University.

A sample is a sub-group or sub-set of the population which is selected in the study and actually
measured or observed. Each member of a sample is called a subject, respondent or interviewee.
e.g. 100 students from Egerton University.

2) Parameters and Statistics


A parameter is a characteristic that is measurable and can assume different values in the
population. It is a characteristic of the entire population. e.g., Kenya’s per capita income

1
A statistic is a characteristic that is measurable in a sample. It is a sample characteristic derived
through statistic procedures and upon which inferences are made to the population. Statisticians
use different notations for parameters and statistics.

MEASURE SAMPLE POPULATION


[Statistics] [Parameters]

Size n N
Mean X, Y µx, μy
Standard deviation S or S. D δ
Variance S² or Var. δ²

3) Variables and Constants


A variable is a measurable characteristic that assumes different values among subjects. It is a
logical way of expressing a particular attribute in sample. E.g. height, weight etc.

A constant is a characteristic that assumes one value which does not vary in any way e.g. π,
gravitational pulls [9.8 N/M²] etc.

There are two main types of variables:-


a) Discrete variables: - which take up only a finite number of values in whole units e.g. no.
of eggs, cars etc.
b) Continuous variable: - which take up an infinite number of values with any degree of
subdivision of units e.g. temp, weight, length, height etc.

THEORY OF MEASUREMENT
Measurement is the act of assigning numerical values to observations. Measurement requires
skill and it is done according to certain rules. Physical measurement is easy but abstract measures
are taken by use of standard tests. Observation is the act of carefully and objectively recording
an item of information according to specified procedure or rule. Observation of the same
characteristic is done following same procedure.

2
Scales of Measurement
Scales of measurement indicate the degree to which numbers assigned to observations
correspond to the characteristics of the variable under study. It is the indication of the amount of
observation. There are four scales or levels of measurement.

1) Nominal Scale
This is the lowest scale of measurement which merely groups subjects into categories or
labels. These categories are such that a subject can only belong to one category and not more
than one. Nominal scale is simplistic and observations can only be statistically analyzed if
they are converted to meaningful figures. E.g. an individual is either old or young. A, is old
while B, is young.

2) Ordinal Scale
This scale provides a ranked order of observations. In ordinal scale, numerals are used to
represent relative position or order among the values of variables. In ordinal scale, the terms
equal, not equal, less than, and greater than are used. In ordinal scale, the degree of how
greater or how less is not indicated e.g. A, is older than B. N.B the degree of difference is not
indicated.

3) Interval Scale
In interval scale, numerals assigned to each measure are ranked in order and the intervals
between numerals are equal. The intervals between the points on the scale become
meaningful in terms of qualitative and quantitative intervals. It gives how much less or
greater an observation is in relation to others. E.g. A, is 20 yrs older than B.
4) Ratio Scale
This is the highest level of measurement. It has the characteristics of the other scales and in
addition it has an absolute zero. Zero means in ratio scale absence of the attribute in question.
Its main advantage is that it enables comparison between measurements. e.g.,A is 40 yrs old,
B, is 20 yrs old thus A, is twice as old as B.

3
Properties of the Levels of Measurement
Property Nominal Ordinal Interval Ratio
Discrete Classes √ √ √ √
Order among Classes √ √ √
Equal Intervals √ √
Absolute Zero √

Rounding off
Measurements in education and psychology cannot be as exact as in physical science. They are
thus approximated by rounding off the final answer to two decimal places. In rounding off, look
at the digit after the one to be retained. If it is less than 5, drop it, if it is 5 and above, raise the
figure to be retained by 1.

ORGANISATION OF DATA
Raw data which is collected during a study is disorganized and has no meaning. In order to
understand and interpret it, it should be organized in the following ways:-
 Frequency Distributions
 Graphic Representations
 Proportions
 Statistical Tables
 Rank Orders Etc.

A. Statistical Tables
Data is arranged into properly selected rows and columns. The various rows and columns are
well titled. From these tables, statistics derived include percentages, variance, and S.D etc.
Construction of the tables follows the following rules:-
 The title should be simple and it appears at the top of the table.
 Rows and columns should be arranged logically to facilitate comparison.
 Headings and subheadings of rows and columns should be brief.
 Units used must be indicated e.g. M, Kg, S etc.
 If data is secondary, the source should be indicated at the end of the table.

4
 In tabulating long columns of figures, spaces should be left after every 5 or 10 rows.
 Large numbers should be grouped in threes.
 Vertical lines between columns are not allowed and horizontal lines should only be at the
top and at the bottom of the table.

Table 1 A Summary of Scales of CAT and Exam for 2001, 2002 And 2003 B.Ed and Aged
Groups
Year B. Ed Aged
CAT EXAM CAT EXAM
2001 66 70 70 68
2002 64 66 68 66
2003 65 66 71 68

B. Rank Order
The raw data is arranged either in ascending order or descending order. Usually the merit of
individual scores is used in ranking. All rules for table’s construction apply.

Table 2 A Group Summary of 5 Students Arranged In Order of Increasing Age in Years.


Student Number Age
1 14
2 15
3 17
4 18
5 24

C. Frequency Distribution
Frequency of a score refers to the number of times a score is repeated in a given series.
Frequencies are important because most data is non-mutually exclusive i.e. scores can be
repeated. A frequency distribution is a listing in order of magnitude of each score together with
the number of times the score occurs [its frequency]. E.g. the intelligence scores of 14 students

5
were obtained as 130, 115, 120, 115, 105, 110, 110, 100, 95, 95, 100, 90, 85 and 75. A frequency
distribution for this data would be as follows:-

Table 3 A Frequency Distribution of IQ Scores for Students.


IQ Scores Frequency
130 1
120 1
115 2
110 2
105 1
100 2
95 2
90 1
85 1
75 1

When the data is very large a grouped frequency distribution is used, where data is grouped into
arbitrarily chosen classes or groups. The following systematic steps are used in the construction
of a grouped frequency distribution:-

(i) Have the raw data set.


(ii) Determine the range of the scores i.e. subtract the lowest score from the highest score.
(iii) Determine the class interval. Two methods can be used :-

a) Decide the number of classes you need, then divide the range with the number of
classes you need; or
b) Decide on the class interval first, write the content of the distribution, and then
determine the frequencies.

6
Example: Group the following scores of a test for students into a frequency table.
68 53 62 61 58 58 57 56 54
54 46 52 52 51 51 50 49 48
47 43 46 45 45 45 45 44 43
43 63 42 42 41 41 41 40 39
39 38 38 37 37 37 36 35 33
32 30 26 26 21
 The range = 68 - 21
= 47
 Choose 10 classes, then the class interval is 47/10 = 4.7
 Choose 5 for the data set.
The grouped frequency table is as follows:-

Table 4 Grouped Frequency Distribution for a Students’ Test


i f Exact limits Mid point C.F C.P.F
65-69 1 64.5-69.5 67 50 100%
60-64 3 69.5-64.5 62 49 98%
55-59 4 54.5-69.5 57 46 92%
50-54 8 59.5-54.5 52 42 84%
45-49 9 44.5-59.5 47 35 70%
40-44 10 39.5-44.5 42 26 52%
35-39 9 34.5-39.5 37 15 30%
30-34 3 29.5-34.5 32 7 14%
25-29 2 24.5-29.5 27 3 6%
20-24 1 19.5-24.5 22 1 2%

Properties of Class Intervals


(i) Class intervals should be mutually exclusive i.e. a score should not belong to more than
one class.
(ii) All classes should be of the same size.
(iii) Class intervals should be continuous throughout the distribution.

7
(iv) Class intervals with the highest score should be at the top and the one with the lowest
score should be at the bottom. This is not mandatory but is recommended because
ranking is logically from top to bottom.

Grouping Error
The mid point of class is assumed to be the score of every individual in the class. The magnitude
of a score from the midpoint varies with the variation of a score from the midpoint. This error
cannot be eliminated and it remains a unique characteristic.

N.B. The larger the interval the greater the error.

The C.F. and C.P.F columns are important because:-


(i) They indicate the proportion of cases lying above or below a certain score.
(ii) They indicate the relative position of an individual score in relation to the others in the
distribution.

D. Graphical Representation of Data


A graphical representation is a geometrical image of a set of data. It is a mathematical means of
enabling one to understand a statistical problem in visual ways.

Advantages of Graphics in Statistics


(i) Data can be represented in a more attractive way.
(ii) Graphics provide a more lasting effect on the brain.
(iii) Comparative analysis and interpretation can be easily and effectively made.
(iv) Various statistics like median, mode, quartiles, and correlation coefficients can be easily
computed.
(v) It is easy to estimate, evaluate, and interpret the characteristics of a set of data.
(vi) Graphical representations are economical and effective.
(vii) If helps in forecasting as it indicates the trends of the data.

8
Graphics Used for Ungrouped Data
In drawing graphics, the following rules are observed:-
(i) Graphics are referred to as figures.
(ii) The title and number of the graph should be at the bottom of the graph.
(iii) A graphics has two axes. The horizontal axes are called the abscissa or x-axis, while the
vertical axes are called the ordinate or y-axis.
(iv) In most cases the frequency is indicated on the ordinate while the values of the variable
are indicated on the abscissa.
(v) The ordinate should be three quarters as long as the abscissa [3/4 rule].
(vi) If scales do not start from zero then its important to break the abscissa e.g.

(i) Bar Graphs


A bar graph is used to represent discrete data, which is either ordinal or normal. The data is
represented by horizontal or vertical bars. The lengths of the bar are proportional to the amount
of variable it represents in the set of data. Bar width is not governed by any rules and it’s
conventional to leave spaces between bars. The space width should be about half of the bar width
but that is not mandatory.

Example
The following is a distribution of the number of students from various provinces attending
Egerton University.

9
Province No. of students
Coast 570
Central 956
Eastern 350
North Eastern 105
Rift valley 868
Nyanza 750
Western 971
Total 4,570

10
(ii) Pie Chart
This is a circular graphic in which data is represented in sectors of a circle. The areas of the
sectors should be proportional to the amount of variable, and the angular difference of the sector
is calculated as:-

f/N X 360º = Xº f – frequency or amount of variable


N – Total number of subject’s i.e. same size
X – Angular difference
The amount of variable is given as a percentage.

Example
A pie chart for the above data would be as follows:-
Percentage for coast = 570/4570 X 100
= 12.5%
Angular difference = 570/4570 X 360
= 44.9º
Province % Angle
Coast 12.5% 44.9º
Central 20.9% 75.3º
Eastern 7.7% 27.6º
N. eastern 2.3% 8.3º
R. valley 19% 68.4º
Nyanza 16.4% 59.1º
Western 21.2% 76.5º

11
Graphics for Grouped Data
Graphics used to represent grouped data include histograms, frequency polygon, cumulative
frequency curve and cumulative percentage frequency ogive.

(i) Histogram
A histogram which is also called a column graph, is a bar graph constructed from a grouped
frequency distribution. A histogram is different from a bar graph in that:-

 No spaces are left between bars.


 Exact class limits are used as boundaries of the bars.
 It is customary to take two extra intervals, one below and on above the given
distribution.

Example
The following data represents scores of 35 students in a test :- 69, 64, 63, 61, 59, 57, 56, 55, 54,
54, 53, 52, 52, 52, 51, 49, 49, 49, 48, 48, 47, 47, 46, 45, 44, 44, 44, 44, 43, 43, 43, 42, 42, 41, 40.

To draw a histogram, a grouped frequency distribution is first drawn as follows:-

12
Class F Class Limits Midpoints
65-69 1 64.5-69.5 67
60-64 3 59.5-64.5 62
55-59 4 54.5-59.5 57
50-54 7 49.5-54.5 52
45-49 9 44.5-49.5 47
40-44 11 39.5-44.5 42

13
(ii) Frequency Polygon
This is a line joining points which are plotted using the midpoints of classes against the
frequencies. Unlike in the bar graph, points of the graph are joined using straight lines. Two
classes are left on either extreme for closing effect.

N.B. A frequency polygon can be drawn by joining the midpoints of the bars in a histogram.

Example
From the data above a frequency polygon is drawn as follows:-

14
(iii) Cumulative Frequency Curve
This is a line in which the exact class limits are plotted against the cumulative frequencies of the
classes. It is normally a sigmoid curve. The plotted points are joined using a smooth continuous
line.

Example

In the above example

Class F U.C.L C.F


65-69 1 69.5 35
60-64 3 64.5 34
55-59 4 59.5 31
50-54 7 54.5 27
45-49 9 49.5 20
40-44 11 44.5 11

15
(iv) Cumulative Percentage Frequency Curve [Ogive]
In this graph, the upper class limits are plotted against the cumulative percentage frequencies.
Then the points are connected using a continuous smooth line. It has a shape similar to the
cumulative frequency curve. It is useful in a quick determination of certain statistics i.e.
quartiles, deciles, percentages and the mean.

Example
For the above data, the c.p.f curve is determined as follows:-

Class f u.c.l c.f c.p.f


65-69 1 69.5 35 100%
60-64 3 64.5 34 97%
55-59 4 59.5 31 88.6%
50-54 7 54.5 27 77.1%
45-49 9 49.5 20 57.1%
40-44 11 44.5 11 31.4%

16
MEASURES OF CENTRAL TENDENCY
Measures of central tendency or centrality measures are values or characteristics that describe the
average of a distribution. They are values around which other items of a distribution congregate.
They condense a huge set of numerical data into single numerical values which are
representative of the entire distribution.

The Mean
This is the average which is obtained by dividing the sum total of the scores in a distribution by
the number of scores i.e. sample size or population size. Means are derived for interval and ratio
data and they help to give the general characteristic or trend of a sample or population.

Properties of the Mean


(i) The mean is appropriate for interval data.
(ii) The mean may be interpreted as a point of balance where equal weight is placed at each
measurement point.
(iii) The mean is a property of all observations in a distribution.
(iv) If all observations of a distribution are added, subtracted, multiplied or divided by a
constant, then the mean is also added, subtracted, multiplied or divided by the same
constant.
(v) The aggregate sum of all the deviations of a given set of observations from the mean is
zero.
Merits of the Mean
(i) It is rigidly defined. Its definition is clear, unambiguous and only one.
(ii) It is easy to calculate and comprehend.
(iii) It is based on all observations.
(iv) It is affected least by fluctuations of sampling i.e. it is a stable average.
Demerits of the Mean
(i) It is very much affected by the extreme observations [outliers].
(ii) It cannot be determined by a single inspection, nor can it be located graphically.
(iii) It cannot be used when dealing with qualitative characteristics which cannot be
quantified. e.g. beauty, honesty, etc.

17
The mean is calculated as follows:-

X = ∑X/N X - mean
∑X – summation of x
N – Sample of population size

Example
The mean of the scores 7, 7, 6, 5, 4, 4, 3 is calculated as:-

X = 7+7+6+5+4+4+3
8
= 40/8
=5
For grouped data the mean is calculated as:-

X = ∑fx = ∑fx f - frequency


N ∑f x – midpoints of classes

E.g. determine the mean of the data represented in the table below.
CLASS f x f[x] c.f
65-69 1 67 67 50
60-64 3 62 186 49
55-59 4 57 228 46
50-54 7 52 364 42
45-49 9 47 423 35
40-44 11 42 462 26
35-39 8 37 296 15
30-34 4 32 128 7
25-29 2 27 54 3
20-24 1 22 22 1

18
∑f = 50 ∑fx =2230

X = 2230/50
= 44.60

Weighted Mean
This is obtained when the relative importance of all items in a distribution is not the same.
Weighted proportion to relative importance is given to each value.

Xw = ∑wiXi
∑wi

The overall sample means for two sets of data are based on the same sizes n 1 and n2, and means
X1 and X2
X = n1x1 + n2x2
n+n2

Each separate sample mean receives weight proportional to the respectful sample size.

Example
The mean yearly income for 200 civil servants in Nakuru is 21,600/=, whereas the mean yearly
income for 400 municipal workers is 19,500/=. What is the overall average salary for these
public servants?

X = 200*21600+19500*400
600
= 20,200/=

The Mode
This is the most frequently occurring score in a distribution. It is the score with the highest
frequency in a distribution, also known as the typical value. In grouped data, the mode is the

19
midpoint of the class with the highest frequency. This class is known as the modal class. E.g. in
the above example, the modal class is 40-44, while the mode is 42.

Properties of the Mode


(i) A distribution may have more than one mode. E.g. scores 3, 4, 4, 4, 5, 6, 7, 7, 7, 8, 10, 11
have two modes 4 and 7. Such a distribution is said to be ‘bimodal’.
(ii) It is possible for a set of scores or measurements not to have a mode when all scores in a
group have the same frequency. In this case, the mode is not helpful in the distribution.

Merits of the Mode


(i) It is easily understood and calculated.
(ii) It is not affected at all by extreme observations.
(iii) It can be conveniently obtained in case of open ended classes.

Demerits of the Mode


(i) It is affected greatly by the fluctuations of sampling.
(ii) It is not based on all observations in a distribution.

The Median
The median is the 50th percentile in a group of scores. It is the point below and above which 50%
of the scores fall. It divides ranked scores into two equal parts such that it exceeds and is
exceeded by the same number of observations. The determination of the median is dependent on
the type of data. For ungrouped data, the median is obtained as follows:-
 Arrange the data in order of magnitude.
 Determine the value at the exact centre in the order.
Example
The data below represents scores in a CAT
9, 9, 6, 4, 7, 6, 8, 2, 9, 10, 3
On order of magnitude
2, 3, 4, 6, 6, 7, 8, 9, 9, 9, 10.
The median is 7.

20
N.B. If the data set is even, the median id half way between each number of the mid pair. In this
case, determine the mid pair and average them e.g. 6+7 = 6.5
2
If similar scores surround the median e.g. 7, 7, 7. Determine the exact lower and exact upper
limits of 7 i.e. 6.5 and 7.5. Divide the interval by 3, add values of 2 portions to the lower limit
e.g. 6.5 + 0.66 = 7.16 this is the median.

For grouped data, determine the median class, then obtain the median as follows:-
Median = L + { N/2 - F} i
f
Where, L – Exact lower limit of the median class.
F – Total frequencies below the median class
f – Frequency of the median class
i – Class interval
N – Total frequencies/sample size/population size

Example
In the above class, the scores are 50, hence have the median is between the 25 th and 26th score.
This falls in the interval class 40-44. Hence;
L = 39.5, F = 15, f = 11, i= 5, N= 50
Median = 39.5 + {50/2 -15}5
11
= 39.5 + {10/11}5
= 44.05

Merits of the Median


(i) It is rigidly defined.
(ii) It is easy to understand and comprehend.
(iii) It is not affected at all by the extreme observations since it’s a positional average.
(iv) It can be calculated when dealing with a distribution having open ended classes.
(v) It can sometimes be located by a single inspection and can be computed graphically.

21
Demerits of the Median
(i) It cannot be obtained exactly for an even number of observations.
(ii) It is not a function of all observations in a distribution.
(iii) It is affected by fluctuations of sampling and thus is a less stable average than the mean.

Mathematical relationship between the 3 averages.


Mo= 3Md – 2M

MEASURES OF DISPERSION [VARIBILITY]


These are measures which provide information on variability discourse and give the amount of
scatter of scores from the centre of a distribution.

The Range
This is the simplest variability measure and it is the difference between the maximum and the
minimum observations in a distribution. It indicates the number of values over which the
distribution spans. It is denoted r.

E.g. scores 7, 7, 2, 5, 6, 5, 4, 3, 4, 3
Range = 7-3
r= 4

Merits of the Range


(i) It is the simplest though crude measure of dispersion.
(ii) It is rigidly defined.
(iii) It is easy to comprehend.
(iv) It is easy to compute.

Demerits of the Range


(i) It is not a function of all observations.
(ii) It is affected by fluctuation and sampling.

22
The Standard Deviation
It is the positive square root of the arithmetic mean of the squares of the deviations of the scores
from their arithmetic mean.

δx = √ 1/n ∑( x – x )² For ungrouped data.

δx = √ 1/n ∑f (xi – x )² For grouped data.

N.B
i) The S.D is always calculated about the mean.
ii) The S.D value depends on the numerical values of the deviations.
iii) The S.D will be greater if the scores are widely spread away from the mean. I.e. the
distribution is heterogeneous.
iv) The S.D will be small if the distribution is homogeneous.

Merits of the S.D


(i) It is rigidly defined.
(ii) It is based on all observations.
(iii) It is less affected by fluctuations of sampling.

Demerits of the S.D


(i) The general nature of extracting the square root is not readily comprehensible to non-
mathematicians.

Example
Obtain the S.D for the following scores.
80, 45, 55, 56, 58, 60, 65, 68, 70, 65, 75, 85, 82, 86, 50, 48, 60, 62, 64, 70.

23
X X - X (X - X)²
86 20.8 432.64
85 19.8 392.04
82 16.8 282.24
80 14.8 219.04
75 9.8 96.04
70 4.8 23.04
70 4.8 23.04
68 2.8 7.04
65 -0.2 0.04
65 -0.2 0.04
64 -1.2 1.44
62 -3.2 10.24
60 -5.2 27.04
60 -5.2 27.04
58 -7.2 51.84
56 -9.2 84.64
55 -10.2 104.04
50 -15.2 231.04
48 -17.2 295.84
45 -20.2 408.04

X = 1340/20
= 65.20
∑x = 1340 ∑(x- x)=0 ∑ ( x - x )² = 2717.2

∑x = √ 1/n∑ ( x - x )²

= √ 1/20 * 2717.2 = √ 135.86 = 11.66

24
Example [example under mean]

CLASS x f (x– x ) (x - x)² f(x - x)²


65-69 61 1 22.4 501.76 501.76
60-64 62 3 17.4 302.76 908.28
55-59 57 4 12.4 153.76 615.04
50-54 52 7 7.4 54.76 383.32
45-49 47 9 2.4 5.76 51.84
40-44 42 11 -2.6 6.76 74.36
35-39 37 8 -7.6 57.76 462.08
30-34 32 4 -12.6 158.76 635.04
25-29 27 2 -17.6 309.76 619.52
20-24 22 1 -22.6 510.76 510.76

∑f ( x - x )² = 4762

δx = √ 4762/ 50
= √ 95.24
= 9.76

Variance
This is the average of squared deviations of scores from the arithmetic mean of the distribution.
It helps to determine the deviation of a single score from the average.
δx² = 1/n ∑( x - x )²
Or for grouped data
δx² = 1/n ∑f( x - x )²

NOTE
The variation in the square of the standard deviation and conversely the S.D is the square root of
the variance.

25
Example
From the above two examples, the variance can be obtained directly from the S.D as follows:-

δx² = 11.66²
= 135.86
And:-
δx² = 9.76²
= 95.24

NOTE
If a constant is added or subtracted from each observation in the series, its variance remains the
same.

CORRELATION
A correlation coefficient is a measure of the strength of the linear relationship between variables.
In most cases, two variables are related in such a way that there is an input variable, while the
other is a response variable. E.g. intelligence and academic performance are related in that
intelligence influences performance. Hence intelligence is the input variable while performance
is the response variable. Methods of regression and correlation analysis enable the determination
of the relationship between variables. The correlation coefficient is a value within the range – 1
and +1.

The correlation coefficient is positive if an increase in one variable [input] causes an increase in
the other variable [response] or a decrease in the input variable causes a decrease in the response
variable. In this case a positive linear correlation is said to exist between the two variables.

26
The correlation coefficient is negative if an increase in the input causes a decrease in the
response variable and vice versa. Thus a negative linear correlation is said to exist between the
two variables.

The correlation coefficient is zero if there exists no linear relationship between the two variables.

There are two main ways of determining correlation i.e. two types of correlation.
(ii) Rank order correlation.
(iii) Linear correlation [product moment correlation].

27
Spearman’s Rank Order Correlation Coefficient
This is determined based on the ranks of scores in both distributions. This coefficient is denoted
as the ρ, and is calculated as follows.
ρ = 1 - 6∑d²

N (N² - 1)
Where d – differences of ranks in the distributions
N – Number of effective pairs or sample size
Example
Determine the nature of the linear relationship in the performance of 10 individuals in two tests
as follows:-
Test 1 - 78, 90, 25, 30, 56, 70, 69, 45, 75, and 60.
Test 2 - 79, 88, 50, 45, 60, 65, 75, 50, 70, and 59.
Procedure
 List the corresponding scores in both distributions and rank each distribution separately.
 Obtain the differences between the corresponding ranks.
 Square the differences and obtain the sum of the squares.
T1 R1 T2 R2 R1-R2) d²
78 2 79 2 0 0
90 1 88 1 0 0
25 10 50 8 2 4
30 9 45 10 -1 1
56 7 60 6 1 1
70 4 65 5 -1 1
69 5 75 3 2 4
45 8 50 8 0 0
75 3 70 4 -1 1
60 6 59 7 -1 1

28
∑d² = 13
ρ =1 – 6 (13)
10 (100-1)
= 0.92 Very high positive correlation.

Pearson’s Product Moment Correlation Coefficient


This is denoted γ; and was discovered by Karl Pearson. It’s determined as follows:-
γxy = ∑xy

Nδxδy

Where x – Deviation from the mean of the first distribution


y – Deviation from the mean of the second distribution
δx – S.D of distribution 1
δy – S.D of distribution 2
N - ∑f
∑xy – Sum of products x and y.
Example above

Test 1 x Test 2 y xy
78 18.2 79 14.9 271.18
90 30.2 88 23.9 721.78
25 -34.8 50 -14.1 490.68
30 -29.8 45 -19.1 569.18
56 -3.8 60 -4.1 15.58
70 10.2 65 0.9 9.18
69 9.2 75 10.9 100.28
45 -14.8 50 -14.1 208.68
75 15.2 70 5.9 89.68
60 0.2 59 -5.1 -1.02
598 641

29
∑1 = 598 ∑ = 641 ∑xy = 2475.2
x = 59.8 x = 64.1
δx = 19.99
δ = 13.24

γxy = ∑xy
Nδxδy
= 2475.2

10x19.99x13.24
= 0.94

TRANSFORMED SCORES

Percentiles and Percentile Ranks


A Percentile Rank of a score is a single number that gives the percentage of cases in the specific
reference group, scoring at or below a particular score. Eg. if a score of 45 corresponds to a
percentile rank of 85, this means that 85% of the scores in the class are equal or lower than 45,
while 15% of the scores are higher than 45.

A Percentile [centile] is the score at or below which a given percentage of the scores lie. E.g. in
the above case, 45 is a percentile.

Notation
PR – Percentile rank
Px– Percentiles x is the score whose rank is required.
e.g. P10 – Tenth percentile

Example
Compute the corresponding percentile rank of the score 41 in the distribution below.

30
ί f c.f
48-50 1 85
45-47 3 84
42-44 4 81
39-41 6 77
36-38 7 71
33-35 9 64
30-32 14 55
27-29 8 41
24-26 10 33
21-23 8 23
18-20 4 15
15-17 3 11
12-14 3 8
9-11 5 5

Direct Method
(i) Locate the class interval where the raw scores fall.
(ii) Combine the frequencies into categories.
a) Take all c.f on the upper side of the critical class.
b) Take all fs on the upper side of the critical class.
(iii) Find the exact limits of the crucial class. Subtract the exact upper limit of the
immediately preceding class interval from the desired score.
(iv) Work out the number of scores to make the required score.

f (c.c) x no of units in step 3


c.i
(v) Add obtained units to c.f of class below critical class
(vi) Divide the obtained value by the same size and multiply by 100.

31
 Critical class is 39-41
 c.f below critical class 71
 c.f above critical class 64
 c.f of critical class 77
 Limits of critical class 38.5-41.5
 41 – 38.5 = 2.5
 6/3 x 2.5 = 5
 71 + 5 = 76
 PR = 76/85 x100
= 89.41%
This means that 89.41% of the scores lie at or below the score 41.

Indirect Method
PRx = 100 ( F +[x – L] f)
N ί
F – c.f below the critical class.
X – Score for which PR is derived
L – Exact lower limit of the critical class
i– Class interval
f– Frequency of the critical class
N – Total number of scores
Thus, PR41 = 100/85 (71 [2.5/3]6)
= 89.41%

Percentiles
A percentile is the reverse of a percentile rank. It is a specific score corresponding to a certain
percentage of scores. A percentile is obtained as:-

Pn = L + ( nN/100 – F ) ί
f

32
N- ∑f
L – Exact lower limit of percentile class
ί – Class interval
n – nth percentile
f – Frequency of percentile class
F – Total frequency below percentile class

Example
Find the raw score at or below which 25% of he scores lie in the above distribution.
P25 = 20.5 + (25x85/100 – 15) 3
8
= 22.84

Relationship between Quartiles, Deciles, and Percentiles


Percentiles Deciles Quartiles

 Divide the distribution  Divide the  Divide the


into 100 equal parts. distribution into 10 distribution into 4
equal parts. equal parts.
 Median is 50th  Median is 5th decile  Median is 2nd quartile
percentile
 75th percentile   3rd quartile

 50th percentile  5th decile  2nd quartile

 25th percentile   1st quartile

33
THE NORMAL DISTRIBUTION
The Normal Curve is a smooth unimodal curve that is perfectly symmetrical. It has 68.3% of the
area under the curve within one standard deviation of the mean. Most human characteristics are
normally distributed with the majority congregating around the mean of the said characteristic(s).

Properties of the Normal Distribution


(i) The normal distribution is symmetrical about the mean.

Area of 1 = Area of 2, thus P(X < μ) = P(X > μ) = 0.5., f(x) is maximum when x = μ.

34
μ= Mo = Md i.e. the three features of location concide. Change in μ would translate the
normal curve along the x – axis, while change in σ would affect the scale hence the shape
of the curve. Thus two normal functions with different μ would have similar shape but
they would be on different locations of the x-axis. On the other hand, two normal
functions with the same μ but different σ would be centered around μ= x, but the one with
larger SD would be flatter and more spread out.

(ii) The Area under the normal curve = 1.

For statistical and inferential purposes, all normal distributions can be converted such that they
have similar characteristics; μ = 0 and σ = 1. This process is called standardization and the
distribution is called a standard normal distribution. This distribution is used to describe and
compare any empirical distribution that is approximately normal. Scores in an ordinary normal
distribution can be converted to scores in a standard normal distribution, otherwise known as Z-
scores.

Z – Scores
These indicate how many standard deviations the raw score is from the mean, and indicate the
direction of the raw score from the mean. i.e. + or –

Conversion of Raw Score to Z-Score

Z-Score = X - X
δ
Example
What is the Z score of a score of 80 in a distribution whose mean is 85 and S.D is 10.
Z10 = 80 – 85
10
= - 0.50
This means that in a normal distribution, the score 80 falls 0.5 deviations to the left [negative
side] of the mean.

35
THE STUDENTS t-DISTRIBUTION
There are certain circumstances within which the normal distribution is not suitable. When
making an inference about a population mean but the standard deviation of the population is not
known, the normal distribution is no longer suitable for deriving a sample and critical scores.
This is because the sampling distribution of the sample means will no longer be normal. In this
case a Student’s t-distribution is used. The t-distribution looks a lot like the z-distribution in that
it is a smooth, unimodal, symmetrical curve. The difference is that the t-distribution is flatter
than the z-distribution, but how much flatter depends on the sample size. For small samples the
distribution is flatter that for larger samples, and when the sample becomes very large (≥ 120),
the t-distribution becomes similar to the normal distribution.

t-Scores
These indicate the position of a score in a t-distribution which is used to compare a small number
of scores.
tscore = 10 Zscore + 50
E.g. in the above data
t80 = 10(-0.5) +50
= 45

Advantages of Standard Scores


(i) They enable comparison of scores of different distributions.
(ii) They provide a quick way of comparing individuals and subjects.

Disadvantages of Standard Scores


(i) They are difficult to comprehend and explain.
(ii) The negative sign in Z scores sometimes causes anxiety.

THE CHI-SQUARE DISTRIBUTION


This is a probability distribution for an infinite number of random samples of the same size
drawn from populations where the two variables are independent of each other. A chi-square
distribution has a long tail, reflecting the fact that it is possible to select random samples that

36
yield a very high value even though the variables are independent, but this is highly improbable.
From the sampling distribution of chi-square we can determine the probability that the difference
between observed scores and the expected scores is due to random variation when the sampling
from populations in which the two variables are independent.

ANALYSIS OF VARIANCE (ANOVA)


This is a tool of statistical analysis which is used:-
(i) For comparison of means
(ii) For comparison of variances
(iii) When the input variable is qualitative or quantitative, and the response variable is
quantitative.
(iv) When dealing with several factors or one factor each at several levels.

The question that ANOVA answers is whether experimental treatments are different. The test
statistics derived from the ANOVA test are compared using the Fisher distribution or the F-
distribution.

HYPOTHESIS TESTING
Hypothesis Testing is the application of standard procedures to determine whether a hypothesis
holds as stated or if it does not hold. The testing of any hypothesis hinges on sample
characteristics i.e. measures obtained from the sample and the aim is either to reject or fail to
reject the null hypothesis. For any stated hypothesis an opposite can be obtained such that two
hypotheses are there for each case, a null hypothesis and a corresponding alternative hypothesis.

If a null hypothesis is rejected as a result of testing, then the alternative hypothesis holds and is
said to be true. O.T.O.H. if a null hypothesis is not rejected, then it is upheld and said to be true.
Depending on the type of data [measures] and the relationship being tested, then testing of a
hypothesis can utilize either a parametric or a non-parametric test.

37
(i) Parametric tests are those that utilize probability distributions in testing of hypotheses. These
include a Z-test, t-test and chi-square test. They are applicable for quantitative data and
qualitative nominal data. The distributions used include normal distribution, t distribution
and chi-square distribution.
(ii) Non-parametric tests are distribution free tests and are used mainly for ordinal data. They
include sign test, wilcoxon test etc.
(iii) Analysis of variance [ANOVA] is used to test for interaction and independence of
various variables that generate various means or variances. ANOVA is interpreted using the
Fisher or F-distribution.

38

You might also like