0% found this document useful (0 votes)
32 views

EPS NOTES

Education

Uploaded by

stanleykakai001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

EPS NOTES

Education

Uploaded by

stanleykakai001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

MACHAKOS UNIVERSITY

CENTER OF OPEN, DISTANCE AND e-LEARNING

IN COLLABORATION WITH

SCHOOL OF EDUCATION

DEPARTMENT OF PSY/SNE

EPS400: EDUCATIONAL TESTING AND MEASUREMENT

WRITTEN BY:

ROSEMARY MULE

Copyright © Machakos University, 2021


All Rights Reserved

MAY, 2021
LECTURE ONE

INTRODUCTION TO MEASUREMENT AND STATISTICS

1.1 Introduction

The lesson focuses on meaning of statistic and its uses in the context of education.
Relationship of statistics to measurement/evaluation brought out by analyzing key terms namely,
data, variable, descriptive and inferential statistics and scales (levels) of measurement which
majorly dictate the method of statistics
.

The lecture covers:


1. Lecture objectives
2. Definition of terms used in this unit
3. Types of statistics
4. Scales of measurement
5. Learning Activities
6. Summary
7. Suggestions for further reading

1.2 Lecture Objectives

By the end of this lecture you should be able to:


1. Define the following terms; statistics, descriptive statistics, inferential statistics, data,
variable, continuous variable, discrete variable, measurement, evaluation
2. Explain the importance of basic knowledge of statistics to educators
3. Show the relationship between statistics and measurement
4. Name the four levels of measurement
5. Describe the properties of the 4 levels of measurement
6. Associate any data collected with the relevant level of measurement.
1.3 Definition of terms used in this unit

Statistics
 Statistics is concerned with scientific methods of collecting, organizing, summarizing,
presenting and analyzing data to draw valid conclusions and make reasonable decision on
the basis of the analysis.
Such data compiled and analyzed could be student marks, student enrolment in a school, sales of
a business, passes and failures in an examination etc
 Rationale for doing statistics
Statistics is needed for the purpose of research extension of knowledge and solving problems
And in interpreting mass of data e.g student marks, number of passes and failures in exams, sales
of a business, profits made by different companies in same industry,…etc.

There are two types of statistics


a) Descriptive Statistics
This type of statistics deals with mass data obtained from population. The numerical data is
interpreted by organizing and summarizing (compiling) it in a way that they can be understood
and communicated without generalizing beyond the group under consideration.
b) Inferential Statistics
In this type of statistics, conclusions about population are made using a representative random
finite sample. . Thus, in this we make reasonable decisions with incomplete information i.e.
taking a small finite group (usually a random sample) and then inferring or generalizing. Thus
one uses a sample (random) to study the population in inferential statistics.
 Measurement
This is the process of assigning of numbers to individuals or objects in a systematic way as a
means of representing properties of the individuals (or objects).
Measuring some quantities is easier than others (e.g. measuring time, length or weight) offers no
problem since we have objective physical instruments to carry out the measurement, even when
interpreting these measurement there is problem. The situation is completely different when we
wish to measure none physical variables. Psychological variables like attitude unlike physical
variables are not tangible. Usually these are variables in individuals’ brain and they need to be
manifested to be measured, and thus why we have to use indirect ways like a test to measure
them.
 Assessment
Process of assigning numerical values to objects and events and comparing the quality and
quantities so assigned between two or more objects of assessment.
Assessment can be formal and informal
 Evaluation
This is the process of collecting quantitative or qualitative information and analyzing the
information and presenting it in a form that facilitates decision making. Evaluation is judgmental.
Evaluation answers the questions such as how good or how well. In an education program,
evaluation is meant to promote learners to the next level of education or class and the cut off mark
must be specified in advance. To be able to evaluate, measurement and assessment must be done
 Test
This is a standardized situation that provides an individual with a score. Students are tested with
same questions and in the same way.
 Examination
An examination consists of several tests that measure different property of the individual so as to
facilitate decision making. An examination takes longer than a test.

 Statistic
This is a derived numerical value that describes some property of data, e.g, averages, rates etc.

 Statistical Methods
These are ways or means of processing data to extract their full significance. e.g, calculation of
mean, mode and median
 Variable
This refers to any single property or characteristic that is possible for different individuals to
possess in different quantities. A variable that assumes only one value is called a constant (i.e. a
variable with only one value in its domain). A variable which can theoretically assume any value
between any two given values is called a continuous variable while a variable that can only take
particular values (i.e whole numbers) or does not take at least one value between any given two
points is called discrete variable. Generally, measurement gives rise to continuous data while
counting or enumeration gives rise to discrete data.

1.4 Scales (Levels) of Measurement

Measurement takes place at basically 4 different levels or scales


Nominal scale
Ordinal Scale
Interval Scale
Ratio scale
Each level of measurement specifies how numbers (data) that are assigned to individuals or objects
relate to the property being measured.
Nominal Scale
This type of scale involves use of a number in place of a property or attribute that is measured e.g.
if the variable measured is gender, we may assign female 1and male 2.
The difference between 1 and 2 is that they are different and no merit is attached to any value.
Other examples that may be assigned nominal values are ethnicity, religion, school house
membership etc. Nominal values are also referred to as categorical variables. The numbers used
represent categories and the relationship between the categories is that they are different. In
nominal scale, counting is possible e.g. number of males.

Ordinal Scale
This scale distinguishes the individuals or objects and also gives the relative position of individuals
with respect to some property or attribute but it does not indicate the distance between positions.It
is characterized by related order categories. The concept of greater than, or less than as well as
counting is applicable. Data in this scale may be assigned e.g, 3 for good, 2 for fair and 1 for poor
in grading system or head teacher>Deputy head teacher >senior teacher

Interval Scale
This scale provides equal intervals from an arbitrary origin. The distance (difference) between any
two numbers on this scale is of a known valueThis scale orders individuals or objects or events
according to the amount of attribute or property they possess and also establishes equal intervals
between the units of measure e.g. given two scores 45 and 40, 45 is better than 40 and five items
were missed more for the one who got 40. In interval scale of measurement, counting is possible,
use of >or <is also possible i.e. it has order and it can be stated meaningfully by how much two
of them differ.

Ratio Scale
Ratio scale is the highest type of measurement, which provides a true zero point as well as equal
intervals. Ratios, which are meaningful, can be formed between any two given values on the scale.
A metric rod used to measure length in units of cm is a ratio scale, for the origin on the scale is an
absolute zero corresponding to no length at all. That is, lengths measured in say cm those numbers
(data) provide ratio scale. Ratio scales are found primarily in physical variables.
1.5 Further Activity

a) Visit the world wide web and read introduction to educational statistics, tests and
measurement.

1.6 Self-Test Questions

a) Explain why social scientists (e.g. educators, psychologists) need to have at least a
rudimentary knowledge of statistics?
b) Distinguish between descriptive statistics and inferential statistics.
c) State the 4 major levels of measurement (scale of measurement) and discuss their
characteristics.
d) Define the following terms:
o Variable
o Continuous variable
o Discrete variable.
o Measurement
1.7 Summary

a) Statistics is a scientific method of collecting, organizing, summarizing, presenting


(compiling) and analyzing data
b) There are two types of statistics namely, descriptive and inferential statistics.
c) In descriptive statistics generalizations are restricted to the group under consideration
while in inferential statistics a representative sample from the population is used to study
the population, thus making inferences about the population using the (representative)
sample.
d) A variable is any particular property (characteristic, trait or attribute) of which different
individuals (objects) will possess in different quantities.
e) Discrete variables are measured in units, which by definition, cannot be subdivided any
further. On the other hand, a continuous variable is one that can take on unlimited or
potentially unlimited number of values between any two given values.
f) There are basically 4 levels of measurement: nominal, ordinal, interval and ratio.
g) Most measurements in social sciences (education and psychology included) are possible
at nominal, ordinal and interval. Very few important variables in these fields lend
themselves to ratio level of measurement.
2.11 Suggestions for Further Reading

Ingule, F. & Gatumu, H. (1996) Essentials of Educational Statistics.


Nairobi: E.A. Educational Publishers.

Glass, G.V. & J.C. Stanley (1970) Statistical Methods in Education and Psychology.
New Jersey: Prentice-Hall.

Johnson, R.R. (1980) Elementary Statistics. Mass.: Duxbury Press.

Sahre, B. Kumar (2005): Statistics in Psychology and Education. Kalyani Publishers,


Ludhiana.

Smith, G.M. (1970) A Simplified Guide to Statistics for psychology and Education
New York: Holt, Rinehart and Winston.
MACHAKOS UNIVERSITY

CENTER OF OPEN, DISTANCE AND e-LEARNING

IN COLLABORATION WITH

SCHOOL OF EDUCATION

DEPARTMENT OF PSY/SNE

EPS400: EDUCATIONAL TESTING AND MEASUREMENT

WRITTEN BY:

ROSEMARY MULE

Copyright © Machakos University, 2021


All Rights Reserved

May, 2021
LECTURE TWO

TABULATION AND GRAPHICAL PRESENTATION OF DATA

Introduction
The topic focuses on use of tables and graphs to describe distributions of data (marks) for groups
of students or (subjects or individuals). Tabulation of data and presentation of distributions of data
using graphs

The lecture covers:


1. Lecture Objectives
2. Tabulation of grouped and ungrouped data
3. Graphical presentation of data
4. Forms of frequency distributions
5. Kurtosis
6. Learning activities
7. Summary
8. Suggestions for further reading

Objectives

By the end of this topic the learner should be able to:


a) Organize data for a group (sample) into:
i) Ungrouped frequency distribution
ii) Grouped frequency distribution
iii) Determine cumulative frequencies for distributions of data
b) Present
c) distributions of data using histograms and frequency polygons
d) Draw cumulative frequency curves (ogives)
e) Distinguish among:
i) Positively skewed distribution
ii) Negatively skewed distribution
iii) Normal distribution

Tabulation and graphical presentation of data

Graphs and tables are used to describe distributions of marks (data) for groups of students.
The quantitative data of marks scored by students is a raw data. Before raw data can be understood
and interpreted, it is organized and summarized. Some of the commonly used procedures to

Page 2 of 12
organize and summarize raw data include frequency distributions (ungrouped and grouped data),
histograms, frequency polygons and ogives (cumulative frequency curves)

Ungrouped Frequency Distributions

A frequency distribution is a tabulation of scores of a group of individuals or objects to show the


number of times each score occurs. The first step in preparation of a frequency table is to arrange
the scores in order (either ascending or descending) order. The variable x is used to represent the
scores and the number of times the score occurs (frequency) is represented by f. The total number
of scores i.e. the sum of all the frequencies is denoted by n…… or ∑f.

Preparation of ungrouped frequency distribution table

Prepare a frequency distribution table for ungrouped data using the raw scores given as follows:

11 9 5 16 16 16 4 9 5 7
4 10 4 4 15 15 5 5 11 18
8 16 12 11 17 3 3 5 3 7
2 11 6 4 18 1 9 2 2 15
5 10 9 8 7 7 2 5 13 1

Frequency distribution Table for ungrouped data


Score Tally Frequency
X II F
1. 1111 2
2. 111 4
3. 1111 3
4. 1111 11 5
5. 7
6. 1
7. 4
8. 2
9. 4
10. 2
11. 4
12. 1
13. 1
15 3
16 4
17 1
18 2
Total 50
frequency

Page 3 of 12
Grouped frequency distributions

When there is a wide range of data, the data may be condensed by setting up intervals which
contain a range of possible data. When data are grouped to form intervals of data called “class
intervals” the resulting frequency distribution is known as a grouped frequency distribution.

S Steps for making a grouped frequency distribution table


i. Determine the range (Highest Score-Lowest score)
ii. Determine size of class interval
iii. Determine the last and first class intervals.

Examples:

1. The raw data given is a record of scores for students in statistics continuous assessment
test;
38, 68, 39, 55, 60, 61, 56, 49, 51, 35, 58, 48, 58, 47
65, 50, 52, 39, 53, 43, 42, 51, 62, 47, 55, 58, 54, 52
46, 65, 45, 55, 46, 42, 52, 34, 59, 53, 48, 48, 60, 50

Prepare a grouped frequency distribution table (inclusive) using a class interval of 5 units

The class intervals must include the highest and lowest score and each class should start with a
multiple of the class size e.g. if class interval is 5, using the raw data above the lowest class
interval will be 30-34 and highest class interval will be 65-69

Grouped frequency distribution table (inclusive) using the scores above


Class Intervals Tally Frequency CF
65-69 3 3
60-64 4 7
55-59 8 15
50-54 10 25
45-49 9 34
40-44 3 37
35-39 4 41
30-34 1 42
NOTE
i. Class intervals must be mutually exclusive (scores fall in only one and only one class
interval).
ii. There should be enough groups to include all observations
iii. Class intervals must be of same size

Class boundaries include; 30-34, 35-39…, Lower class boundaries include 30, 35,…while Upper
class boundaries are 34, 39,.. Lower class limit include 29.5 and 34.5 and Upper class limit for
the lowest class interval are 34.5 and 39.5

Page 4 of 12
2. Prepare a grouped frequency distribution table ( inclusive) for the raw data in the table below
using a class interval of four units.

11 9 5 16 16 16 4 9 5 7
4 10 4 4 15 15 5 5 11 18
8 16 12 11 17 3 3 5 3 7
2 11 6 4 18 1 9 2 2 15
5 10 9 8 7 7 2 5 13 1

The grouped table for the above data

Scores Tally Frequency


1-3 9
4-6 13
7-9 10
10-12 7
13-15 4
16-18 7
Total= 50

GRAPHICAL PRESENTATION OF SCORES (DATA)

The Ordinary frequency distribution table does not give a very clear picture of the real situation
of the scores and is therefore supplemented with graphical representation of the same data.
Frequency distributions are presented graphically using histograms, frequency polygons and
Cumulative frequency curves ( Ogives)
Histogram
A histogram is a series of continuous (joined) columns or bars, each having as its base one class
interval and its height the number of cases or frequency in that class. The external boundaries of
a histogram are formed by two perpendicular lines i.e. the horizontal (x-axis) and vertical (y-
axis). To construct the histogram, lower and upper class limits (real limits) for each class interval
are used on the horizontal axis

Page 5 of 12
Example
Using the data on statistics test, draw a histogram

Histogram for grouped data

12

10

Series1
8 Series2

Series3
frequency

Series4
6
Series5

Series6

4 Series7

Series8

0
1

scores

Frequency Polygon

This is a line graph in which the horizontal axis contains the mid-points of the class intervals,
while the vertical axis contains the frequencies. Each frequency is plotted against corresponding
mid-point of the class interval. Allow one class interval below the first class interval and one
class interval after the last class interval. This is because the polygon must be closed (i.e. the
ends must touch the x-axis).

Page 6 of 12
Frequency polygon for statistics CAT

12

10

8
Frequency

0
24.4-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5 49.5-54.5 54.5-59.5 59.5-64.5 64.5-69.5 69.5-74.5

Class intervals for scorres

Class-marks used for frequency polygons fall at the middle of the class interval. Also note the
frequency polygon has to be closed. This is done by considering an extra class-mark, i.e. the
next class-marks on the two ends of the class intervals as illustrated. The frequency of each of
these class intervals is 0
.
Cumulative frequency curve (Ogive)
Cumulative frequency refers to scores attained up to and including those in the class interval.
Cumulative frequency curve is a line graph constructed by plotting the cumulative frequencies
against upper class limits of the class intervals and joining the points by free-hand.

Page 7 of 12
Cumulative frequency curve for the statistics continuous assessment data

cumulative
frequency

Class intervals (x)

Forms of frequency distributions

1. Normal Curve
The normal curve is a bell-shaped symmetrical curve with the peak of the distribution at the
center and the tails of the distribution continually approaching but never touching the
horizontal axis (asymptotic). This kind of curve is a mathematical concept which is not realized
by any real data but plays an important role in statistical inferences.

2. Skewed Distributions
A distribution is said to be skewed if it has no symmetry i.e. to is asymmetrical. In skewed
distributions, the scores ‘trail off’ in one direction. Skewed distributions can either be
described as being positively skewed or negatively skewed. A distribution is said to be
negatively skewed if scores are relatively frequent in number towards the right hand end of the
scale. A distribution is said to be positively skewed if scores are frequent in number towards
the left hand side of the scale. See the illustrations below:

Page 8 of 12
Normal Curve Distribution

freq

Scores

Positively Skewed Distribution

Freq.

Freq.

Page 9 of 12
x

Scores

Kurtosis
Kurtosis refers to the flatness or peakedness of a distribution in relation to the normal curve. If one
distribution is more peaked than the normal curve, it is described as Leptokurtic. If a distribution
is less peaked than the normal curve, it is said to be Platykurtic. The normal distribution in relation
to the Leptokurtic and Platykurtic is described as Mesokurtic

Forms of Kurtosis

freq.

Leptokurtic

Mesokurtic
Platykurtic

Scores

Page 10 of 12
Summary

1. Before raw data can be understood and interpreted, it is usually necessary to organize and
summarize them in some meaningful way.
2. Procedures used to organize and summarize data include frequency distributions,
histograms, frequency polygons and ogives.
3. A frequency distribution is a tabulation of scores (or other attributes) of a group of
individuals to show the number of times each score occurs.
4. When dealing with a large number of scores, we use a grouped frequency distribution in,
which scores are grouped to form intervals of scores called ‘class intervals’.
5. In cumulative frequency distribution, we indicate the number of scores that are less or
greater than a given value.
6. A graph is a very effective method of representing data.
7. Three common methods of representing a distribution graphically are histogram, the
frequency polygon and the smooth curve (ogive).
8. The histogram is a series of columns or bars each having as its base one class interval and
its height the number of cases or frequency in that class.
9. Frequency polygons are similar to histograms but instead of columns, the midpoints at
the appropriate frequency of each class interval are joined by straight lines. The straight
lines are extended down to the vertical (X-axis) one class above and one class below to
create a many sided figure (polygon).
10. An ogive is a cumulative frequency curve and is constructed by plotting the cumulative
frequencies against the actual (real) upper limits of the class intervals.
11. Various forms of frequency distributions exist and these include the normal distribution,
negatively skewed distributions and positively skewed distribute

Further reading

Ingule, F. & Gatumu, H. (1996) Essentials of Educational Statistics.


Nairobi: E.A. Educational Publishers.
Glass, G.V. & J.C. Stanley (1970) Statistical Methods in Education and Psychology.
New Jersey: Prentice-Hall,
Johnson, R.R. (1980) Elementary Statistics. Mass.: Duxbury Press,
Smith, G.M. (1970) A Simplified Guide to Statistics for psychology and Education
New York : Holt, Rinehart and Winston,

Page 11 of 12
Self-Test Assessment

The following were the scores obtained by a form ii class in a mathematics test:
49 63 59 44 49 51 62 37 30 49 45 52 50 42 54 32 57
41 42 56 44 46 63 44 40 50 46 53 48 37 46 53 68 36
40 56 37 66 43 40 43 51 59 42 52 46 57
(a) Make an ungrouped frequency distribution table for this data. The table should
show both tally marks and frequencies. The total frequency (N) = 50.
(b) Make a grouped frequency distribution that should have both tally marks and
frequencies for each class interval. Use class size (i) = 5 and start with 30-34 as
the lowest class interval. Indicate the class-mark and actual limits for each class
interval. Indicate also the above as well as below cumulative frequencies.
(c) Plot (on graph paper) a histogram and frequency polygon for this data. Note
that the frequency polygon should be closed by extending the lines to X-axis as
emphasized in the text.
(d) Comment on the distribution of the scores (i.e. is their distribution close to
normal or are they skewed positively or negatively?).
1. Using the same data, repeat 1 b and c but now using class size i = 4 and starting with 30-
33 as the lowest class.
2. Select 30 of these scores randomly (one of best ways of selecting them randomly is to
write each score on small piece of paper. Put all these 50 folded pieces of paper in a box
and pick 30 after mixing all thoroughly). Using these 30 scores repeat no. 1 (a), (b) and
(c).
How does the distribution of the scores compare with the original distribution i.e. the
distribution of the 50 scores?

3. Distinguish between the following terms using illustration if appropriate,


positively skewed and negatively skewed distributions.

Page 12 of 12
MACHAKOS UNIVERSITY

ENTER OF OPEN, DISTANCE AND e-LEARNING

IN COLLABORATION WITH

SCHOOL OF EDUCATION
DEPARTMENT OF PSY/SNE

EPS400: EDUCATIONAL TESTING AND MEASUREMENT

WRITTEN BY:
ROSEMARY MULE

Copyright © Machakos University, 2021


All Rights Reserved

MAY, 2021
LECTUER 3
MEASURES OF CENTRAL TENDENCY

Introduction
There is need to have concise ways of presenting (summarizing) information (data) rather than
by means of graphs or tables. Single numbers (indexes), such as mean, mode and median show
the general level of performance. These indexes are referred to as measures of central tendency
(or commonly called average).
The lecture covers;
i. Meaning of measures of central tendency
ii. Mode, median and arithmetic mean and their calculations
iii. Interpretation of forms of frequency distributions using mode, median and arithmetic
mean
iv. Summary and exercises for self test

Objectives
By the end of this lecture topic the learner should be able to:
1. Describe the three measures of central tendency (mode, median, mean)
2. Compute:
i) Mode for ungrouped and grouped data.
ii) Median for ungrouped and grouped data
iii) Mean for ungrouped and grouped data.
3. Describe unimodal, bimodal and multimodal distributions
4. Describe positively skewed, negatively skewed and normal distributions using the three measures
of central tendency.
5. Discuss the properties of the mean (e.g. what happens to the mean when a constant is added to
all the scores in the distribution).

Measures of Central Tendency


Graphs of frequency distributions of given variables reveal two things: firstly, the values of the variable
tend to cluster around a central value; and secondly, the values spread around that central value in a
specific way. Describing the central points around which values in a distribution spread is what is called
a measure of central tendency. Measures of central tendency give some idea of the average or
representative scores in distributions. Three measures of central tendency that we shall consider are;
the mode, median and mean.
Mode
The mode is the most frequently occurring score or value in a distribution. It is the score or
value with the highest frequency in a distribution. e.g, using these score values; 1, 3, 4, 4, 6, 8,
9, score 4 occurs two times while other scores occur only once and therefore the Mode = 4.
Such a distribution with only one score value as the mode is said to be unimodal.
When two adjacent scores have same frequency e.g. 1, 3, 4, 4, 6, 6, 8, 9, Mode = mean of 4 and 6
4+6
= 2 = 5. If the modes are non - adjacent then each mode value is stated separately. If non-
adjacent modes are two, the distribution is said to be bimodal e.g. 1, 3, 3, 4, 6, 8, 8, 9. Modal score
value is 3 and 8. A distribution that has several non- adjacent modes is referred to as a multimodal
distribution and the modes are stated separately. For grouped distribution, the mid-point of the
class with the highest frequency is the mode.
Median
This is a score in a distribution such that the number of scores above it is equal to the number of
scores below it. It is the centrally placed score in a distribution. Example, considering the scores;
5, 6, 9, 10, 11, Median = 9. When the number of scores is even, median is the mean of the two
scores between which the median lies e.g. 5, 7, 8, 9, 12, 13. Median lies between scores 8 and 9
and therefore median will be the average of these two scores

8+9
Median= = 8.5
2

Median for grouped data

Scores frequency (f) C.F


30-34 1 1
35-39 4 5
40-44 3 8
45-49 9 17
50-54 10 27
55-59 8 35
60-64 4 39
65-69 3 42

𝑁
𝑖 𝑁
Median = L + 𝑓 ( 2 − 𝐶)=L+ ( 𝑓2 − 𝑐𝑓𝑤)

L = Lower class limit of the median group


C = Cf of the class below the median group
f = frequency of median group
𝑁
= is total number of scores in the distribution divided by 2, i.e. total frequencies divided by 2
2
42 5
Median = 49.5 + ( 2 − 17) 10 = 51.5

Note
The median is a position average i.e. it is determined by placing the scores in rank order and establishing
the middle point. Thus the median is a position average that divides the distribution into two equal
halves such that one half is below it and the other half above it.

Arithmetic Mean
This is the sum of all scores in a distribution divided by the total number of the scores
It is the best known reliable measure of central tendency. Thus mean is simply found by adding all
the scores in a distribution and dividing by the total number of scores (N). It is denoted by X
pronounced X bar.

The formula is:


𝑠𝑢𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑐𝑜𝑟𝑒𝑠
Mean = 𝑠𝑢𝑚 𝑜𝑓 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑖𝑒𝑠

Where X = mean

Xi is the raw score for each individual i.e. the ith person’s score.

N is the number of scores.

∑ is the summation sign indicating we are summing from the first score to the last
score i.e. all the X-scores in the distribution are added.

Example

Determine the mean of 3, 3, 4, 5, 6, 6, 8, 9 and10. The sum of these is 3 + 3 + 4 + 5 + 6 + 6 + 8 +


9 + 10 = 54

N=9

Therefore, the mean is 54/9 = 6.


For grouped frequency distribution, the mean is obtained as follows:

Each class-mark or mid-point ( x i ) is multiplied by its corresponding frequency ( f i ). The products are
then summed up and divided by total frequency to give the mean.

This can be summarized as follows:


n n

 i 1
f i xi fx i 1
i i
X = n
=
f
N
i
i 1

where x i refers to the class-marks or mid-points and f i to the corresponding


frequencies. The calculation of the mean of grouped frequency distribution is
illustrated below:

Class interval Frequency ( f i ) Class-mark ( x i ) f i xi

65-69 3 67 201

60-64 4 62 248

55-59 8 57 456

50-54 10 52 520

45-49 9 47 423

40-44 3 42 126

35-39 4 37 148

30-34 1 32 32
n n


i 1
f i =42  f x = 2154
i 1
i i

m
n

fx
i 1
i i
2154
X= n
= = 51.3
f
42
i
i 1
Properties of the mean

1. One important property of the mean is that is that it is the point in a distribution of scores such
that the summed deviations of scores from it (the mean) are equal to zero. What do we mean
by deviation? Deviation is the difference between a score and the mean, X i  X , and it can be
either positive or negative. In any distribution the sum of deviations about the mean is always
equal to zero.
N
i.e.  (X
i 1
i   ) = 0 where μ is the population mean for X-scores and population has

N subjects.
n

(X
i 1
i  X ) = 0 where X is the sample mean and sample is of size n.

For illustration, let us consider the following scores. Suppose our scores are 3, 3, 4, 5, 6, 6, 8, 9 and 10
(note this can be considered as a population or a sample without any change of the results). The
mean will be 6 and the deviation scores will be 3-6, 3-6, 4-6, 5-6, 6-6, 6-6, 8-6, 9-6 and 10-6, in
general X i  X . These deviations will be respectively -3, -3, -2, -1, 0, 0, 2, 3 and 4 (note their
sum is zero). Thus, the mean may be considered as the exact balance point in a distribution.

2. If we add a constant, say C to every score in the distribution, the resulting scores will have a
mean X X  C equal to the original mean X X plus the constant C. If we subtract a constant
instead, the resulting scores will a mean equal to original mean minus the constant. Note
that subtracting a constant C is the same as adding –C. Hence the first formula is adequate
or it includes even the second formula.
i.e. X X  C = X X + C

X X C = X X - C

Mean, Mode and Median of Frequency Distributions Compared


Normal Curve Distribution
In the normal curve distribution, the mean, median and mode are all equal. The three measures of
central tendency are all located exactly at the center of a normal distribution curve as shown in the
figure below:
Normal curve

Mean

Mode

median
Median
Characteristics of the Normal Curve
A bell-shaped symmetrical curve with the peak of the distribution in the centre
Slopes on either side equal to each other 50% of the area on the left and 50% on the right
The mean of the distribution lies in the centre
The mean, mode and median are equal

The area under the curve is one

The curve approaches the horizontal axis but never touches it

Positively skewed distribution

In a positively skewed distribution, the mean is greater than the median and the median is greater
than the mode. For our example, this may mean that most students obtained low marks while there
were extremely few students who got high marks, a situation normally found when a test is too hard.
The positions of these measures of central tendency on a positively skewed curve is shown below:
Mean

Median

Mode

Mean  Median  Mode


Negatively skewed distribution

In a negatively skewed distribution, the mode is greater than the median and the median is greater than
the mean. This illustrates a situation where many students have obtained high marks while very few
students have got low marks. This may occur if the test was too easy for most students.

Mean
Mode

Median

Mean < Median < Mode


Summary
The following statements summarize the major points in this unit:
1. The mean, median and mode are measures of central tendency. They give an idea of the average
or typical score in a distribution.
2. The mode is the most frequent score value in a distribution. It is the score with highest frequency
in a distribution.
3. The median is a point in a distribution of scores such that 50 percent of scores are located above
it and the other.
4. The mean is found by adding all the scores in a distribution and dividing by the total number of
scores.
5. One important property of the mean is that it is the point in a distribution of scores such that the
summed deviations of the scores from it (the mean) is zero.
6. The sum of squared deviations from the mean is less than the sum of squared deviations from any
other point.
7. The mean is generally preferred by statisticians as the measure of central tendency, but the
median is quite often easier to compute, and therefore preferred by classroom teachers in
considerable number of cases.
8. For distribution that are fairly normal, it matters little which measure of central tendency is used.

Further Reading
Ingule, F. & Gatumu, H. (1996) Essentials of Educational Statistics.

Nairobi: E.A. Educational Publishers.

Glass, G.V. & J.C. Stanley (1970) Statistical Methods in Education and Psychology.

New Jersey: Prentice-Hall,

Johnson, R.R. (1980) Elementary Statistics. Mass.: Duxbury Press,

Smith, G.M. (1970) A Simplified Guide to Statistics for psychology and Education

New York : Holt, Rinehart and Winston,


Self Test
1. Using the data used before in unit 2 and the data was as below:
49 63 59 44 49 51 62 37 30 49 45 52 50 42 54 32 57

41 42 56 44 46 63 44 40 50 46 53 48 37 46 53 68 36

40 56 37 66 43 40 43 51 59 42 52 46 57

a) Compute the mean, median and mode for the ungrouped data (the ungrouped data was
obtained in the earlier exercise).
b) Compute the mean, median, modal interval and mode for grouped data also found earlier.
c) In terms of the magnitude of mean, median and mode, comment on the distribution of these
scores.

2. Compute the mean, median, modal interval and mode for the grouped data but now using the size
of the class interval as 3, and start with 30-32 as the lowest class interval as done earlier.
MACHAKOS UNIVERSITY

CENTER OF OPEN, DISTANCE AND e-LEARNING

IN COLLABORATION WITH

SCHOOL OF EDUCATION

DEPARTMENT OF PSY/SNE

EPS400: EDUCATIONAL TESTING AND MEASUREMENT

WRITTEN BY:

ROSEMARY MULE
Copyright © Machakos University, 2021
All Rights Reserved

MAY, 2021
LECTURE 5

MEASURES OF VARIABILITY

Introduction
The measures considered in this lecture unit are range, quartile deviation, mean deviation, variance
and standard deviation. Range is the simplest measure of variability (or dispersion) while Standard
deviation is the most reliable measure of variability

Objectives
By the end of the lecture unit, the trainee should be able to:
i. Compute range, quartile deviation, mean deviation, variance and standard deviation for
grouped and ungrouped data using computational and definitional formulae.
ii. Explain the properties of standard deviation (s.d.) (e.g. when a constant is added to all
the scores of the distribution)
iii. Interpret computed values of range, mean deviation, quartile deviation and standard
deviation.

Measures of variability (Dispersion)


Variability refers to the arrangement or spread of values that the variable takes in a distribution.
Measures of variability give information about the difference between scores. They provide
information about the differences in spread between scores in a distribution. A measure of
central tendency is not enough to describe a distribution because it is also important to know how
the scores in a distribution are spread. Distributions may have the same mean but may differ in
the extent of variation of the scores around that measure of central tendency. To describe any
distribution of scores fully, we need three important elements i.e
Measure of central tendency
Measure of variability
Shape of the distribution
Types of measures of variability
Measures of variability or spread or dispersion are; Range, Quartile deviation, Mean deviation,
Variance and standard deviation
Range
Is the difference between the highest score and the lowest score in a distribution;
e.g. 3, 3, 4, 5, 6, 9, 11
Range = 10-3=7

For grouped data,


Range=mid-point of highest interval - mid-point of lowest interval
Range is the simplest measure of variability but it is not a stable measure of variability. It is only
a quick reference to the spread of scores in a distribution.
Quartile deviation (Q.D)
It is half of 75th score-25th score in a distribution. It is also called the semi-inter quartile range
Consider the set of scores
3, 3, 4, 5, 6, 6, 8, 9, 10
Q1 Q3
3+4
Q1 = Median of lower half = 2 = 3.5
8+9
Q3 = Median of upper half = = 8.5
2
Q3+𝑄1 8.5−3.5 5
Q.d = = = = 2.5
2 2 2

Quartile deviation for grouped data


Scores frequency C.F
𝑁𝑡ℎ
F Q1 = score ( i.e score which is quarter way up)
4
30-34 1 1
3𝑁𝑡ℎ
35-39 4 5 Q3 = score (i.e score which is 3 quarter way up)
4
40-44 3 8
𝑄3−𝑄1
45-49 9 17 Q.D = 2
50-54 10 27
55-59 8 35
60-64 4 39
65-69 3 42
n=∑f = 42

NB: Range and Q.D do not take into consideration each individual score hence not stable mean
of variability
Mean Deviation ( M. D)
Mean deviation is a stable measure of variability because it considers the spread of each
individual score from the most reliable measure of central tendency, i.e. the mean.

𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛𝑠


Mean deviation = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑐𝑜𝑟𝑒𝑠

X
i 1
i X
M. D =
n

Example
Calculate the M.D of the following scores
3, 3, 4, 5, 6, 6, 8, 9, 10

Score score-mean (d) d= X i  X


X X- 6
3 -3 3
3 -3 3
4 -2 2
5 -1 1
6 0 0
6 0 0
8 2 2
9 3 3
10 4 4

3+3+4+5+6+6+8+9+10 54
Mean = = =6
9
9

X i 1
i X
Mean deviation =
n

3+3+2+1+2+3+4
M.D = 9
18
,, = 9
,, =2

A large value of mean deviation indicates a greater spread in the values of the distribution while
a small value indicates that the values are close in in terms of variability

Standard Deviation (s.d)

The standard deviation is a measure of variability that characterizes a distribution of scores or


any other variables. It indicates how the scores (or any other variables) are spread. Standard
deviation is the most reliable measure of variability. Standard deviation is denoted by S and
Variance is denoted by S2
Formula for S

(X
i 1
i  X )2
S= i.e. S = √ (variance)
n

Example
Use the scores 3, 4, 5, 5, 6, 8, 9and 10 to compute variance and the standard deviation

Xi Xi  X (X i  X )2
3 -3.25 10.5625
4 -2.25 5.0625
5 -1.25 1.5625
5 -1.25 1.5625
6 -0.25 0.0625
8 1.75 3.0625
9 2.75 7.5625
10 3.75 14.0625
43.500

(X
i 1
i  X )2
Variance, S x2 =
n
43.5
=
8
= 5.4375
Standard deviation, S x = 5.4375
= 2.33

Standard deviation indicates how the scores (or any other variables) are spread. The bigger the
magnitude of standard deviation, the bigger is the spread of the scores. The smaller the
magnitude of the standard deviation, the smaller is the spread of

Properties of standard deviation


1. Adding or subtracting a constant from each of the scores does not affect the standard
deviation
2. The sum of the squares of the deviations of all scores from their arithmetic mean is
minimum
3. Multiply every score of the distribution by a constant, the variance is original variance
multiplied by the constant squared hence new standard deviation is original s.d x constant

Summary
The following statements summarize the major pints of this lecture topic
1. The range, variance and standard deviation are measures of variability.
They give an indication of the spread of scores in a distribution.
2. The range is defined as the difference between the highest and the lowest scores in a
distribution.
3. Variance is obtained by dividing the sum of squared deviations by the total number of
observations in the distribution.
4. The standard deviation is the square root of variance.
5. The bigger the standard deviation, the bigger the spread of scores and the more
heterogeneous the group is on which the scores are based.
6. Adding a constant to every score in the distribution has no effect on variance or standard
deviation.
When every score in a distribution is multiplied by a constant, the new variance is the original
variance times the constant squared. Self Test

1. Given the scores 3, 4, 4, 5, 6, 6, 7, 8 and 10; compute


(i) Mean, range, mean deviation, median, variance ( S X2 ), and standard deviation (
S X ).
(ii) Add 6 to each score, and recalculate the mean, variance and standard deviation.
(iii) Subtract 5 from each score, and recalculate the mean, variance and standard
deviation.
(iv) Multiply each score by 4, and recalculate the mean, variance and standard
deviation.
2. In case of 1 (ii) to 1 (iv), discuss what happens to:
(i) The whole distribution
(ii) The mean
(iii) The variability (i.e. variance and standard deviation) in relation to the constant
applied.

3.
Class interval
Frequency
65-69 3
60-64 4
55-59 8
50-54 10
45-49 12
40-44 6
35-39 5
30-34 2
N =50

Using the above data and changing the scale of the class-mark by appropriate manipulation (i.e.
using assumed mean method).
(a) Compute:
(i) The mean
(ii) The variance and standard deviation.

(b) Repeat using the above data, but omitting the top class interval and the bottom class
interval, (N = 45).
(c) Double the frequency of the data and determine how the median and mean,
variance and standard deviation are affected.
(d) Double only the last frequency of the data and determine how the median, mean,
variance and standard deviation are affected.

.
4 The following scores were obtained by 30 Form I students in a Kiswahili test:
46 31 18 39 40 38
37 19 15 26 14 37
24 41 18 19 21 25
31 10 20 21 32 46
20 30 32 27 31 37
a) Use the above scores to prepare a grouped frequency distribution
using a class interval size 5 and starting with 10-14 as your lowest
class interval.
b) Basing on your grouping in a) above, prepare a complete frequency
distribution table for grouped data having the following columns:
i) Class interval
ii) Tally marks
iii) Frequency
iv) Real (or exact) class limits
v) Classmark (midpoints)
vi) Cumulative frequencies below (less than)
Cumulative frequencies above (more than)
c) For the grouped data, calculate the following:
i) Mode ii) Median iii) Mean
d) Determine the range for the grouped data.
e) Calculate mean deviation for the grouped data.
f) Compute the variance and standard deviation for the grouped data.
g) i) Comment on the performance in this Kiswahili test
using above information.
ii) Describe fully the shape of the distribution basing your
answers to part (c).
7.

Further Reading

Ingule, F. & Gatumu, H. (1996) Essentials of Educational Statistics.


Nairobi: E.A. Educational Publishers.
Glass, G.V. & J.C. Stanley (1970) Statistical Methods in Education and Psychology.
New Jersey: Prentice-Hall,
Johnson, R.R. (1980) Elementary Statistics. Mass.: Duxbury Press,
Smith, G.M. (1970) A Simplified Guide to Statistics for psychology and Education
New York : Holt, Rinehart and Winston,
MACHAKOS UNIVERSITY
CENTER OF OPEN, DISTANCE AND e-LEARNING
IN COLLABORATION WITH
SCHOOL OF EDUCATION
DEPARTMENT OF PSY/SNE

EPS400: EDUCATIONAL TESTING AND MEASUREMENT

WRITTEN BY:

ROSEMARY MULE

Copyright © Machakos University, 2021


All Rights Reserved

JULY, 2021
LECTURE 8: INTERPRETATION OF SCORES
Introduction
In norm-referenced tests (NRT) where we consider the relative position of a score,
transformation of scores into percentile ranks, standard scores (z-value), standardized scores or
normalized scores are justifiable. Such manipulation (transformation) is not justifiable in
criterion-referenced tests (CRT). A score in CRT is meaningful identity conferring information
about how a candidate has mastered what is being tested.
Objectives

By the end of unit the learner should be able to:


i. Define percentile rank, percentile point, standard scores (z-value), standardized scores and
normalized scores.
ii. Compute z-values, percentile ranks given normalized z-values
iii. Convert z-scores into standardized scores or normalized z-values to normalized scores

Interpretation of Scores
A test mark or score is sum of marks of the correctly answered questions. Such score is called
the row score of an individual in a particular test Raw scores may not be comparable across tests
e.g. A raw score of 70 on test I and a raw score of 50 on Test II. It is not easy to assess relative
performance on the two tests if the distributions for the two tests have quite different shapes.
It is not easy to interpret a raw score in that a raw score does not give any information about an
examinee’s performance. Raw scores are transformed to interpretable scores using percentiles,
standard and standardized scores, and normalized scores.
Percentile Ranks
The percentile rank or (percentile score) is defined as the percentage of scores, which fall at or below
a given score. For example, if a percentile rank for a score is 85, it means that 85 per cent of the scores
in the total distribution fall at or below the score. The formula used for computing the percentile rank
is:

fw
(cf b  )100
Percentile rank = 2
N
Where

cfb = cumulative frequency (below) for the interval immediately below the interval

containing the score of interest.


fw = the number of scores within the interval containing the score of interest. This is

equal to the number of examinees obtaining the score of interest.

N = the total number of subjects in the distribution i.e. the total number of examinees.

Let us consider the scores distribution in the table below. For each score value , the frequency (number)
of the examinees obtaining the score appears in the second column. The third column contains the
cumulative frequency for each score value, which is the number of examinees who have score less than
or equal to each score value. Let us use this observed-test-score distribution to estimate the percentile
rank for the score value, 5 (that is the percentile rank for the individual who gets 5).

Score value( X i Frequency ( f w ) Cumulative Percentile rank


) frequency
below (cfb)

11 1 10 95

10 1 9 85

8 1 8 75

7 2 7 60

6 1 5 45

5 2 4 30

4 1 2 15

3 1 1 5

fw
(cf b  )100
Percentile rank for score value 5 = 2
N
2
(2  )100
= 2
10
= 30
The percentile ranks for other scores in the distribution are provided in the fourth column. Under the
definition, the percentile rank of a score is always less than 100 and greater than zero (i.e. 0 < percentile
rank < 100).

Standard Scores

A standard score indicates the relative position of a score in a distribution in terms of the
number of standard deviations from the mean. To get a standard score corresponding to any
raw score, the mean of the raw scores is subtracted from the raw score, and the result is
divided by the standard deviation of the distribution (or the raw scores).
Standard scores are also called z score or z values. Therefore:

x
z=

or

xx
z= where μ or x is the mean of distribution while s or σ is standard deviation of the
s
distribution under consideration, as observed earlier.

Converting raw scores to standard scores

Individuals Score ( x i ) ( xi  x ) z score

A 3 -7 -1.11

B 6 -4 -.63

C 7 -3 -.47

D 9 -1 -.16

E 15 5 .79

F 20 10 1.58

Sum 60 .00 .00

Mean 10 .00 .00

s.d. 6.32 6.32 1.00

Converting scores to standard scores using the formula above automatically puts the transformed scores
(standard scores) into a new scale with a mean of zero (0) and a standard deviation (s.d.) of one (1). Each
transformed score indicates how many standard deviations the raw score lies from mean. For instance, a
standard score 1.5 (z = 1.5) indicates that the corresponding raw score lies 1.5 standard deviations above
the mean. If the standard score is equal to –1.5 (i.e. z = -1.5), the corresponding raw score lies 1.5 standard
deviations below the mean.

Converting raw scores to standard scores (z scores) has no effect on the shape of the distribution. If the
original distribution of the raw scores is skewed to the right, the distribution of corresponding standard
scores will also be skewed to the right. If the original raw-score distribution is normal, the distribution of
the corresponding standard scores will also be normal.

Note that percentile ranks are ordinal measures while standard scores are interval measures.

Standard scores may not be easily interpreted by ordinary people who have no knowledge of
what the mean and standard deviation are. Standard scores may be positive or negative
Standardized Scores
Standardized scores are linear transformation of raw scores, but unlike the standard scores, they are
always expressed as whole numbers and are non-negative. Any set of standard scores can be transformed
to an arbitrary mean, μs, and standard deviation, σs, by applying the formula:

Y = μ s + σs z

Where z is the standard score and Y is the standardized score. An example of standardized scores is the
T score (commonly referred to as ‘linear T scores’). The T score has a mean of 50 and standard deviation
of 10. The formula for a T score is:

T = 50 + 10z

To obtain a T score, the z score is multiplied by 10 and 50 is added to the product. If we start with a raw
score, to obtain the equivalent T score; calculate the value of z for the raw score, multiply the z value by
10 and then add 50. These operations are summarized in the formula:

( xi  x )
T=  10 + 50
s
( xi   )
Or T=  10 + 50

T scores are always whole numbers, and if the value obtained is not a whole number it has to be rounded
to the nearest whole number. T scores are also non-negative and usually greater than zero.

It can be noted from the formula above that standardized scores are linear transformation of standard
scores since the means and standard deviations of the standardized scores (50 and 10 respectively for
linear T scores) are the constants we apply to obtain standardized scores. Since standard scores are a
linear transformation of raw scores, it means that standardized scores are also a linear transformation of
raw scores
Normalization

A raw-score distribution or a distribution obtained from a linear transformation rarely has an exact
statistical meaning. Raw scores or their linear transformation distributions are changed so as to obtain a
normal distribution of scores by performing normalization. All normalized scores have a normal
distribution. On a normalized distribution, every score has a concise statistical meaning as a result. The
percentage of individuals above and below each score is known exactly on a scale with a known mean and
standard deviation unit of measurement.

Normalization involves forcing the distribution of transformed scores to be as close as possible to a normal
distribution by smoothing out, stretching, or condensing irregularities and departures from normality in
the raw-score distribution.

T scores are an example of normalized scores. Normalized scores are whole numbers.
SUMMARY
The principal ideas, conclusions, implications presented in this unit are summarized in the
following statements:
1. Raw scores can be interpreted in a meaningful manner after conversion into
transformed scores in norm-referenced tests.
2. Common forms of expressing transformed scores are percentiles, standard and
standardized scores and normalized scores
3. The percentile is defined as the percentage of scores falling at or below a given score.
The primary advantages of percentile are that they are straightforward to calculate
and that they are easy to interpret.
4. To get a standard score corresponding to any raw score, the mean of the raw score is
subtracted from the raw score and the result is divided by standard deviation of the
distribution. Disadvantage of standard score is that they are often expressed in
negative form and decimals.
5. Standardized scores are linear transformations of standard scores and are always
expressed as whole numbers and are non-negative. Linear T scores are examples of
standardized scores and they (linear T scores) have a mean of 50 and standard
deviation of 10.
6. The transformation to normalized scores involves forcing the distribution of
transformed scores to be as close as possible to a normal distribution by smoothing
out irregularities and departures from normality in the raw-score distribution
REFERENCES

Allen M.J. & Yen W.M. (1976) Introduction to Measurement Theory.

Belmont California: Wadsworth Inc.

Brown, F.G. (1970) Principles of educational and psychological testing. 2nd ed.

New York: Holt, Rinehart & Winston.

Ingule, F. & Gatumu, H. (1996) Essentials of Educational Statistics.

Nairobi: E.A. Educational Publishers.

Mehrens, W.A. & Lehmann, I.J. (1978) Measurement and Evaluation in Education and

Psychology. New York: Holt, Rinehart and Winston.

Self Test

The following is data for 3 students on three tests. Along with these tests scores, the mean ( X i
) and standard deviation (Si) for the scores are given:

Biology X11 = 35 X12 = 22 X13 = 20 X 1 = 30 S1= 4

Chemistry X21 = 23 X22 = 14 X23 = 28 X 2 = 21 S2 = 3

Physics X31 = 39 X32 = 24 X33 = 20 X 3 = 28 S3 = 5

(i) By converting X13 in Biology and X33 in physics into z scores, find out whether student
3 had done better in physics or biology? What assumptions are you making here for
the comparison to be justified?
(ii) What is the mean z score for student 1 on all the three tests?
(iii) What is the mean linear T score for student 2 on all the three tests?
1. Discuss why raw scores have to be converted to standard scores and normalized scores?
2. Distinguish between standardized scores and normalized scores. When are standardized and
normalized equal?
3. (i) (a) Define the term percentile rank.
(b) Define term percentile point
(ii) Given the following distribution:

score (Xi ) 1 2 3 4 5 6 7 8 9

-----------------------------------------------------------

frequency (fi ) 1 3 5 9 12 22 23 16 9

Determine the percentile rank corresponding to each score. Transform to T scores.

4. Given that a raw score distribution on a given test is normal with mean 48 and variance 4.
Complete the table below using this information on relevant values of z scores, percentile ranks,
T scores.

Raw score z score percentile T score

rank

---------------------------------------------------------------------

52

50

48

44
MACHAKOS UNIVERSITY
CENTER OF OPEN, DISTANCE AND e-LEARNING

IN COLLABORATION WITH
SCHOOL OF EDUCATION

DEPARTMENT OF PSY/SNE

EPS400: EDUCATIONAL TESTING AND MEASUREMENT

WRITTEN BY:
ROSEMARY MULE
Copyright © Machakos University, 2021
All Rights Reserved

JUNE, 2021
LECTURE 6
MEASURES OF RELATIONSHIP
Introduction
The relationship or association between two variables is an important concept in research or
any studies. It can help in prediction, given one variable and not the other, and if their
relationship is known and is high enough to allow prediction.

Objectives
By the end of the lecture, the learner should be able to:
1. Explain two methods of studying the relationship, one requiring stringent requirement
(assumptions) while the other not so stringent requirements.
2. Compute, given two sets of data for a group, the two indexes (measures) of relationship i.e.
Pearson product moment correlation coefficient and Spearman rank order correlation
coefficient.
3. Interpret the computed value of the relationship
4. Draw a scatter diagram (also called scatter-plot or scatter-gram) and describe relationship it
portrays in simple terms.
5. Give the properties of the indices e.g. what happens to the relationship index when the scores
are linearly transformed.

Measures of Relationship
Measures of relationship show the relationship between two variables and the strength of the
relationship. To show the relationship between two variables, the following methods are used
(i) Scatter diagram or scatter graph
(ii) Correction coefficient

Scatter Diagram
A scatter diagram is a graph of data points that show a relationship between two variables
Example
The table below shows the scores of two subjects in Maths and Physics for 6 form II sections
Maths scores (x) 42 54 66 78 100 120
Physics scores (y) 81 45 55 42 97 77
Plot a scatter diagram for the data

Scatter diagram showing relationship between maths and science scores for case II

100

90

80

70

60
Science scores

50

40

30

20

10

0
0 20 40 60 80 100 120 140
Maths scores

Case 2
Maths 42 54 66 78 100 120
Science 81 88 93 99 109 125
Scatter Diagram

Scatter diagram showing relationship between maths and science scores for case I

140

120

100
Science scores

80

60

40

20

0
0 20 40 60 80 100 120 140
Maths scores

Case I
Scatter diagram shows there is no systematic relationship between the two variables
Case II
Scatter diagram shows that there is a pattern of the points. The patterns suggest a highly positive
relationship i.e. as Maths scores increase, there is a corresponding increase in science score.
Note: A scatter diagram does not provide a precise measure of relationship
The following methods provide a precise measure of relationship, covariance, Pearson Product
Moment Correlation Coefficient and Spearman Rank Correlation Coefficient
Covariance
This is a measure of relationship and it shows the degree of relationship between two variables
by use of a simple averaging procedure
∑(𝑥−𝑥)(𝑦−𝑦)
i.e Covariance (𝑥, 𝑦) = 𝑛−1

e.g. Calculate the covariance (x,y)


x y x-x y-y (x-y) (y-y)
7 40 -1 -10 10
8 50 0 0 0
9 60 1 10 10
20
7+8+9 40+50+60 150
𝑥= = 8 𝑎𝑛𝑑 𝑦 = =
3 3 3

= 𝑦 = 50
20 20
Cov (𝑥, 𝑦) = 3−1 = = 10
2

PEARSON PRODUCT MOMENT CORRELATION COEFFICIENT


The correlation coefficient shows the strength of a relationship between two variables and the
nature of the relationship. i.e either
A very strong relationship and either positive or negative
Or a strong relationship and either +ve or -ve
Or a moderate relationship and either +ve or -ve
Or a weak relationship and either +ve or -ve
also −1 ≤ 𝑟(𝑥𝑦) ≤ +1
𝑁∑xy − (∑x)(∑y)
𝑟𝑥𝑦 =
√[𝑁∑𝑥 2 − (∑x)2 ][𝑁∑𝑦 2 − (∑y)2 ]
Example:
Calculate the product moment corr-coeff between x and y
x y xy x2 y2
50 45 2250 2500 2025
49 50 2450 2401 2500
30 25 750 900 625
11 10 110 121 100
10 15 150 100 225
∑x=150 ∑y=145 ∑xy=5710 ∑x2=6022 ∑y2=5475

𝑛∑xy − (∑x)(∑y)
𝑟𝑥𝑦 =
√[𝑛∑𝑥 2 − (∑x)2 ][𝑛∑𝑦 2 − (∑y)2 ]

57100=150x145
= [5x6022−(150)2 ][5x5475−(145)2 ]

28550 − 21750
=
√[30110 − 22500][27375 − (21025]

6800
=
√7610 X √6350

6800
= 87.235𝑋79.69
6800
= = 0.978
6951.76

Interpretation
There is a very strong positive relationship between the two variable x and y
Spearman Rank Order Correlation Coefficient
Denoted by r1 because it is an approximation of rxy
6∑𝐷 2
r1 = 1-𝑁(𝑁2 −1)

Where
D = difference in ranks for each pair of scores
N = Number of pairs of scores
r1 is based in the ranks of score and not the scores

x rank of x y rank of y Deviations


y D D2
50 5 45 4 1 1
49 4 50 5 -1 1
30 3 25 3 0 0
11 2 10 1 1 1
10 1 15 2 -1 1
4

6∑𝐷 2
r1 = 1-𝑁(𝑁2 −1)
6x4
= 1-5 x 24
24 1
= 1- 5 x 24 = 1 − 5 = 1 − 0.2 = 0.8

Assumptions underlying Pearson Product moment Correlation Coeff (rxy)


 The relationship between x and y is linear
 The two distributions have similar shapes
 The scatter diagram is uniformly distributed i.e. homoscedastic
 The data is based on internal scale of measurement
Assumptions of Spearman rank order correlation coefficient
The two variables measures are ordinal
Interpretation

The ρ (rho) is interpreted in the same way as rxy. The value of rho can never be less than –1 nor greater
than +1. It equals to +1, only if each person has exactly the same ranks on both X and Y. It is –1, if there
are no ties and the order is completely reversed for the two variables such that the first is the last in the
other variables and so forth.

Note

1. Although the Spearman correlation coefficient formula is simpler and does not look much like
the computational formula we used for Pearson correlation coefficient, it is algebraically
equivalent to the Pearson when it is used with ranked data instead of the interval data.
2. Tie places are easily handled by assigning the mean value of ranks to each of the tie holders.
3. If a very large number of ties occur, however you would probably be wise to reconsider the use
of Spearman (rho) coefficient, other non-parametric methods such as Kendall’s tau or chi-square
may be more appropriate.
4. Ranking can be done from the smallest or largest and so forth as long as you stick to the
convention you use to the end.
5. if there are no ties in the data, Spearman coefficient is merely what one obtains by replacing the
observations by their ranks and then computing Pearson product moment correlation
coefficient of ranks.
Summary
1. When two measures are related, the term correlation is used to describe this fact.
2. Correlation has two distinctions: correlation that merely describes presence or absence of
relationship and correlation, which shows the degree of magnitude of relationship.
3. A study of correlation to determine presence or absence of relation can be done through logical
examination of data and examination of scatter diagrams. Methods used to provide indices of
the magnitude of relationship include covariance, Pearson product-moment correlation
coefficient and Spearman rank-order correlation coefficient.
4. The measure of correlation assumes only values between –1 and +1.
5. If the larger values (scores) of X tend to be paired with larger values (scores) of Y, and hence the
smaller values (scores) of X and Y tend to be paired together, then the measure of correlation
should be positive and close to +1. If the tendency is strong, then we would speak of a positive
correlation between X and Y.
6. If the large values of X tend to be paired with the smaller values of Y, and vice versa, then the
measure of correlation should be negative and close to –1. If the tendency is strong, then we
say that X and Y are negatively correlated.
7. If the values of X seem to be randomly paired with the values of Y, the measure of correlation
should be fairly close to zero. We then say that X and Y are uncorrelated, or have no correlation
or have correlation zero or are independent.
8. Adding or multiplying every score in two distributions with a constant has no effect on the size
of the correla
9. In order to use rxy, the relationship between the two variables should be linear, the two
distributions must be similar, the variance of the two distributions should be identical
(homoscedastic) and data should be based on interval scale of measurement.
10. When measure is based on ordinal data, the Spearman rank order correlation coefficient, ρ
(rho), should be used. The Spearman rank order correlation coefficient can be interpreted in the
same way
Further Reading

Conover W.J. (1980) Practical non parametric Statistics.

New York: John Wiley & Sons Inc.

Ingule, F. & Gatumu, H. (1996) Essentials of Educational Statistics.

Nairobi: E.A. Educational Publishers.

Glass, G.V. & J.C. Stanley (1970) Statistical Methods in Education and Psychology.

New Jersey: Prentice-Hall,

Johnson, R.R. (1980) Elementary Statistics. Mass.: Duxbury Press.

Smith, G.M. (1970) A Simplified Guide to Statistics for psychology and Education

New York: Holt, Rinehart and Winston.

Siegel S. (1956) Non parametric Statistics for Behavioral Sciences.

New York: McGraw-Hill Inc.

Self Test

1. The following scores were obtained when a group of 11 students were tested on two tests, test
A and test B

Examinee Test A (X) Test B (Y)

1 2 2

2 2 3

3 4 4

4 5 4

5 3 5

6 6 5

7 4 6

8 5 6
9 6 7

10 8 8

11 7 9

(a) Plot a scatter diagram for the above data (use graph paper).
(b) Compute the Pearson product moment correlation coefficient, rxy between tests A
and B for this group of 11 examinees.
(c) Interpret your computed value of rxy.
(d) State the assumption underlying this correlation analysis.
(e) Compute the Spearman correlation coefficient, ρ (rho), for the above data.
(f) What are the major differences between these two measures of relationship (i.e.
between Pearson and Spearman correlation coefficients)?
2. Suppose the following were scores of a small class in two tests, test A and test B. Test A is taken
as variable X while test B is taken as variable Y.

Test A (X) Test B (Y)

John 5 4

Mary 6 6

Peter 5 5

Ali 3 2

Juma 2 3

James 3 4

(a) Plot a scatter diagram for the above test scores.


(b) Compute the Pearson product-moment correlation coefficient rxy between test A and test
B for this class. Interpret the value of rxy.
(c) Compute the Spearman rank order correlation coefficient, ρ (rho), for this class in the two
tests.
(d) Which one of the two correlation coefficients would you prefer for this data? Give
reasons for your choice.
3. (a) Determine if there is a logical relationship between X and Y using the data
in the table below.
X 3 4 7 9 11 15 20
Y 5 4 6 4 10 5 9

(b) By means of a scattergram, say the kind of relationship between X and Y in the
above data.
MACHAKOS UNIVERSITY
CENTER OF OPEN, DISTANCE AND e-LEARNING

IN COLLABORATION WITH
SCHOOL OF EDUCATION

DEPARTMENT OF PSY/SNE

EPS400: EDUCATIONAL TESTING AND MEASUREMENT

WRITTEN BY:
ROSEMARY MULE
Copyright © Machakos University, 2021
All Rights Reserved

JULY, 2021
LECTUR 11:
QUALITY OF A TEST
Introduction
The two important qualities of a test (measurement or instrument) are reliability and validity.
We consider reliability first, before validity though, validity is more important than reliability.
Objectives
By the end of the lecture, the learner should be able to:

1. Define reliability
2. Describe the 3 methods commonly used for estimating reliability.
3. Explain factors that may influence (increase or decrease) reliability coefficient of a test (or
instrument).
4. Define validity
5. Differentiate between reliability and validity
6. State the 3 kinds of validity.
7. Distinguish among all the 3 kinds of validity namely content validity, criterion-related validity
and construct validity.

Quality of a test

There are certain qualities that every measurement device (test or questionnaire) should possess.
The measurement (or test) should be:
Reliable
Valid and
Scored accurately and objectively

Reliability of a Test
Reliability refers to the accuracy of the measurement (scores) provided by a test. It is the degree
of consistency between two measures of the same kind (test). A test must therefore measure
consistently if it is going to be reliable i.e an individual should obtain approximately the same
mark on another administration of the same test. The degree of consistency of a test is referred to
as reliability coefficient of a test and is calculated using Pearson product moment correlation
coefficient (r)

Methods of Estimating Reliability


1. Test-retest method
This is testing the same examinees twice with the same test on different times or days, then
correlating the scores from the two administration of test
2. Parallel form method
This is administering two parallel tests to the same group of individuals on the same day and
correlating the scores to obtain the coefficient of reliability
Tests are parallel if the corresponding items on the test have equal content, means
and standard deviation
3. Split half method
This method involves use of one test and dividing it into two parts, then correlating the
scores of the two halves. The correlation only estimates the reliability of one half which is
denoted as 𝑟½½ . The reliability of the whole test is corrected using the following formula
2𝑟
𝑟𝑥𝑥 = 1±𝑟½½ , where; 𝑟𝑥𝑥 = reliability of the whole test
½½

𝑟½½ = reliability of half test

e.g. if 𝑟½½ = 0.8, then reliability of the whole test

2 𝑥 0.8 1.6
𝑟𝑥𝑥 = = = 0.888
1+0.8 1.8

Split-half method can be done using odd/even,


method or upper half and lower half in terms of item numbers
Reliability may be influenced by the following factors:
1. Test Length
Adding more items, provided that they are equally reliable, will increase the reliability of the
test. Thus an unreliable test can often be made more useful by increasing the number of items.
With all other things being constant a test with more items is more reliable than a a test that
has few items. The test should neither be too short nor should it be unreasonably too long

2. Speed/ Power test


A test is considered a pure speeded test if everyone who reaches an item gets it right but no one
has time to finish answering all the items while a test is considered a power test if everyone has
time to try all the items. Power test yields a better estimate of reliability compared to speed test.
3 Group Homogeneity
Other things being equal, the more heterogeneous the group, the higher the reliability
coefficient.
4 Difficulty of the items
Easy tests such that almost everyone gets all items correct or very hard test that almost everyone
gets all the items wrong yield a low reliability.
5 Objectivity
The more subjective a measure is scored the lower the reliability of the measure. Thus
objective tests are more reliable than subjective tests with all other things being equal
Validity
Validity of a test refers to the ability of the test to measure what it purports to measure. It refers
to the relevance of a test in testing what it is supposed to test
When testing validity, the reliability of a test should be known and then established whether the
test measures what it is constructed to measure.
Types of Test Validity
The four types of validity commonly used are:
Face validity
Content validity
Criterion related validity
Construct validity
1. Face Validity
This basically means that a test appears valid on the face of it. The test appear to be testing the
trait it is constructed to test.

2. Content Validity
Refers to how adequately a test is related to a specific field of study or content as per the relevant
domain.
It is the extent to which the sum of items is representative of the total population e.g. topics from
which test items should be sampled out
3. Criterion Related Validity
Criterion-related validity refers to an empirical study between the scores of a test and an
external criteria variable or measure. It is used when the scores of a test can be related to a
criterion measure. The criterion refers to some behavior that the scores of a test are used to
predict e.g. KCSE grades are a predictor variable while job effectiveness or performance in
university are the criterion variables.
There are two types of criterion-related validity
 Concurrent validity
 Predictive Validity
These two only differ in regard to time i.e
In concurrent validity, the criterion data is gathered at the same time as test scores while in
predictive validity, the criterion data is gathered at a later date after the test scores to predict
future behavior. Predictive validity is used e.g. if KCSE results are used to predict performance
in the first year university examinations, then KCSE constitute the test and university exams
provide the criteria. In Concurrent validity, the purpose is to determine whether a test can be
substituted for another test.

4. Construct Validity
A test construct validity is the degree to which the test measures the theoretical construct or
trait that it was designed to measure. A construct refers to a factor or trait and is any domain
of knowledge e.g. verbal ability, mathematical ability. Any skill or any ability can be
regarded as a construct. These skills or abilities cannot be measured directly. To measure
them we need to define them and then test them
Summary
Reliability has to do with consistency. Unless a test measures consistently, it cannot be reliable.
A test is reliable if its observed scores are highly correlated with its true scores
We explored three commonly used methods for estimating reliability coefficient:

1. Equivalent form:
Reliability coefficient is the correlation between observed scores on two equivalent tests
(also called parallel or alternate).
2. Test retest
This is testing the same examinees twice with the same test on different times or days, then
correlating the scores from the two administration of test

3. Internal consistency or split half and correcting for full test using Spearman-Brown
prophecy formula.

4. Validity of a test refers to the ability of the test to measure what it purports to measure.

5. The four types of validity commonly used are:

Face validity
Content validity
Criterion related validity
Construct validity

Further reading:
Allen M.J. & Yen W.M. (1976) Introduction to Measurement Theory.

Belmont California: Wadsworth Inc.

Brown, F.G. (1970) Principles of educational and psychological testing. 2nd ed.

New York: Holt, Rinehart & Winston.

Mehrens, W.A. & Lehmann, I.J. (1978) Measurement and Evaluation in Education and

Psychology. New York: Holt, Rinehart and Winston.

Self test
1. a) Name three test properties that influence test reliability
b) Discuss how these properties can be manipulated to increase test reliability

2. Describe the three commonly used methods of estimating reliability coefficient.

3. Distinguish between:

i. Predictive validity and construct validity


ii. Face validity and concurrent validity
iii. Construct and concurrent validities
4. What do you understand by the following terms?
i. Validity
ii. Reliability
5. Discuss the similarities and differences of validity and reliability as they pertain to a test.

Further reading:
Allen M.J. & Yen W.M. (1976) Introduction to Measurement Theory.

Belmont California: Wadsworth Inc.

Brown, F.G. (1970) Principles of educational and psychological testing. 2nd ed.

New York: Holt, Rinehart & Winston.

Mehrens, W.A. & Lehmann, I.J. (1978) Measurement and Evaluation in Education and

Psychology. New York: Holt, Rinehart and Winston.


LECTURE12: ITEM ANALYSIS
Introduction

Statistical analysis of test items is known as item analysis and deals with difficulty of item and
discrimination index of item.

Objectives

By the end of the lecture, the learner should be able to:

1. Define:
a) Item analysis
b) Difficulty index of an item
c) Discrimination index of an item.
2. Determine difficulty index and discrimination index of a test item
3. State the desirable limits for
a) Difficulty index
b) Discrimination index

Item Analysis
A test is only as good as the items it contains. Thus, when constructing a test, we must be
concerned with the quality of the items. When evaluating the quality of the items, various
criteria are used;
1. An item should measure the knowledge or skill it is designed to measure. This is
validity or soundness of an item.
2. We should also be concerned about the quality of expression. Items (or questions)
must be clearly written, grammatical and at the appropriate reading level. Thus you
should check whether, all serious learners (but not giving unfair hints) understand the
questions.
3. The statistical characteristics (analysis) of the item, which would be a topic of
discussion here. Others two above have already been discussed.
Statistical analysis of test items is what we refer to as item analysis. The item analysis helps an
examiner to:
i. Judge the quality of the item. Thus the examiner can identify good or poor items.
The other major use of item analysis is to
ii. Identify knowledge and skills examinees have and have not mastered, if an item in
a classroom test is answered incorrectly by a majority of the students, this
information tells the teacher something is wrong. Unfortunately without further
investigation, it does not tell her/him what went wrong. The item may have been
misleading or poorly constructed, the material have been so difficult that students
were not able to learn it, or the instruction may have been incomplete. Only
further analyses would tell which is the most likely or the true explanation.
Specifically, most item analyses are concerned with three aspects of an item;
a) .Difficulty of item, which is nothing else but the proportion of examinee who
answer an item correctly and it is referred to as difficult index.
b) Discrimination power of the item or else called discrimination index, is concerned
with whether the item differentiates between people with varying degree of
knowledge or ability.
c) Content validity as well as the effect of distracters.
Note: If the difficult index is low the validity may still be okay, but if the discrimination index is
negative, for an item, the item may not be measuring what it is supposed to measure (not
valid). It may be measuring something else but definitely not ability.
The item analysis is important for item (or question) bank. It is ridiculous for teachers to have to
write new items (questions) every time they prepare a test. Over time, they should have built a
test file of the better items to be reused. Item analyses help in this line for you are able to
know good and bad items and bad items (questions) have to be discarded or be improved.
Note that item analysis is best done in multiple-choice items
Example;
Consider a group of 40 examinees and suppose they responded as below to this item:

A B* C D E Omit
Upper 0 20 0 0 0 0
group
Lower 3 8 4 2 3 0
group

Asterisk indicates the correct answer (or key). B is the key. Others are distracters (the ones which are
incorrect).

For each item, compute the percentage or proportion, which gets (or who get) the item correct. This is
what is called item difficulty index. Thus item difficulty index can be expressed as a decimal, fraction or
percentage. Thus range is from 0-1 (for decimal or fraction) or 0-100% (for percentage). Item difficulty
index is denoted by p.

p = number answering correctly/number of test takers

=28/40 = 0.7 for our above case.

An item with a difficulty index of 0.3 is more difficult than an item with a difficulty index of 0.8. Why?
The index is quite useful in item analysis. If p for an item is very close to 0 or 1, the item generally should
be altered or discarded, because it is not giving any information about differences among examinees’ trait
levels or abilities.

If p = 0 the item is very difficult (nobody got it right)

If p = 1 the item is very easy (everybody got it right).

Acceptable p is 0.3-0.7 for it maximizes the information the test provides about differences among the
examinees. Note: You can have simple items early in the test for motivational reasons.

Examiner should not forget the purpose of the test. A test used to select graduate students for a university
that admits about 10% of the applicants should contain extremely difficult items. A test used to select
children for a remedial education program should contain very easy items. By this time you should have
realized even objective (multiple-choice) tests can be used for any level of education (even for Ph.D.
programs).

Item discrimination index

Item discrimination index is obtained by subtracting the number of students in the lower group who
answered the item correctly from the number in the upper group who got the item right, and dividing this
by the number of students in either group. That is, half of the total number of students when we divide
the group into upper and lower halves. In our example:

RU  RL 20  8 12
Discrimination index = =1 = =0.6
2  40
1
2T
20

This value is usually expressed as a decimal and can range –1.00 to +1.00.

If it has a positive value, the item has a positive discrimination. This means that a large proportion of the
more knowledgeable students than poor students got the item right. If the value is zero, the item has
zero discrimination. This can occur

a) Because the item is too easy or too hard


b) Because it is ambiguous.
If more poor than better students get the item right, one would obtain a negative discrimination. For a
classroom test a discrimination of an item of 0.20 and above is good. Note that the discrimination index
is of an item (or a question) not for the test. You are necessarily analyzing each item at a time i.e. looking
at each item of the test and determining its quality in terms of how the examinees have performed in it.
Thus the two indexes (difficulty and discrimination) are group dependent and consequently should be
used bearing that in mind.

In general, discrimination index of 0.40 is regarded as satisfactory. However, one should not automatically
conclude that because an item has a low discrimination index, it is a poor item and should be discarded.
Items with low or negative discrimination indices should be identified for more careful examination.
Those with low, but positive, discrimination indices should be kept (especially for mastery tests). As long
as an item discriminates in a positive fashion, it is making some contribution to valid measurement of the
students’ competencies. And as long as we need some easy items to instill proper motivation in the
examinees, such items are valuable.
Further reading
Allen M.J. & Yen W.M. (1976) Introduction to Measurement Theory.

Belmont California: Wadsworth Inc.

Brown, F.G. (1970) Principles of educational and psychological testing. 2nd ed.

New York: Holt, Rinehart & Winston.

Mehrens, W.A. & Lehmann, I.J. (1978) Measurement and Evaluation in Education and

Psychology. New York: Holt, Rinehart and Winston.

Exercise:

1. What do you understand by the following terms?


a. Item analysis
b. Validity
c. Reliability
Discuss their similarities and differences as they pertain to a test.

2. Explain the following as they relate to item analysis


a. Item difficulty
b. Item discrimination
c. Distracters or distractors
3. i) What is item analysis?

i) What properties are desirable for item difficulty indices and item-discrimination
indices?
ii) Generally why is an item difficulty index of 0.01or 0.99 undesirable?
iii) Why is a negative or zero item discrimination index undesirable?

2. Use the information below on an analysis of four test items to answer the questions which
follow:

Item

1 A* B C D OMIT TOTAL

__________ _______

Upper group 16 2 1 1 0 20

Lower group 6 6 5 3 0 20
2 A B C* D OMIT TOTAL

__________ _______

Upper group 2 4 10 4 0 20

Lower group 4 5 6 5 0 20

3 A B* C D OMIT TOTAL

__________ _______

Upper group 2 16 2 0 0 20

Lower group 7 8 5 0 0 20

4 A B C D* OMIT TOTAL

__________ _______

Upper group 2 1 1 16 0 20

Lower group 1 1 0 18 0 20

Keys are symbolized by * for every item.

(i) Calculate the difficulty index of items 1, 2, 3 and 4.


(ii) Calculate the discrimination index of items 1,2, 3 and 4.
(iii) Explain what your calculations tell you about the quality and effectiveness of
each of the four items.
(iv) Examine all the distractors in each of the four items and state which are
functioning well and which are not and give a reason for each answer.
Tests – Uses of tests

Introduction

A test is a device for measuring psychological variables. The measuring device as we have
found has to be of great reliability as well as being of great validity.

Objectives

At the end of the unit the learner should be able to:


1. Define test
2. State the uses of tests
3. Distinguish between measurement and evaluation.

We shall talk about tests before looking at validity. Psychological variables or characteristics are
best measured by psychological tests.

Definition of ‘test’:

A test is a systematic procedure for measuring a sample of behaviour (psychological variable).


“Systematic procedure” indicates that a test is constructed, administered, and scored (or marked)
according to prescribed rules or laid down rules, which must be followed to absolutely.
Test items are systematically chosen to fit the test specifications and the same or equivalent
items are administered to all persons (examinees) and the directions and time limits are the same
for all persons taking the test. The use of predetermined rules [or marking scheme] for
evaluating (scoring) responses assures agreement between different persons who might score
(mark) the test. Hence consistency or reliability is ensured consequently.
Using standard procedures ensures comparability among the examinees and ensures there is
uniformity in all aspects you can think of. The test should not favour any individuals or any
group of individuals unfairly. A test should not have any kind of bias.
A second important term in the definition is behaviour. In the strictest sense, a test measures
only test-taking behaviour; that is, the responses a person (examinee) makes to the test items.
Here we are talking about psychological variables and as we know these cannot be measured
directly, rather we infer the characteristics (trait) from his or her responses to the given test
items. We have to measure their manifestations, since they are not tangible.
If the behaviour exhibited (manifested) on the test adequately mirrors the construct (trait) being
measured, the test will provide useful information. Here we are talking about validity of the test
i.e. the test measuring what it is supposed to measure. If the test does not adequately reflect the
underlying characteristic, inferences made from test scores will be in error for validity is
important.
A test contains only a sample of all possible items. No test is so comprehensive that it includes
every possible item that might be developed to measure the behaviour domain [or population or
universe] e.g. a driver’s test will not test you how to drive at night, or on a slippery wet road or
when raining very much etc. Thus any particular test is better thought of as a sample of all
possible items.
Because a test contains only a sample of all possible items, two problems arise.
1. We must ensure that the questions or items represented on the test are a representative
sample of all-possible questions or items. [Validity]
2. Would an examinee get the same score if he were given a different set of sampled items
from the same domain? [Reliability].
A test is a measuring instrument. Thus we need to define measurement (measure).
Measurement is assigning of numbers to individuals in a systematic way as a means of
representing the properties of the individuals such that those with more of the property you are
measuring will score more, those with less will score less.

Difference between measurement and evaluation:

Measurement answers the question, how much? That is, measurement provides a description of
a person’s (examinee’s) performance, it does not provide judgment; that is, it says nothing about
the worth or value of the performance. If we put value or worth or judgment on it then we are
evaluating. We are going beyond description. We are attempting to answer the question how
good? This is evaluation. A mark or score like 40 out of 50 is measurement. If we say it is B+
then this is an evaluation, since judgment has been made on the value of the mark or score in
terms of how good. That is, objective description here is a measurement, while subjective
judgment of quality is an evaluation.

Uses of tests:

Explicit uses of tests:


1. Selection:
It is done in academic setting, in business and industry and for job opportunity or other sectors
offering jobs where there are more qualified applicants than job opportunities. That is, in the
selection situation there are more applicants than can be accepted (or employed or hired), and a
decision has to be made on whom to accept. The role of the test is to identify the most promising
applicants (or candidates or examinees) i.e. those with the greatest probability of success. In the
simplest case, the decision is either accept or reject. Like in Kenya for university entrances there
are clear-cut off (or cut off points), which are strictly adhered to. Once laid down one cannot go
to complain, with what kind of excuse regardless of how fitting the reason may be for, being left
out.
The implication here is that we seem to have very little interest on those who are rejected (or left
out). Social economic status, poor health, poor facilities and background or other adverse factors
may contribute for a person to be left out. Many such factors are assumed uniform for all. In
other words, nobody is favoured is the assumption, which we know is not true.
2. In Placement:
There are several individuals and several alternative courses of action e.g. in universities there
are several departments and each has its requirements. In general each person is to be assigned
to a program using certain criteria.
3. Diagnosis:
It involves comparing an individual’s performance in several areas in order to determine relative
strengths and weaknesses. Generally, diagnostic procedures are instituted when an individual is
having difficult in some area. Once the areas of disability are identified, a program of
remediation can be undertaken. For instance, If a child has problem in reading or doing word
problems in mathematics, you may give a test consisting of phonetic, word meaning
(vocabulary), sentence meaning, paragraph meaning and reading rate, so as to identify what
particular weaknesses or strengths of the child need appropriate action.
4. Hypothesis testing:
In psychological research, tests are often used for hypothesis testing. And what is a hypothesis?
This is dealt with here briefly (for details refer elsewhere). In brief, a hypothesis is a speculative
statement, or an educated guess, which you may wish to establish whether to accept or reject.
For instance, we can manipulate our subjects in a certain way (varying may be the degree of
manipulation) and then we try to find the effect of the manipulation. We give a test to find out
the effect of the manipulation. This is a type of experimental study or design. In correlational
study (design) we have cases of natural manipulation. In a correlation study we may look at the
performance at a certain time or under certain conditions. We study what has taken place and
then we make inferences. Using various methods like keeping other variables constant or
eliminating them analytically or otherwise, we are able to study the effect of the variable
manipulated.
Tests can also be used for hypothesis building. We may find a difference in performance in 80’s
and 90’s and then we go on to hypothesize what could be the reason (may be a drop in
socioeconomic status, 8-4-4 educational system or combination of these and others).
Psychologists or educators (even lay-people) use tests to make a lot of deductions or build
hypotheses. For example, Muthoni got a very good division I in the O-level examination but
failed to obtain university entrance after A-level. Why? Muthoni went to do science for A-level
because of parental pressure. Her father wanted her to be a doctor but she did not have much
interest in sciences (Biology and the like). Muthoni could have done very well if she took Arts
(Humanities). Or Muthoni may have lost her father just before the exams and this traumatized
her too much beyond recovery and this indeed may have contributed to her poor performance in
the A-level exams.
5. Another use of tests is in evaluation:
Formative evaluation and summative evaluation:
A teacher can use test to find not only the weak students but also his weaknesses or topic not
understood well etc. Thus classroom examinations and tests are usually used to evaluate the
instructional method or the teacher.
All of these uses involve some decision. In selection, the decision is whether to accept or reject
an applicant. In placement, where does the candidate fit best in terms of ability and skills is
considered. In diagnosis the remedial treatment, which is to be used after finding out the
weakness is the concern. In hypothesis testing usually using statistics, you need to establish
(reject or accept) the hypothesis. In hypothesis building, from evidence obtained, you build a
hypothesis. You decide in evaluation what grade to give to a student or how effective is the
procedure. Effectiveness has to deal with summative evaluation, or evaluation done at the end.
While that is done at the beginning (e.g. to check entry behaviour) is formative evaluation.
We know how seriously we take tests. We belong to a culture, which overrates exams. You get
a lot of respect if you A student, division I, first class or Ph.D. scholar. If you do your tests
badly, you seldomly (rarely) get a chance of saying why you obtained a low score. Research on
tests shows ‘ability’ is important to doing well in a test but accounts for less than 50%. Other
factors do count e.g. difficult of items, quality of instructions, personality variables like
socioeconomic status and linguistic variables.

Further reading:

Allen M.J. & Yen W.M. (1976) Introduction to Measurement Theory.


Belmont California: Wadsworth Inc.
Brown, F.G. (1970) Principles of educational and psychological testing. 2nd ed.
New York: Holt, Rinehart & Winston.

Mehrens, W.A. & Lehmann, I.J. (1978) Measurement and Evaluation in Education and
Psychology. New York: Holt, Rinehart and Winston.

Exercise:
1. Discuss five uses of tests
2. Explain why evaluation is important in a program.
3. Distinguish among test, measurement and examination.
Classification of tests

Introduction

There are several ways in which we can classify tests. The purpose of classifying is to group
together that which have similar properties by large, otherwise tests are unique in their own
rights.

Objectives

At the end of the unit the learner is expected to:


1. Distinguish between:
i) Essay and objective tests
ii) Oral and written tests
iii) Maximal performance and typical performance tests
iv) Norm-referenced and criterion-referenced tests

2. Distinguish among achievement tests, ability tests and aptitude tests


3. State and describe:
a. 3 major types of essay tests
b. 4 types of objective tests.

Classification of tests:

There are a variety of ways in which tests can be classified.


i. One type of classification is based upon the type of items format used – Essay (subjective)
versus objective.
ii. Another classification is based upon the type of stimulus material used to present the
problems to the students (examinees)- verbal and nonverbal
iii. A classification by purpose and here we have several categories:
a. Maximal performance versus typical performance
b. Formative versus summative evaluation
c. Norm referenced tests (NRT) versus criterion-referenced tests (CRT).

Classification by item format:

The two major categories here are as seen above:


1. Essay and 2. Objective

Essay type:
Essay questions are subdivided into three major types
1. Extended (or Discussion) response
2. Restricted response
3. Oral.

Extended response:
This also referred to as discussion type. Here the question is very much open ended (unstructured).
No restriction is given. Most university questions in many departments are of this type. Example:
a. Discuss what is a system
b. Discuss what is a scientific method (approach)
c. Discuss the Information Processing Model of
(Memory).

Restricted response:
Here the student (examinee) is more limited in the form and scope of his answer because he is told
specifically the context that his answer is to take.
Example:
d. Give the three advantages and three
disadvantages of Essay tests
e. Give the three advantages and three
disadvantages of multiple-choice items
f. Distinguish between Classical conditioning and
Operant conditioning
g. Distinguish among aptitude tests, achievement
tests and ability tests
(v) Distinguish among Memory, Learning and
Insight.
We can refer to these as short answer essay tests.

Oral examination:
Also is called viva, viva voce or defence. Usually done after writing a dissertation or a thesis for
an advance degree, a masters or doctorate degree. Essentially, it is to find out how well the
candidate has linked theory and practice in solution to his problem, and very important to see
whether indeed he/she is the one who wrote that thesis or dissertation.

Objective type:
Objective type item can be subdivided into four major types:
1. Short-answer
i. Single word, symbol, formula
ii. Multiple words or phrase
2. True-false (right-wrong, Yes-No)- dichotomous case
3. Multiple choice
4. Matching.
Variations of the multiple-choice format:
There are four frequent forms:
i. One correct answer
ii. Best answer
iii. Analogy type
iv. Reverse type
Others are substitution, incomplete (blank to fill) etc.

Maximal performance tests:

Test takers (examinees) attempt to make the highest possible score. The goal is to measure the
upper limits of examinee’s abilities. Classroom tests are in this category and are example of
achievement tests. Others in this category of maximal performance tests are aptitude tests and
ability tests. Note these are not mutually exclusive. A particular test may serve more than one
of these purposes.

Typical performance tests (or measures):


These assess somebody’s reaction or behaviour. Here the concern is not maximal performance
but rather reaction or behaviour (typical reaction or behaviour) such as liking of courses, others
in this category are measures in attitude, interest and personality, and are best assessed by
questionnaires mainly.
For maximal performance tests we have basically 3 categories:
1. Achievement tests
2. Ability tests
3. Aptitude tests
Achievement tests:
These are designed to measure the knowledge and skills developed in a relatively circumscribed
area (domain) (Brown, 1976). This area may be as narrow as one day’s class assignment (e.g.
computing median, variance etc.) or as broad as several years’ study (as KCSE examinations). In
every case, however, we are attempting to measure what a person knows or can do at a particular
point in time. His or her best performance in a test, examining what has been learned as a result
of a particular course or experience or a series of experiences.

Aptitude tests:
This is a test for giving your potentialities or what you are capable of doing from your formal or
informal experiences. Thus aptitude tests indicate the probability that certain other behaviours
will be acquired or learned. We consider a test to be aptitude test if:
It measures the results of general and incidental learning experiences.
Its frame of reference is toward the future.
This is in contrast to an achievement test, which measures learning from relatively specific
experiences, and focuses on the past learning. Thus aptitude test predict what can be learned in
the future. Thus it measures the ability to acquire certain behaviours or skills given appropriate
opportunity.
Ability tests:
Indicate the power to perform a task. Ability tests measure present status. In this category we
have performance test (or practical test) like driving or tuning an engine, playing piano and
swimming.

Norm referenced versus criterion-referenced tests:

The distinction between the two is on how test scores are interpreted. In norm-referenced tests, an
individual’s scores are interpreted by comparing them to those of other people in some comparison
(peer or norm) group. In criterion-referenced tests, the concern is mastery of the content
regardless of the performance of the other examinees. In norm-referenced tests scores usually may
need to be transformed for comparability or meaningful interpretation but in criterion-referenced
tests, scores indicate proficiency or mastery or competency.
Thus a score in criterion referenced test is a meaningful number representing the level of mastery,
while in norm referenced test a (raw) score may not carry much meaning unless it is converted to
a standard scale to show its relative position compared to other scores. In other words,
transformation such as conversion to Percentile ranks, standardization and normalization are
justifiable among norm-referenced tests. A score in criterion-referenced test is a meaningful score
and should not be subjected to such transformation.
The difference between criterion-referenced tests and norm-referenced tests is just theoretical. The
number 1 student (or candidate) can be seen as the criterion or standard, thus having realized the
perfect score or has mastered perfectly. If all have mastered equally they all would be number 1.
But we know this is not the case in practice. In other words, is norm-referenced tests and criterion-
referenced tests are more theoretically different than they are practically.
From another perspective, we should realize that when we are ranking or finding position of a
candidate, we do this according to their performance (mastery). In both, we are talking about
mastery, of course from a different angle. As long as we are concerned about mastery then both
type of tests (criterion referenced and norm referenced tests) are going to have more in common
to an extent that they are hardly different. They end up doing the same thing. Practically they are
the same but theoretically or philosophically different. You do not talk of transforming score in
criterion-referenced tests, only in norm-referenced tests. In criterion-referenced tests, a score is
meaningful.

Further reading:

Allen M.J. & Yen W.M. (1976) Introduction to Measurement Theory.


Belmont California: Wadsworth Inc.

Brown, F.G. (1970) Principles of educational and psychological testing. 2nd ed.
New York: Holt, Rinehart & Winston.

Mehrens, W.A. & Lehmann, I.J. (1978) Measurement and Evaluation in Education and
Psychology. New York: Holt, Rinehart and Winston.
Exercises
1. Describe
i. Typical performance tests
ii. Maximal performance tests.
2. Distinguish among achievement tests, ability tests and aptitude tests
3. Distinguish between norm referenced test and criterion referenced test.

You might also like