Scoring and Interpretation of Test Scores
Scoring and Interpretation of Test Scores
INTRODUCTION
The effort through this Module is to show what test is, why test is important, how tests are
constructed and what precautions are taken to ensure validity of tests. The module will round
off by explaining how tests are scored and interpreted. In order to enjoy the study of the unit,
you should have other units by your side and cross-check aspects relevant to this unit that
were discussed in the previous units.
OBJECTIVES
By the end of this unit, you should be able to:
a. score and interpret tests in general and continuous assessment in particular;
b. analyse test items;
c. compute some measures of general tendency and variability; and
d. compute Z – score and the Percentile.
SCORING OF TESTS
This section introduces to you the pattern of scoring of tests, be they continuous assessment
tests or other forms of tests. The following guidelines are suggested for scoring of tests:
i. You must remember that multiple choice tests are difficult to design, difficult to
administer, especially in a large class, but easy to score. In some cases, they are
scored by machines. The reasons for easy scorability of multiple-choice tests are
because they usually have one correct answer which must be accepted across the
board.
ii. Essay or subject types of tests are relatively easy to set and administer, especially in a
large class. They are, however, difficult to mark or assess. The reason is because
easy questions require a lot of writing of sentences and paragraphs. The examiner
must read all these.
iii. Whether an objective or subjective tests, all tests must have marking schemes.
Marking schemes are the guide for marking any test. They consist the points,
demands and issues that must be raised before the candidate can be said to have
responded satisfactorily to the test. Marking schemes should be drawn before testing
not after the test has been taken. All marking schemes should carry mark allocation.
They should also indicate scoring points and how the scores are totaled up to
represent the total score for the question or the test.
iv. Scoring or marking on impression is dangerous. Some students are very good at
impressing examiners with flowery language without real academic substance. If you
mark on impression, you may be carried away by the language and not the relevant
78
Measurement and Evaluation in Education (PDE 105)
facts. Again, mood may change impression; your impression can be changed by joy,
sadness, tiredness, time of the day and son on. That is why you must always insist on
a comprehensive marking scheme.
v. Scoring can be done question-by-question or all questions at a time. The best way is
to score or mark one question across the board for all students. Sometimes, this may
be feasible and tedious, especially in a large class.
vi. Scores can be interpreted into grades, A, B, C, D, E and F. They may be interpreted
in terms of percentages: 10%, 20%, 50% etc. Scores may be presented in a
comparative way in terms of 1st position, 2nd position, and 3rd position to the last.
Scores can be coded in what is called BAND. In band system, certain criteria are
used to determine those who will be in Excellent, Very Good categories, etc. An
example of a band system is the one given by the International English Testing
Services (IETS) and the one by Teaching English as a Foreign Language (TOEFL)
test.
Find the corrected scores of two candidates. A and B who both scored 35 in an objective test
of 50, if a attempted 38 questions while B attempted all the questions.
SA = 35 - ¾ = 34 and SB = 35 – 15/4 = 31)
Note that under rights only, each of the students gets 35 out of 50.
Item Analysis
Item analysis helps to decide whether a test is good or poor in two ways:
i. It gives information about the difficulty level of a question.
ii. It indicates how well each question shows the difference (discriminate) between the
bright and dull students. In essence, item analysis is used for reviewing and refining a
test.
Difficulty Level
By difficulty level we mean the number of candidates that got a particular item right in any
given test. For example, if in a class of 45 students, 30 of the students got a question
correctly, then the difficulty level is 67% or 0.67. The proportion usually ranges from 0 to 1
or 0 to 100%.
An item with an index of 0 is too difficult hence everybody missed it while that of 1 is too
easy as everybody got it right. Items with index of 0.5 are usually suitable for inclusion in a
test.
Though the items with indices of 0 and 1 may not really contribute to an achievement test,
they are good for the teacher in determining how well the students are doing in that particular
area of the content being tested. Hence, such items could be included. However, the mean
difficult level of the whole test should be 0.5 or 50%.
n x 100
Usually, the formular for their difficulty is p = where
N
P = item difficult
n = the no of students who got the item correct.
N = the number of students involved in the test.
1
However, in the classroom setting, it is better to use the upper of the students that got the
3
1
item right (U) and the lower of the students that got it right (L).
3
80
Measurement and Evaluation in Education (PDE 105)
Item Discrimination
The discrimination index shows how a test item discriminates between the bright and the dull
students. A test with many poor questions will give a false impression of the learning
situation. Usually, a discrimination index of 0.4 and above are acceptable. Items which
discriminate negatively are bad. This may be because of wrong keys, vagueness or extreme
U − L U − L
difficulty. The formular for discrimination index is: or
1 0.5 N
N
2
Where
U = the number of students that got it right in upper group.
L = the number of students that got it right in the lower group.
N = the number of students usually involved in the item analysis.
In summary, to carry out item analysis:
81
Measurement and Evaluation in Education (PDE 105)
Mode
The mode is the most frequent or popular score in the population. This is usually evident
during the drawing of frequency tables. It is not frequently used as the median and mean in
the classroom because it can fall anywhere along the distribution of scores (top, middle or
bottom) and a distribution may have more than one mode.
82
Measurement and Evaluation in Education (PDE 105)
Median
This is the middle score after all the scores have been arranged in order of magnitude i.e.
50% of the score are on either side of it. Median is very good where there are deviant or
extreme scores in a distribution, however, it does not take the relative size of all the scores
into consideration. Also, it cannot be used for further statistical computations.
The Mean
This is the average of all the scores and it is obtained by adding the scores together and
dividing the sum by the number of scores.
Sum of all Scores
M or = X =
Number of Scores
Though, the mean is influenced by deviant scores, it is very important in that it takes into
cognizance the relative size of each score in the distribution and it is also useful for other
statistical calculations.
ACTIVITY
The mean score is the same as the average score i.e. Sum of all scores/the number of scores.
This is the most common statistical instrument used in our classroom
If in a class of 9, the scores are 29, 85, 78, 73, 40, 35, 20, 10 and 5. Find the mean.
MEASURES OF VARIABILITY
Measure of variability indicates the spread of the scores. The usual measures of variability
are Range, Quartile Deviation and Standard Deviation. Their computations are as illustrated
below.
Range
The range is usually taken as the difference between the highest and the lowest scores in a set
of distribution. It is completely dependent on the extreme scores and may give a wrong
picture of the variability in the distribution. It Is the simplest measure of variability.
Example: 7, 2, 5, 4, 6, 3, 1, 2, 4, 7, 9, 8, 10. Lowest score = 1, Highest = 10. Range =
10 -1 = 9
Quartile Deviation
Note that Quartiles are points on the distribution which divide it into “quartiles”, thus, we
have 1st, 2nd and 3rd quartiles.
Inter-quartile range is the difference between Q3 and Q1 i.e. Q3 = Q1. This is often used
than the Range as it cuts off the extreme score. Semi inter-quartile Range is thus half of
inter-quartile range.
This is also known as the semi-inter quartile range. It is half the difference between the upper
quartile (Q3) and the lower quartile (Q1) of the set of scores.
Q3 − Q1
2
83
Measurement and Evaluation in Education (PDE 105)
Where Q3 = P75 = point in the distribution below which lie 75% of the scores.
Q1 = P25 = Point in the frequency distribution below which lies 25% of the scores.
In cases where there are many deviant scores, the quartile deviation is the best measure of
variability.
Standard Deviation
This is the square root of the mean of the squared deviations. The mean of the squared
deviations is called the variance (S2). The deviation is the difference between each score and
the mean.
∑ x2
SD (Μ) =
N
x = X - X - deviation of each score from the mean
N = number of scores.
The SD is the most reliable of all measures of variability and lend itself for use in other
statistical calculations.
Deviation is the difference between each score (X) and the mean (M). To calculate the
standard deviation:
(i) find the mean (m)
(ii) find the deviation (x-m) and square each.
(iii) sum up the squares and divide by the number of the population (N)
(iv) find the positive square root.
Deviations
Squared deviation
Students Marks obtained X–m
(X – m)2 = x2
Take m = 54
A 68 14 196
B 58 4 16
C 47 -7 49
D 45 -9 81
E 54 0 0
F 50 4 16
G 62 8 64
H 59 5 25
I 48 -6 36
J 52 -2 4
487
84
Measurement and Evaluation in Education (PDE 105)
N = 10
∑ x2 ( X − M )2 487
SD (Μ) = = = = 6.97
N N 10
Activity
Find the mean and standard deviation for the following marks.
20, 45. 39, 40, 42, 48, 30, 46 and 41.
DERIVED SCORES
In practice, we report on our students after examinations by adding together their scores in
the various subjects and thereafter calculate the average or percentage as the case may be.
This does not give a fair and reliable assessment. Instead of using raw scores, it is better to
use derived scores”. A derived score usually expresses every raw score in terms of other raw
score on the test. The commonly used ones in the class room are the Z-Scores, T-Score and
Percentiles. The computation of each of these will be demonstrated.
T-Score
This is another derived score often used in conjunction with the Z-score. It is defined by the
equation.
T = 50 + 10Z
Where z is the standard score.
It is also used in the same way as the Z-score except that the negative signs are eliminated in
T-Scores.
85
Measurement and Evaluation in Education (PDE 105)
Consider the maximum scores obtained in English and Mathematics in the table above. We
cannot easily guarantee which of the subject was more tasking and in which the examiner
was more generous. Hence, for justice and fair play, it is advisable to convert the scores in
the two subjects into common scores (Standard scores) before they are ranked. Z – and T –
score are often used.
The Z – score is given by
Raw Score − mean X − M
Z - Score = =
Standard deviaiton SD
Activity
Calculate the Z- and T-scores for students A,B,C and D in the table above.
Percentile
This expresses a given score in terms of the percentage scores below it i.e. in a class of 30,
Ibrahim scored 60 and there are 24 pupils scoring below him. The percentage of score below
60 is therefore:
24 100
× = 80%
30 1
Ibrahim therefore has a percentile of 80 written P80. This means Ibrahim surpassed 80% of
his colleagues while only 20% were better than him. The formula for the percentile rank is
given by:
86
Measurement and Evaluation in Education (PDE 105)
100 F
PR = × (b + ) where
N 2
PR = Percentile rank of a given score
b = Number of scores below the score
F = Frequency of the score
N = Number of all scores in the test.
CREDIT UNITS
Courses are often weighed according to their credit units in the course credit system. Credit
units of courses often range from 1 to 4. This is calculated according to the number of
contact hours as follows:
1 credit unit = 15 hours of teaching.
2 credit units = 15 x 2 or 30 hours
3 credit units = 15 x 3 or 45 hours
4 credits units = 15 x 4 or 60 hours
Number of hours spent on practicals are usually taken into consideration in calculating credit
loads.
87
Measurement and Evaluation in Education (PDE 105)
Total WGP
=
Total Credit Units registered
(The scores and their letter grading may vary from programme to programme or Institution to
Institution)
For example, a score of 65 marks has a GP of 4 and a Weighted Grade Point of 4 x 3 if the
mark was scored in a 3 unit course. The WGP is therefore 12. If there are five of such
courses with course units 4, 3, 2, 2 and 1 respectively. The Grade Point Average is the sum
of the five weighted Grade Points divided by the total number of credit units i.e. (4 + 3 + 2 +
2 + 1)
88
Measurement and Evaluation in Education (PDE 105)
ACTIVITY
Below is a sample of an examination transcript for a student
a. Determine for each course the
(i) GP and
(ii) WGP.
b. Find the GPA.
NOTE:
WGD
GPA =
Total Credit taken
SUMMARY
• In this unit, we have discussed the basic principles guiding scoring of tests and test
interpretations.
• The use of frequency distribution, mean, mode and mean in interpreting test scores
were also explained.
• The methods by which test results can be interpreted to be meaningful for classroom
practices were also vividly illustrated.
89
Measurement and Evaluation in Education (PDE 105)
ASSIGNMENT
1. State the various types of Tests and explain what each measure are?
2. Pick a topic of your choice and prepare a blue-print table for 25 objective items.
3. Explain why:
(a) we use percentile to describe student’s performance and
(b) Z-scores to describe in a distribution.
4. Give four factors each that can affect the reliability and validity of a test.
5. Use the criteria and basic principles for constructing continuous assessment tests
discussed in this unit to develop a 1 hour continuous assessment test in your subject
area. By citing specific examples from the test you have constructed, show how you
have used the testing concepts learnt to construct the test. You should bring out from
your test at least ten testing concepts used in the construction of the test.
REFERENCES
Canale, M and Swain (1980) Theoretical Basis of Communicative Approaches to Second
Language Teaching and Testing. Applied Linguistics 1 (I): 1 – 47.
Carroll, J. B. (1983) Psychometric Theory and Language testing in Oller, J. W (ed) Issues
in Language Testing Research Rowley, Mass: Newbury House.
Lado, R. (1961) Language Testing: The Construction and Use of Foreign Language
Tests. London: Longman.
Licingston, S. A. and M. J. Zeiky (1982) Passing Scores: A Manual for setting standards
of Performance on Educational and Occupational Tests. Princeton N. J:
Educational Testing Services.
90