Assessment in Learning 1 Table of Specifications
Assessment in Learning 1 Table of Specifications
Table of Specifications
The table of specifications (TOS) is a tool used to ensure that a test or assessment measures the content
and thinking skills that the test intends to measure. Thus, when used appropriately, it can provide
response content and construct (i.e., response process) validity evidence. A TOS may be used for large-
scale test construction, classroom-level assessments by teachers, and psychometric scale development.
It is a foundational tool in designing tests or measures for research and educational purposes. The
primary purpose of a TOS is to ensure alignment between the items or elements of an assessment and
the content, skills, or constructs that the assessment intends to assess.
TEST CONSTRUCTION
Validity – the extent to which the test measures what it intends to measure
Reliability – the consistency with which a test measures what it is supposed to measure
Usability – the test can be administered with ease, clarity and uniformity
Scorability – easy to score
Interpretability – test results can be properly interpreted and is a major basis in making sound
educational decisions
Economical – the test can be reused without compromising the validity and reliability
VALIDITY
1. Construct validity: Does the test measure the concept that it’s intended to measure?
2. Content validity: Is the test fully representative of what it aims to measure?
3. Face validity: Does the content of the test appear to be suitable to its aims?
4. Criterion validity: Do the results correspond to a different test of the same thing?
RELIABILITY
There are four main types of reliability. Each can be estimated by comparing different sets of results
produced by the same method.
Parallel forms --------------------------- Different versions of a test which are designed to be equivalent.
A measure of central tendency (also referred to as measures of centre or central location) is a summary
measure that attempts to describe a whole set of data with a single value that represents the middle or
centre of its distribution.
There are three main measures of central tendency: the mode, the median and the mean. Each of these
measures describes a different indication of the typical or central value in the distribution.
Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement age data.
Age Frequency
54 3
55 1
56 1
57 2
58 2
60 2
The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.
The mode has an advantage over the median and the mean as it can be found for both numerical and
categorical (non-numerical) data.
The are some limitations to using the mode. In some distributions, the mode may not reflect the centre
of the distribution very well. When the distribution of retirement age is ordered from lowest to highest
value, it is easy to see that the centre of the distribution is 57 years, but the mode is lower, at 54 years.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
It is also possible for there to be more than one mode for the same distribution of data, (bi-modal, or
multi-modal). The presence of more than one mode can limit the ability of the mode in describing the
centre or typical value of the distribution because a single value to describe the centre cannot be
identified.
In some cases, particularly where the data are continuous, the distribution may have no mode at all (i.e.
if all values are different).
In cases such as these, it may be better to consider using the median or mean, or group the data in to
appropriate intervals, and find the modal class.
The median is the middle value in distribution when the values are arranged in ascending or descending
order.
The median divides the distribution in half (there are 50% of observations on either side of the median
value). In a distribution with an odd number of observations, the median value is the middle value.
Looking at the retirement age distribution (which has 11 observations), the median is the middle value,
which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median value is the mean of the two
middle values. In the following distribution, the two middle values are 56 and 57, therefore the median
equals 56.5 years:
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The median is less affected by outliers and skewed data than the mean, and is usually the preferred
measure of central tendency when the distribution is not symmetrical.
The median cannot be identified for categorical nominal data, as it cannot be logically ordered.
The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The mean is calculated by adding together all the values (54+54+54+55+56+57+57+58+58+60+60 = 623)
and dividing by the number of observations (11) which equals 56.6 years.
The mean can be used for both continuous and discrete numeric data.
The mean cannot be calculated for categorical data, as the values cannot be summed.
As the mean includes every value in the distribution the mean is influenced by outliers and skewed
distributions.
Many users regard the multiple-choice item as the most flexible and probably the most effective of the
objective item types. A standard multiple-choice test item consists of two basic parts:
The stem may be in the form of either a question or an incomplete statement, and the list of
alternatives contains one correct or best alternative (answer) and a number of incorrect or inferior
alternatives (distractors).
The purpose of the distractors is to appear as plausible solutions to the problem for those students who
have not achieved the objective being measured by the test item. Conversely, the distractors must
appear as implausible solutions for those students who have achieved the objective. Only the answer
should appear reasonable to these students.
Multiple-Choice Items are flexible in measuring all levels of cognitive skills. It permits a wide sampling of
content and objectives, provide highly reliable test scores and reduced guessing factor compared with
true-false items and can be machine-scored quickly and accurately. Again, Multiple-Choice Items are
difficult and time-consuming to construct, depend on a student’s reading skills and instructor’s writing
ability. The simplicity of writing low- level knowledge items leads instructors to neglect writing items to
test higher-level thinking. These questions may encourage guessing (but less than true-false).
1. Design each item to measure an important learning outcome; present a single, clearly formatted
problem in the stem of the item; but the alternatives at the end of the question, not in the middle and
put as much of the wording as possible in the stem.
a. 1953
b. 1954 (correct)
c. 1955
d. 1956
a. 1930s
b. 1940s
c. 1950s (correct)
d. 1970s
2. All options should be homogenous and reasonable and punctuation should be consistent, make all
options grammatically consistent with the stem.
b. a musical instrument
c. a gardening tool
b. a metalworking tool
3. Reduce the length of the alternatives by moving as many words as possible to the stem. The
justification is that additional words in the alternatives have to be read four or five times.
a. average (correct)
b. midpoint
4. Construct the stem so that it conveys a complete thought and avoid negatively worded items like
“Which of the following is not…….?” textbook wording and unnecessary words.
b. comparing teachers
5. Do not make the correct answer stand out because of its phrasing or length. Avoid overusing always
and never in the alternatives and overusing all of the above and none of the above. When all of the
above is used, students can eliminate it simply by knowing that one answer is false. Alternatively, they
will know to select it if any two answers are true.
Poor: A narrow strip of land bordered on both sides of water is called an _____.
a. isthmus (correct)
b. peninsula
c. bayou
d. continent
(Note: Do you see why a would be the best guess given the phrasing?)
Better: A narrow strip of land bordered on both sides by water is called a (n) _____.
2. True-False Items
The true false items typically present a declarative statement that the student must mark as either true
or false. Instructors generally use true- false items to measure the recall off actual knowledge such as
names, events, dates, definitions, etc. However, this format has the potential to measure higher levels
of cognitive ability, such as comprehension of significant ideas and their application in solving problems.
They are relatively easy to write and can be answered quickly by students. Students can answer 50 true-
false items in the time it takes to answer 30 multiple-choice items. They provide the widest sampling of
content per unit of time.
Again, the problem of guessing is a major weakness. Students have a fifty-per cent chance of correctly
answering an item without any knowledge of the content. Items are often ambiguous because of the
difficulty of writing statements that are unequivocally true or false.
Poor: A good instructional objective will identify a performance standard. (True/False) (Note: The
correct answer here is technically false. However, the statement is doubtful. While a performance
standard is a feature of some “good” objectives, it is not necessary to make an objective good).
2. Convey only one thought or idea in a true/false statement and avoid verbal clues (specific
determiners like “always”) that indicate the answer.
Poor: Bloom’s cognitive taxonomy of objectives includes six levels of objectives, the lowest being
knowledge. (True/False)
3. Completion Items
Items provide a wide sampling of content; they minimize guessing compared with multiple-choice and
true false. They are rarely can be written to measure more than simple recall of information; more time-
consuming to score than other objective types and difficult to write so there is only one correct answer
and no irrelevant clues.
1. Start with a direct question, switch to an incomplete statement, and place the blank at the end of
the statement.
2. Leave only one blank. This should relate to the main point of the statement. Provide two blank if
there are two consecutive words.
Better: The sine is the ration of the ___ to the ___. (opposite side, hypotenuse).
3. Make the blanks in uniform length and avoid giving irrelevant clue to the correct answer.
Poor: The first president of the United States was _____. (Two words)
(Note: The desired answer is George Washington, but students may write “from Virginia”, “a general”,
and other creative expressions.)
Better: Give the first and last name of the first president of the United States: _____.
4. Matching Items
A matching exercise typically consists of a list of questions or problems to be answered along with a list
of responses. The examinee is required to make an association between each question and response. A
large amount of material can be condensed to fit in fewer space Students have substantially fewer
chances for guessing correct associations than on multiple-choice and true/false tests Matching tests
cannot effectively test higher-order intellectual skills
1. Teacher should Use homogeneous material in each list of a matching exercise. Mixing events and
dates with events and names of persons, for example, makes the exercise two separate sets of
questions and gives students a better chance to guess the correct response.
2. Put the problems or the stems (typically longer than the responses) in a numbered column at the
left and the response choices in a lettered column at the right. Always include more responses than
questions. If the lists are the same length, the last choice may be determined by elimination rather than
knowledge.
3. Arrange the list of responses in alphabetical or numerical order if possible in order to save reading
time. All the response choices must be likely, but make sure that there is only one correct choice for
each stem or numbered question.
Poor:
Column A Column B
e) Columbus
f) Madam Curie Here, column B consists of explorer, scientists and politicians. For higher-class students,
the list would be very heterogeneous.
Good:
Column A Column B
Eye Digestion
Tongue Hearing
Stomach Breathing
Lung Smelling
Ear Seeing
Tasting Chewing
Short-answer questions should be restrictive enough to evaluate whether the correct answer is given.
Allow a small amount of answer space to discourage the shotgun approach. These tests can test a large
amount of content within a given time period. Again, these test items are limited to testing lower-level
cognitive objectives, such as the recall of facts. Scoring may not be as straightforward as anticipated.
Poor: What is the area of a rectangle whose length is 6m and breadth 75 cm?
Better: What is the area of a rectangle whose length is 6m and breadth 75 cm? Express your answer in
sq. m.
“A test item which requires a response composed by the examinee, usually in the form of one or more
sentences, of a nature that no single response or pattern of responses can be listed as correct, and the
accuracy and quality of which can be judged subjectively only by one skilled or informed in the subject.”
The difference between short-answer and essay questions is more than just in the length of response
required. On essay questions there is more emphasis on the organization and integration of the
material, such as when marshaling arguments to support a point of view or method. Essay questions
can be used to measure attainment of a variety of objectives. Stecklein (1955) has listed 14 types of
abilities that can be measured by essay items:
4. Explanations of meanings
5. Summarizing of information in a designated area
6. Analysis
7. Knowledge of relationships
All these involve the higher-level skills mentioned in Bloom’s Taxonomy. So essay questions provide an
effective way of assessing complex learning outcomes
Through essay questions, when a paper-and-pencil test is necessary (e.g., assessing students’ ability to
make judgments that are well thought through and that are justifiable). Essay questions require
students to demonstrate their reasoning and thinking skills, which gives teachers the opportunity to
detect problems students may have with their reasoning processes. When educators detect problems in
students’ thinking, they can help them overcome those problems
1. Ask questions that are relatively specific and focused and which will elicit relatively brief responses.
Poor: Describe the role of instructional objectives in education. Discuss Bloom’s contribution to the
evaluation of instruction.
Better: Describe and differentiate between behavioral (Mager) and cognitive (Gronlund) objectives with
regard to their (1) format and (2) relative advantages and disadvantages for specifying instructional
intentions.
2. If you are using many essay questions in a test, ensure reasonable coverage of the course objectives.
Follow the test specifications in writing prompts. Questions should cover the subject areas as well as the
complexity of behaviors cited in the test blueprint. Pitch the questions at the students’ level.
Poor: What are the major advantages and limitations of essay questions?
Better: Given their advantages and limitations, should an essay question be used to assess students’
abilities to create a solution to a problem? In answering this question, provide brief explanations of the
major advantages and limitations of essay questions. Clearly state whether you think an essay question
should be used and explain the reasoning for your judgment.
Example A assesses recall of factual knowledge, whereas Example B requires more of students. It not
only requires students to recall facts, but also to make an evaluative judgment, and to explain the
reasoning for the judgment. Example B requires more complicated thinking than Example A.
3. Formulate questions that present a clear task to perform and indicate the point value for each
question, provide ample time for answering, and use words which themselves give directions e.g.,
define, illustrate, outline, select, classify, summaries etc.
Better: Discuss the analytical method of teaching Mathematics, giving its characteristics merits, demerits
and practicability. Give illustration.
The construction of clear, unambiguous essay questions that call forth the desired responses is a much
more difficult task than is commonly presumed.
As we noted earlier, one of the major limitations of the essay test is the subjectivity of the scoring. That
is, the feeling of the scorers is likely to enter into the judgments they make concerning the quality of the
answers. This may be a personal bias toward the writer of the essay, toward certain areas of content or
styles of writing, or toward shortcomings in such extraneous areas as legibility, spelling, and grammar.
These biases, of course, distort the results of a measure of achievement and tend to lower their
reliability.
The following rules are designed to minimize the subjectivity of the scoring and to provide as uniform a
standard of scoring from one student to another as possible. These rules will be most effective, of
course, when the questions have been carefully prepared in accordance with the rules for construction.
Evaluate answers to essay questions in terms of the learning outcomes being measured. The essay test,
like the objective test, is used to obtain evidence concerning the extent to which clearly defined learning
outcomes have been achieved. Thus, the desired student performance specified in these outcomes
should serve as a guide both for constructing the questions and for evaluating the answers.
If a question is designed to measure “The Ability to Explain Cause-Effect Relations,” for example, the
answer should be evaluated in terms of how adequately the student explains the particular cause-effect
relations presented in the question.
All other factors, such as interesting but extraneous factual information, style of writing, and errors in
spelling and grammar, should be ignored (to the extent possible) during the evaluation. In some cases,
separate scores may be given for spelling or writing ability, but these should not be allowed to
contaminate the scores that represent the degree of achievement of the intended learning outcomes.
Score restricted-response answers by the point method, using a model answer as a guide. Scoring with
the aid of a previously prepared scoring key is possible with the restricted-response item because of the
limitations placed on the answer. The procedure involves writing a model answer to each question and
determining the number of points to be assigned to it and to the parts within it. The distribution of
points within an answer, of course, takes into account all score able units indicated in the learning
outcomes being measured. For example, points may be assigned to the relevance of the examples used
and to the organization of the answer, as well as to the content of the answer, if these are legitimate
aspects of the learning outcome. As indicated earlier, it is usually desirable to make clear to the student
at the time of testing the bases on which each answer will be judged (content, organization, and so on).
Grade extended-response answers by the rating method, using defined criteria as a guide. Extended-
response items allow so much freedom in answering that the preparation of a model answer is
frequently impossible. Thus, the test maker usually grades each answer by judging its quality in terms of
a previously determined set of criteria, rather than scoring it point by point with a scoring key. The
criteria for judging the quality of an answer are determined by the nature of the question and thus by
the learning outcomes being measured.
Evaluate all of the students’ answers to one question before proceeding to the next question. Scoring or
grading essay tests question by question, rather than student by student, makes it possible to maintain a
uniform standard for judging the answers to each question. This procedure also helps offset the halo
effect in grading. When all of the answers on one paper are read together, the grader’s impression of
the paper as a whole is apt to influence the grades he assigns to the individual answers. Grading
question by question, of course, prevents the formation of this overall impression of a student’s paper.
Each answer is more appropriate to judge on its own merits when it is read and compared with other
answers to the same question, than when it is read and compared with other answers by the same
student.
Evaluate answers to essay questions without knowing the identity of the writer. This is another attempt
to control personal bias during scoring. Answer to essay questions should be evaluated in terms of what
is written, not in terms of what is known about the writers from other contacts with them. The best way
to prevent our prior knowledge from biasing our judgment is to evaluate each answer without knowing
the identity of the writer. This can be done by having the students write their names on the back of the
paper or by using code numbers in place of names.
Whenever possible, have two or more persons grade each answer. The best way to check on the
reliability of the scoring of essay answers is to obtain two or more independent judgments. Although
this may not be a feasible practice for routine classroom testing, it might be done periodically with a
fellow teacher (one who is equally competent in the area). Obtaining two or more independent ratings
becomes especially vital where the results are to be used for important and irreversible decisions, such
as in the selection of students for further training or for special awards.
Be on the alert for bluffing. Some students who do not know the answer may write a well organized
coherent essay but one containing material irrelevant to the question. Decide how to treat irrelevant or
inaccurate information contained in students’ answers. We should not give credit for irrelevant material.
It is not fair to other students who may also have preferred to write on another topic, but instead wrote
on the required question.
Write comments on the students’ answers. Teacher comments make essay tests a good learning
experience for students. They also serve to refresh your memory of your evaluation should the student
question the grade.