EPSC 311 UPDATED 311 NOTES
EPSC 311 UPDATED 311 NOTES
COURSE OUTLINE
Prerequisite NONE
Course purpose
The course aims at enabling the learner to study different types of tests which can be used to
monitor learners’ progress and therefore make decisions wisely
Expected Learning Outcomes
At the end of the course the students should be able to:
(i) Explain relationship between measurement, evaluation, testing and assessment.
Course Content
Relationship between measurement, evaluation, testing and examination. Principles of
evaluation. Historical development of measuring instruments. Scales of measurement.
Characteristics of a good test. Types of tests and test formats. Test construction, planning and
5 CAT 1
6 Types of tests and Analyse features of a good Group work discussion
test formats test assignment and
presentations
Discuss factors that
contribute to poor test
10 CAT 2
11 Quantitative and Describe quantitative and In-class discussions
qualitative methods qualitative methods applied in
applied in selection of
test items selection of test items
13 End of semester
examination.
14 End of semester
examination.
Mode of delivery
This course shall be delivered through Discussions and presentations, Lectures, and Internet
research
Violet, K. N., Marcella, M., & Josphine, M. K. The Evaluation Dilemma in Kenya Education
System.
Ayot H.O., & Patel M.M., (1992). Instructional methods (general methods). Nairobi,
Educational Research and Publication
Bali S. K., Ingule F.O, &Rono P.K., (1989). Psychology of Education (part two) Tests
and Measurements. Nairobi, Nairobi University Press.
Chase, C.I. (1999). Contemporary assessment for educators. New York:
Longman.Websites
http://www.sfsu.edu/~testing/MCTEST/testconstruction.html
Education.
Gronlund, N. E., & Linn, R. L. (1995). Measurement and evaluation in teaching. Englewood
Welch C. J. (2006). Item and prompt development in performance testing. In S. M. Downing &
Erlbaum.
Professional Ethics
1. Punctuality is fundamental and students are required to be in class before the designated
time for the lecture.
3. Let us refrain from signing the attendance register on behalf of colleagues who are not
present.
4. Plagiarism is a serious academic offense and is highly discouraged. Plagiarized work shall
NOT be accepted. Notwithstanding the above, collaboration in course work is certainly
encouraged as this promotes team spirit and group synergy, provided originality is preserved.
5. Assignments should be handed in on or before the date they are due. Students can hand in
their assignments through their class representative. Assignments handed in late shall NOT
be accepted.
Stamp
Education for Freedom/Elimu ni Uhuru
TOPIC 1: RELATIONSHIP BETWEEN TEST, MEASUREMENT AND
EVALUATION
EXAMPLES:
(i) A teacher measures Wambua’s height to be 128 cm. She evaluates his height
when she says that he is short.
(ii) (ii) A teacher measures Atieno’s achievement in Geography to be 62%. He
evaluates her achievement in Geography when he says that Atieno’s
performance is satisfactory.
(iii) (iii) Mutuma measures the size of his classroom and finds that it is 4.5m x
3.5m x 2.5m. He evaluates the classroom dimensions when he reports that the
classroom is too small to be used for 50 students.
From the meaning and definitions, it is obvious that a term test, measurement and
evaluation are interrelated. The tests are specific instruments for measurement.
Administration of a test is a process of measurement, without test, measurement is not
possible. Measurement is a technique necessary for evaluation. It represents status of
certain attributes or properties and is a terminal process. Measurement describe a
situation, evaluation judges its worth of value.
Measurement is a technique of evaluation and test are tool of measurement. The term test,
measurement and evaluation are clearly distinct but related. Teachers obtain measures
from test/s in order to make fair evaluation about specific trait or characteristics of the
students. An evaluation often involves one or more tests and in turn a test is involved in
one or more measurement.
One of the primary measurement tools in education is the assessment. Teachers gather
information by giving tests, conducting interviews and monitoring behavior. The assessment
should be carefully prepared and administered to ensure its reliability and validity. In other
words, an assessment must provide consistent results and it must measure what it claims to
measure.
Measurement determines the degree to which an individual possesses a defined characteristic. It
involves first defining the characteristic to be measured, and then selecting the instrument with
which measured. Test scores vary between being objective or subjective.
A test is objective when two or more people score the same test and assign similar scores. Tests
that are most objective are those that have a defined scoring system and are administered by
trained testers.
A subjective test lacks a standardized scoring system, which introduces a source of
measurement error. We use objective measurements whenever possible because they are more
reliable than subjective measurements. (Barrow and Rosemary, 1979)
Evaluation is a dynamic decision-making process focusing on changes that have been made.
This process involves:
(i)Collecting suitable data (measurement)
(ii)Judging the value of these data according to some standard; and
(iii) Making decisions based on the data.
The function of evaluation is to facilitate rational decisions.
For the teacher, this can be to facilitate student learning; for the exercise specialist, this could
mean helping someone establish scientifically sound weight reduction goals.
According to educator and author, Graham Nuthall (2017), in his book The Hidden Lives of
Learners, "In most of the classrooms we have studied, each student already knows about 40-50%
of what the teacher is teaching." The goal of data-driven instruction is to avoid teaching students
what they already know and teach what they do not know in a way the students will best respond
to.
For the same reason, educators and administrators understand that assessing students and
evaluating the results must be ongoing and frequent. Scheduled assessments are important to the
process, but teachers must also be prepared to re-assess students, even if informally, when they
sense students are either bored with the daily lesson or frustrated by material they are not
prepared for. Using the measurements of these intermittent formative assessments, teachers can
fine-tune instruction to meet the needs of their students on a daily and weekly basis.
Why is data-driven instruction so effective?
Accurately measuring student progress with reliable assessments and then evaluating the
information to make instruction more efficient, effective and interesting is what data-driven
instruction is all about. Educators who are willing to make thoughtful and intentional changes in
instruction based on more than the next chapter in the textbook find higher student engagement
and more highly motivated students.
In fact, when students are included in the evaluation process, they are more likely to be self-
motivated. Students who see the results of their work only on the quarterly or semester report
When students are informed about the results of more frequent formative assessments and can
see how they have improved or where they need to improve, they more easily see the value of
investing time and energy in their daily lessons and projects.
Teachers teach content then test students. This cycle of teaching and testing is familiar to anyone
who has been a student. Tests seek to see what students have learned. However, there can be
other more complicated reasons as to why schools use tests.
At the school level, educators create tests to measure their students' understanding of specific
content or the effective application of critical thinking skills. Such tests are used to evaluate
student learning, skill level growth and academic achievements at the end of an instructional
period, such as the end of a term, unit, course, semester, program or school year.
Summative Tests
At the national level, standardized tests are an additional form of summative assessments, E.G
KCPE, KCSE and other KNEC exams.
Disadvantages
1. Tests demand time that could be used for instruction and innovation. They claim that
2. Schools are under pressure to "teach to the test," a practice that could limit the curricula.
3. Students with special needs or varied learning environments may be at a disadvantage when
they take standardized tests. Compare day schools and national schools in Kenya
4. Summative standardized tests create a lot of anxiety due to the associated value for future
careers, secondary schools, college or universities admission.
The obvious point of classroom testing is to assess what students have learned after the
completion of a lesson or unit. When the classroom tests are tied to well-written lesson
objectives, a teacher can analyze the results to see whether the majority of students did well or
need more work. This information may help the teacher create small groups or to use
differentiated instructional strategies.
Educators can also use tests as teaching tools, especially if a student did not understand the
questions or directions. Teachers may also use tests when they are discussing student progress at
team meetings eg. PTA meetings
Another use of tests at the school level is to determine student strengths and weaknesses. One
effective example of this is when teachers use pretests at the beginning of units to find out what
students already know and figure out where to focus the lesson (Entry behavior).
3. Testing measures effectiveness of the teaching and learning process and curriculum
through summative evaluation.
The national examinations can also be used to compare with the best international
examinations standards or practice
INTRODUCTION
In the last unit, you read through important definitions in measurement and evaluation, you saw
the types of evaluation and the purposes of evaluation. In this unit we shall
move another step to look at the historical development of testing and evaluation.
This may help you to appreciate the course the more, and also appreciate the early players in the
course.
OBJECTIVES
2200 B.C.: Chinese emperor examined his officials every third year to determine
their fitness for office.
1862 A.D.: Wilhelm Wundt uses a calibrated pendulum to measure the “speed of
thought”.
1869: Scientific study of individual differences begins with the publication of
Francis Galton’s Classification of Men According to Their Natural Gifts.
1879: Wundt establishes the first psychological laboratory in Leipzig, Germany.
1884: Galton administers the first test battery to thousands of citizens at the
International Health Exhibit.
1888: J.M. Cattell opens a testing laboratory at the University of Pennsylvania.
1890: Cattell uses the term "mental test" in announcing the agenda for his
Galtonian test battery.
1901: Clark Wissler discovers that Cattellian “brass instruments” tests have no
correlation with college grades.
In statistics, there are four data measurement scales: nominal, ordinal, interval and ratio. These
are simply ways to sub-categorize different types of data
Nominal Scale
Nominal scales are used for labeling variables, without any quantitative value. “Nominal”
scales could simply be called “labels.” Here are some examples, below. Notice that all of these
scales are mutually exclusive (no overlap) and none of them have any numerical significance. A
good way to remember all of this is that “nominal” sounds a lot like “name” and nominal scales
are kind of like “names” or labels.
Note: a sub-type of nominal scale with only two categories (e.g. male/female) is called
“dichotomous.”
Other sub-types of nominal data are “nominal with order” (like “cold, warm, hot, very hot”) and
nominal without order (like “male/female”).
Ordinal Scale
With ordinal scales, the order of the values is what’s important and significant, but the
differences between each one is not really known. Take a look at the example below. In each
case, we know that 4 is better than 3 or 2, but we don’t know–and cannot quantify–how much
better it is. For example, is the difference between “OK” and “Unhappy” the same as the
difference between “Very Happy” and “Happy?” We can’t say.
Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness,
discomfort, etc.
The best way to determine central tendency on a set of ordinal data is to use the mode or median
Interval scales are numeric scales in which we know both the order and the exact differences
between the values. The classic example of an interval scale is Celsius temperature because the
difference between each value is the same. For example, the difference between 60 and 50
degrees is a measurable 10 degrees, as is the difference between 80 and 70 degrees.
Interval scales are nice because the realm of statistical analysis on these data sets opens up. For
example, central tendency can be measured by mode, median, or mean; standard deviation can
also be calculated.
Like the others, you can remember the key points of an “interval scale” pretty easily. “Interval”
itself means “space in between,” which is the important thing to remember–interval scales not
only tell us about order, but also about the value between each item.
Here’s the problem with interval scales: they don’t have a “true zero.” For example, there is no
such thing as “no temperature,” at least not with Celsius. In the case of interval scales, zero
doesn’t mean the absence of value, but is actually another number used on the scale, like 0
degrees Celsius. Negative numbers also have meaning. Without a true zero, it is impossible to
compute ratios. With interval data, we can add and subtract, but cannot multiply or divide.
Ratio scales tell us about the order, they tell us the exact value between units, and they also have
an absolute zero–which allows for a wide range of both descriptive and inferential statistics to
be applied. Good examples of ratio variables include height, weight, and duration.
Ratio scales provide a wealth of possibilities when it comes to statistical analysis. These
variables can be meaningfully added, subtracted, multiplied, divided (ratios). Central
tendency can be measured by mode, median, or mean; measures of dispersion, such as standard
deviation and coefficient of variation can also be calculated from ratio scales.
Summary
In summary, nominal variables are used to “name,” or label a series of values. Ordinal scales
provide good information about the order of choices, such as in a customer satisfaction survey.
Interval scales give us the order of values + the ability to quantify the difference between each
one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to
calculate ratios since a “true zero” can be defined.
Summar
y of data types and scale measures
A. Types of Testing
There are four types of testing in schools today — diagnostic, formative, benchmark, and
summative. What purpose does each serve? How should parents use them and interpret the
feedback from them?
1. Diagnostic Testing
This testing is used to “diagnose” what a student knows and does not know. Diagnostic testing
typically happens at the start of a new phase of education, like when students will start learning a
new unit. The test covers topics students will be taught in the upcoming lessons.
Teachers use diagnostic testing information to guide what and how they teach. For example, they
will plan to spend more time on the skills that students struggled with most on the diagnostic
test. If students did particularly well on a given section, on the other hand, they may cover that
content more quickly in class. Students are not expected to have mastered all the information in a
diagnostic test.
Diagnostic testing can be a helpful tool for parents. The feedback children receive on these tests
helps the parent to know what kind of content they will be focusing on in class and to anticipate
which skills or areas they may have trouble with.
2. Formative Testing
This type of testing is used to gauge student learning during the lesson. It is used throughout a
lecture and designed to give students the opportunity to demonstrate that they have understood
the material. This informal, low-stakes testing happens in an ongoing manner, and student
performance on formative testing tends to get better as a lesson progresses.
Schools normally do not send home reports on formative testing, but it is an important part of
teaching and learning. If you help your children with their homework, you are likely using a
version of formative testing as you work together.
3. Benchmark Testing
This testing is used to check whether students have mastered a unit of content. Benchmark
testing is given during or after a classroom focuses on a section of material, and covers either a
part or all of the content has been taught up to that time. The assessments are designed to let
teachers know whether students have understood the material that’s been covered.
Unlike diagnostic testing, students are expected to have mastered material on benchmark tests,
since they cover what the children have been focusing on in the classroom. Teachers and Parents
will often receive feedback about how their children have grasped each skill assessed on a
benchmark test. This feedback is very important since it gives insight into exactly which
concepts the student did not master.
4. Summative Testing
This testing is used as a checkpoint at the end of the year or course to assess how much content
students learned overall. This type of testing is similar to benchmark testing, but instead of only
covering one unit, it cumulatively covers everything students have been spending time on
throughout the year.
These tests are given using the same process to all students in a classroom, school, or state, so
that everyone has an equal opportunity to demonstrate what they know and what they can do.
Students are expected to demonstrate their ability to perform at a level prescribed as the
proficiency standard for the test.
Since summative tests cover the full range of concepts for a given grade level, they are not able
to assess any one concept deeply. So, the feedback is not nearly as rich or constructive as
feedback from a diagnostic or formative test. Instead, these tests serve as a final check that
students learned what was expected of them in a given unit.
We need a balance of the four different types of testing in order to get a holistic view of our
children’s academic performance. Each type of test differs according to its purpose, timing, skill
coverage, and expectations of students.
Though each type offers important feedback, the real value is in putting all that data together:
Using a diagnostic test, you can gauge what a student already knows and what
she will need to learn in the upcoming unit.
Formative tests help teachers and parents monitor the progress a student is
making on a daily basis.
A benchmark test can be used as an early indicator of whether students have met
the lesson’s goals, allowing parents and teachers to reteach concepts that the
student may be struggling with.
Ideally, when heading into the summative testing, teachers and parents should already know the
extent to which a student has learned the material. The summative testing provides that final
confirmation.
B. TYPES OF TESTS
There are two main categories of tests: SUBJECTIVE AND OBJECTIVE TESTS
Objective test: this is a test consisting of factual questions requiring extremely short answers
that can be quickly and unambiguously scored by anyone with an answer key. They are tests that
call for short answer which may consist of one word, a phrase or a sentence.
Subjective test: this is a type of test that is evaluated by giving opinion. They are more
challenging and expensive to prepare, administer and evaluate correctly, though they can be
more valid.
Description: In this format, statements are presented as either true or false, and test-takers must
indicate the correctness of each statement. They are easy to prepare, can be marked objectively
and cover a wide range of topics
Purpose: Useful for assessing basic knowledge and the ability to differentiate between true and
false information.
Advantages
i. can test a large body of material
ii. they are easy to score
Disadvantages
i. Difficult to construct questions that are definitely or unequivocally true or false.
ii. They are prone to guessing
2) Matching Items
Description: Test-takers are presented with two columns, one containing items and the other
containing corresponding answers or options. They must match items from one column with the
correct counterparts in the other.
Purpose: Useful for assessing knowledge of relationships, associations, or definitions.
Advantages
a. Measures primarily associations and relationships as well as sequence of events.
b. Can be used to measure questions beginning with who, when, where and what
c. Relatively easy to construct
d. They are easy to score
Disadvantages
a. Difficult to construct effective questions that measure higher order thinking
In this, learners are required to supply the words or figures which have been left out. They may
be presented in the form of questions or phrases in which a learner is required to respond with a
word or several statements.
Advantages
• Relatively easy to construct.
• Can cover a wide range of content.
• Reduces guessing.
Disadvantages
-Primarily used for lower levels of thinking.
-Prone to ambiguity.
-Must be constructed carefully so as not to provide too many clues to the correct answer.
-Scoring is dependent on the judgment of the evaluator.
Description: These are carefully designed and norm-referenced assessments administered under
standardized conditions to ensure fairness and comparability. Examples include SAT, ACT,
GRE, and IQ tests.
Purpose: Commonly used for educational admissions, employment selection, and large-scale
assessment purposes.
6. Fill-in-the-Blank Tests:
Description: Test-takers are presented with sentences or questions with one or more blanks, and
they must provide the missing words or phrases.
Purpose: Typically used to assess factual knowledge and understanding of concepts.
7. Short-Response Tests
Description: Similar to essay tests, but with shorter responses. Test-takers are asked to provide
concise answers or explanations to questions or prompts.
Purpose: Suitable for assessing knowledge, comprehension, and the ability to provide brief
explanations.
8. Performance Tests
Description: These tests require test-takers to perform specific tasks or demonstrate skills.
Examples include driving tests, laboratory experiments, or hands-on assessments.
Purpose: Effective for evaluating practical skills and abilities in real-world scenarios.
9. Oral Tests
Description: This format involves collecting and reviewing a selection of a person's work or
artifacts over time. It can include essays, projects, artwork, or other evidence of learning and
achievement.
Purpose: Often used to assess a person's overall progress, development, and skills in a holistic
manner.
11. Formative and Summative Assessments
12. Diagnostic Tests
Limitations
i. They are insufficient for measuring knowledge of factual materials because they call for
extensive details in selected content area at a time.
ii. Scoring is difficult and unreliable.
Examples of Subjective/Essay Test Items
Extended response item
Imagine that you and a friend had a visit in a national park. Write a story about an adventure that
you and your friend had in the national park.
Marble
Gneisis
Slate
Kenya
True-false item
Adolescents face an identity crisis period (true/false)
Before discussing test construction describing qualities of a good test and factors affecting test
is important.
Qualities of a Good Test/exam
Validity - A good test should measure what it is supposed to measure i.e. it should measure
specific objective(s) of the test set. A test that is set in a language that is not understandable is
invalid.
Reliability - A good test should yield the same results on a re-test on the same group of learners
under similar conditions.
Practicality /Usability- A test is said to be practical or usable if it can be readily used by the
teacher in everyday classroom conditions.
Cost -- A test which costs too much material to produce or a marking scheme which is hard to
make renders a test useless.
Factors to consider when constructing a Test
Specification of objectives -The kind of vocabularies used should elicit the kind of responses
required from the candidates.
Content -The examiner should ensure that questions set cover all topics taught/covered in class.
Emphasized content areas- Some content areas/topics should be given more emphasis then
others depending on the time spent to cover and the total number of questions usually set from
such topics.
Ability level of students - Questions set should be able to differentiate between bright, average
and weak pupils.
Specification for types of domains to be measured- Questions set should include cognitive,
affective and psychomotor domains.
Specification of the cognitive domain to be measured
This include (Bloom’s taxonomy) as indicated in topic 3
TABLE OF SPECIFICATION AND THE IMPORTANCE OF TABLE OF
SPECIFICATION
A Table of Specification (TOS), also known as a Test Blueprint, is a document or matrix that
outlines the content, skills, and cognitive levels that a test or assessment will measure. It
provides a clear and organized plan for test development, ensuring that the assessment aligns
Components of a Blueprint:
1. Content Areas: A blueprint specifies the content areas or topics that will be covered in the
assessment. It provides a clear outline of what knowledge or skills are being assessed.
2. Cognitive Levels: Blueprints define the cognitive levels or thinking skills that the assessment
will target. These levels are often based on educational taxonomies like Bloom's Taxonomy and
may include categories such as knowledge, comprehension, application, analysis, synthesis, and
evaluation.
Applications of a Blueprint:
1. Assessment Design: Blueprints serve as a foundational document for designing assessments that
are aligned with specific learning objectives and curricular goals. They guide test developers in
creating questions or tasks that accurately measure the intended content and cognitive focus.
2. Content Coverage: Blueprints ensure that the assessment covers a balanced representation of
the content areas and cognitive levels outlined in the curriculum. This prevents overemphasis on
certain topics and provides a comprehensive evaluation of student knowledge and skills.
3. Fairness: By specifying the content areas and cognitive levels in advance, blueprints help ensure
that the assessment is fair to all test-takers. No single topic or skill should be disproportionately
emphasized, which can reduce bias and promote equity.
4. Efficiency: Blueprints contribute to the efficient development of assessments by providing a
clear roadmap for item writers and test constructors. This helps streamline the item-writing
process and ensures that the assessment aligns with the intended focus.
5. Alignment: Blueprints facilitate alignment between the assessment and the curriculum or
learning standards. They ensure that the test measures what is taught and intended to be learned.
6. Validity and Reliability: Following a blueprint can enhance the validity of the assessment by
aligning it with the intended objectives. Additionally, a well-constructed blueprint can contribute
to the reliability of the assessment by ensuring consistency in content coverage and cognitive
focus.
7. Transparency: Blueprints promote transparency in assessment practices by clearly
communicating the assessment's design and intent to stakeholders, including students, teachers,
and administrators.
8. Feedback and Improvement: After the assessment is administered, blueprints can be used to
analyze the results in relation to the planned content and cognitive distribution. This information
can inform decisions about the assessment's quality and areas for improvement.
In summary, a blueprint is a structured plan that guides the design and development of
assessments. Its applications include aligning assessments with learning objectives, promoting
fairness, enhancing efficiency, and ensuring that assessments accurately measure the intended
content and cognitive levels. Blueprints play a crucial role in educational and standardized
testing, helping to create assessments that are valid, reliable, and aligned with educational goals.
Tests should be constructed and administered in such a way that the scores (marks) reflect the
ability they are supposed to measure.
The type of test to be constructed depends on the nature of the ability it’s meant to measure and
purpose of the test.
Certain types of educational tests can only be constructed by teams of suitably qualified and
equipped researchers.
The process of test construction is long and painstaking for it involves creating large batteries of
test questions in the particular area to be examined followed by extensive trials in order to assess
their effectiveness.
The prompt for a subjective item poses a question, presents a problem, or prescribes a task. It
sets forth a set of circumstances to provide a common context for framing the response.
Action verbs direct the examinee to focus on the desired behavior, for instance, solve, interpret,
compare and contrast, discuss or explain. Appropriate directions indicate expected length format
of the response, allowable resources or equipment’s, time limits and features of the response that
count in scoring.
Scoring response
During subjective scoring at least four types of rater errors may occur as the rater; becomes more
lenient or severe over time or scores erratically due to fatigue or distractions; has knowledge or
belief about an examinee that influences perception of response; is influenced by examinees
good or poor performance on items previously or influenced by the strength or weakness of a
preceding examinees response.
Under extended response items we can take an example of the essay test items look on how it is
constructed:
- Essay items require learners to write or type the answer in a number of paragraphs. The
learners use their own words and organize the information or material as they see it fit.
- In writing essay test, clear and unambiguous language should be used. Words such as ‘how’,
‘why’, ‘contrast’, ‘describe’ and discuss are useful. The questions should clearly define the scope
of the answer required.
- The time provided for the learner to respond to the questions should be sufficient for the
amount of writing required for a satisfactory response. The validity of questions can be enhanced
by ensuring that the questions correspond closely to the goals or objective being tested.
- An indication of the length of the answer required should be given.
B. Test Planning:
C. Test Administration:
Training: Ensure that test administrators are adequately trained in test administration
procedures, including maintaining test security and following standardized instructions.
Preparation: Set up the test environment before test-takers arrive. Check that all necessary
materials are available, including test booklets, answer sheets, and any required equipment.
Instructions: Clearly and concisely provide instructions to test-takers regarding the format of
the test, time limits, and any special procedures.
Monitoring: Supervise the test administration to prevent misconduct, such as cheating or
unauthorized assistance.
Accommodations: Provide accommodations, as needed, for individuals with disabilities or
special needs, ensuring equal access to the assessment.
Time Management: Keep track of the time during the test to signal when specific sections or
tasks need to be completed.
Handling Issues: Address any issues or disruptions during the test administration promptly and
professionally.
Collection of Test Materials: Collect all test materials, including completed test booklets and
answer sheets, and ensure they are securely stored.
Quantitative Methods
Quantitative methods provide a systematic and data-driven approach to the selection of test
items, ensuring that assessments are fair, valid, and reliable. These methods help assessment
developers create tests that accurately measure the intended constructs and provide meaningful
information for decision-making in various fields, including education and psychology.
Application: Analyzing item difficulty can help ensure that the test includes items with
an appropriate level of challenge.
Application: Items with high discrimination values are typically retained, as they
effectively separate individuals with different levels of the construct being measured.
3. Item-Total Correlation:
Application: Items with low item-total correlations may need revision or removal as they
may not be contributing effectively to the measurement of the intended construct.
Application: Item analysis statistics help identify problematic items, allowing for data-
driven decisions in item selection and revision.
5. Item Response Theory (IRT): IRT is a sophisticated quantitative approach used to model
the relationship between an individual's ability and their responses to test items. IRT models
provide valuable information about item difficulty and discrimination and are commonly
used in the development of standardized tests.
6. Factor Analysis: Factor analysis is employed to identify underlying factors or dimensions in
a set of test items. This technique helps ensure that test items are measuring the intended
constructs and can reveal if there is redundancy or overlap among items.
7. Differential Item Functioning (DIF) Analysis: DIF analysis examines whether test items
function differently for different subgroups of test takers (e.g., males vs. females, different
ethnic groups). Detecting DIF is important to ensure that test items are unbiased and do not
favor one group over another.
8. Item Banking: Item banking involves storing test items in a database and categorizing them
based on their characteristics (e.g., difficulty level, content area). Quantitative methods are
used to manage and organize item banks effectively, making it easier to select items for
specific tests.
9. Computerized Adaptive Testing (CAT): CAT uses quantitative algorithms to select test
items based on a test taker's responses to previous items. This adaptive approach tailors the
test to an individual's ability level, allowing for more precise and efficient assessment.
Application: Expert reviews can identify items that may need refinement or revision
based on their expertise in the subject area.
2. Cognitive Interviews:
Description: Focus groups and pilot testing involve gathering feedback from a sample of
test-takers on the overall test, including individual items. This qualitative input can reveal
any difficulties or concerns that arise during the testing process.
Application: Feedback from focus groups and pilot testing can inform revisions to test
items and administration procedures.
Application: Item review committees help identify and address issues related to bias,
fairness, and alignment with testing goals.
5. Content Analysis:
Application: Content analysis helps verify that test items align with the content and
objectives of the assessment.
NOTE
Quantitative and qualitative methods can complement each other in the item selection
process. Quantitative analyses provide numerical insights into item performance, while
qualitative methods offer deeper insights into the cognitive processes and perceptions of test-
takers and experts. The combination of both approaches helps ensure the quality and
effectiveness of test items in assessing the desired constructs accurately.
Test validation is a crucial process in the development and use of assessments, whether they are
educational exams, psychological tests, or any other type of evaluation. Validation refers to the
process of gathering evidence to support the appropriateness, meaningfulness, and effectiveness
of a test in measuring the intended construct or trait. Two key aspects of test validation are
validity and reliability.
1. Test Validity
Validity refers to the extent to which a test measures what it is intended to measure. It is the most
critical aspect of test quality because it determines whether the test results are accurate and
meaningful in assessing the specific trait or construct of interest.
There are several types of validity evidence that can be collected during the validation
process:
i) Content Validity: Content validity assesses whether the items on a test adequately represent
the domain or content that the test is supposed to measure. This is typically assessed through
expert judgment and content analysis to ensure the test items cover the relevant material
comprehensively.
iii) Concurrent Validity: This involves comparing test scores to external criteria at the same
time.
iv) Predictive Validity: This assesses the ability of test scores to predict future outcomes.
v) Construct Validity: Construct validity examines whether the test accurately measures an
underlying theoretical construct or trait. It involves a more abstract and theoretical evaluation
of the test's properties and often relies on the use of multiple methods and converging
evidence.
Factors Affecting Construct Validity:
Theoretical Framework: The test should be grounded in a sound theoretical framework that
defines and characterizes the construct of interest.
Convergent and Discriminant Validity: Construct validity can be influenced by the extent to
which the test correlates with other tests measuring similar or different constructs, respectively.
vi) Face Validity: Face validity refers to the superficial appearance of a test and whether it
appears to measure what it claims to measure. It is not a strong form of validity and may not
necessarily indicate a valid test.
Factors Affecting Face Validity:
Item Wording: The wording and presentation of test items can influence how the test is
perceived. Ambiguous or poorly worded items can reduce face validity.
Establishing validity involves conducting research and gathering evidence to support the test's
claims. The evidence can come from various sources, including expert judgments, statistical
analyses, and empirical studies. Collecting validity evidence is an ongoing process, and it helps
ensure that the test remains relevant and accurate for its intended purpose.
Formative Validity When applied to outcomes assessment it is used to assess how well a
measure is able to provide information to help improve the program under study. Example:
When designing a rubric for history one could assess student’s knowledge across the discipline.
Test validation is of paramount importance in the field of assessment and testing for several
reasons:
1. Ensures Accuracy and Fairness: Validation ensures that a test accurately measures what it is
intended to measure. It helps confirm that the test's scores are a valid representation of the
construct or trait being assessed. This accuracy is essential to make fair and informed decisions
based on test results.
2. Reduces Bias and Discrimination: Proper validation helps identify and minimize biases in test
items or procedures that could unfairly disadvantage certain groups of test-takers based on
factors like gender, ethnicity, or socioeconomic status. It promotes fairness and equity in
assessment.
3. Enhances Test Utility: Valid tests provide meaningful and useful information. They are more
likely to meet their intended purposes, whether in education, clinical practice, research, or other
domains. Valid assessments are valuable tools for decision-making.
4. Supports Informed Decision-Making: Valid test results are essential for making informed
decisions about individuals, such as educational placement, job selection, or clinical diagnosis.
Without validation, decisions may be arbitrary or unreliable.
In summary, test validation is essential because it underpins the reliability, fairness, and
usefulness of assessments. It ensures that test results are accurate and appropriate for their
intended purposes, safeguarding individuals' rights and promoting informed decision-making in
various fields.
Reliability refers to the consistency and stability of test scores when the same test is administered
multiple times or by different raters. In other words, a reliable test should produce similar results
when given to the same individuals under consistent conditions.
Test-Retest Reliability: This assesses the consistency of scores when the same test is
administered to the same group of individuals on two different occasions. It measures the
stability of scores over time.
Internal Consistency Reliability: Internal consistency examines the degree to which the items
within a test are measuring the same underlying construct. Common measures of internal
consistency include Cronbach's alpha and split-half reliability.
Inter-Rater Reliability: Inter-rater reliability is relevant when multiple raters or observers are
involved in scoring assessments. It assesses the degree of agreement among raters.
Parallel Forms Reliability: Parallel forms reliability assesses the consistency of scores when
two equivalent forms of a test are administered to the same group of individuals. It measures the
consistency of scores across different versions of a test.
Reliability is crucial because if a test is not reliable, it cannot be valid. In other words, a test that
produces inconsistent results cannot accurately measure the trait or construct it is intended to
assess. Therefore, ensuring the reliability of a test is a fundamental step in the validation process.
In summary, test validation involves establishing the validity and reliability of a test to ensure
that it accurately measures the intended construct or trait. Validity assesses whether the test
measures what it claims to measure, while reliability assesses the consistency and stability of test
scores.
Test interpretation is the process of analyzing and making sense of the results obtained from an
assessment or test. It involves translating raw scores or data into meaningful and useful
information. Effective test interpretation is essential for deriving insights, making informed
decisions, and drawing valid conclusions based on test outcomes. Here's a description of key
aspects of test interpretation:
Reporting test results is a crucial step in the assessment process, whether in education,
psychology, healthcare, or other fields. How test results are reported can significantly impact the
understanding and utility of the assessment.
Here are several ways of reporting test results, along with reasons for using each approach:
1. Numerical Scores:
Reason: Numerical scores provide a quantifiable and objective representation of an
individual's performance on a test. They are easy to compare and analyze statistically.
Use Cases: Numerical scores are commonly used in standardized assessments, such as
educational exams or psychological tests. They allow for straightforward comparisons
across individuals or groups.
2. Percentiles:
Reason: Percentiles compare an individual's performance to a reference group. They
offer a clear understanding of where an individual's score falls in relation to others.
Use Cases: Percentiles are valuable in educational and clinical assessments, as they help
educators, clinicians, and individuals interpret how their performance compares to a
larger population.
3. Standard Scores (Z-Scores or T-Scores):
Reason: Standard scores transform raw scores into a standardized scale with known
properties (e.g., mean of 100, standard deviation of 15). This allows for easier
comparison and interpretation.
Use Cases: Standard scores are often used in educational and psychological assessments,
providing a common metric for various tests.
4. Grade Equivalents:
Reason: Grade equivalents express a test score in terms of the grade level at which it is
typical. This can be useful for educators and parents in understanding a student's
performance relative to grade-level expectations.
Use Cases: Grade equivalents are frequently used in educational assessments,
particularly for young students.
5. Qualitative Descriptors: