0% found this document useful (0 votes)
12 views6 pages

Reviewer Test Measurement Midterms

The document discusses various aspects of test measurement, focusing on reliability and validity in psychological testing. It outlines methods for estimating reliability, such as test-retest and split-half reliability, and emphasizes the importance of developing reliable and valid tests for accurate assessment. Additionally, it addresses considerations in test development and the implications of psychological traits and states in testing contexts.

Uploaded by

elai.tampol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

Reviewer Test Measurement Midterms

The document discusses various aspects of test measurement, focusing on reliability and validity in psychological testing. It outlines methods for estimating reliability, such as test-retest and split-half reliability, and emphasizes the importance of developing reliable and valid tests for accurate assessment. Additionally, it addresses considerations in test development and the implications of psychological traits and states in testing contexts.

Uploaded by

elai.tampol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Reviewer- Test Measurement Midterms (1)Two test administrations with the same group are

required
● Reliability refers to the consistency of scores (2)test scores may be affected by factors such as
motivation, fatigue, or intervening events such as practice,
obtained by the same person when re –
learning,etc.
examined with the same test on different
•Testtakers may do better or worse on a specific form of
occasions or with 2 different sets of
the test not as a function of their true ability but simply
equivalent items, it is a PREREQUISITE TO because of the particular items that selected for inclusion
VALIDITY in the test
● More number of items increase reliability •Developing alternate forms of tests can be time-
because more domains will be covered consuming and expensive.
● Reliability is important in clinical Judgment, •Once an alternate or parallel form of a test has been
because we don’t diagnose with only one test developed, it is advantageous to the test user in several
● The goal of reliability is to minimize error as ways. (minimizes the effect of memory for the content of a
previously administered form of the test.)
much as possible

3. Split-Half Reliability Estimates


Estimation of Reliability
• Split-Half Reliability is obtained by correlating two
pairs of scores obtained from equivalent halves of a
1. Test-Retest Reliability
single test administered once.
•Test – Retest Reliability involves giving a single
• It is a useful measure of reliability when it is
sample of persons the same assessment at two
impractical or undesirable to assess reliability with
points in time. The scores obtained at each of these
two tests or to administer a test twice.
time points can then be correlated with one another,
thereby providing an estimate of the reliability of the
Step 1. Divide the test into equivalent halves.
scores.
Step 2. Calculate a Pearson r between scores on the
•There is no “one size fits all” answer to the question
two halves of the test. Step
of how long this gap should be to have an accurate
Step 3. Adjust the half-test reliability using the
estimate of stability. It needs to be long enough so
Spearman-Brown formula
as to minimize memory effects, but not so long that
the trait being measured changes.
•An acceptable time lag for use in test-retest will vary
depending upon the nature of the trait being
This can be done in two ways
assessed, and the extent to which it is temporally
(1) One way to split a test is to randomly assign items to
stable. one or the other half of the test.
● Carry over effect – 1st take influences the 2nd (2) Another way is to assign odd-numbered items to one
take of a test half of the test and even-numbered items to the other half.
● Practice effect – type of carry over effect; the
score is higher due to experience In testing 4. Inter-Item Consistency - correlation among
● Test sophistication - familiarity with a all items on a scale
particular test or type of test, which might • Measure of inter-item-calculated from a single
affect the score attained administration of a single form test
• Index of inter-item- useful in assessing the
2. Parallel-Forms and Alternate-Forms homogeneity of the test
Reliability •more homogeneous, more inter-item consistency
• coefficient of equivalence – is degree of the • Homogenous Test- items that measure ingle trait
relationship between various forms of a test • heterogenous Test- items that measure more than a trait
• Parallel- the observed score must have the same
Although a homogeneous test is desirable because it so
mean and variances
readily lends itself to clear interpretation, it is often an
• Alternate- the test merely different versions(without
insufficient tool for measuring multifaceted psychological
the “sameness” of observed scores) variables such as intelligence or personality
• Item Sampling- source of error •KR20 – Kuder-Richardson – used to know inter item
Obtaining estimates of alternate-forms reliability and consistency of tests with dichotomous items of different
parallel-forms reliability is similar in two ways to levels of difficulty
obtaining an estimate of test-retest reliability:
•KR21 – Kuder-Richardson – used to know inter item Reliability Coefficient- describing the consistency of
consistency of tests with dichotomous items of same levels scores across contexts(diff times, item&raters)
of difficulty Acceptable: 0.70- 0.90 basic research
•Coefficient alpha – Developed by Cronbach for tests with 0.90- 0.095 Clinical setting
non-dichotomous items

CONSIDERATIONS CONCERNING THE PURPOSE


More examples of homogenous tests
AND USE OF A RELIABILITY
1.Vocabulary Test: All the items in the test assess the
•Homogeneity vs heterogeneity
participant's knowledge of words and their meanings
more homo, more internal consistency
2.Mathematics Proficiency Test: mathematical problems
Hetero, low of test-retest reliability
that assess a person's understanding of mathematical
• Dynamic vs Static Characteristics
concepts and their ability to solve equations, perform
Dynamic- is a trait state, or ability presumed to be ever-
calculations, and apply mathematical principles.
changing as a function of a situational and cognitive
3.Reaction Time Test: This test measures the speed at which
experiences
an individual responds to stimuli. Participants might be
Static- either test-retest or alternate form method would
presented with visual or auditory cues and are required to
be appropriate
react as quickly as possible
•Restriction or inflation of range
4.Physical Fitness Test - Push-Ups: In a physical fitness test
The manual for this test should contain not one reliability
that assesses push-up ability, all the items would involve
value but instead reliability ues for test takers at each
performing push-ups. The test measures an individual's
grade level
upper body strength and endurance.
•Speed test vs power test
Memory Recall Test: This type of test assesses an
Speed test- uniformly low level of difficulty but with short
individual's ability to remember and recall information
time limit
Power test- items are difficult that no test takers is able to
More examples of heterogenous tests
obtain perfect score even the time is long enough
1.Intelligence Test: Traditional intelligence tests often
•Criterion-referenced test
include a variety of subtests that assess different cognitive
Scores on criterion-referenced tests tends to be interpreted
abilities, such as verbal reasoning, mathematical ability,
in pass-fail terms
spatial reasoning, and memory
2.Personality Inventory: Some personality inventories
VALIDITY
assess multiple personality traits within a single test. For
example, the Big Five personality traits (Openness,
• Internal validity- norm/local group
Conscientiousness, Extraversion, Agreeableness, • External validity- other group
Neuroticism) are often measured together in one
assessment. The Trinatarian model of Validity
3.Job Skills Assessment - Multitasking: In job-related 1. Scrutinizing the test content
assessments, multitasking tests might evaluate an 2. Relating scores obtain to other test scores or
individual's ability to manage multiple tasks other measures
simultaneously. The test could include items where
3. Executing a comprehensive analysis of
participants need to handle incoming emails, phone calls,
a. How scores on the test relate to other test
and prioritize tasks all at once.
b. How score on the test can be understood
4.Psychopathology Assessment: In psychological
within some theoretical framework for
assessments, measures of psychopathology might include
understanding the construct
a range of symptoms related to different mental health
TYPES OF VALIDITY
disorders. The assessment would provide insights into
various psychological conditions. 1. Face validity - what a test appears to
measure to the person being tested t to what
the test actually measures (judgment how
relevant test items appear to be)
5. Inter-Scorer Reliability (known as scorer, judge, 2. Content Validity- judgment of how
observer, and inter-rater reliability) adequately a test samples behavior
• the gree of consistency between 2 or more scorers representative of the universe of behavior
with regard to a particular measure that the test was designed to sample
• Cohen’s Kappa- used in 2 raters - covers as much domains
• Fleiss Kappa- more than 2 raters possible, involves process of validation with
experts
3. Criterion-related validity - manifested by sexual orientation and preferences, psychopathology,
concurrent(test score correlate highly to personality in general, and specific personality traits.
established tests measuring the same/similar •A psychological trait exists only as a construct —an
construct. informed, scientific concept developed or constructed to
Characteristics of criterion describe or explain behavior.
● Relevant- applicable to the matter at hand 2. Psychological Traits and states can be
● Valid- for the purpose for which it is being quantified and measured
•Test developers and researchers, much like people in
used
general, have many different ways of looking at and
● Uncontaminated
defining the same phenomenon. For example, of the
4. Construct Validity (umbrella validity)-
different ways a term such as aggressive is used.
accurately measures the construct of interest.
•The test score is presumed to represent the strength of
Evidence of Construct Validity
the targeted ability or trait or state and is frequently
1. Homogeneity- how uniform a test is in
based on cumulative scoring the higher the score on the
measuring a single concept (e. Academics
test, the higher the test taker is on the ability, trait, or
achievement test,, mathematics, spelling, and
other characteristic that the test purports to measure.
reading comprehension)
3. Test-Related Behavior Predicts Non-Test-
- Pearson r used to correlate
Related Behavior
average subtest scores with the
•The objective of the test is to provide some indication
average total test scores
of other aspects of the examinee’s behavior.
2. Evidence of changes with the age (sing of time)-
•The tasks in some tests mimic the actual behaviors that
if a test score purport to be a measure of a
the test user is attempting to understand. By their
construct that could be expected to change over
nature, however, such tests yield only a sample of the
time, then the test score too, should show the
behavior that can be expected to be emitted under non-
same progressive changes with age to be
test condition
considered a valid measure of the construct.
4. Tests and Other Measurement Techniques
3. Evidence of pretest posttest changes- that test
Have Strengths and Weaknesses
scores change as a result of some experience
•Competent test users understand a great deal about
between a pretest and posttest
the tests they use. They understand, among other
4. Evidence from distinct groups(contrasted
things, how a test was developed, the circumstances
groups)- if test is a valid measure of particular
under which it is appropriate to administer the test, how
construct, then the test scores from group of
the test should be administered and to whom, and how
people who would be presumed to differ with
the test results should be interpreted.
respect to the that construct should have
5. Various Sources of Error Are Part of the
correspondingly diff test scores
Assessment Process
5. Convergent Evidence- is seen if scores on the
•More specifically, error refers to a long-standing
test undergoing construct validation tent to
assumption that factors other than what a test attempts
correlate highly in the predicted direction /
to measure will influence performance on the test.
scores on older, more established, and already
•Test scores are always subject to questions about the
validated test design
degree to which the measurement process includes
6. Discriminant Evidence- is seen when the validity
error.
coefficient shows a little correlations between
6. Testing and Assessment can be conducted in
test scores and/or other variables with which
a fair and unbiased manner
scores on the test being construct validation
•One source of fairness-related problems is the test user
should not theoretically be correlate
who attempts to use a particular test with people whose
background and experience are different from the
ASSUMPTIONS ABOUT TESTING
background and experience of people for whom the test
1. Psychological traits and states exist
was intended
•Trait - any distinguishable, relatively enduring way in
•For example, a vocabulary test that includes words or
which one individual varies from another”
concepts more familiar to individuals from a particular
•States - also distinguish one person from another but
cultural or socioeconomic background, putting others at
are relatively less enduring
a disadvantage.
•Psychological trait- Among them are psychological
7. Testing and assessment benefit society
traits that relate to intelligence, specific intellectual
•Considering the many critical decisions that are based
abilities, cognitive style, adjustment, interests, attitudes,
on testing and assessment procedures, we can readily
appreciate the need for tests, especially good tests – What would the testtaker learn, or how might the testtaker
which are reliable and valid benefit, from an administration of this test? What would
the test user learn, or how might the test user benefit?
What social benefit, if any, derives from an administration
of this test?
9. Is there any potential for harm as the result of an
administration of this test?
What safeguards are built into the recommended testing
procedure to prevent any sort of harm to any of the parties
involved in the use of this test?

STAGES IN TEST DEVELOPMENT

Test Conceptualization Test Construction


is a foundational step in test development. It involves Scaling – It is the setting rules for assigning numbers in
defining the construct to be measured, guiding the measurement. In testing, The purpose of scaling is to
creation of test items, ensuring content validity, and convert raw data, which might be in the form of test
informing scoring and data analysis. A well- scores, ratings, or responses, into a standardized format
conceptualized test is more likely to provide accurate that allows for meaningful comparisons and
and meaningful assessments of the targeted construct. interpretations.
Properties of scales
1.What is the test designed to measure? 1. Magnitude - “moreness” of scale, higher means
Its answer is closely linked to how the test developer “more” (ability to know if one score is greater than,
defines the construct being measured and how that
equal to, or less than another)
definition is the same as or different from other tests
2. Equal Interval – difference between point 1 to 0.2
purporting to measure the same construct
same meaning as 0.2 & 0.3
2.What is the objective of the test?
3. Absolute zero – scale has zero point. (the zero point
In what way or ways is the objective of this test the same
represents the absence of the property being measured,
as or different from other tests with similar goals? What
real-world behaviors would be anticipated to correlate with ie: no feelings of guilt, no obsessive thoughts)
testtaker responses? Types of scales
3.Is there a need for this test? (1) age-based scale – used if the testtaker’s test
Are there any other tests purporting to measure the same performance as a function of age is of critical interest
thing? In what ways will the new test be better than or (2) grade-based scale - used if the testtaker’s test
different from existing ones? Will there be more compelling performance as a function of grade is of critical interest
evidence for its reliability or validity? (3) Stanine scale - If all raw scores on the test are to be
4. Who will use this test? transformed into scores
Clinicians? Educators? Others? For what purpose or that can range from 1 to 9
purposes would this test be used? (4) rating scale - a grouping of words, statements, or
5. Who will take this test? symbols on which judgments of the strength of a particular
Who is this test for? Who needs to take it? Who would find trait, attitude, or emotion are indicated by the test taker. It
it desirable to take it? For what age range of testtakers is can be used to record judgments of oneself, others,
the test designed? What reading level is required of a experiences, or objects, and they can take several forms
testtaker? What cultural factors might affect testtaker (5) Likert scale - , is used extensively in psychology, usually
response? to scale attitudes. Likert scales are relatively easy to
6. How will the test be administered? construct.
Individually or in groups? Is it amenable to both group and (6) summative scale- used when the final test score is
individual administration? What differences will exist obtained by summing the ratings across all the items
between individual and group administrations of this test? (7) The Many Faces of Rating Scales – uses smileys. It have
Will the test be designed for or amenable to computer been used in social-psychological research with young
administration? How might differences between versions of children and adults with limited language skills
the test be reflected in test scores? (8) comparative scaling -is an ordinal or rank order scale
7. What is the ideal format of the test? that can also be referred to as a non-metric scale.
Should it be true–false, essay, multiple-choice, or in some Respondents evaluate two or more objects at one time and
other format? Why is the format selected for this test the objects are directly compared with one another as part of
best format? the measuring process. For example, testtakers would be
8. Who benefits from an administration of this test? asked to sort the cards from most justifiable to least
justifiable actions.
(9) Method of paired comparisons. Testtakers are •Rank the students according to TOTAL SCORE
presented with pairs of stimuli (two photographs, two •Select the top 27 percent and the lowest 27 percent
objects, two statements), which they are asked to compare. •For EACH ITEM, the percentage of students in the upper
They must select one of the stimuli according to some rule. and lower groups answering correctly is calculated.
(10) Guttman scale – also uses ordinal level measures. •SOLVE: (Upper Group Percent Correct) – (Lower Group
Items on it range sequentially from weaker to stronger Percent Correct)
expressions of the attitude, belief, or feeling being The maximum item discrimination is 100%. This would occur
measured. A feature of Guttman scales is that all if all those in the upper group answered correctly and all
respondents who agree with the stronger statements of the those in the lower group answered incorrectly.
attitude will also agree with milder statements. Zero discrimination occurs when equal numbers in both
groups answer correctly
Negative discrimination, which is highly undesirable, occurs
when more students in the lower group then the upper
group answer correctly.
0% - 24%: Usually unacceptable – might be approved
Writing Items 25% - 39% Good item
In the grand scheme of test construction, considerations 40% - 100% Excellent item
related to the actual writing of the test’s items go hand in FOR EXAMPLE. 80% from the upper group and 30% from
hand with scaling considerations. the lower group answered #1 correctly. The item
item pool discrimination is 50%. It means that #1 is a good item
is the reservoir or well from which items will or will not be
drawn for the final version of the test. The Item-Difficulty Index
Item format Suppose every examinee answered item 1 of the test
includes variables such as the form, plan, structure, correctly. Can we say that item 1 is a good item? What if no
arrangement, and layout of individual test items. one answered item 1 correctly? In either case, item 1 is not
Scoring Items a good item. If everyone gets the item right then the item
Many different test scoring models have been devised. is too easy; if everyone gets the item wrong, the item is too
Perhaps the most commonly used model—owing, in part, to difficult.
its simplicity and logic—is the cumulative model. Just as the test as a whole is designed to provide an index
of degree of knowledge about (f0r example) history, so
Test Tryout each individual item on the test should be passed (scored
The test tryout should be executed under conditions as as correct) or failed (scored as incorrect) on the basis of
identical as possible to the conditions under which the testtakers’ differential knowledge of history
standardized test will be administered; all instructions, and
everything from the time limits allotted for completing the In a true-false item, the probability of guessing correctly
test to the atmosphere at the test site, should be as similar on the basis of chance alone is 1/2, or .50. Therefore, the
as possible. optimal item difficulty is halfway between .50 and 1.00,
or .75. In general, the midpoint representing the optimal
Item Analysis item difficulty is obtained by summing the chance success
A good test item is one that is answered correctly by high proportion and 1.00 and then dividing the sum by 2, or
scorers on the test as a whole. An item that is answered
incorrectly by high scorers on the test as a whole is
probably not a good item
Among the tools test developers might employ to analyze
and select items are
(1)an index of the item’s difficulty For a five-option multiple-choice item, the probability of
(2)an index of the item’s reliability guessing correctly on any one item on the basis of chance
(3)an index of the item’s validity alone is equal to 1/5, or .20. The optimal item difficulty is
(4)an index of item discrimination therefore .60:

The Item-Discrimination Index


•Measures of item discrimination indicate how adequately
an item separates or discriminates between high scorers
and low scorers on an entire test.
•In this context, a multiple-choice item on an achievement Test Revision
test is a good item if most of the high scorers answer Test developers may find that they must balance various
correctly and most of the low scorers answer incorrectly. strengths and weaknesses across items. For example, if
STEPS: many otherwise good items tend to be somewhat easy, the
test developer may purposefully include some more
difficult items even if they have other problems
As revision proceeds, the advantage of writing a large item
pool becomes more and more apparent. Poor items can be
eliminated in favor of those that were shown on the test
tryout to be good item

You might also like