0% found this document useful (0 votes)
6 views

Validity Reliability.2

The document discusses different types of test validity including content validity, face validity, predictive validity, concurrent validity, construct validity, convergent validity, and divergent validity. It provides examples and procedures for establishing each type of validity. It also discusses factors that affect test reliability and how reliability can be measured.

Uploaded by

Course Unknown
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Validity Reliability.2

The document discusses different types of test validity including content validity, face validity, predictive validity, concurrent validity, construct validity, convergent validity, and divergent validity. It provides examples and procedures for establishing each type of validity. It also discusses factors that affect test reliability and how reliability can be measured.

Uploaded by

Course Unknown
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

ESTABLISH TEST

VALIDITY AND
RELIABILITY
QUALITIES OF GOOD TEST:

1.VALIDITY
2.RELIABILITY
WHAT IS TEST VALIDITY?
A measure is valid when it
measures what it is supposed
to measure.
• If the quarterly exam is valid, then the
contents should directly measure the
objectives of the curriculum.

• If a scale that measures personality is


composed of five factors, then the scores on
the five factors should have highly correlated
items.

• In an entrance exam is valid, it should predict


students’ grades after the first semester.
TYPES OF VALIDITY
1. CONTENT VALIDITY
2. FACE VALIDITY
3. PREDICTIVE VALIDITY
4. CONSTRUCT VALIDITY
5. CONCURRENT VALIDITY
6. CONVERGENT VALIDITY
7. DIVERGENT VALIDITY
CONTENT VALIDITY
When the items represent the domain being
measured.

Procedure:
The items are compared with the objectives of
the program. The items need to measure
directly the objectives (for achievement) or
definition (for scales). A reviewer conducts the
checking.
FACE VALIDITY
When the test is presented well, free from
errors, and administered well.

Procedure:
The test items and layout are reviewed and
tried out on a small group of respondents. A
manual for administration can be made as a
guide for the test administrator.
PREDICTIVE
VALIDITY
A measure should predict a
future criterion. An example is
an entrance exam predicting the
grades of the students after the
first semester.
CONSTRUCT VALIDITY
The components or factors of the test should
contain items that are strongly correlated.

• A good test must have strong construct


validity, meaning that it accurately measures
the underlying theoretical construct it is
intended to measure. This ensures that the
test is capturing the concept it claims to
assess.
CONCURRENT VALIDITY
When two or more measures are present for
each examinee that measure the same
characteristics.

Procedure:
To determine concurrent validity, you have to
measure the correlation of results from an
existing test and a new test and demonstrate
that the two give similar results.
CONVERGENT VALIDITY
When the components or factors of a test are
hypothesized to have a positive correlation.
• For a test to be considered of high quality, it
should demonstrate convergent validity by
showing strong correlations with other
measures that assess the same or related
constructs. This helps establish the test's
credibility and accuracy.
DIVERGENT VALIDITY
When the components or factors of a test
are hypothesized to have a negative
correlation.
• crucial for a good test as it confirms that
the test is distinct from measures of
unrelated constructs. A high-quality test
should show low correlations with measures
of constructs that are theoretically
unrelated.
Group activity: Analyze the given cases of each type of validity. Then answer the
following question below.

1. Content validity

A coordinator in science is checking the science test paper for grade 4. She asked the
grade 4 science teacher to submit the table of specifications containing the objectives of
the lesson and corresponding items. The coordinator checked whether each item is
aligned with the objectives.

• How are the objectives used when creating test items?


• How is content validity determined when given the objectives and the items in a test?
• What should be presented in a test table of specifications when determining content
validity?
• Who checks the content validity of items?
2. Face validity

The assistant principal browsed the test paper made by the math
teacher. She checked if the contents of the items are about
mathematics. She examined if instructions are clear. She browsed
through the items if the grammar is correct and if the vocabulary is
within the students’ level of understanding.

What can be done in order to ensure that the assessment appears


to be effective?
What practices are done in conducting face validity?
Why is face validity the weakest form of validity?
3. Predictive validity

The school admission’s office developed an examination. The officials wanted to


determine if the results of the entrance examination are accurate in identifying good
students. They took the grades of the students accepted for the first quarter. They
correlated the entrance exam results and the first-quarter grades. They found significant
and positive correlations between the entrance examination scores and grades. The
entrance examination results predicted the grades of students after the first quarter. Thus,
there was predictive-prediction validity.

• Why are two measures needed in predictive validity?


• What is the assumed connection between these two measures?
• How can we determine if a measure has predictive validity?
• What statistical analysis is done to determine predictive validity?
• How are the test results of predictive validity interpreted?
4. Concurrent validity

A school guidance counselor administered a math achievement test to


grade 6 students. She also has a copy of the students’ grades in math.
She wanted to verify if the math grades of the students are measuring
the same competencies as the math achievement test. The school
counselor correlated the math achievement scores and math grades to
determine if they are measuring the same competencies.

• What needs to be available when conducting concurrent validity?


• At least how many tests are needed for conducting concurrent
validity?
• What statistical analysis can be used to establish concurrent
validity?
• How are the results of a correlation coefficient interpreted for
concurrent validity?
5. Construct validity

A science test was made by a grade 10 teacher composed of four domains:


matter, living things, force and motion, and earth and space. There are 10
items under each domain. The teacher wanted to determine if the ten items
made under each domain really belonged to that domain. The teacher
consulted an expert in test measurement. They conducted a procedure called
factor analysis. Factor analysis is a statistical procedure done to determine if
the items written will load under the domain they belong.

What type of test requires construct validity?


What should the test have in order to verify its constructs?
What are constructs and factor in a test?
How are these factors verified if they are appropriate for the test?
What results come out in construct validity?
How are the results in constructed validity interpreted?
Construct validity continues…

The construct validity of a measure is reported in journal articles. The following


are guide questions used when searching for construct validity of a measure
from reports:
• what was the purpose of construct validity?
• What type of test was used?
• What are the dimensions or factors that were studied using construct
validity?
• What procedures was used to establish the construct validity?
• What statistics was used for the construct validity?
• What were the results of the test’s construct validity?
5. Convergent validity

A math teacher developed a test to be administered at the end of the school


year, which measures number sense, patterns and algebra, measurement,
geometry, and statistics. It is assumed by the math teacher that students’
competencies in number sense improves their capacity to learn patterns and
algebra, and other concepts. After administering the test, the scores were
separated for each area and these five domains were intercorrelated using
Pearson R. The positive correlation between number sense and patterns, and
algebra indicates that, when number sense scores increase, the pattern of
algebra also increase. This shows student learning of number sense scaffold
patterns and algebra competencies.

• What should the test have in order to conduct a convergent validity?


• What are done with the domain in the test divergent validity?
• What analysis is used to determine convergent validity?
• How are the result in the convergent validity interpreted
5. Divergent validity

An English taught metacognitive awareness strategy to comprehend a


paragraph for grade 11 students. She wanted to determine if the performance
of her students in reading comprehension would reflect well in the reading
comprehension test. She administered the same reading comprehension test
to another class which was not taught the metacognitive awareness strategy.
She compared the result using the T-test for independent samples and found
that the class that was taught the metacognitive awareness strategy
performed significantly better than the other group. The test has a divergent
validity:

• What conditions are needed to conduct divergent validity?


• What assumption is being proved in divergent validity?
• What statistics analysis can be used to establish divergent validity?
• How are the results of divergent validity interpreted?
WHAT IS TEST
RELIABILITY?
Is the consistency of the
responses to measure under
conditions:
1.When retested on the same
person;
2.When retested on the same
measure;
3. Similarity of responses across
items that measures the same
characteristics.
There are different factors that affect the reliability of the measure.
The reliability of the measure can be high or low, depending on the
following factors”

1. The number of items in a test- The more items a test has, the
likelihood of reliability is high. The probability of obtaining
consistent scores is high because of the large pool of items.
2. Individual differences of participants- every participant
possesses characteristics that affect their performance in a test,
such as fatigue, concentration, innate ability, perseverance, and
motivation. The individual factors change over time and affect the
consistency of the answer in a test.
3. External environment- the external environment may include
room temperature, noise level depth of instruction, exposure to
materials, and quality of instruction, which could affect changes n
the responses of examinees in a test
What are the different ways to
establish test reliability?
There are different ways in
determining the reliability of a test.
The specific kind of reliability will
depend on the
(1) variable you are measuring,
(2) type of test, and
(3) versions of the test.
The method of Testing Reliability

1. Test-retest
2. Parallel Forms
3. Internal Consistency
4. Inter-rater Reliability
1. Test-retest
The test needs to be administered at one time to a group
of examinees and administered it again at another time to
the “same group” of examinees.

This type of analysis is of value when we measure “traits”


or characteristics that do not change over time. Tests that
measure some constantly changing characteristic are not
appropriate for test-retest evaluation.
• The time interval is not more than 6
months between the first and second
administration of tests that measure stable
characteristics.
Coefficient of stability- the estimate of test-retest
reliability. Obtained using Pearson-r.

Correlation between administrations 1 and 2 should


be .70 or higher.
CORRELATION COEFFICIENT CORRELATION STRENGTH
.70 to 1.00 VERY STRONG
.50 to .70 STRONG
.30 to .50 MODERATE
0 to .30 WEAK
0 NONE
Pearson’s r describes the linear relationship
between two quantitative variables.
Pearson’s r in terms of statistical analysis:

Formula:

n – number of respondents
X – are scores of the 1st test
Y – are the scores of the 2nd test
XY – are the product of the 1st and 2nd test

RESPONDENTS
(N) X Y X² XY
1 50 51 2500 2601 2550
2 43 42 1849 1764 1806
3 48 48 2304 2304 2304
4 45 44 2025 1936 1980
5 40 41 1600 1681 1640
6 47 47 2209 2209 2209
7 52 51 2704 2601 2652
8 39 38 1521 1444 1482
9 44 43 1936 1849 1892
10 43 41 1849 1681 1763
TOTAL ⅀X =451 ⅀Y= 446 ⅀X²= 20497 ⅀Y²= 20070 ⅀XY= 20278
r

INTERPRETATION:
.70 to 1.00 VERY STRONG
.50 to .70 STRONG
.30 to .50 MODERATE
r 0 to .30 WEAK
0 NONE
Exercise:
Interpretation:
Complete the table an find the reliability of the two test. 0.70 - 1.00 very strong
X Y XY 0.50 - 0.70 strong
0.30 – 0.50 moderate
10 20 0 – 0.30 weak
9 15 0 none

6 12
10 18
12 19 FIND:
4 8 ∑X:
5 7 ∑Y:
7 10
∑:
16 17
∑:
8 13
∑XY:
2. Parallel Forms
• Compares two equivalent forms of a test that
measure the same attribute.

• Coefficient of Equivalence- the correlation


between the scores obtained on the two forms
represents the reliability coefficient of the test.

• Two ways to administer: Immediate or Delayed


Parallel Forms method
• Same no. of items
• Same format
• Same coverage
• Same difficulty
• Same instructions, time limits,
illustrative examples
Statistical measure:
Pearson’s r
3. Internal Consistency
• Used when tests are administered once.
• This model of reliability measures the internal
consistency of the test which is the degree to which
each test item measures the same construct.
• It is the intercorrelations among the items.
• If all items on a test measure the same construct,
then it has a good internal consistency.
A.Split-Half
• In split-half reliability, a test is given and divided
into halves that are scored separately.

• The results of one half of the test are then


compared with the results of the other.

• The two halves of the test can be created in a


variety of ways: odd-even, random.
• Each examinee will have two scored
coming form the same test. The scores on
each set should be close or consistent.

• Split-half is applicable when the test has a


large number of items.
Statistical measure:
Pearson’s r
Spearman Brown Formula
B. Cronbach’s Alpha
• Used in tests with no right or wrong
answers.

• Average of all split-halves.

• Disadvantage: affected by the number


of items.
C. KR20

• Kuder-Richardson 20

• The formula for calculating the


reliability of a test in which the items
are dichotomous, scored 0 or 1
(usually for right or wrong).
4. Inter-rater Reliability
• The degree of agreement or consistency
between two or more scorers (or judges or
raters) with regard to a particular measure.

• Sufficient training is needed.

• Inter-scorer reliability, judge reliability, observer


reliability, and inter-rater reliability.
Kendall's W (also known as Kendall's
coefficient of concordance)

• A method used for assessing the agreement


among raters.

• 1 (perfect agreement) to -1.


NO. OF NO. OF TYPE OF RELIABILITY
METHOD STATISTICAL MEASURE
FORMS SESSION MEASURE
Test-Retest 1 2 Measure of Stability Pearson r

Parallel Forms-Immediate 2 1
Measure of Equivalence Pearson r
Parallel Forms-Delayed
2 2

Pearson r and Spearman-


Split Half
Brown Formula

Cronbach’s Alpha Measure of Internal Cronbach’s Alpha


1 1
Consistency

Kuder-Richardson
Kuder-Richardson

measure of the consistency


Inter-rater 1 1 and agreement between two Kendall's W
or more raters
Improving
the Test
Items
A. Item Analysis
B. Difficulty Index
C. Discrimination Index
D. Table of Non-plausibility of
Distracters
Item Analysis

Item analysis is a process which examines


student responses to individual test items
(questions) to assess the quality of those items
and of the test as a whole.
Item Difficulty
• Refers to the percentage of learners who
answered an item correctly.

• The larger the percentage of getting an


item right, the easier the item. The higher
the difficulty index, the easier the item is
understood (Wood, 1960)
Interpretation:
DIFFICULTY INDEX REMARK

0.91 and above Very Easy

0.76 to 0.90 Easy

0.26 to 0.75 Moderate or average

0.11 to 0.25 Difficult

Below 0.10 Very Difficult


Discrimination Index
Refers to the power of the item to
discriminate the students between
those scored high and low in the overall
test.
Interpretation:
DISCRIMINATION INDEX REMARK

0.40 and above Very good item


0.30 to 0.39 Good Item
0.20 to 0.29 Reasonably good item
0.10 to 0.19 Maginal item
Below 0.10 Poor item
Types of Discrimination Index

POSITIVE DISCRIMINATION
-happens when more students in the upper group got the item
correctly than those students in the lower group.

NEGATIVE DISCRIMINATION
-happens when more students in the lower group got the item correctly
than those students in the upper group.

ZERO DISCRIMINATION
Happens when the number of students in the upper and lower group
who answer the test correctly are equal.
Table of Non-plausibility of Distracters

is a tool used in educational assessment to


evaluate the effectiveness of multiple-choice test
items.

• The table lists the distractors (incorrect answer


choices) for each test item along with a rating of
their plausibility.
Item distractor
Good item distractor should be:
1. Plausible (reasonable) option but not correct.
2. Have negative discrimination (chosen by
low-performing students)

• Effective distractor distracts students from the


correct answer.
• An effective distractor will attract low-performing
students than high-performing students.
Table of Non-plausibility of Distracters
STEP IN ITEM ANALYSIS:

1. Score the TEST


2. Arrange the test papers from highest to
lowest.
3. Separate the top 27% and the bottom
lower 27%.
Ex. The following scores are arranged from Highest to Lowest.
No. of students: 50 No. of Test Items: 50
UPPER GROUP LOWER GROUP
Upper 27% = 13.5 or 14 students Lower 27% =13.5 or 14
1. 50 11.43 21. 35 31. 29 41. 20

2. 50 12.42 22. 34 32. 27 42. 15

3. 49 13.41 23. 34 33. 27 43. 15

4. 48 14.40 24. 32 34. 27 44. 15

5. 48 15.38 25. 32 35. 25 45. 14

6. 46 16.38 26. 30 36. 25 46. 12

7. 45 17.37 27. 30 37. 23 47. 12

8. 44 18.35 28. 29 38. 22 48. 11

9. 43 19.35 29. 29 39. 20 49. 10

10. 43 20.35 30. 29 40. 20 50. 6


4. Prepare a tally sheet.

Item Upper Lower


Number Group=14 Group=14
1 14 14
2 13 11
3 14 11
4 14 7
5. Convert the frequencies to proportions.
Formula: P= no. of correct answer/n
Example: P=14/14 or P=1

Item Upper Lower


Number Group=14 Group=14
F P F P

1 14 1 14 1
2 13 0.93 11 0.79
3 14 1 11 0.79
4 14 1 7 0.5
6. Compute the Difficulty Index
Formula:
Item Difficulty =

DIFFICULTY INDEX REMARK


0.91 and above Very Easy
0.76 to 0.90 Easy
0.26 to 0.75 Moderate or average
0.11 to 0.25 Difficult
Below 0.10 Very Difficult
Item Difficulty
• Refers to the percentage of learners who
answered an item correctly.

• The larger the percentage of getting an


item right, the easier the item. The higher
the difficulty index, the easier the item is
understood (Wood, 1960)
Item Difficulty =

Item Upper Group=14 Lower Group=14 Difficul


Number ty
Index
F P F P

1 14 1 14 1 1
2 13 0.93 11 0.79 0.86
3 14 1 11 0.79 0.90
DIFFICULTY INDEX REMARK
0.91 above Very Easy
0.76 to 0.90 Easy
0.26 to 0.75 Moderate or average
0.11 to 0.25 Difficult
0.10 below Very Difficult
Item Upper Lower Difficulty REMARKS
No. Group=14 Group=14 Index
F P F P

1 14 1 14 1 1 VERY EASY
2 13 0.93 11 0.79 0.86 EASY
3 14 1 11 0.79 0.90 EASY
7. Compute the Discrimination Index

Formula:
Item Discrimination = pH – pL
DISCRIMINATION REMARK
INDEX
0.40 and above Very good item
0.30 to 0.39 Good Item
0.20 to 0.29 Reasonably good item
0.10 to 0.19 Maginal item
Below 0.10 Poor item
Discrimination Index
Refers to the power of the item to
discriminate the students between
those scored high and low in the overall
test.
Formula:
Item Discrimination = pH – pL
Item Upper Lower Discrimination
Number Group=14 Group=14 Index

F P F P

1 14 1 14 1 0
2 13 0.93 11 0.79 0.14
3 14 1 11 0.79 0.21
4 14 1 7 0.5 0.5
DISCRIMINATION INDEX REMARK

0.41 and above High


0.20 to 0.40 Moderate
Below 0.19 Low

Item Upper Group=14 Lower Group=14 Discriminat REMARKS


No. ion Index
F P F P

1 14 1 14 1 0 LOW
2 13 0.93 11 0.79 0.14 LOW
3 14 1 11 0.79 0.21 MODERATE
4 14 1 7 0.5 0.5 HIGH
Negative
Discrimination

This happens when more students in


the lower group than the upper group
select the right answer to an item.
• Very easy or very difficult items are
not good discriminations.

• A poorly written item will have little


ability to discriminate.
8. Decide whether to retain, revise, or reject
the item.

• When an item meets the general


acceptability guidelines for difficulty and
discrimination, you should keep the item.
DIFFICULTY INDEX DISCRIMINATION INDEX DECISION
High
Moderate or average (0.41 and above)
Items with difficulty index within 0.26
(0.26 to 0.75) Moderate RETAIN to 0.75 and with discrimination index
(0.20 to 0.40) of 0.20 and above are to be retained.
Moderate or Average Low
(0.26 to 0.75) (Below 0.19)

Very easy Item with difficulty index within 0.26 to


(0.91 and above)
Easy
High
REVISE 0.75 but with discrimation index of
(0.41 and above) 0.19 and below OR with discrimination
(0.76 to 0.90)
Moderate
Difficult
(0.20 to 0.40) index of 0.20 and above but with
(0.11 to 0.25) difficulty index not within 0.26 to 0.75
Very Difficult
(below 0.10) should be revised.
Very easy
(0.91 and above)
Easy Items with difficulty index not
(0.76 to 0.90) Low
within 0.26 to 0.75 and with a
Difficult (Below 0.19) REJECT
(0.11 to 0.25) discrimination index of 0.19 and
Very Difficult below should rejected.
(below 0.10)
Upper Lower
Ite Group=14 Group=14
Difficulty Discrimination
m Index
REMARKS
Index
REMARKS DECISION
No.
F P F P

VERY
1 14 1 14 1 1 0 LOW REJECT
EASY

2 13 0.93 11 0.79 0.86 EASY 0.14 LOW REJECT

3 14 1 11 0.79 0.90 EASY 0.21 MODERATE REVISE


REMINDERS:

• Difficulty levels should never be negative – if you get


a negative result, recalculate

• Item discriminations can be negative – that is, more


students in the low-scoring group can get the item
correct than those in the top-scoring group.

• The general advice is to reject, revise, or rewrite


items entirely with the negative item discriminations.

You might also like