0% found this document useful (0 votes)
6 views

Psyass Midterm Notes (2)

The document outlines the processes and tools involved in psychological testing and assessment, highlighting the differences between testing and assessment. It details various types of tests, including achievement, aptitude, and intelligence tests, as well as the methods of gathering psychological data such as interviews and behavioral observations. Additionally, it discusses the rights of test takers and the roles of evaluators in ensuring ethical practices in psychological assessments.

Uploaded by

jenaldguangco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Psyass Midterm Notes (2)

The document outlines the processes and tools involved in psychological testing and assessment, highlighting the differences between testing and assessment. It details various types of tests, including achievement, aptitude, and intelligence tests, as well as the methods of gathering psychological data such as interviews and behavioral observations. Additionally, it discusses the rights of test takers and the roles of evaluators in ensuring ethical practices in psychological assessments.

Uploaded by

jenaldguangco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

PSYCHOLOGICAL ASSESSMENT ❖ Two Elements of Ability: Speed and Accuracy

o Achievement Tests – measure a person’s previous


PSYCHOLOGICAL TESTING AND ASSESSMENT
learning
in a specific area. Ex. spelling tests, timed arithmetic tests,
We have established that testing and assessment are not completely
map quizzes
different from each other, but rather related to some degree.
o Aptitude Tests – measures a person’s potential for
Psychological Testing – only an aspect of the whole assessment process learning, acquiring a specific skill of the individual’s ability
❖ Process of measuring psychology related variables by means of to perform in a new job or situation.
devices or procedures designed to obtain a sample of a o Intelligence Tests – measure a person’s general
behavior. inclination to solve problems, adapt to changing
❖ Standardized psychological test circumstances, and to think abstractly.
❖ Measuring behaviour to convert in numerical data
B. Performance Tests – measure usual or habitual thoughts,
Psychological Assessment – the utilization of various methods to come up feelings, and behaviour. Indicate how test takers think and act
with a psychological profile. on a daily basis. Measure constructs such as personality
❖ The gathering of psychology-related data for the purpose of characteristics and attitudes or pattern of behaviour. No right or
making a psychological evaluation that is accomplished through wrong answers.
the use of tools such as tests, interviews, case studies, ❖ Personality Test
behavioral observation, and specially designed apparatuses
Non-objective – either correct or incorrect, essay
and procedures.
Objective – true-false, fill-in the blank, matching, multiple choice questions
❖ Answer a referral questions
Projective – measure personality – show images, ask open-ended
Both are gathering data. questions
Subjective Test – essays, portfolios, capstone projects, oral presentations
Testing Assessment
Objective Psychological tests are evaluated at two distinct points and in two different
To obtain some gauge, usually To answer a referral question, solve ways:
numerical in nature, with regard to an a problem, or arrive at a decision 1. When considered for use, their technical qualities should be of
ability or attribute. through the use of tools of primary concern.
evaluation. 2. When the use is identified, the skill of the user, and the way
tests are used should be the primary concern.
Process
Testing may be individual or group in Assessment is typically Varieties of Assessment
nature. After test administration, the individualized. In contrast to testing, Therapeutic Assessment – done for therapeutic purposes.
tester will typically add up “the number of the assessment more typically Educational Assessment – uses tests and other tools to
correct answers or the number of certain focuses on how an individual evaluate abilities and skills relevant to success or failure in a
types of responses… with little if any processes rather than simply the school context. Ex. intelligence test, achievement tests, and
regard for the how or mechanics of such results of that processing. reading comprehension tests
content” (Maloney & Ward, 1976, p. 39) Retrospective Assessment – use of evaluative tools to draw
conclusions about psychological aspects of a person as they
Role of Evaluator
existed at some point in time prior to the assessment.
The tester is not key to the process; The assessor is key to the process
Remote Assessment – use of tools of psychological evaluation
practically speaking, one tester may be of selecting tests and/or other tools
to gather data and draw conclusions about a subject who is not
substituted for another tester without of evaluation as well as in drawing
in physical proximity to the person or people conducting the
appreciably affecting the evaluation. conclusions from the entire
evaluation.
evaluation.
Skill of Evaluator Ecological Momentary Assessment (EMA) – “in the moment”
Testing typically requires technician-like evaluation of specific problems and related cognitive and
Assessment typically requires an behavioral variables at the very time and place they occur.
skills in terms of administering and
educated selection or tools of
scoring a test as well as in interpreting a
evaluation, skill in evaluation, and The Assessment Process – to find out the problem
test result.
thoughtful organization and 1. Referral question
integration of data. 2. Formulating goals and objectives
Outcome 3. Choosing applicable tools
Testing yields a test score or series of 4. Findings and conclusions
Assessment entails a logical
test scores. 5. Referral/intervention
problem-solving approach that
brings to bear many sources of data Forms of Assessment
designed to shed light on a referral 1. Collaborative Psychological Assessment – assessor and
question. assessee work as partners from initial contact through final
feedback.
Types of Test
2. Therapeutic Psychological Assessment – therapeutic self-
Most of the time, test users categorize test based on domain.
discovery and new understandings are encouraged throughout
A. Ability Tests – measures skills in terms of speed, accuracy or
the assessment process.
both.
3. Dynamic Psychological Assessment – follows a model (a) ➢ Panel Interview – more than one interviewer participates in the
evaluation (b) intervention (c) evaluation. Provide a means for assessment of personnel.
evaluating how the assesse processes or benefits from some ❖ Motivational Interviewing – therapeutic dialogue that
type of intervention during the course of evaluation. combines person-centred listening skills such as openness
and empathy, with the use of cognition-altering techniques
Tools of Psychological Assessment
designed to positively affect motivation and effect
1. Tests
therapeutic change.
2. Interview
3. Portfolio 3. Portfolio – also called as work samples. This pertains to a collection of
4. Case History Data outputs made by an individual that would signify his skills and abilities.
5. Behavior Observations
4. Case History Data – pertains to records, transcripts, and other accounts
6. Role Play Tests
in written, pictorial or other form that preserve archival information, official
7. Computers
and informal accounts, and other data and items relevant to assessee.
1. Tests ❖ Case Study (or Case History) – a report or illustrative
Psychological Tests – a device or procedure designed to measure account concerning a person or an event that was
variables related to psychology (intelligence, personality, aptitude, compiled on the basis of case history data.
interests, attitudes, or values).
CHD also plays important role in different assessment settings:
➢ Overt Behavior – individual’s observable activity.
➢ Clinical – sheds light on an individual’s past and current
➢ Covert Behavior – cannot be directly observed.
adjustment as well as on events and circumstances that may
Psychological tests differ in various ways: format, administration have contributed to any changes in adjustment.
procedures, scoring and interpretation procedures, and psychometric ➢ Neuropsychology – provides information about
soundness/technical quality. neuropsychological functioning prior to the occurrence of a
trauma or other event that results in a deficit.
o Format – the form, plan, structure, arrangement, and layout of
➢ Educational Setting – insight into current academic and
items as well as to related considerations such as time, limits.
behavioral standing, and can be used for future class
o Administration Procedures – whether the tests are administered
placements.
individually or in group.
o Scoring and Interpretation Procedures – process of assigning 5. Behavioral Observation – the manner of monitoring the actions of others
evaluative codes (score) to performance on test, interview, or or oneself by visual or electronic means while recording quantitative or
some other sample of behavior. qualitative information regarding the actions.
❖ Cut Score – cutoff score. A reference point, usually ➢ Naturalistic Observation – observer observes the subject in a
numerical, derived by judgment and used to divide a set of natural setting, from a distance.
data into two or more classifications. ➢ Participant Observation – observer joins the subject in its
o Psychometric Soundness/Technical Quality – test’s reliability, environment, while at the same time taking note of the behavior.
validity, and utility.
Types of Behavior Observation
❖ Utility – the usefulness or practical value that a test or
Anecdotal Recording – involves recording and interpreting a
other tool of assessment has for a particular purpose.
narrative of behaviour during an observation period using an
American Counseling Association’s (1995) Level of Competence in Testing antecedent-behavior-consequence (ABC) format for interpreting
and Assessment behavior.
o Level A – these instruments can be adequately administered, ❖ Only observable behaviour should be recorded.
scored, and interpreted solely by using the manual. Ex. ❖ For example, if the observer saw that a student punched
Achievement Tests or SDS (Self-Directed Search). another student, he should put “he punched his classmate”
o Level B – test user must have a master’s level degree in rather than “he is mad”.
psychology or guidance and counselling or the equivalent with
Interval Recording Methods – uses intervals during which the
relevant training in assessment. Ex. MBTI, Suicidal Ideation
behavior is observed to occur.
Questionnaire, Strong Interest Inventory.
❖ Three Basic Variations on Interval Recording: partial-
o Level C – these instruments and aids require substantial
interval recording, whole-interval recording, momentary-
knowledge about the construct being measured and about the
time sampling
specific instrument used. Often, the publisher will require PhD
➢ Partial-Interval Recording – observer creates his
qualification in psychology or guidance and counselling with
own interval size to which the behaviour would be
specific coursework related to the instrument. Ex. WISC-IV, SB,
observed, but 30 seconds is a common choice.
Rorschach, RISB.
❖ The observer begins observing the
2. Interview – gathering information through direct communication student or client for the presence of the
involving reciprocal exchange. target behavior.
Interviews can be used in several settings: ➢ Whole-Interval Recording – similar to partial-
➢ School Setting – to determine appropriate interventions and interval recording in all aspects but one. In partial-
placements. interval recording, the behaviour is rec-orded as
➢ Human Resources – to make informed hiring, promotion, and having occurred if it was observed to occur at any
termination decisions. (selection process) point during the interval.
➢ Momentary Time Sampling – intervals in ❖ Right to know why they are being evaluated, how test
momentary time sampling are quite shorter. data will be used and what information will be released
❖ The difference is that, behaviour will only to whom.
be recorded as present if it occurs at the ❖ May be obtained by parent or legal representative.
end of each interval. ❖ Must be in written form:
❖ The advantages of time sampling are its ✓ General purpose of the testing
relative ease, and the fact that between ✓ The specific reason it is being undertaken
sampling points. The observer can do ✓ General type of instruments to be administered
other tasks. However, the greatest
2. Right To Be Informed of Test Findings
disadvantage in its inability to record the
❖ Reporting of findings should be realistic.
true occurrence of the behaviour.
❖ It should also be in a language that can be easily
6. Role Play Tests – a tool of assessment wherein assesses are directed understood by the testtaker.
to act to act as if they were in a particular situation. ❖ They also have the right to be oriented of the different
recommendations out of their test results.
7. Computers as Tools
Computer-Assisted Psychological Assessment (CAPA) – the 3. Right of Privacy and Confidentiality
convenience and economy of time in administering, scoring, ➢ Private Right – recognizes the freedom of the individual
and interpreting tests. to pick and choose for himself the time, circumstances,
Computer Adaptive Testing – the computer’s ability to tailor the and particularly the extent to which he wishes to share
test to the test taker’s ability or test-taking patterns. or withhold from others his attitudes, beliefs, behaviors,
and opinions.
Parties in Assessment
➢ Privileged Information – information protected by law
1. The Test Developer
from being disclosed in legal proceeding. Protects
❖ They are the one who create tests.
clients from disclosure in judicial proceedings. Privilege
❖ They conceive, prepare, and develop tests. They also find
belongs to the client not the psychologists.
a way to disseminate their tests, by publishing them either
❖ “Protective privilege ends where the public peril
commercially or through professional publications such as
begins”
books or periodicals.
❖ Landmark case of Tarasoff v. Regents of the
University of California
2. The Test User – they select or decide to take a specific test off
❖ This established the clinician’s duty to warn for
the shelf and use it for some purpose. They may also
potential violence, or even prevent the spreads
participate in other roles, e.g., as examiners or scorers.
of AIDS infection from an HIV positive patient.
3. The Test Taker – anyone who is the subject of an assessment.
➢ Confidentiality – concern matters of communication
❖ Psychological Autopsy – reconstruction of a deceased
outside the courtroom.
individual’s psychological profile on the basis of archival
records, artifacts, & interviews previously conducted with
4. Right To A Least Stigmatizing Label – the standard advise that
the deceased assesse.
the least stigmatizing labels should always be assigned when
reporting test results.
Applications of Assessment
❖ Jo-Ann’s mother, Carmel Iverson, brought a libel
1. Educational Setting
(defamation) suit against Frandsen on behalf of her
2. Geriatric Setting – assessment for the aged
daughter. Mrs. Iverson lost the lawsuit. the court
3. Counseling Setting
ruled in part that the psych evaluation “was a
4. Clinical Setting
professional report made by a public servant in good
5. Business and Military Setting
faith, representing his best judgment”

Current Uses of Psychological Assessment: Decision Making,


STATISTICS AND ASSESSMENT
Psychological Research, Understanding the Self

Statistics – science of collecting, analyzing, presenting, and interpreting


Assessment of People with Disabilities
data.
o Alternate Assessment – evaluative or diagnostic procedure or
process that varies from the usual, customary, or standardized
Statistics deals with COPAI of data.
way a measurement is derived either by virtue of some special
Collection
accommodation made to the assessee by means of alternative
Organizing
methods.
Presentation
o Accommodation – adaptation of a test, procedure, or situation
Analysis
or the substitution of one test for another to make the
Interpretation
assessment more suitable for an assessee with exceptional
needs. Within statistics in psychological assessment are concepts of
❖ Translate it into Braille and administer in that form measurement and scaling.

Rights of the Testtakers Measurement – act of assigning numbers or symbols to characteristics of


1. Right of Informed Consent things which represent magnitude of the object being measured.
Scaling – an extension of measurement. Scaling involves creating a o Raw Score – a straightforward, unmodified accounting of
continuum in which measurements on object are located. performance that is usually numerical.
o Frequency Distribution – used to organize the collected data in
0-1.0 0 Low
tabular form or graphic form.
1.01-2.00 Average
Types of Frequency Distribution:
2.01-3.00 High
➢ Ungrouped/Simple Frequency Distribution – shows
the frequency of an item in each separate data value
Two Categories of a Scale:
rather than groups of data values. Ex. 5, 10, 15
Continuous Scale – interval/ratio and can be divided and prone
➢ Grouped Frequency Distribution – data is arranged
to errors. Theoretically possible to divide any of the values of
and separated into groups called class intervals. The
the scale.
frequency of data belonging to each class interval is
Discrete Scale – nominal/ordinal used to measure discrete
noted in a frequency distribution table. The grouped
variable Ex. rural and urban, female and male
frequency table shows the distribution of frequencies
Measurement always involve errors. in class intervals. Ex. 0-5, 6-10

Errors – collective influence of all the factors on a test score or Frequency Distribution Graphs
measurement beyond those specifically measured by the test or Bar Graphs – represent data using rectangular bars of uniform
measurement. width along with equal spacing between the rectangular bars.
Histograms – graphical representation of data using rectangular
A psychology professional often deal with two types of data:
bars of different heights. In a histogram, there is no space
Quantitative Data – numerical data like age, weight, grades, etc.
between the rectangular bars.
Qualitative Data – categorical data like name, section, religion,
Pie Chart – visually displays data in a circular char. It records
subjects, etc.
data in a circular manner and then it is further divided into
sectors that show a particular part of data out of the whole part.
Data can be used in various ways. There is a need to create a levelling to
Frequency Polygon – drawn by joining the mid-points of the
be able to identify how to treat the data.
bars in a histogram.
Remember that as counsellors, data’s level of measurement dictates the
Measures of Central Tendency
appropriate statistical tool to use in order to arrive at meaningful
1. Mean
interpretation.
❖ Most commonly used measure of the center of data.
Stanley Smith Stevens – developed the four scales (ratio, interval, ❖ Also referred to as “arithmetic average”
nominal, ordinal) of measurement in 1946. ❖ Can be influenced by extreme scores

Scales/Levels of Measurement To find the mean, add all the observations then divide the sum by the total
Nominal – non-numeric group labels that do not reflect number of observations.
quantitative information. Simplest form of measurement and
involve classification based on one or more distinguishing Mean
S Population Mean

characteristics.
∑𝐱 ∑𝐱
Ordinal – reflects order, ranking and hierarchy. No absolute 𝐱̅ = 𝛍=
𝐧 𝐧
zero point.
Interval – no absolute zero point. Equal intervals between
numbers. Each unit on the scale is exactly equal to any other where ∑x is sum of all data values

unit on the scale. N is number of data items in population

Ratio – absolute zero point n is number of data items in sample

Valid Nominal Ordinal Interval Ratio Ex. The grades of Student A in 5 subjects are 78, 88, 89, 90 and 95. What

Frequency Yes Yes Yes Yes is her average?


∑x 78+88+89+90+90+95 440
Distributions x̅ = = = = 88
n 5 5
Median, percentile No Yes Yes Yes
ranges
2. Median – the middle value in a distribution when the values are
Addition, No No Yes Yes arranged in an ascending order.
subtraction, mean, ❖ The median divides the distribution in half (there are
standard deviation 50% of observations on either side of the median
Multiplication, No No No Yes value
division, ratios,
coefficient of Ex. Find the median of the scores of sophomore students in Chemistry
variation 10, 12, 14, 16, 23, 24, 25, 33, 34, 35, 41, 45, 50

Ex. The ages of the patients of Hospital X in the pediatric ward are:
What is the “frequently” used scale of measurement in Psychology?
2, 5, 5, 6, 8, 9, 9, 10
Describing Data: 6+8= 14/2 = 7 (the median is 7)
o Distribution – set of test scores arrayed for recording or study.
3. Mode – observation which appears the most number of times in The Normal Curve
a distribution. Can be unimodal, bimodal, polymodal, or no o Standard Deviation – measure of variability equal to the square
mode. root of the average squared deviations about the mean. It is
equal to the square root of the variance.
Ex. What is the model of the test scores of the new students in a Statistics
o Variance – equal to the arithmetic mean of the squares of the
test?
difference between the scores in a distribution and their mean.
12, 13, 12, 11, 10, 20, 24, 25, 10, 22, 20, 13, 16, 18, 20, 20, 20, 20
❖ Squares from the deviation of the mean
Mode: 20
∑ 𝐱𝟐
𝐬𝟐 = − 𝐱̅ 𝟐
𝐧
Measures of Variability
o Variability – indication of how scores in a distribution are 85+100+90+95+80 450
SD = = = 90 (mean)
scattered or dispersed. 5 5
o Range – the range of a distribution is equal to the difference SD = 85-90 = (-5)² = 25
between the highest and the lowest scores.
100-90 = (10)² = 100
❖ The range is based entirely on the values of the
lowest and highest scores; one extreme score (if it 90-90 = (0)² = 0

happens to be the lowest or the highest) can 95 -90 = (5)² = 25


radically alter the value of the range.
80-90 = (-10)² = 100
❖ When its value is based on extreme scores in a
AD= 250/5 = 50 = √50 = 7.07
distribution, the resulting description of variation may
be understated or overstated. Solve for the standard deviation of 6, 7, 8, 9 and 10
➢ Interquartile Range – measure of variability equal to 6+7+8+9+10 40
the difference between Q3 and Q1.
SD = = = 8 (mean)
5 5
➢ Semiquartile Range – equal to the interquartile range SD = 6-8 = (-2)² = 4
divided by 2.
= 7-8 = (-1)² = 1
*the difference between Q1 and Q3 should be in
equal distance to achieve symmetry, otherwise the = 8-8 = (0)² = 0
distribution is skewed. = 9-8 = (1)² = 1

Interquartile and Semi-interquartile Distribution = 10-8 = (2)² = 4


SD = 10/5 = √2 = 1.41

Non-Normal Distribution
o Skewness – distribution’s lack of symmetry.

Positively skewed examination results may indicate that the test was too
difficult. More items that were easier would have been desirable in order to
better discriminate at the lower end of the distribution of test scores. A
Quartile – specific point distribution has a negative skew when relatively few of the scores fall at
Quarter - interval the low end of the distribution. Negatively skewed examination results may
indicate that the test was too easy.
o Average Deviation (AD) – described the amount of variability in
a distribution.
|𝐱|
Formula: 𝐀𝐃 =
𝐧
Ex. Solve for the average deviation of: 85 100 90 95 80

85+100+90+95+80 450
AD = = = 90 (mean)
5 5
AD = 85-90 = -5 o Kurtosis – flatness or peakedness of the distribution.

100-90 = 10
90-90 = 0
95 -90 = 15
80-90 = -10
AD= 30/5 = 6
o The Normal Curve – bell-shaped, smooth, mathematically
How to solve: defined curve that is highest at its center.
✓ Get the arithmetic mean (add all the scores divided by the n)
✓ Calculate the deviation from the mean. After calculating the
mean, you can calculate the deviation from the mean for each
value in the data set.
✓ Calculate the sum of all deviations.
✓ Calculate the average deviation.
±68.26% s 2.2 2.2
SEM = = = = 0.30 = 35.4 − 36.6 inches
±95.44% √n 55 7.42
±99.72%
±99.98% Your study is all about the mean academic performance of all grade 11
students in region 6. With initial data gathering, you have a mean of 84.8
The Descriptive use of Normal Curve (Gaussian Curve) and a standard deviation of 3 from 93 students. Compute the SEM and
o Normalization – transformation of scores so that they can provide the confidence interval 99%
assume the same meaning and can relatively be compared. s 3 3
SEM = = = = 0.31 = 83.66 − 86.04 inches
√n √93 9.64
The Inferential Use of Normal Curve
o Estimating Population parameter – valid and convenient estimate
of the characteristic of the population from statistic sample.
STANDARD SCORES
o Testing Hypothesis About Differences – basis for rejecting or not
rejecting the null hypothesis formulated.
Standard Score – raw score that has been converted from one scale to
o Inferences - deduction
another scale, where the latter scale has some arbitrarily set mean and
Estimating Population Parameter standard deviation.
Ex. Given for example that you want to know the average height of all
Why convert raw scores?
Filipinos. You have acquired a sample data from 50 Filipinos and found out
o Easy interpretations
that the average height is 64 inches and a standard deviation of 4 inches.
o Position of a testtaker’s performance relative to other testtakers
o Standard Error (SE) – lies on the premise that sample statistic is readily apparent.
can be used to estimate the population parameter.
In converting raw scores to standard scores, two different transfromations
o Given the sample mean, we can now estimate the possible
can take place:
location of the population mean by using the formula of
Linear Transformation – one that retains a direct numerical
Standard Error of the Mean (SEM).
relationship to the original raw score.
𝐬
𝐒𝐄𝐌 = Non-Linear Transformation – required when the data under
√𝐧 consideration are not normally distributed yet comparisons with
Where: s – standard deviation, n – number of cases normal distributions need to be made.
4 4
SEM = = = 0.566 or 0.57 Z-Score – results from the conversion of a raw score into a number
√50 7.07
indicating how many standard deviation units the raw score is below or
The SEM is synonymous to a standard deviation that sets the confidence
above the mean of the distribution.
interval.

𝐗 − 𝐱̅
Given that your SEM is 0.566, which by nature a standard deviation, would 𝐙=
𝐬
be now ±0.566.
Where:
The sample mean was 64 inches. To create, the confidence interval which
x – raw score from a test
will pave the way for the population parameter estimation, it would be 64
x̅ – mean score of the class
±0.566.
s – standard deviation

Your confidence interval at 1 standard deviation above and below the


Ex. Mary took a test in English. She scored 12 in a 20-item test. The mean
mean is 63.43-64.56.
score of her class is 15 with a standard deviation of 1.5. How well did Mary

In other words, we are 68% sure that the mean height of Filipinos is perform in the test?

between 63.43-64.56 inches. 12 − 15 −3


Z= = = −2
1.5 1.5
± 1 – 68.26 (+1 – 34.13, -1 = 34.13)
T = (z-score) (SD) + Mean = (-2) (10) + 50 = -20 + 50 = 30
± 2 – 95.44 (+2 – 47.72, -2 = 47.72)
± 3 – 99. 72 (+3 - 49.86, -3 = 49.86) Remember that standard scores have absolute mean and standard
± 4 – 99.98 (+4 - 49.99, -4 = 49.99) deviation as basis for.

s 4 4 Z-Score has a mean of 0 and SD of 1


SEM = = = = 0.57 = 63.4 − 64.57 inches
√n √50 7.07
= 68.26% sure From a z-score we can transform it to another standard score, which is T-
score.
95.55% sure (62.86 – 65.14 inches)
99.72% sure (62.29 – 65.71 inches) T-score has a mean of 50 and a SD of 10
99.98% sure (61.72 – 66.28 inches)
T = (z-score) (SD) + Mean
T= (1) (10) + 50
You want to know the mean age of all public school teachers in the
T = 60
Philippines. You have initially gathered data from 55 teachers, and it had
produced a mean of 36 years old with a standard deviation of 2.2. Solve
A = (z-score) (SD) + Mean
for the SEM and provide the confidence interval at 95%.
A= (1) (100) + 500
A = 600

IQ = (z-score) (SD) + Mean


IQ= (1) (15) + 100
IQ = 115

❖ Coefficient of Determination – tells the proportion of the total


variation in scores on Y is attributed to X (r^2)
STANDAR SCORE MEAN SD
Z-score 0 1 For example the correlation coefficient between self-esteem and resilience

T-score 50 10 is 0.92, the coefficient of determination is 0.85.

A-score 500 100


r^2 = .92^2 = .85 x 100 = 85%
IQ 100 16
That means, 85 percent of what happened to the resilience of the person
Stanine 5 2
is associated with his self-esteem.
Sten 5.5 2

Coefficient of Alternation – measure of non-association between two


Ex. Mr. Tan, a physics teacher, administered a test to his students. CJ, variables. Calculated as 1-r^2
one of his students, scored 26 in that 30-item quiz; the mean score of the
class is 28 with a standard deviation of 4. What is the equivalent T-score of If 85% of what happened to the resilience of the person is associate with
CJ, and what can Mr. Tan infer out of it? his/her self-esteem, then the remaining 15% percent speaks of its non-
26 − 28 −2 association.
Z= = = −0.5
4 4
1-r^2 = 1 - .85 + .15 x 100 = 15%
T = (z-score) (SD) + Mean = (-0.5) (10) + 50 = 45
Regression – analysis relationships among variables for the purpose of
CJ Test is below the mean which is CJ performing lower to the general understanding how one variable may predict the other variable.
performance of the class.
Several measures of Prediction (Regression)
Logistic Regression – measure of regression to use when the
An examinee got a raw score of 80 in a job screening test that measures
predictor is a continuous variable, while the criterion is a
ability. The mean score of all examinees was 96 with a standard deviation
nominal variable.
of 7. What will be the examinee’s DIQ (16) score in the test?
❖ For example, the predictor is interview score, while
the criterion is a school program.
80 − 96 −16 Multinomial Regression – measure of regression to use when
Z= = = −2.29
7 7 the predictors (2 or more) are continuous variables when the
DIQ = (z-score) (SD) + Mean = (-2.29) (16) + 100 = 63.36 criterion is a nominal variable.
❖ For example, the predictors are interview score and
IQ test scores, while the criterion is a school
CORRELATION AND REGRESSION
program.
Simple Linear Regression – measure of regression to use when
Correlation – expression on the degree and direction of correspondence the predictor is a continuous variable while the criterion is also a
between two variables. continuous variable.
❖ For example, the predictor is interview score, and the
Correlation Coefficient – mathematical index that describes the direction
criterion is GPA.
and magnitude of a relationship.
Multiple Linear Regression – measure of regression to use

The correlation between two variables can be positive and negative. when the predictors are continuous variables while the criterion
is also a continuous variable.
There are several tools that would aid us in measuring correlation: ❖ For example, the predictors are interview and IQ
Pearson r – statistical tool of choice when the variables are scores, and the criterion is GPA.
continuous. (Pearson correlation coefficient or Pearson product-
moment coefficient of correlation) Karl Pearson
RELIABILITY AND VALIDITY OF TESTS
Spearman’s Rho – statistical tool of choice when the
respondents are few, and scores are being viewed in the ordinal
level of measurement. (Rank-order correlation coefficient or Z = 3 = T = 30
Rank-difference correlation coefficient) Charles Spearman IQ = 85 < T = 60
T = 75 > IQ = 130
Sometimes we also correlate variables which appear to be dichotomous, Z = -1.5 > IQ = 85
or has two levels only.
Assumptions of Testing and Assessment:
1. Psychological Traits and States Exist
2. Psychological Traits and States Can Be Quantified and
Measured
3. Test-Related Behavior Predicts Non-Test Related Behavior
4. Tests and Other Measurement Techniques Have Strengths and ❖ Errors associated with split half reliability is item
Weaknesses sampling error.
5. Various Sources of Errors Are Part of the Assessment Process
4. Inter-Item Consistency – refers to the degree of correlation
6. Testing and Assessment Can Be Conducted in a Fair and
among all the items on a scale. A single administration of a
Unbiased Manner
single form of a test.
7. Testing and Assessment Benefit Society
❖ The issue on test homogeneity or heterogeneity
should be carefully considered.

KR-20 formula is used to solve for the reliability coefficient of a test which
contains dichotomous items.

Cronbach’s Alpha is used to solve for the reliability coefficient of a test


which contains non-dichotomous items.
precise inaccurate accurate imprecise
inaccuracy = ERROR imprecision = ERROR Validity
Validity – the degree to which all the accumulated evidences support the
Systematic Error Random Error
intended interpretation of test scores and proposed purpose.
errors in the instrument caused by unknown and
❖ Simply saying, validity means the ability of a particular
in itself. unpredictable changes.
instrument to measure what is intends to measure.

In order for a test to be considered “good”, there are technical qualities that
In the trinitarian view, there are only three major types of validity.
it should possess.
Construct

The first two important technical qualities are validity and reliability.

Validity – the accuracy of a particular instrument/test.


❖ For example: A valid weighing scale should record 1kg, when a Content Criterion-related
1kg of rice is put in it.
However, there is another type of validity which attracted controversy over
Aside from reliability and validity, there are also other considerations in the years.
assessing the soundness of a test. o Face validity – assumed to only assess the appearance of the
test. It concerns more of the format, readability, and other
A good test is one that contains adequate norms.
factors that meet the eye.

Norms – are the test performance data of a particular group of testtakers


Construct Validity – a judgment about the appropriateness of inferences
that are designed for use as a reference when evaluating or interpreting
drawn from test scores regarding individual.
individual test scores.
o When a test uses norms for interpretation, that test is said to be Construct – an informed, scientific idea developed or hypothesized to
a norm-referenced test. describe or explain behavior.
o The people where the normative data came is called the
Evidences that a test is constructively valid:
normative sample.
1. Evidence of homogeneity
o The process of deriving norms is called norming.
2. Evidence of changes with age
o The administration of the test to establish the normative data is
3. Evidence of pretest-posttest changes
called standardization.
4. Evidence from distinct groups
Reliability 5. Convergent evidence
Sources of Errors have to be noted when dealing with reliability estimates: 6. Discriminant evidence
Time Sampling Error – errors associated with time of
Content Validity – mostly used in the field of education, research, and
administering the test.
psychology.
Item Sampling Error – errors associated with revision or
creation of items.
Construct Underrepresentation – describes the failure to capture important
components of a construct.
1. Test-Retest Reliability – administration of the same test twice to
the same group of respondents.
If construct validity relies on correlational techniques, content validity relies
❖ Error associated with test-retest reliability is time
on expertise of people in the field.
sampling error.
The use of Good and Scates and CVR are the most popular methods of
2. Parallel Forms/Alternate Forms Reliability – two test
assessing content validity
administrations to the same group of respondents.
❖ Errors associated with parallel/alternate forms Content Validity Ration (CVR) – method of gauging agreement among
reliability are time sampling error, and item sampling raters or judges regarding how essential a particular item is. This was
error. proposed by C. H. Lawshe.
❖ The experts shall rate each item either essential, useful but not
3. Split Half Reliability – one test, cut into two, administered to the
essential, not essential.
same group of test takers.
𝐧𝐞 − 𝐍⁄𝟐 A person who seeks a driver’s license had performed well in both Driver’s
𝐂𝐕𝐑 =
𝐍⁄ Aptitude Test, and Manual Driving Test.
𝟐
Where: nₑ equals the number of SMEs rating an item as “essential” and N o Concurrent Validity is an index of the degree to which a test
equals the total number of SMEs providing ratings. score is related to some criterion measure obtained at the same
time.
Number of Panelists Minimum Value
5 .99
UTILITY OF TESTS
6 .99
7 .99
8 .75 Utility of Tests – refers to the usefulness or practical value of testing to
improve efficiency.
9 .78
10 .62
Factors Affecting Test’s Utility
11 .59 Psychometric Soundness
12 .56 Costs
13 .54 Benefits
14 .51
15 .49 Psychometric Soundness of Tests

20 .42 o Index of utility can tell us something about the practical value of
the information derived from scores on the test.
25 .37
o Test scores are said to have utility if their use in a particular
30 .33
situation helps us to make better decisions—better, that is, in
35 .31
the sense of being more cost-effective.
40 .29
Source: Lawshe (1975) Costs of Tests

4−10⁄2 o These are economic, financial, or budget- related factors that


CVR = 10⁄ = -0.2 The item should be rejected
must be taken into account when conducting tests.
2
o It refers to disadvantages, losses, or expenses in both
9−10⁄2 economic and noneconomic terms.
CVR = 10⁄ = 0.80 The item should be retained
2
When test is to be conducted, one should allocate funds for the purchase
Table of Specification of the ff.:
Particular test
Supply of blank test protocols – answer sheet
Computerized test processing, scoring, and interpretation from
the test publisher or some independent service

Benefits – refers to profits, gains, or advantages in both economic and


noneconomic terms.
o Economic – financial returns in cents, dollars (Peso in the Phils)
Criterion-Related Validity – judgment of how adequately a test score can o Noneconomic
be used to infer an individual’s most probable standing on some measure - An increase in the quality of workers’ performance.
of - An increase in the quantity of workers’ performance.
interest – or a criterion. - A decrease in the time needed to train workers.
- A reduction in the number of accidents.
Two types of criterion related validity are concurrent validity and predictive - A reduction in worker turnover.
validity.

A particular test has good predictive validity if it has a good forecasting


function.

OLSAT has a good predictive ability in predicting college academic


performance

Interventions may have been given before measuring the criterion.

o Predictive Validity is an index of the degree to which a test


score predicts some criterion measure.

A test has good concurrent validity if scores from a new test agree with the
scores from a well-established test.

The score of a client with depression in a newly made test Bedoria


Depression Scale is the same as his score on Beck Depression Inventory.
The philosophy behind the Taylor-Russell tables is that a test will be useful
to an organization if:
The test is valid
The organization can be selective in its hiring because it has
more applicants than openings
There are plenty of current employees who are not performing
well, thus there is room for improvement.

So how do we go about it?


1. Find out the criterion validity coefficient of test
2. Obtain selection ratio
3. Obtain the base rate
4. Consult Taylor-Russell table

Selection Ratio – the percentage of applicants an organization hires.

𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐡𝐢𝐫𝐞𝐬
𝐬𝐞𝐥𝐞𝐜𝐭𝐢𝐨𝐧 𝐫𝐚𝐭𝐢𝐨 =
𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐚𝐩𝐩𝐥𝐢𝐜𝐚𝐧𝐭𝐬
The lower the selection ratio, the greater the potential usefulness of the
test.

Base Rate – the percentage of current employees who are considered


successful.

Utility Analyses – as a family of techniques that entail a cost–benefit


analysis designed to yield information relevant to a decision about the
usefulness and/or practical value of a tool of assessment.
o “Does the benefit outweigh the cost?”

In a most general sense, a utility analysis may be undertaken for the


purpose of evaluating whether the benefits of using a test (or training
program or intervention) outweigh the costs. If undertaken to evaluate a
test, the utility analysis will help make decisions regarding whether:
One test is preferable to another test for use for a specific
purpose;
One tool of assessment (such as a test) is preferable to another
tool of assessment (such as behavioral observation) for a
specific purpose;
The addition of one or more tests (or other tools of assessment)
to one or more tests (or other tools of assessment) that are
already in use is preferable for a specific purpose;
No testing or assessment is preferable to any testing or
assessment.

To determine how useful a test would be in any given situation, several


formulas and tables have been designed.
Taylor-Russell Tables
Naylor-Shine Tables
Expectancy Charts
Lawshe Tables
Utility Formula

Taylor-Russell Tables
o provide an estimate of the percentage of total new hires who
will be successful employees if a test is adopted (organizational
success)
o A series of tables based on the selection ratio, base rate, and
test validity that yield information about the percentage of future
employees who will be successful if a particular test is used.
Determining Proportion of Correct Decisions
Lawshe Tables
o A utility method that compares the percentage of times a
o Tables that use the base rate, test validity, and applicant
selection decision was accurate with the percentage of
percentile on a test to determine the probability of future
successful employees.
success for that applicant.

Three information needed:


Validity coefficient
Base rate
Applicant’s test score (more specifically, did the person score in
the top 20%, the next 20%, the middle 20%, the next lowest
20%, or the bottom 20%)

o Quadrant I – represent employees who scored poorly on the


test but performed well on the job.
o Quadrant II – represent employees who scored well on the test
and were successful on the job.
o Quadrant III – represent employees who scored high on the
test, yet did poorly on the job.
o Quadrant IV – represent employees who scored low on the test
and did poorly on the job.

If a test is a good predictor of performance, there should be more points in


the quadrant II and IV because the points in the other two quadrants
represent “predictive failures”. That is, in quadrants I and III no
correspondence is seen between test scores.
Utility Formula: Brogden-Cronach-Gleser Utility Formula
o Measures the productivity gains or the estimated increase in
To estimate the test’s effectiveness, the number of points in each quadrant
work output; estimates the benefit of using a particular method
is totalled, and the following formula is used:
in selection method.
𝐏𝐨𝐢𝐧𝐭𝐬 𝐢𝐧 𝐪𝐮𝐚𝐝𝐫𝐚𝐧𝐭𝐬 𝐈𝐈 𝐚𝐧𝐝 𝐈𝐕 o Method of ascertaining the extent to which an organization will
𝐓𝐨𝐭𝐚𝐥 𝐩𝐨𝐢𝐧𝐭𝐬 𝐢𝐧 𝐚𝐥𝐥 𝐪𝐮𝐚𝐝𝐫𝐚𝐧𝐭𝐬 benefit from the use of a particular selection system.

To resulting number represents the percentage of time that we expect to 𝐮𝐭𝐢𝐥𝐢𝐭𝐲 𝐠𝐚𝐢𝐧 = (𝐍)(𝐓) (𝐫𝐱𝐲 )(𝐒𝐃𝐲 )(𝐳̅𝐦 ) − (𝐍)(𝐂)
be accurate in making a selection decision in the future. To determine,
whether this is an improvement, we use the following formula: To use this formula, five items of information must be known:
Number of employees hired per year (n) – this number is easy
𝐏𝐨𝐢𝐧𝐭𝐬 𝐢𝐧 𝐪𝐮𝐚𝐝𝐫𝐚𝐧𝐭𝐬 𝐈 𝐚𝐧𝐝 𝐈𝐈
to determine: It is simply the number of employees who are
𝐓𝐨𝐭𝐚𝐥 𝐩𝐨𝐢𝐧𝐭𝐬 𝐢𝐧 𝐚𝐥𝐥 𝐪𝐮𝐚𝐝𝐫𝐚𝐧𝐭𝐬
hired for a given position in a year.
There are 5 data points in quadrant 1, 10 in quadrant II, 4 in quadrant III, Average tenure (t) – the average amount of time that
and 11 in quadrant IV. The percentage of time we expect to be accurate iin employees in the position tend to stay with the company. The
the future would be number is computed by using information from company
records to identify the time that each employee in that position
II + IV 10 + 11 21 stayed with the company. The number of years of tenure for
= = = 0.70
I + II + III + IV 5 + 10 + 4 + 11 30
each employee is then summed and divided by the total number
of employees.
To compare this figure with the test we were previously using to select
Test validity (r) – this figure is the criterion validity coefficient
employees, we compute the satisfactory performance baseline.
that was obtained through either a validity study or validity
I + II 5 + 10 15 generalization.
= = = 0.50
I + II + III + IV 5 + 10 + 4 + 11 30 Standard deviation of performance in dollars (SDy) – for many
years, this number was difficult to compute. Research has
Using the new test would result in a 40% increase in selection accuracy
shown, however, that for jobs in which performance is normally
[.70 - .50 = .20 divided by 5.0 = 0.04] over the selection methods
distributed, a good estimate of the difference in performance
previously used.
between an average and a good worker (one standard deviation
away in performance) is 40% of the employee’s annual salary
Naylor-Shine Tables
(Hunter & Schimdt, 1982). The 40% rule yields results similar to
o Entails obtaining the difference between the means of the
more complicated methods and it preferred by managers
selected and unselected groups to derive an index of what the
(Hazer & Highhouse, 1997). To obtain this, the total salaries of
test (or some other tool of assessment) is adding to already
current employees in the position in question should be
established procedures.
averaged.
o Determines the increase in average score on some criterion
Mean standardized predictor score of selected applicants (m) –
measure.
this number is obtained in one of two ways. The first method is
to obtain the average score on the selection test for both the both minorities and nonminorities, but predicts
applicants who are hired and the applicants who are not hired. significantly better for one of the two groups.
The average test score of the nonhired applicants is subtracted
from the average test score of the hired applicants. This
TEST DEVEOPMENT
difference is divided by the standard deviation of all the test
scores.
The Five Stages in Test Development
Considerations Test Conceptualization
Pool of applicants Test Construction
Complexity of job Test Tryout
Cut Scores – reference point derived as a result of a judgement Item Analysis
and used to divide a set of data into a classifications Test Revision

Types of Cut Scores Test Development – process of creating a test.


Relative Cut Score - reference point in a distribution that
classifies the set of data based on norm-related Test Conceptualization

considerations.(norm-related score) The challenge can always be found in the beginning. To guide us

Fixed Cut Score - reference point in the distribution that conceptualizing the test that we are making, we should pay attention to

classifies the set of data based on judgements concerning a these questions:

minimum level of proficiency required to be included. (absolute ✓ What is the test designed to measure?

cut scores) ✓ What is the objective of the test?

Multiple Cut Score - two or more cut scores with reference to ✓ Is there a need for this test?

one predictor for categorizing testtakers. ✓ Who will use and take this test?
✓ What will the test cover?
Methods for Setting Cut Scores ✓ What is the ideal format for the test?
Angoff Method
o Presence or absence of a particular trait, attribute or ability. Search for Content Domain

o Provides an estimate on how testatkers with the least minimal o Grounded theory (ask people informally and in general)

competence should answer the items correctly . o Pattern analysis (key informants)

o The judgement of the experts are averaged to yield the cut - Interview

scores of the test. - Focus group discussions


o Patterns analysis (review of related literature)
Known Groups Method - Theses and dissertations in academic libraries
o Entails collection of data on the predictor of interest from groups - Popular sources (broadsheets, social media, etc)
known to possess a trait, attribute or ability. - Scientific literature (refereed and abstracted journals)
o Involves a problem of determination where to set cut off due to
the composition of groups. Defining the Test Universe, Audience, and Purpose
o Defining the test universe
IRT Based Method - prepare a working definition of the construct
o Based on testtakers performance across all items on the test. - locate studies that explain the construct
o Some of the total number must be scored correct. - locate current measures of the construct
o Each item is associated with a particular level of difficulty. o Defining the target audience
o In order to pass, the test taker must answer items that are - make a list of characteristics of persons who will take the
deemed to be above some minimum level as determined by test-particularly those characteristics that will affect how
experts. test takers will respond to the test questions (e.g., reading
o Item Mapping is used in licensing examination. level, disabilities, honesty, language)
o Bookmark Method is used in academic applications. o Defining the purpose
- includes not only what the test will measure, but also how
Discriminant Method
scores will be used
o A family of statistical techniques used to shed light on the
- e.g., will scores be used to compare test takers (normative
relationship between certain variables and two naturally
approach) or to indicate achievement (criterion approach)?
occuring groups.
- e.g., will scores be used to test a theory or to provide

Determining the Fairness of a Test information about an individual?

o Measurement Bias – group differences in test scores that are


Developing a Test Plan
unrelated to the construct being measured.
o A test plan includes a definition of the construct, the content to
o Predictive Bias – a situation in which the predicted level of job
be measured (test domain), the format for the questions, and
success falsely favors one group over another.
how the test will be administered and scored.
➢ Single-Group Validity – characteristic of a test that
significantly predicts a criterion for one class of people Defining the Construct
but not for another. o Define construct after reviewing literature about the construct
➢ Differential Validity – characteristic of a test that and any available measures
significantly predicts a criterion for two groups, such as o Operationalize in terms of observable and measurable
behaviors.
o Provide boundaries for the test domain (what should and
shouldn’t be included)
o Specify approximate number of items needed

In writing items in a questionnaire, the theoretical foundation is very


important.

Items in a test that you develop must always be anchored on a theory.

The theory has to be backed up with related literature to enhance its 2.


relevance in the present time. Methods of Paired Comparisons – testtakers are presented with pairs of
stimuli (two photographs, two objects, two statements), which they are
A comprehensive RRL will also show the application of theory on various
asked to compare. They must select one of the stimuli according to some
fields of discipline, which will support its use in the present.
rules; for examples, the rule that they agree more with one statement than
the other, or the rule that they find one stimulus more appealing than the
Test Construction – the stage where you need to write the items.
other.
Scaling – the process of setting rules for assigning numbers in
measurement. Stated another way, scaling is the process by which a
Select the behaviour that you think would be more justified:
measuring device is designed and calibrated and by which numbers (or
a. cheating on taxes if one has a chance
other indices)-scale values-are assigned to different amounts of the trait,
b. accepting a bribe in the course of one’s duties
attribute, or characteristic being measured.
3. Q-sort
1. Rating Scale – a grouping of words, statements, or symbols on which
o Here, stimuli such as printed cards, drawings, photographs, or
judgments of the strength of a particular trait, attitude, or emotion are
other objects are typically presented to testtakers for evaluation.
indicated by the testtaker.
o One method of sorting, comparative scaling, entails judgments
of a stimulus in comparison with every other stimulus on the
Likert Scales
scale.
o A series of statements that express either a favourable or
o Another scaling system that relies on sorting is categorical
unfavourable attitude toward the concept under study.
scaling. Stimuli are placed into one of two or more alternative
o The respondent is asked the level of agreement or
categories that differ quantitatively with respect to some
disagreement with each statement
continuum.
o Each respondent is given a numerical score to reflect how
favourable or unfavourable her attitude is toward each
4. Guttman Scale – items on it range sequentially from weaker to stronger
statement.
expressions of the attitude, belief, or feeling being measured. A feature of
o The scores are then totalled to measure the respondent’s
Guttman scales is that all respondents who agree with the stronger
attitude.
statements of the attitude will also agree with milder statements.

5-point scales
Do you agree or disagree with each of the following:
Satisfaction Likelihood Level of concern
a. All people should have the right to decide whether they wish to end their
1. Very dissatisfied 1. Very unlikely 1. Very unconcerned
lives.
2. Dissatisfied 2. Unlikely 2. Unconcerned
b. People who are terminally ill and in pain should have the option to have
3. Neither 3. Neutral 3. Neutral
a doctor assist them in ending their lives.
dissatisfied 4. Likely 4. Concerned
4. Satisfied 5. Very likely 5. Very concerned c. People should have the option to sign away the use of artificial life-

5. Very satisfied support equipment before they become seriously ill.

Agreement Frequency Awareness d. People have the right to a comfortable life.


1. Strongly disagree 1. Never 1. Very unaware
Tips in Writing Items:
2. Disagree 2. Rarely 2. Unaware
1. Avoid double barreled statements. Example: I am sad and
3. Neither agree or 3. Sometimes 3. Neither aware or
disagree 4. Often unaware anxious

4. Agree 5. Always 4. Aware 2. Always state it in a first person point of view to make it more
5. Strongly agree 5. Very aware personal. Example: I always feel excited about new toys.
Familiarity Quality Importance 3. Watch your language.
1. Very unfamiliar 1. Very poor 1. Very unimportant 4. Keep your statements specific.
2. Unfamiliar 2. Poor 2. Unimportant 5. Always include negatively worded items to test the consistency
3. Somewhat 3. Acceptable 3. Neutral of responses.
familiar 4. Good 4. Important 6. Include synonymous items.
4. Familiar 5. Very good 5. Very important
5. Very familiar Item-Format
o In constructing test items, we should also take into
consideration the format of the item you will be utilizing.
Semantic Differential Scale – a survey or questionnaire rating scale that
o Item format could selected-response or constructed response.
asks people to rate a product, company, brand, or any ‘entity’ within the
➢ Selected Response format requires testtakers to select
frames of a multi-point rating option.
a response from a set of alternative responses.
Examples: Multiple Choice, True or False, Matching Essay – a test item that requires the testtaker to respond to a question by
Type writing a composition, typically one that demonstrates recall of facts,
➢ Constructed Response format requires testtakers to understanding, analysis, and/or interpretation.
supply or to create the correct answer, not merely to
Writings Items for Computer Administration
select it.
Two advantages of using digital media:
Examples: Essay, Short Answer Discussion, and
Item Bank – ability to store items
Identification
item Branching – ability to individualize
o The use of selected-response format can mostly be seen in
Ceiling Effects – diminished utility of an assessment tool for distinguishing
ability tests, where there are right or wrong answers. However,
testtakers at the high end of the ability, trait, or other attribute being
personality tests are also using this response format with a
measured.
different way of interpreting them.
o The three types of selected-response formats are multiple
Scoring Items
choice, matching, and true/false.
Cumulative Model – th4e higher the score on the test, the
higher the testtaker is on the ability, trait, or other characteristic
Multiple-Choice Format
that the test purports to measure.
A multiple-choice format has three elements:
Class Scoring (Category Scoring) – testtaker responses earn
1. A stem
credit toward placement in a particular class or category with
2. A correct alternative
their testtakers whose pattern of responses is presumably
3. Several distractors
similar in some way.
A psychological test, an interview, and a case study are: ----- Stem Ipsative Scoring – comparing a testtaker’s score on one scale
a. Psychological assessment tools ----- Correct Alternative within a test to another scale within that same test.
b. Standardized behavioral samples ----- Several Distractors
Test Tryout – the act of trying out items on people who are similar in
c. Reliable assessment instruments ----- Several Distractors
critical respects to the people for whom the test was designed.
d. Theory-linked measures ----- Several Distractors

The most critical question is on how many people on whom the test should
A good multiple-choice item has this following characteristics:
be tried out?
a. Has one correct alternative
b. Has grammatically parallel alternatives
The informal rule of thumb says that there should be no less than 5
c. Has alternatives of similar length
subjects, and a maximum of 10 subjects for each item on a test.
d. Has alternatives that fit grammatically with the stem
e. Includes as much of the item as possible in the stem to avoid The more number of participants there is, the lesser the chance that the
necessary repetition test will be psychometrically unsound.
f. Avoids ridiculous distractors
The attempt to tryout constructed test items should replicate the ideal
Matching Type conditions with which the standardized test will be administered.
o An item format consists of premises and responses.
What is a good item?
o Premises on the left column, responses on the right column.
A good item is an item that passed the 4-way test.
Both of them should be homogenous.
Test for validity
o Matching type is very susceptible to guessing. In order to avoid
Test for reliability
it, the number of responses should be more than the number of
Test for item difficulty
premises.
Test for item discrimination
True/False Format
Item Analysis – a general term for a set of methods used to evaluate the
o Also called as the binary-choice item, for it only includes 2
items. Generally, this refers to the assessment of Item Difficulty and Item
choices.
Discrimination.
o A good true/false item contains only a single idea, is not
excessively long, and is not subject to debate. Item Difficulty – is defined as the number of people who get a particular
item correct.
Completion Item
o Also called as short-answer item, or popularly known as
For example, if 84% of the students taking a particular test get item 1
identification.
correct, then the item difficulty index for that item is 0.84.
o A good completion item requires the examinee to provide a
word phrase that completes a sentence. How do we know then if that item is a good item?
o A good completion item should be worded in a way that the
We make use of the optimal item difficulty index.
correct answer is specific.

Supposing, you constructed a multiple choice item with 4 alternatives.


The standard deviation is generally considered the most useful measure of
________________. (Variability) Formula for getting the optimal item difficulty index:
Or (1+chance of getting it correct)/2
What descriptive statistic is generally considered the most useful measure
of variability? (Standard Deviation) In a multiple choice item with 4 alternatives:
(1+0.25)/2 = 0.625
o If an item has negatively discriminate or zero discrimination, is
In a true/false item:
to be rejected whatever the difficulty value
(1+0.5)/2 = 0.75
A floor effect is when most of your subjects score near the bottom. There
In a multiple choice item with 5 alternatives:
is very little variance because the floor of your test is too high.
(1+0.2)/2 = 0.60
A ceiling effect is the opposite, all of your subjects score near the top.
In a multiple choice item with 3 alternatives:
There is very little variance because the ceiling of your test is too low.
(1+0.33)/2 = 0.67

D.V Item Evaluation


0.20 – 0.30 Most Difficult
0.30 – 0.40 Difficult
0.40 – 0.60 Moderate Difficult
0.60 – 0.70 Easy
0.70 – 0.80 Most Easy

According to experts, the ideal item difficulty index that an item should
possess, should be from 0.3 up to the optimal item difficulty index

Anything lower or higher than that, the item is needed to be discarded or


revised. However, this rule is not absolute.

General Guidelines for Difficulty index


o Low difficulty value index means, that item is high difficulty one
- Ex: D.V=0.20 » 20% only answered correctly for that item.
So that item is too difficult
o High difficulty value index means, that item is easy one
- Ex: D.V=0.80 » 80% answered correctly for that item. So
that item is too easy.

Item Discrimination is the ability of an item on the basis of which the


discrimination is made between superiors and inferiors.

An item could have no discriminating power, positive discriminating power,


or negative discriminating power.

Steps in Conducting Item Discrimination


Step 1: Arrange the scores in descending order
Step 2: Take the 27% highest scorers (U), and 27% lowest
scorers (L)
Step 3: Identify how many of the highest scores, and lowest
scores got each item correct.

Zero Discrimination or No Discrimination


No one in the lowest scores got the item correct.
Positive Discrimination – more highest scores got the item
correct.
Negative Discrimination – more lowest scores got the item
correct.

Item U L U-L n d[(U-L)/n]


1 20 16 4 32 .13
2 30 10 20 32 .63
3 32 0 32 32 1
4 20 20 0 32 0
5 0 32 -32 32 -1

Items 1, 2, and 3 have good discriminating power


Item 4 does not have discriminating power
Item 5 should be rejected or revised

Relationship between difficulty value and discriminating power.


o Both (D.V & D.I) are complementary not contradictory to each
other
o Both should be considered in selecting good items

You might also like