0% found this document useful (0 votes)
42 views253 pages

Psychological Testing Assessment

The document provides an overview of psychological testing and assessment, detailing its definitions, history, and fundamental principles. It emphasizes the importance of reliability, validity, and ethical considerations in psychological evaluations, as well as the historical context of psychometrics from ancient China to modern practices. Additionally, it discusses the roles of psychometricians and the significance of understanding individual differences in psychological measurement.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views253 pages

Psychological Testing Assessment

The document provides an overview of psychological testing and assessment, detailing its definitions, history, and fundamental principles. It emphasizes the importance of reliability, validity, and ethical considerations in psychological evaluations, as well as the historical context of psychometrics from ancient China to modern practices. Additionally, it discusses the roles of psychometricians and the significance of understanding individual differences in psychological measurement.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 253

PSYCHOLOGICAL

(TESTING) ASSESSMENT
A review…
Why are you here?
•House Rules…
Before exam!
During exam!
After exam!
Topic Outline
• Basic Definitions…
• Basic Concepts/Assumptions in Psychological Testing and Assessment
• History of Psychological Testing (History of Psychometrics )
• Basic Psychological Statistics
• Basic Principles and Tools of (Psychological) Measurement
• Norms
• Reliability & Validity
• Test Development
• Psychological Tests and Applications of Tests
• Ethics in Psychological Testing
Psychological Assessment
• Psychological assessment is a process of testing that uses a combination of
techniques to help arrive at some hypotheses about a person and their behavior,
personality and capabilities.
• According to Philippine law, RA 10029 (Philippine Psychology Act of 2009),
• Psychological assessment is the: "gathering and integration of
psychology-related data for the purpose of making a psychological
evaluation, accomplished through a variety of tools, including
individual tests, projective tests, clinical interview and other
psychological assessment tools, for the purpose of assessing diverse
psychological functions including cognitive abilities, aptitudes,
personality characteristics, attitudes, values, interests, emotions
and motivations, among others, in support of psychological
counseling, psychotherapy and other psychological interventions."
• In the book authored by Munarriz and Cervera, Psychological
Testing in the Philippines: Practice, Directions and Perspectives
(2013), they made a clear distinction with words such as
psychometrics, psychological testing, and psychological
assessment:
• Psychometrics then refers to the theory, technique
and development of psychological measurements;
• Psychological testing, to the application of tests,
and;
• Psychological assessment, to the use of tests,
among others to do a psychological evaluation.
Who/What is a Psychometrician?
Psychological Testing
and Assessment
Basic Concepts and its Differences?
Psychological Testing
✓ an objective and standardized measure sample of
behavior;
✓ process of measuring psychology-variables by
means of devices or procedures designed to
obtained a sample of behavior;
✓ process of administering, scoring and interpreting
Ψ test.
• There are a number of core principles that form the foundation for
psychological assessment:
• Tests are samples of behavior.
• Tests do not directly reveal traits or capacities, but may allow inferences to be
made about the person being examined.
• Tests should have adequate reliability and validity.
• Test scores and other test performances may be adversely affected by temporary
states of fatigue, anxiety, or stress; by disturbances in temperament or
personality; or by brain damage.
• Test results should be interpreted in light of the person’s cultural background,
primary language, and any handicaps.
• Test results are dependent on the person’s cooperation and motivation.
• Tests purporting to measure the same ability may produce different scores for
that ability.
• Test results should be interpreted in relation to other behavioral data and to case
history information, never in isolation.
Psychological Assessment
✓ The gathering and integration of psychology-related data for the purpose
of making a Ψ evaluation, accomplished through the use of tools such as
tests, interviews, case studies, behavioral observation & other designed
apparatus and measurement procedure;
✓ Extends beyond obtaining a number;
✓ Focuses on how the individual processes than the results of that
processing;
✓ Less emphasis on the measurement of strength of traits; and
✓ More emphasis on the understanding of problems in their social context
• Psychological assessment is a powerful tool, but its
effectiveness depends upon the skill and knowledge of the
person administering and interpreting the test. When used
wisely and in a cautious manner, psychological assessment
can help a person learn more about themselves and gain
valuable insights. When used inappropriately, psychological
testing can mislead a person who is making an important
life decision or decision about treatment, possibly causing
harm.
BASIC
ASSUMPTIONS
In Psychological Testing and
Assessment
Assumption 1: Psychological Traits and States Exist

• A trait is a relatively enduring (i.e., long lasting)


characteristic on which people differ.
• A state is a less enduring or more transient
characteristic on which people differ.
• Traits and states are actually social constructions, but
they are real in the sense that they are useful for
classifying and organizing the world, they can be
used to understand and predict behavior, and they
refer to something in the world that we can measure.
Individual Differences
✓The cornerstone of psychological measurement - that there
are real, relatively stable differences between people.
• Thismeans that people differ in measurable ways in their
behavior and that the differences persist over a sufficiently
long time.
• Researchers are interested in assigning individuals numbers
that will reflect their differences.
• Psychological tests are designed to measure specific
attributes, not the whole person.
• These differences may be large or small.
Assumption 2: Psychological Traits and States Can Be
Quantified and Measured
• The second assumption is that the defined traits and states can be reliably
quantified and measured.
• Deciding how to measure some aspect is nearly as difficult as deciding what to
measure. Cohen & Swerdlik (2002) use the term “aggressive” to illustrate this point. An
“aggressive salesperson,” “aggressive killer,” “aggressive dancer,” and “aggressive
football player” all utilize “aggressiveness” in a different manner (Cohen & Swerdlik,
2002, p. 14). So how does one go about measuring such a phenomena? Very carefully.
• For nominal scales, the number is used as a marker.
• For the other scales, the numbers become more and more quantitative as you move
from ordinal scales (shows ranking only) to interval scales (shows amount, but lacks a
true zero point) to ratio scales.
• Most traits and states measured in education are taken to be at the interval level of
measurement.
Assumption 4: Tests and other measurement techniques
have strengths and weaknesses
• All tests and measurement devices have strengths and weaknesses.
• The ability of tests to measure constructs effectively must be researched
and discussed before such tests are useful to psychology.
• An argument for collaboration with other test professionals is that they
share different expertise with different psychological assessment tools.
• Understanding the limitations of a test is “emphasized repeatedly in the
codes of ethics of associations of assessment professionals” (Cohen &
Swerdlik, 2002, p. 19).
Assumption 5: Various sources of error are part of the
assessment process
• Error is simply a part of the assessment process.
• Cohen and Swerdlik (2002) write “potential sources or error are legion” (p. 18).
• Humans are not perfect and cannot act so. Anything produced and evaluated by a human is
imperfect. The overwhelming realization that every diagnosis is based only on an approximation of
an evaluation, not on a “true” evaluation, is enough to make one cringe.
• Factors other than what a test attempts to measure, will, to some extent, influence an individual’s
performance on a test (Cohen & Swerdlik, 2002).
• In addition, test administrators and developers are also sources of error. Because there are many
sources of bias and error in psychological testing, the practicing psychologist must be well versed
in the statistical analysis of individual tests, open to collaboration with other knowledgeable
professionals, reflective and consistent in testing practices, and treat all test takers in an equal
fashion.
• Analysis of results should always be critically scrutinized so that conclusions based upon
interpretations from test results reflect a high level of validity and usefulness.
Assumption 6: Testing and assessment can be
conducted in a fair and unbiased manner
• Tests must be conducted in a fair and unbiased manner.
• This premise is so important that Cohen and Swerdlik (2002) remark, “If
we had to pick the one of these 12 assumptions that is more
controversial than the remaining 11, this one is it” (p. 20).
• Tests not measured or interpreted in as “fair” and “unbiased” a manner
as possible are worthless to the assessor, psychologist, and client.
• The fact that psychological measuring tools are relied upon to measure
important aspects of human psychology demands that the utmost care
and attention be given to their fairness.
Assumption 7: Testing and assessment
benefit society
• Testing is useful.
• If the testing and assessment process was not beneficial to a variety of agencies of society it would
not be a multi-billion dollar industry. It serves to categorize, prioritize, criminalize, organize, and
many other –ize’s. Tests are useful to measure progress and change in the professional world as
well as education.
• Evaluations are necessary for society to function on a highly complex level. Requirements for
medical doctors and other licensed health professionals stems from the premise that these people
are “highly qualified and competent” in their knowledge and abilities. Without tests and
measurements there would be no accountability for qualifications of individuals in positions that
demand it.
• Hiring the right person for a certain position and everything in this world would just be based on
one’s own notion.
• Educational difficulties among students wouldn’t be identified because there are no instruments
used for diagnosis.
✓ Tests are there to help make decisions.
HISTORY OF
PSYCHOLOGICAL TESTING
(HISTORY OF PSYCHOMETRICS)
History of Psychometrics
• Chinese influence
• Individual Differences: Darwin and Galton
• Experimental Psychologists
• The study of mental deficiency
• Intelligence Testers
• Personality Testers

47
History of Psychometrics: Chinese influence
• 2000 B.C.E.
• Scattered evidence of civil service testing in China
• Every 3rd year in China, oral examinations were given to
help determine work evaluations and promotion decisions.
• 206 B.C.E. to 220 C.E.
• Han Dynasty in China develops test batteries
• two or more tests used in conjunction.
• Test topics include civil law, military affairs, agriculture,
revenue, geography
48
History of Psychometrics: Chinese influence
• 1368 C.E. to 1644 C.E.
• Ming Dynasty in China develops multistage testing
• Local tests lead to provincial capital tests; capital tests lead
to national capital tests
• Only those that passed the national tests were eligible for
public office
• 1832
• EnglishEast India Company copies Chinese system to
select employees for overseas duty.

49
History of Psychometrics: Chinese influence

• 1855
• British Government adopts English East India
Company selection examinations.
• French & German governments follow shortly.

• 1883
• United States establishes the American Civil Service
Commission
• Developed & administered competitive examinations
for government service jobs.
50
History of Psychometrics: Individual
Differences, Darwin and Galton
• Individual differences - despite our similarities,
no two humans are exactly alike.
• Charles Darwin
• some of these individual differences are more
“adaptive” than others
• these individual differences, over time, lead to
more complex, intelligent organisms

51
History of Psychometrics: Individual
Differences, Darwin and Galton
• Sir Francis Galton
• “Applied Darwinist”: some people possessed
characteristics that made them “more fit” than others.
• Wrote Hereditary Genius (1869)
• Sets up an anthropometric (measurement of the human
individual) laboratory at the International Exposition of
1884
• For 3 pence, visitors could be measured with:
• The Galton Bar - visual discrimination of length
• The Galton Whistle (aka “dog whistle” - determining
highest audible pitch
52
History of Psychometrics: Individual
Differences, Darwin and Galton
• Galton’s Anthropometric Lab

53
Individual Differences:
Darwin and Galton

• Galtonalso noted that persons with mental retardation also


tend to have diminished ability to discriminate among heat,
cold & pain.
• Other advances of Galton’s:
• Considered by some the founder of psychometrics
• pioneered rating scales & questionnaires
• first to document individuality of fingerprints
• first to apply statistics in the measurement of humans

54
History of Psychometrics:
Galton’s Famous Students
• Karl Pearson
• Does the name Pearson sound familiar?
• student of Galton
• extended Galton’s early work with statistical
regression
• James McKeen Cattell
• first to use the term “mental test”
• U.S.dissertation on individual differences in
reaction time based upon Galton’s work
55
History of Psychometrics:
Early Experimental Psychologists
• Early 19th century scientists, generally interested in
identifying common aspects, rather than individual
differences.
• Differences between individuals was considered a
source of error which rendered human measurement
inexact.
• Sounds a lot like things from your past (e.g. ANOVA)
and your coming future
56
History of Psychometrics:
Early Experimental Psychologists
• JohanFriedrich Herbart - mathematical models of the
mind; founder of pedagogy as an academic discipline;
went against Kant;
• Ernst Heinrich Weber - sensory thresholds; just
noticeable difference (JND)
• Gustav Theodor Fechner - mathematics of sensory
thresholds of experience; founder of psychophysics;
considered of one founders of experimental psychology;
Weber-Fechner Law first to relate sensation and
stimulus; considered by some the founder of
psychometrics
57
History of Psychometrics:
Early Experimental Psychologists
• Fechner influenced many prominent
psychologists (e.g. Wundt, Freud)
• Wilhelm Wundt – considered one of the
founders of psychology; first to set up a psych
laboratory
• Edward Titchener – succeeded Wundt;
brought Structuralism to America; His brain is
still on display in the psychology department
at Cornell

58
History of Psychometrics:
Early Experimental Psychologists
• Guy Montrose Whipple – Student of Titchener; pioneer of human
ability testing; conducted seminars that changed the field of psych
testing; APA issued its first set of standards for professional
psychological testing because of his criticisms

• Louis Leon Thurstone – Large contributor to factor analysis;


attended Whipple’s seminars; approach to measurement was
termed the law of comparative judgment

• Factor analysis is a method of finding the minimum number of


dimensions (characteristics, attributes, called factors, to account
for a large number of variables. (e.g. active, talkative, outgoing –
Extroversion)
59
History of Psychometrics:
Interest in Mental Deficiency
• 1805 – Jean-Étienne Esquirol, French Physician
• Favorite Student of Philippe Pinel (founder of psychiatry)
• Manuscript on “mental retardation.”
• differentiated between insanity & mental retardation
• insanity had a period of normal intellectual functioning
• Many degrees to mental retardation
• normality to “low-grade idiocy”
• Attempted to develop system to classify people into these
many degrees but found that the individual’s use of
language provided the most dependable continuum
60
History of Psychometrics:
Interest in Mental Deficiency
• 1840s - Edouard Seguin, French Physician
• Pioneer in training mentally-retarded persons.
• Rejected the notion of incurably MR
• 1837: opens first school devoted to teaching MR
children.
• 1848: emigrates to USA, wide acceptance of theories
• 1866: experiments with physiological training of MR
• sense-training / muscle-training still used today
• leads to nonverbal tests of intelligence (Seguin
Form Board)
61
History of Psychometrics:
Intelligence Testing
• Alfred Binet
• 50 years after Esquirol & Seguin -- 1905
• French Society for the Psychological Study of
the Child urged French ministers to develop
special classes for children who failed to
respond to normal schooling.
• Ministers required a way to identify the
children

62
History of Psychometrics:
Intelligence Testing
• Alfred Binet
• First Intelligence Test: Binet-Simon Scale of 1905
• 30 items of increasing difficulty
• Standardized administration
• Same instructions & format for ALL children
• Standardization sample
• Created norms by which performance one child can be
compared with other children.

63
History of Psychometrics:
Intelligence Testing
• Alfred Binet
• Standardization Sample
• 50 Normal children aged 3-11yrs
• “Some” mentally retarded children and adults

• 1908 Binet-Simon Scale


• More items (greater reliability)
• Better standardization sample (300 normal children)
• Introduction of Mental Age

64
History of Psychometrics:
Intelligence Testing
• Alfred Binet’s legacy
• 1911 Binet-Simon, minor revision
• Binet dies
• 1912 Kuhlmann-Binet revision
• Extends testing downward to 3 months of age
• 1916 Lewis Madison Terman & Stanford Colleagues
revise Binet’s test for use in the United States
• More psychometrically sound
• Introduction of the term IQ
• Mental Age / Chronological Age x 100 = IQ
65
History of Psychometrics:
Intelligence Testing
• World War I - Robert Yerkes
• Need for large-scale group administered ability tests by the
army
• Army commissions Yerkes, then head of the American
Psychological Association, to develop two structured tests
of human abilities
• Army Alpha - required reading ability
• Army Beta - did not require reading ability

• Testing “frenzy” hits between World War I and the 1930s.

66
History of Psychometrics:
Intelligence Testing
• Testing Frenzy of the 1930’s
• 1937 Revision of the Stanford-Binet includes over 3000
individuals in standardization
• 1939 Wechsler-Bellevue Intelligence Scale
• David Wechsler
• Subscales were “adopted” from the Army Scales
• Produces several scores of intellectual ability rather than
Binet’s single scores (e.g. Verbal, Performance, Full-Scale)
• Evolves into the Wechsler Series of intelligence tests (e.g.
WAIS, WISC, etc.)
67
History of Psychometrics:
Personality Testing
• Rise – 1920s, Fall – 1930s, Slow Rise – 1940s
• Intended to measure personality traits
• Trait: relatively enduring dispositions (tendencies to act,
think or feel in a certain manner in any given circumstance)
that distinguish one individual from another
• NOT temporary states

68
History of Psychometrics:
Personality Testing
• First Rise and Fall: Structured Tests
• Woodworth Personal Data Sheet
• First objective personality test meant to aid in psychiatric
interviews
• Developed during World War I
• Designed to screen out soldiers unfit for duty
• Mistakenly assumed that a subjects response could be
taken at face value
69
History of Psychometrics:
Personality Testing
• Woodworth
Test Item Yes No
1. I wet the bed.
2. I drink a quart of whiskey each day.
3. I am afraid of closed spaces.
4. I believe I am being followed.
5. People are out to get me.
6. Sometimes I see or hear things that other
people do not hear or see.
70
History of Psychometrics:
Personality Testing
• Slow Rise: Projective Tests
• Herman Rorschach inkblot test (1921)
• Started with great suspicion; first serious
study in 1932.
• Symmetric colored & b/w inkblots.

71
History of Psychometrics:
Personality Testing

• Rorschach inkblot example

72
History of Psychometrics:
Personality Testing
• Thematic Apperception Test
• Henry Murray and Christina Morgan (1935)
• “Ambiguous” pictures though considerably more
structured than the Rorschach
• Subjects are shown the pictures and asked to write a
story including:
• what has led up to the event shown
• what is happening at the moment
• what the characters are feeling and thinking, and
• what the outcome of the story was.
73
History of Psychometrics:
Personality Testing
Thematic Apperception Test Example

74
History of Psychometrics:
Personality Testing
• Second coming of the Structured Test
• Early 1940s – Structured Tests were being developed based
on better psychometric properties.
• Minnesota Multiphasic Personality Inventory (MMPI; 1943)
• Tests like the Woodworth made too many assumptions
• The meaning of the test response could only be determined
by empirical research
• Most widely used (MMPI-2, MMPI-A)

75
History of Psychometrics:
Personality Testing

• Sixteen Personality Factor Questionnaire


• Raymond B. Cattell (early 1940s)
• Based on Factor Analysis – method for
finding the minimum number of
dimensions (factors) for explaining the
largest number of variables

76
BASIC PRINCIPLES
AND TOOLS OF
MEASURMENT
SCALES OF MEASUREMENT
Measurement - the act of assigning numbers or
symbols to characteristics of things, people or
events according to rules.
Scale - a set of numbers or other symbols whose
properties model empirical properties of the
objects to which the numbers are assigned.
Measurement
• The process of assigning numbers to objects in such a way that
specific properties of the objects are faithfully represented by
specific properties of the numbers.
• Psychological tests do not attempt to measure the total person,
but only a specific set of attributes.
• Measurement is used to capture some “construct”
- For example, if research is needed on the construct of
“depression”, it is likely that some systematic measurement tool
will be needed to assess depression.
Types of Measurement Scales
1. Nominal

2. Ordinal

3. Interval

4. Ratio
Types of Measurement Scales
Nominal Scales –
✓there must be distinct classes but these classes have no
quantitative properties. Therefore, no comparison can be
made in terms of one being category, being higher than
the other;
✓involves CLASSIFICATION or CATEGORIZATION based on
one or more distinguishing characteristics.
✓Mutually exclusive and exhaustive.
✓Cannot be meaningfully added, subtracted, ranked or
averaged.
For example - there are two classes for the variable gender --
males and females. There are no quantitative properties for
this variable or these classes and, therefore, gender is a
nominal variable.

Other Examples:
country of origin Disorders listed in DSM V
Yes/No responses
animal or non-animal married vs. single
Nominal Scale
• Sometimes numbers are used to designate category
membership
Example:
Country of Origin
1 = United States 3 = Canada
2 = Mexico 4 = Other
• However, in this case, it is important to keep in mind
that the numbers do not have intrinsic meaning
Ordinal Scales
• There are distinct classes but these classes have a
natural ordering or ranking. The differences can be
ordered on the basis of magnitude (property of
moreness); have no absolute zero point.

Example:
Final position of horses in a thoroughbred race.
The horses finish first, second, third, fourth, and so
on. The difference between first and second is not
necessarily equivalent to the difference between
second and third, or between third and fourth.
Ordinal Scales
• Doesnot assume that the intervals between
numbers are equal.

Example:
finishing place in a race (first place, second place)

1st place 2nd place 3rd place 4th place

1 hour 2 hours 3 hours 4 hours 5 hours 6 hours 7 hours 8 hours


Interval Scales
• It is possible to compare differences in magnitude, but importantly the
zero point does not have a natural meaning. It captures the properties
of nominal and ordinal scales -- used by most psychological tests.
• Designates an equal-interval ordering - The distance between, for
example, a 1 and a 2 is the same as the distance between a 4 and a 5; it
is possible to average a set of measurements and obtain a meaningful
result.

• Example: Celsius temperature is an interval variable. It is meaningful to


say that 25 degrees Celsius is 3 degrees hotter than 22 degrees Celsius,
and that 17 degrees Celsius is the same amount hotter (3 degrees) than
14 degrees Celsius.
• Notice, however, that 0 degrees Celsius does not have a natural
meaning. That is, 0 degrees Celsius does not mean the absence of heat!
Ratio Scales
• Captures the properties of the other types of scales, but also
contains a true zero, which represents the absence of the
quality being measured; all mathematical operations can be
performed.

• For example:
• Heart beats per minute has a very natural zero point.
Zero means no heart beats. Weight (in grams) is also a
ratio variable. Again, the zero value is meaningful, zero
grams means the absence of weight. The number of
intimate relationships a person has had 0 quite literally
means none. A person who has had 4 relationships has
had twice as many as someone who has had 2.
• Each of these scales have different properties (i.e.,
difference, magnitude, equal intervals, or a true zero
point) and allows for different interpretations

• The scales are listed in hierarchical order. Nominal


scales have the fewest measurement properties and ratio
having the most properties including the properties of all
the scales beneath it on the hierarchy.

• The goal is to be able to identify the type of


measurement scale, and to understand proper use and
interpretation of the scale.
Measures of Location or
Central Tendency
Percentiles

◼ A percentile provides information about how the


data are spread over the interval from the smallest
value to the largest value.

◼ Admission test scores for colleges and universities


are frequently reported in terms of percentiles.
Quartiles
◼ Quartiles are specific percentiles.
◼ First Quartile = 25th Percentile

◼ Second Quartile = 50th Percentile = Median


◼ Third Quartile = 75th Percentile
Measures of Variability
◼ It is often desirable to consider measures of variability
(dispersion, spread), as well as measures of location.
◼ For example, in choosing supplier A or supplier B we
might consider not only the average delivery time for
each, but also the variability in delivery time for each.
Norms
Norms
Norm-Referenced Test : one of the most useful ways of describing
a person’s performance on a test is to compare his/her test score
to the test scores of some other persons or group of people.

• Norms are average scores computed for a large representative


sample of the population.

• The arithmetic average (mean) is used to judge whether a score


on the scale above or below the average relative to the
population of interest.

• A representative sample is required to ensure meaningful


comparisons are made.
Norms (cont.)
• No single population can be regarded as the normative
group.

• A representative sample is required to ensure meaningful


comparisons are made.

• When norms are collect from the test performance of


groups of people these reference groups are labeled
normative or standardized samples.
Norms (cont.)
• The normative sample selected as the normative group,
depends on the research question in particular.

• It is necessary that the normative sample selected be


representative of the examinee and of the research question
to be answered, in order for meaningful comparisons to be
made.

• For example: tests measuring attitudes towards federalism


having norm groups consisting of only students in the
province of Quebec might be very useful for interpretation
regionally in Quebec, however their generalizability in other
parts of the country (Yukon, Toronto, Ontario) would be
suspect.
Sample Groups
Although the three terms below are used interchangeably, they
are different.

Standardized Sample - is the group of individuals on whom the test


is standardized in terms of scoring procedures, administration
procedures, and developing the tests norms. (e.g., sample used in
technical manual)
Normative Sample - can refer to any group from which norms are
gathered. Norms collected after test is published

Reference Group - any group of people against which test scores are
compared.
Types of Norms
Norms can be Developed:
• Locally
• Regionally
• Nationally
Normative Data Can be Expressed By:
• Percentile Ranks
• Age Norms
• Grade Norms
Local Norms
• Test users may wish to evaluate scores on the basis of
reference groups drawn from specific geographic or
institutional setting.

For Example
Norms can be created employees of a particular
company or the students of a certain university.

• Regional & National norms examine much broader


groups.
Subgroup Norms
• When large samples are gathered to represent broadly defined
populations, norms can be reported in aggregate or can be separated
into subgroup norms.

• Provided that subgroups are of sufficient size and fairly


representative of their categories, they can be formed in terms of:

- Age
- Sex
- Occupation
- Education Level

Or any other variable that may have a significant impact on test scores
or yield comparisons of interest.
Percentile Ranks
• The most common form of norms and is the simplest method of
presenting test data for comparative purposes.

• The percentile rank represents the percentage of the norm group that
earned a raw score less than or equal to the score of that particular
individual.

For example, a score at the 50th percentile indicates that the individual
did as well or better on the test than 50% of the norm group.

• When a test score is compared to several different norm groups,


percentile ranks may change.

For example, a percentile rank on a mathematical reasoning test may be


lower when comparing it to math grade students, than music students.
Age Norms
• Method of describing scores in terms of the average or typical age
of the respondents achieving a specific test score.

• Age norms can be developed for any characteristic that changes


systematically with age.

• In establishing age norms, we need to obtain a representative


sample at each of several ages and measure the particular age
related characteristic in each of these samples.

• It is important to remember that there is considerable variability


within the same age, which means that some children at one age will
perform similar to children at other ages.
Grade Norms
• Most commonly used in school settings.
•Similar to age norms except the baseline is grade level rather than
age.

•It is important to remember that there is considerable variability


within individuals of different grade, which means that some children
in one grade will perform similar to or below children in other grades.

• One needs to be extremely careful when interpreting grade norms


not to fall into the trap of saying that, just because a child obtains a
certain grade-equivalent on a particular test, he/she is the same
grade in all areas.
Evaluating Suitability of a Normative Sample

•How large is the normative sample?


• When was the sample gathered?
• Where was the sample gathered?
• How were individuals identified and selected?
• What was the composition of the normative sample?
- age, sex, ethnicity, education level, socioeconomic status
Caution When Interpreting Norms
•Norms are not based on samples that adequately represent
the type of population to which the examinee’s scores are
compared.

•Normative data can become outdated very quickly.

•The size of the sample taken.


Setting Standards/Cutoffs
• Rather than finding out how you stand compared to others, it might
be useful to compare your performance on a test to some external
standard.

For Example - if most people in class get an F on a test and you get a D,
your performance in comparison to the normative group is good.
However, overall your score is not good.

Criterion-Referenced Tests - assesses your performance


against some set of standards. (e.g., school tests, Olympics)
Normal Distribution Curve
• Many human variables fall on a normal or close to normal curve including IQ,
height, weight, lifespan, and shoe size.

• Theoretically, the normal curve is bell shaped with the highest point at its
center. The curve is perfectly symmetrical, with no skewness (i.e., where
symmetry is absent). If you fold it in half at the mean, both sides are exactly
the same.

•In theory, the distribution of the normal curve ranges from negative infinity
to positive infinity.
Skewness
• Skewness is the nature and extent to which symmetry is
absent.
Positive Skewness - when relatively few of the scores fall at
the high end of the distribution.
For Example - positively skewed examination results may indicate that a test was too difficult.

Negative Skewness - when relatively few of the scores fall at


the low end of the distribution.
For Example - negatively skewed examination results may indicate that a test was too easy.
RELIABILITY
RELIABILITY
• The consistency of measurements
• The more error, the less reliable the measure
• Reliable measures provide consistent
measurement from occasion to occasion

A RELIABLE TEST
Produces similar scores across various
conditions and situations, including different
evaluators and testing environments.
Reliability
• Reliability refers to a measure’s ability to capture an
individual’s true score, i.e. to distinguish accurately
one person from another
• While a reliable measure will be consistent,
consistency can actually be seen as a by-product of
reliability, and in a case where we had perfect
consistency (everyone scores the same and gets the
same score repeatedly), reliability coefficients could
not be calculated
How do we account for an individual who
does not get exactly the same test score
every time he or she takes the test?
1. Test-taker’s temporary psychological or
physical state
2. Environmental factors
3. Test form
4. Multiple raters
TYPES OF RELIABILITY
1. Test-retest
2. Split half (internal consistency)
3. Inter-rater
4. Alternate forms
TEST-RETEST RELIABILITY
• Suggests that subjects tend to obtain the same score when tested at
different times.
• Correlation between time 1 & time 2 (Product-moment
correlation (r))
• Test-retest reliability refers to the consistency of participant’s
responses over time (usually a few weeks, why?)
• Assumes the characteristic being measured is stable over time—
not expected to change between test and retest.
Split-Half Reliability
•Sometimes referred to as internal
consistency
•Indicates that subjects’ scores on
some trials consistently match their
scores on other trials.
Internal Consistency Reliability
• Relevant for measures that consist of more than 1 item
(e.g., total scores on scales, or when several behavioral
observations are used to obtain a single score).
• Internal consistency refers to inter-item reliability, and
assesses the degree of consistency among the items in a
scale, or the different observations used to derive a
score.
• Want to be sure that all the items (or observations) are
measuring the same construct.
Internal Consistency
• We can get a sort of average correlation among items to
assess the reliability of some measure (Cronbach’s alpha)
• Correlations amongst multiple items in a factor
• As one would most likely intuitively assume, having more
measures of something is better than few
• It is the case that having more items which correlate with
one another will increase the test’s reliability
• Is a multi-item scale measuring a single concept?
• Are items in scale consistent with one another?
Estimates of Internal Consistency
• Item-total score consistency
• Split-half reliability: randomly divide items into 2
subsets and examine the consistency in total scores
across the 2 subsets (any drawbacks?)
• Cronbach’s Alpha: conceptually, it is the average
consistency across all possible split-half reliabilities
• Cronbach’s Alpha can be directly computed from data
INTER-RATER RELIABILITY
Involves having two raters independently
observe and record specified behaviors,
such as hitting, crying, yelling, and getting
out of the seat, during the same time period.

TARGET BEHAVIOR
A specific behavior the observer is looking to
record
• If a measurement involves behavioral
ratings by an observer/rater, we would
expect consistency among raters for a
reliable measure
• Best to use at least 2 independent raters,
‘blind’ to the ratings of other observers
• Precise operational definitions and well-
trained observers improve inter-rater
reliability
ALTERNATE FORMS RELIABILITY
• Also known as equivalent forms reliability or
parallel forms reliability
• Obtained by administering two equivalent
tests to the same group of examinees
• Items are matched for difficulty on each
test
• It is necessary that the time frame between
giving the two forms be as short as possible
RELIABILITY COEFFICIENTS
•The statistic for expressing reliability.
•Expresses the degree of consistency in
the measurement of test scores.
•Denoted by the letter r with two
identical subscripts (rxx).
FACTORS AFFECTING RELIABILITY

1. Test length
2. Test-retest interval
3. Variability of scores
4. Guessing
5. Variation within the test situation
Reliability Interpretation
<.6 = not reliable
.6 = OK
.7 = reasonably reliable
.8 = good, strong reliability
.9 = excellent, very reliable
>.9 = potentially overly reliable or redundant
measurement – this is subjective and whether a scale is
overly reliable depends also on the nature what is being
measured
Estimating Reliability
✓Reliability can range from 0 to 1.0.
✓When a reliability coefficient equals 0, the
scores reflect nothing but measurement error.
✓Rule of Thumb: measures with reliability
coefficients of 70% or greater have acceptable
reliability
How Many Items per Factor?
• More items -> greater reliability
(The more items, the more ‘rounded’ the
measure)
• Min. = 3
• Max. = unlimited
RELIABILITY
Reproducibility of a measurement
VALIDITY
VALIDITY
The extent to which an instrument
actually measures what it purports to
measure.
Validity
• Validity refers to the question of whether
our measurements are actually hitting on
the construct we think they are.
• While we can obtain specific statistics for
reliability (even different types), validity is
more of a global assessment based on the
evidence available.
• Denotes the extent to which an instrument is
measuring what it is supposed to measure.
• We can have reliable measurements that are
invalid
• Classic example: The scale which is
consistent and able to distinguish from one
person to the next but actually off by 5
pounds.
Face Validity
• Prima facie extent to which an item is judged to reflect target
construct
• Refers to the extent to which a measure ‘appears’ to measure
what it is supposed to measure
• Not statistical—involves the judgment of the researcher (and the
participants)
• A measure has face validity—’if people think it does’
• Just because a measure has face validity does not ensure that it is
a valid measure (and measures lacking face validity can be valid).
Validity Criteria in Psychological Testing
1. Content validity
2. Criterion validity
• Concurrent
• Predictive
3. Construct-related validity
• Convergent
• Discriminant
Content Validity
1. Whether the individual items of a test represent what you actually want
to assess.
2. Items represent the kinds of material (or content areas) they are supposed
to represent
• Are the questions worth a flip in the sense they cover all domains of a
given construct?
• E.g. job satisfaction = salary, relationship w/ boss, relationship w/
coworkers etc.
3. Systematic examination of the extent to which test content covers a
representative sample of the domain to be measured – e.g. sources,
• existing literature
• expert panels
• qualitative interviews / focus groups with target sample
CRITERION VALIDITY
• A method for assessing the validity of an instrument
by comparing its scores with another criterion
known already to be a measure of the same trait or
skill.

• Refers to the extent to which a measure


distinguishes participants on the basis of a particular
behavioral criterion.

• The degree to which the measure correlates with


various outcomes
• Does some new personality measure correlate
with the Big 5?
• Examples:
• The Scholastic Aptitude Test (SAT) is valid to the
extent that it distinguishes between students that
do well in college versus those that do not
• A valid measure of marital conflict should correlate
with behavioral observations (e.g., number of fights)
• A valid measure of depressive symptoms should
distinguish between subjects in treatment for
depression and those who are not in treatment
✓Expressed as a correlation between
the test in question and the
criterion measure. The correlation
coefficient is referred to as a VALIDITY
COEFFICIENT.
Criterion-related Validity
(Concurrent & Predictive)
1. Concurrent Validity
✓The extent to which a procedure correlates
with the current behavior of subjects
✓Correlation between the measure and other
recognised measures of the target construct.
The measure and criterion are assessed at the
same time.
✓Criterion is in the present
• Measure of ADHD and current scholastic
behavioral problems
2. Predictive Validity
✓Extent to which a measure predicts something that
it theoretically should be able to predict
✓The extent to which a procedure allows accurate
predictions about a subject’s future behavior
✓A measure’s ability to distinguish participants on a
relevant behavioral criterion at some point in the
future
✓Criterion in the future
✓SAT and college WPA
SAT Example
• High school seniors who score high on the SAT are better
prepared for college than low scorers (concurrent validity)
• Probably of greater interest to college admissions
administrators, SAT scores predict academic performance
four years later (predictive validity)
CONSTRUCT VALIDITY
✓The extent to which a test measures a
theoretical construct or attribute.
✓How much is it an actual measure of the
construct of interest?
Construct Validity
• Most scientific investigations involve hypothetical
constructs—entities that cannot be directly
observed but are inferred from empirical evidence
(e.g., intelligence)
• Construct validity is assessed by studying the
relationships between the measure of a construct
and scores on measures of other constructs
• We assess construct validity by seeing whether a
particular measure relates as it should to other
measures
CONSTRUCT
•Abstract concepts such as intelligence,
self-concept, motivation, aggression and
creativity that can be observed by some
type of instrument.
A test’s construct validity
is often assessed by its
convergent and
discriminant validity.
Convergent validity
• Extent to which a measure correlates with
measures with which it theoretically should
be associated.
• Correlates well with other measures of the
construct
Ex.: Depression scale correlates well with
other depression scales
Discriminant validity
•Extent to which a measure does not
correlate with measures with which it
theoretically should not be associated.
•Is distinguished from related but distinct
constructs
Ex.: Depression scale = Stress scale
FACTORS AFFECTING VALIDITY
1. Test-related factors
2. The criterion to which you compare
your instrument may not be well
enough established
3. Intervening events during test
taking
4. Reliability of the instrument
Estimating Validity
• Like reliability, validity is not absolute.
• Validity is the degree to which variability
(individual differences) in participant’s
scores on a particular measure, reflect
individual differences in the characteristic
or construct we want to measure
Self-Esteem Example
• Scores on a measure of self-esteem should
be positively related to measures of
confidence and optimism (Convergent)

• But, negatively related to measures of


insecurity and anxiety (Discriminant)
Estimating the Validity of a Measure
• A good measure must not only be reliable, but also valid
• A valid measure measures what it is intended to measure
• Validity is not a property of a measure, but an indication of the
extent to which an assessment measures a particular construct
in a particular context—thus a measure may be valid for one
purpose but not another
• A measure cannot be valid unless it is reliable, but a reliable
measure may not be valid
Convergent and Discriminant Validity

• To have construct validity, a measure should both:


• Correlate with other measures that it should be related
to (convergent validity)
• And, not correlate with measures that it should not
correlate with (discriminant validity)
Summary
• Reliability and Validity are key concerns in psychological testing.
• Part of the problem in psychology is the lack of reliable
measures of the things we are interested in1
• Assuming that they are valid to begin with, we must always
press for more reliable measures if we are to progress
scientifically
• This means letting go of supposed ‘standards’ when they are
no longer as useful and look for ways to improve current ones
Reliability vs. Validity
Error and Bias
• Beyond concerns of validity and reliability, any professional who
uses educational, psychological, or other types of tests must be
aware of any sources of error and bias that might exist.
• Since “bias is a factor inherent within a test that systematically
prevents accurate, impartial measurement” (Cohen & Swerdlik,
2001, p. 179), it is essential to eliminate as much potential and
actual bias as possible before the results of the test can be utilized
effectively.
• The Joint Committee on Standards for Educational and
Psychological Testing (1999) has written standards of practice with
reference to eliminating possible error and bias in testing.
The committee identified several different areas that need to be carefully
examined by administrators. These areas cover bias both inside and
outside of a test and test situation:
• Construct irrelevant components—those items that may lower or higher scores for different groups of
examinees.
• Content related—those items, especially in educational testing, which discuss how well a given test
covers the domain and whether that domain is appropriate. How clear the questions and instructions are
written. The type of response necessary from the test taker, i.e. essay, short answer, bubble, etc.
• Testwiseness—those issues relating to the familiarity with skills to take a test, answer questions in a
timely manner, and ability to guess well on questions the test taker does not know.
• Equitable treatment of test takers—those issues that are concerned with fair treatment of all test takers
by the administrator.
• Testing environment—those aspects related to the physical test environment, such as temperature,
comfort level, noise, etc.
• Perceived relationship between test taker and administrator—aspects related to how test administrator
and test taker relate during the evaluation period.
• State of test taker—emotional, physical, mental condition of the test taker.
TEST DEVELOPMENT
• Test Conceptualization

• Test Construction

• Test Tryout

• Item Analysis

• Test Revision
TEST CONCEPTUALIZATION
SOME PRELIMINARY
QUESTIONS
• What is the test designed to measure?
• Who will use this test?
• Who benefits from an administration of
this test?
• What is the ideal form of the test?
• Is there a need for this test?
• What is the objective of this test?
Norm Reference vs. Criterion-Reference
• Good item = High Score • High Scores does not count
and is irrelevant

• Mastery
• Accuracy

Norm Reference Criterion Reference


Pilot Work
Refers to the preliminary research surrounding the
creation of a prototype of the test.

Used to evaluate whether test items should be included


in the final instrument.

The test developer attempts to determine how best to


measure a targeted construct.
TEST CONSTRUCTION
Scaling

• Defined as the process of setting rules for assigning numbers


in measurement.

• A measuring device is designed and calibrated.

• The scale values are assigned to different amounts of the


trait, attribute or characteristic being measured.
TYPES OF SCALING
RATING SCALE

• Can be defined as a grouping of words, statements


or symbols on which judgments of the strength of a
particular trait, attitude, or emotion.
• Can be used to record judgments of oneself, others,
experiences or objects and may take several forms.
Comparative Scaling

•Entails judgment of a stimulus in comparison


with every other stimulus on the scale.

•Might feature 30 items or more.


Categorical Scaling

• Stimuli are placed into one of two or more alternative


categories that differ quantitatively.

• Might also be 30 items or more.


I believe I would like the work of a lighthouse keeper.

TRUE FALSE (circle one)

PLEASE RATE THE EMPLOYEE ON ABILITY TO COOPERATE AND GET ALONG WITH FELLOW
EMPLOYEES:

EXCELLENT____/____/____/____/____/____/____/____/UNSATISFACTORY
SCALING METHODS
LIKERT SCALE

• Used extensively, usually to scale attitudes.

• Are relatively easy to construct.

• Consists of
5 alternative responses, usually on an
agree/disagree or approve/disapprove basis (sometimes 7)
Driving without seatbelt is:

1 2 3 4 5 6 7 8 9 10
Never Always
Justified Justified

Driving without a seatbelt is:

_______ _______ ______ ______ ______


Never Rarely Sometimes Usually Always
Justified Justified Justified Justified Justified
Method of Paired Comparisons

• Test takers are presented with pairs of stimuli which they are
asked to compare

• They must select the choice in which they agree on or find


appealing.
Select the behavior that you think would be more
justified :

a. Driving without a license

b. Watching an R-18 movie having the age of 17 and below


Guttman Scale

• Yields ordinal-level measures.

• Items on it range from weaker to stronger expressions of


attitudes or traits.

• All respondents who agree with the stronger statements of


the attitude will also agree with the milder statements.
Do you agree or disagree with each of
the following :
 All people should have the right to decide whether they wish to end their
lives.
 People who are terminally ill and in pain should have the option to have a
doctor assist them in ending their lives.
 People should have the option to sign away the use of artificial life-support
equipment before they become seriously ill.
 People have the right to a comfortable life.
WRITING ITEMS
• WHAT RANGE OF CONTENT SHOULD THE
ITEMS COVER?

• WHICH OF THE MANY DIFFERENT TYPES OF


ITEM FORMATS SHOULD BE EMPLOYED?

• HOW MANY ITEMS SHOULD BE WRITTEN?


What is an “Item Pool”?

• It is the reservoir or well from which items will be


drawn or discarded for the final version of the test.
ITEM FORMAT
ITEM FORMAT
• Requires test takers to • Requires test takers to
select a response from a supply or to create the
set of alternative correct answer , not
responses. merely to select it.

SELECTED-RESPONSE CONSTRUCTED-RESPONSE
SELECTED-RESPONSE ITEM
MULTIPLE CHOICE

Has 3 elements :
• a stem,
• a correct alternative or option, and
• several incorrect alternatives or options variously
referred to as distractors or foils.
A good multiple-choice item in an
achievement test :
a. Has one correct alternative
b. Has grammatically parallel alternatives
c. Has alternatives of similar strength
d. Has alternatives that fit grammatically with the stem
e. Includes as much of the item as possible in the stem to avoid unnecessary
repetition
f. Avoids ridiculous distractors
g. Is not excessively long
h. All of the above
i. None of the above
Matching Item
• The test taker is presented with two columns, premises on
the left and responses on the right.

• The wording of premises should be fairly short and to the


point. No more than dozen premises should be included.

• Must be homogeneous
For example…
1. Britney Spears a. Hero
2. Backstreet Boys b. Drop it Like Its Hot
3. N’Sync c. Beautiful
4. Jennifer Lopez d. If You Wanna Be My Lover
5. Christina Aguilera e. One Time
6. Boyzone f. Get Down
7. Mariah Carey g. Pretty Boy
8. M2M h. Pyramid
9. Snoop Dogg i. Oops! I Did It Again
10. Destiny’s Child j. Lets Get Loud
11. Spice Girls k. Its Raining Men
l. This I Promise You
TRUE / FALSE ITEM

• Usually takes the form of a sentence that requires the test


taker to indicate whether the statement is or is not a FACT.

• Contains a single idea, is not excessively long.

• Easier to write than multiple choice items and can be written


relatively quickly.
FOR EXAMPLE…
• Having a love life destroys studies

True False

• Drinking too much alcohol results to infertility

True False

• Men prefer girls who are shorter than them

True False
CONSTRUCTED-
RESPONSE ITEM
Completion item
• Requires the examinee to provide a word or phrase
that completes a sentence.
• Should be worded so that the correct answer is
specific.
• For example:
• The Culinary Arts focuses mainly on ________.
• A “mariposa” is a type of ________.
SHORT-answer item

• Item that the test taker can respond quickly and


succinctly.
• No hard and fast rules for how short an answer
must be to be considered a short answer.
• For example:
• Who is the newly elected President of the Philippines?
• What are the two parts of the human nervous system?
ESSAY
• Answerable by a paragraph or two
• Requires the test taker to respond by writing a composition,
typically one that demonstrates recall of facts, understanding,
analysis or interpretation.
• Useful when the test developer wants the examinee to demonstrate
a depth of knowledge about a single topic.
• For example:
• Make a stand about the implementations of dichotomies in our
society.
• Compare and Contrast the life of the Filipinos before and during
the Spanish Era. Site specific examples.
PSYCHOLOGICAL
MEASUREMENT/TESTS
Types of Tests
• Individual Tests vs. Group Tests
• Individual tests: test administrator gives a test to a single
person
• e.g. WAIS-III, MMPI-2
• Group tests: single examiner gives a test to a group of
people
• e.g. Scholastic Aptitude Test (SAT), Graduate Record
Examinations (GRE)

216
Types of Tests
• Ability Tests
• Achievement Tests
• evaluates what an individual has learned
• measures prior activity
• Aptitude Tests
• evaluates what an individual is capable of learning
• measures capacity or future potential
• Intelligence Tests
• Measures a person’s general potential to solve problems,
adapt to novel situations and profit from experience
217
Types of Tests
• Personality Tests: Objective & Projective
• Objective Personality Tests
• present specific stimuli and ask for specific responses
(e.g. true/false questions) .
• Projective Personality Tests
• present more ambiguous stimuli and ask for less specific
responses (e.g. inkblots, drawings, photographs,
Rorschach, TAT)

218
APPLICATIONS OF TESTS
Application of Psychological Measurement
➢ Educational Testing
➢Personnel Testing
➢ Clinical Testing
Educational Testing
• Intelligence tests and achievement tests are used from an
early age in the U.S and Canada. From kindergarten on, tests
are used for placement and advancement.
•Educational institutions have to make admissions and
advancement decisions regarding students. e.g., SAT, GRE,
subject placement tests
• Used to assess students for special education programs.
Also, used in diagnosing learning difficulties.
•Guidance counselors use instruments for advising students.
•Investigates school curriculum.
Personnel Testing

• Following WW I, business began taking an active interest in testing job


applicants. Most government jobs require some civil service
examination.
•Tests are used to assess: training needs, worker’s performance in
training, success in training programs, management development,
leadership training, and selection.
• For example, at the Lally School of Management, the Myers -Briggs
type indicator is used extensively to assess managerial potential. Type
testing is used to hopefully match the right person with the job they are
most suited for.
Clinical Testing
• Tests of Psychological Adjustment and tests which can
classify and/or diagnose patients are used extensively.

• Psychologist generally use a number of objective and


projective personality tests.

•Neuropsychological tests which examine basic mental


function also fall into this category. Perceptual tests are
used detecting and diagnosing brain damage.
Testing Activities of Psychologists
Clinical Psychologists - e.g. Assessment of Intelligence, Assessment of
Psychopathology
Counseling Psychologists
e.g. Career Interest Inventories, Skill Assessment
School Psychologists
e.g. Assessment of Academic progress, Readiness for School,
Social Adjustment
I/O Psychologists - e.g. Managerial potential,
Training Needs, Leadership Potential
Neuropsychologists - e.g., Assessment of Brain Damage, neurological impairments.
Forensic Psychology - intersection between law and psychology --needed for legal
determinations
e.g. Assessment for risk, competency to stand trial, child custody
Information About Tests
The Mental Measurement Yearbook - A guide to all currently available
psychological tests.
The MMY uses content classifications do describe tests:
1. Achievement 2. Behavior Assessment
3. Developmental 4. Education
5. English & Language 6. Fine Arts
7. Foreign Languages 8. Intelligence and Aptitude
9. Mathematics 10. Neuropsychological
11. Personality 12. Reading
13. Science 14. Sensory-Motor
15. Social Studies 16. Speech and Hearing
17. Vocations
Ethics In Psychological Testing

• Given the widespread use of tests, there is considerable


potential for abuse.

• A good deal of attention has therefore been devoted to


the development and enforcement of professional and
legal standards.

• The American Psychological Association (APA) has taken


a leading role in the development of professional
standards for testing.
American Psychological Association Ethical Guidelines:

➢The investigator has the responsibility to make a careful evaluation of its ethical
acceptability.

➢The investigator is obliged to observe stringent safeguards to protect the rights of


human participants.

➢The researcher must evaluate whether participants are considered “Subject at risk”
or “Subject at minimal risk” - No appreciable risk (physical risk, mental harm).

➢The principal investigator always retains the responsibility for ensuring ethical
practice in research. That is, the principal researcher is responsible for the ethical
practices of collaborators, assistants, employees, etc. (all of whom are also responsible
for their own ethical behavior).

➢Except in minimal-risk research, the investigator establishes a clear and fair


agreement with participants that clarifies the obligations and responsibilities of each.
Must explain all aspects of the research that may influence the subjects decision to
participate. Explains all other aspects that the participants inquire about.
APA Ethical Guidelines
➢In research involving concealment or deception, the research considers the
special responsibilities involved.

➢Individual’s freedom to decline, and freedom to withdraw, is respected.


Researcher is responsible for protecting participants from physical and mental
discomfort, harm, and danger that may arise from research procedures. If there
are risks, the participants must be aware of this fact.

➢After the data are collected the investigator provides participants with
information about the nature of the study and attempts to remove any
misconceptions that may have arisen.

➢The investigator has the responsibility to detect and remove any undesirable
consequences to the participant that may occur due to the research.

➢The information obtained from the participant should be treated confidentially


unless otherwise agreed upon with the participant.
Informed Consent
•Participants must be fully informed as to the purpose and
nature of the research that they are going to be involved in.

•Participants must be fully informed about the procedures


used in the research study.

•After getting this information, the participants must provide


consent for their participation.

•Participants must be informed about their right to


Confidentiality and their right to withdrawal without penalty.
Debriefing

Post-administration debriefing should:


-Restate purpose of testing.
-Explain how the results will be used (usually,
emphasize that the interest is in the group findings).
-Reiterate that findings will be treated
confidentially.
-Answer all of the respondents questions fully.
-Thank the examinee!
Participant Feedback
•In clinical research, or research with interpretive
instruments, there may be the need to provide more
in-depth feedback about individual’s responses (e.g.,
Research on Emotional Intelligence).

•In such cases, first and foremost, it is critical that this


kind of detailed feedback be given by a qualified
individual.
At Least 4 parties are involved in Professional Test Use:

(1) Testing professionals: the test developer and


publisher
(2) Testing professionals: the individuals who
administer the testing procedure
(3) The user: the organization or practice that will
eventually use the information to make certain
decisions
(4) The test taker
DEVELOPING/SELECTING APPROPRIATE TESTS

 Define what each test measures and what the test should be used for.

 Describe the population(s) for which the test is appropriate.

 Accurately represent the characteristics, usefulness, and limitations of


tests for their intended purposes.

 Describe the process of test development.

 Provide evidence that the test meets its intended purpose(s).

 Provide either representative samples or complete copies of test


questions, directions, answer sheets, manuals, and score reports to
qualified users.
DEVELOPING/SELECTING APPROPRIATE TESTS

 Indicate the nature of the evidence obtained concerning the


appropriateness of each test for groups of different racial, ethnic, or
linguistic backgrounds who are likely to be tested.

 Describe the population(s) represented by any norms or comparison


group(s), the dates the data were gathered, and the process used to select
the samples of test takers.

 When feasible, make appropriately modified forms of tests or


administration procedures available for test takers with handicapping
conditions. Warn test users of potential problems in using standard norms
with modified tests or administration procedures that result in non-
comparable scores.
DEVELOPING/SELECTING APPROPRIATE TESTS

 When a test is optional, provide test takers or their parents/guardians with


information to help them judge whether the test should be taken, or if an
available alternative to the test should be used.

 Provide test takers the information they need to be familiar with the
coverage of the test, the types of question formats, the directions, and
appropriate test-taking strategies. Strive to make such information equally
available to all test takers.

 Provide test takers or their parents/guardians with information about rights


test takers may have to obtain copies of tests and completed answer sheets,
retake tests, have tests rescored, or cancel scores.

 Tell test takers or their parents/guardians how long scores will be kept on
file and indicate to whom and under what circumstances test scores will or will
not be released.
Responsibility of The Tester

1. Have competence in test administration, interpretation and


feedback.

2. Have an understanding of basic psychometrics and scoring


procedures and be competent in interpretation, and apply
scientific knowledge and professional judgment to the results.

3. Take responsibility for the selection, administration, and scoring,


the analysis, interpretation and communication of test results.

4. Be familiar with the context of use: the situation, purpose,


setting in which a test is used.
Responsibility of The Tester

5. Have knowledge of legal and ethical issues related to test use

6 . Awareness of ethnic or cultural variables that could influence


the results:

7. Have the ability to determine language proficiency

8. Have knowledge of important racial, ethnic, or cultural variables


relevant for individuals or groups to whom tests are administered.
Issues to Address with the Testee

1. Informed consent - Assuring confidentiality, freedom’ to withdraw,


purpose of assessment, What kinds of attributes are being measured?
2. Who is the client? Individual, Group, Employer
3. What happens with results, who has access to it?
4. Where will the data be stored, how, and for how long?
5. Time frame in which results are to be considered valid
6. Who will be the payer, and how much
7. Where will the assessment take place
8. Are the facilities appropriate, conducive for testing
9. Will there be follow-up assessments or feedback
Factors Not Under the Examiner’s Control

1. How fatigued a test taker is.


2. Motivation level of the test taker.
3. Physical Discomfort
4. Test Anxiety
Ethnic and Cultural Variables
• Knowledge of attitudes of various racial, ethnic, or cultural
groups toward testing.
• Ability to determine language proficiency.
• Ability to determine the potential effects of different test
settings on different racial, ethnic, or cultural groups.
• Knowledge of specific biases that have been demonstrated for
particular tests for individuals or groups of individuals from
particular racial, ethnic, or cultural minority groups.
Test Fairness
• People with different values often disagree over the
fairness of some testing practices.
• Factors that affect testing fairness:
• 1. Obstacles that prevent people from performing well
• 2. Test may provide unfair advantage to some people
• 3. Some tests are not valid and used in wrong
situations
• 4. Some tests are used for purposes that are inherently
objectionable
Test Use & Test Fairness
• A test is most likely to be seen as unfair when:
• 1. It is the sole basis for the decision.
• 2. The consequences of doing poorly on the test is harsh

• Ways to reduce concerns over test unfairness:


• 1. Multiple assessment procedures
• 2. Use more intensive screening procedures for those likely
to be treated unfairly by a given test.
Types of Decisions
• Two distinctions are very useful for classifying decisions:

• 1. Individual or
Institutional
• 2. Comparative or Absolute
THE END!

THANK YOU!!!

You might also like