Validity and Reliability
Validity and Reliability
com)
Table of Contents
1 Validity and Reliability
2 Types of Validity
3 External Validity
3.1 Population Validity
3.2 Ecological Validity
4 Internal Validity
5 Test Validity
5.1 Criterion Validity
5.1.1 Concurrent Validity
5.1.2 Predictive Validity
6 Content Validity
7 Construct Validity
7.1 Convergent and Discriminant Validity
8 Face Validity
9 Definition of Reliability
10 Test–Retest Reliability
10.1 Reproducibility
10.2 Replication Study
11 Interrater Reliability
12 Internal Consistency Reliability
13 Instrument Reliability
1
Copyright Notice
Copyright © [Link] 2014. All rights reserved, including the right of reproduction in
whole or in part in any form. No parts of this book may be reproduced in any form without
written permission of the copyright owner.
Notice of Liability
The author(s) and publisher both used their best efforts in preparing this book and the
instructions contained herein. However, the author(s) and the publisher make no warranties of
any kind, either expressed or implied, with regards to the information contained in this book,
and especially disclaim, without limitation, any implied warranties of merchantability and
fitness for any particular purpose.
In no event shall the author(s) or the publisher be responsible or liable for any loss of profits or
other commercial or personal damages, including but not limited to special incidental,
consequential, or any other damages, in connection with or arising out of furnishing,
performance or use of this book.
Trademarks
Throughout this book, trademarks may be used. Rather than put a trademark symbol in every
occurrence of a trademarked name, we state that we are using the names in an editorial
fashion only and to the benefit of the trademark owner with no intention of infringement of the
trademarks. Thus, copyrights on individual photographic, trademarks and clip art images
reproduced in this book are retained by the respective owner.
Information
Published by [Link].
2
1 Validity and Reliability
The principles of validity and reliability are fundamental cornerstones of the scientific
method.
Together, they are at the core of what is accepted as scientific proof, by scientist and
philosopher alike.
By following a few basic principles, any experimental design will stand up to rigorous
questioning and skepticism.
What is Reliability?
The idea behind reliability is that any significant results must be more than a one-off finding
and be inherently repeatable.
Other researchers must be able to perform exactly the same experiment, under the same
conditions and generate the same results. This will reinforce the findings and ensure that the
wider scientific community will accept the hypothesis.
Without this replication of statistically significant results, the experiment and research have not
fulfilled all of the requirements of testability.
For example, if you are performing a time critical experiment, you will be using some type of
stopwatch. Generally, it is reasonable to assume that the instruments are reliable and will
keep true and accurate time. However, diligent scientists take measurements many times, to
minimize the chances of malfunction and maintain validity and reliability.
At the other extreme, any experiment that uses human judgment is always going to come
under question.
For example, if observers rate certain aspects, like in Bandura’s Bobo Doll Experiment, then
the reliability of the test is compromised. Human judgment can vary wildly between observers,
and the same individual may rate things differently depending upon time of day and current
mood.
3
This means that such experiments are more difficult to repeat and are inherently less reliable.
Reliability is a necessary ingredient for determining the overall validity of a scientific
experiment and enhancing the strength of the results.
Debate between social and pure scientists, concerning reliability, is robust and ongoing.
What is Validity?
Validity encompasses the entire experimental concept and establishes whether the results
obtained meet all of the requirements of the scientific research method.
For example, there must have been randomization of the sample groups and appropriate care
and diligence shown in the allocation of controls.
Internal validity dictates how an experimental design is structured and encompasses all of the
steps of the scientific research method.
Even if your results are great, sloppy and inconsistent design will compromise your integrity in
the eyes of the scientific community. Internal validity and reliability are at the core of any
experimental design.
External validity is the process of examining the results and questioning whether there are any
other possible causal relationships.
Control groups and randomization will lessen external validity problems but no method can be
completely successful. This is why the statistical proofs of a hypothesis called significant, not
absolute truth.
Any scientific research design only puts forward a possible cause for the studied effect.
There is always the chance that another unknown factor contributed to the results and
findings. This extraneous causal relationship may become more apparent, as techniques are
refined and honed.
Conclusion
If you have constructed your experiment to contain validity and reliability then the scientific
community is more likely to accept your findings.
Eliminating other potential causal relationships, by using controls and duplicate samples, is
the best way to ensure that your results stand up to rigorous questioning.
4
How to cite this article:
Martyn Shuttleworth (Oct 20, 2008). Validity and Reliability. Retrieved from [Link]:
[Link]
5
2 Types of Validity
Here is an overview on the main types of validity used for the scientific method.
Let's take a look on the the most frequent uses of validity in the scientific method:
External Validity
External validity is about generalization: To what extent can an effect in research, be
generalized to populations, settings, treatment variables, and measurement variables?
External validity is usually split into two distinct types, population validity and ecological validity
and they are both essential elements in judging the strength of an experimental design.
Internal Validity
Internal validity is a measure which ensures that a researcher's experiment design closely
follows the principle of cause and effect.
Test Validity
Test validity is an indicator of how much meaning can be placed upon a set of test results.
Criterion Validity
6
Concurrent validity measures the test against a benchmark test and high correlation
indicates that the test has strong criterion validity.
Predictive validity is a measure of how well a test predicts abilities. It involves testing a
group of subjects for a certain construct and then comparing them with results obtained
at some point in the future.
Content Validity
Content validity is the estimate of how much a measure represents every single element of a
construct.
Construct Validity
Construct validity defines how well a test or experiment measures up to its claims. A test
designed to measure depression must only measure that particular construct, not closely
related ideals such as anxiety or stress.
Convergent validity tests that constructs that are expected to be related are, in fact,
related.
Discriminant validity tests that constructs that should have no relationship do, in fact,
not have any relationship. (also referred to as divergent validity)
Face Validity
Face validity is a measure of how representative a research project is ‘at face value,' and
whether it appears to be a good project.
7
3 External Validity
External validity is one the most difficult of the validity types to achieve, and is at the
foundation of every good experimental design.
Many scientific disciplines, especially the social sciences, face a long battle to prove that their
findings represent the wider population in real world situations.
The main criteria of external validity is the process of generalization, and whether results
obtained from a small sample group, often in laboratory surroundings, can be extended to
make predictions about the entire population.
The reality is that if a research program has poor external validity, the results will not be taken
seriously, so any research design must justify sampling and selection methods.
External validity is usually split into two distinct types, population validity and ecological validity
, and they are both essential elements in judging the strength of an experimental design.
8
Psychology and External Validity
External validity often causes a little friction between clinical psychologists and research
psychologists.
Clinical psychologists often believe that research psychologists spend all of their time in
laboratories, testing mice and humans in conditions that bear little resemblance to the outside
world. They claim that the data produced has no external validity, and does not take into
account the sheer complexity and individuality of the human mind.
Before we are flamed by irate research psychologists, the truth lies somewhere between the
two extremes! Research psychologists find out trends and generate sweeping generalizations
that predict the behavior of groups. Clinical psychologists end up picking up the pieces, and
study the individuals who lie outside the predictions, hence the animosity.
In most cases, research psychology has a very high population validity, because researchers
take meticulously randomly select groups and use large sample sizes, allowing meaningful
statistical analysis.
However, the artificial nature of research psychology means that ecological validity is usually
low.
Clinical psychologists, on the other hand, often use focused case studies, which cause
minimum disruption to the subject and have strong ecological validity. However, the small
9
sample sizes mean that the population validity is often low.
For example, a research design, which involves sending out survey questionnaires to
students picked at random, displays more external validity than one where the questionnaires
are given to friends. This is randomization to improve external validity.
Once you have a representative sample, high internal validity involves randomly assigning
subjects to groups, rather than using pre-determined selection factors.
With the student example, randomly assigning the students into test groups, rather than
picking pre-determined groups based upon degree type, gender, or age strengthens the
internal validity.
Work Cited
Campbell, D.T., Stanley, J.C. (1966). Experimental and Quasi-Experimental Designs for
Research. Skokie, Il: Rand McNally.
10
3.1 Population Validity
Population validity is a type of external validity which describes how well the sample
used can be extrapolated to a population as a whole.
It evaluates whether the sample population represents the entire population, and also whether
the sampling method is acceptable.
For example, an educational study that looked at a single school could not be generalized to
cover children at every US school.
On the other hand, a federally appointed study, that tested every pupil of a certain age group,
will have exceptionally strong population validity.
Due to time and cost restraints, most studies lie somewhere between these two extremes,
and researchers pay extreme attention to their sampling techniques.
Experienced scientists ensure that their sample groups are as representative as possible,
striving to use random selection rather than convenience sampling.
Martyn Shuttleworth (Sep 16, 2009). Population Validity. Retrieved from [Link]:
[Link]
11
3.2 Ecological Validity
Ecological validity is a type of external validity which looks at the testing environment
and determines how much it influences behavior.
In the school test example, if the pupils are used to regular testing, then the ecological validity
is high because the testing process is unlikely to affect behavior.
On the other hand, taking each child out of class and testing them individually, in an isolated
room, will dramatically lower ecological validity. The child may be nervous, ill at ease and is
unlikely to perform in the same way as they would in a classroom.
Generalization becomes difficult, as the experiment does not resemble the real world situation.
Martyn Shuttleworth (Mar 19, 2009). Ecological Validity. Retrieved from [Link]:
[Link]
12
4 Internal Validity
Looking at some extreme examples, a physics experiment into the effect of heat on the
conductivity of a metal has a high internal validity.
The researcher can eliminate almost all of the potential confounding variables and set up
strong controls to isolate other factors.
At the other end of the scale, a study into the correlation between income level and the
likelihood of smoking has a far lower internal validity.
A researcher may find that there is a link between low-income groups and smoking, but
cannot be certain that one causes the other.
Social status, profession, ethnicity, education, parental smoking, and exposure to targeted
advertising are all variables that may have an effect. They are difficult to eliminate, and social
research can be a statistical minefield for the unwary.
13
Internal Validity vs Construct Validity
For physical scientists, construct validity is rarely needed but, for social sciences and
psychology, construct validity is the very foundation of research.
Even more important is understanding the difference between construct validity and internal
validity, which can be a very fine distinction.
The subtle differences between the two are not always clear, but it is important to be able to
distinguish between the two, especially if you wish to be involved in the social sciences,
psychology and medicine.
Internal validity only shows that you have evidence to suggest that a program or study had
some effect on the observations and results.
Construct validity determines whether the program measured the intended attribute.
Internal validity says nothing about whether the results were what you expected, or whether
generalization is possible.
For example, imagine that some researchers wanted to investigate the effects of a computer
program against traditional classroom methods for teaching Greek.
The results showed that children using the computer program learned far more quickly, and
improved their grades significantly.
However, further investigation showed that the results were not due to the program itself, but
due to the Hawthorne Effect; the children using the computer program felt that they had been
singled out for special attention. As a result, they tried a little harder, instead of staring out of
the window.
This experiment still showed high internal validity, because the research manipulation had an
effect.
14
However, the study had low construct validity, because the cause was not correctly labeled.
The experiment ultimately measured the effects of increased attention, rather than the
intended merits of the computer program.
However, there are a number of tools that help a researcher to oversee internal validity and
establish causality.
Temporal Precedence
Temporal precedence is the single most important tool for determining the strength of a cause
and effect relationship. This is the process of establishing that the cause did indeed happen
before the effect, providing a solution to the chicken and egg problem.
To establish internal
validity through temporal
precedence, a researcher
must establish which
variable came first.
One example could be an ecology study, establishing whether an increase in the population of
lemmings in a fjord in Norway is followed by an increase in the number of predators.
Lemmings show a very predictable population cycle, which steadily rises and falls over 3 to 5
year cycle. Population estimates show that the number of lemmings rises due to an increase
in the abundance of food.
This trend is followed, a couple of months later, by an increase in the number of predators, as
more of their young survive. This seems to be a pretty clear example of temporal precedence;
the availability of food for the lemmings dictates numbers. In turn, this dictates the population
of predators.
STOP!
Not so fast!
In fact, the predator/prey relationship is much more complex than this. Ecosystems rarely
15
contain simple linear relationships, and food availability is only one controlling factor.
Turning the whole thing around, an increase in the number of predators may also control the
lemming population. The predators may be so successful that the lemming population
plummets and the predators starve, through limiting their own food supply.
What if predators turn to an alternative food supply when the number of lemmings is low?
Lemmings, like many rodents, show lower breeding success during times of high population.
This really is a tough call, and the only answer is to study previous research. Internal validity
is possibly the single most important reason for conducting a strong and thorough literature
review.
Even with this, it is often difficult to show that cause happens before effect, a fact that
behavioral biologists and ecologists know only too well.
By contrast, the physics experiment is fairly easy - I heat the metal and conductivity
increase/decreases, providing a simpler view of cause and effect and high internal validity.
For example, in the study of Greek learning, the results showed that the group with the
computer package performed better than those without.
16
More of the program equals more of the outcome.
Less of the program equals less of the outcome.
This seems pretty obvious, but you have to remember the basic rule of internal validity.
Covariation of the cause and effect cannot explain what causes the effect, or establish
whether it is due to the expected manipulated variable or to a confounding variable.
As with the lemming example, there could be many other plausible explanations for the
apparent causal link between prey and predator.
Researchers often refer to any such confounding variable as the 'Missing Variable,' an
unknown factor that may underpin the apparent relationship.
The problem is, as the name suggests, that the variable is missing, and trying to find it is
almost impossible. The only way to nullify it is through strong experimental design, eliminating
confounding variables and ensuring that they cannot have any influence.
Randomization, control groups and repeat experiments are the best way
to eliminate these variables and maintain high validity.
In the lemming example, researchers use a whole series of experiments, measuring predation
rates, alternative food sources and lemming breeding rates, attempting to establish a baseline.
In the experiment where researchers compared a computer program for teaching Greek
against traditional methods, there are a number of threats to internal validity.
The group with computers feels special, so they try harder, the Hawthorne Effect.
17
The group without computers becomes jealous, and tries harder to prove that they
should have been given the chance to use the shiny new technology.
Alternatively, the group without computers is demoralized and their performance suffers.
Parents of the children in the computerless group feel that their children are missing out,
and complain that all children should be given the opportunity.
The children talk outside school and compare notes, muddying the water.
The teachers feel sorry for the children without the program and attempt to compensate,
helping the children more than normal.
We are not trying to depress you with these complications, only illustrate how complex internal
validity can be.
In fact, perfect internal validity is an unattainable ideal, but any research design must strive
towards that perfection.
For those of you wondering whether you picked the right course, don't worry. Designing
experiments with good internal validity is a matter of experience, and becomes much easier
over time.
For the scientists who think that social sciences are soft - think again!
18
5 Test Validity
Test validity is an indicator of how much meaning can be placed upon a set of test
results. In psychological and educational testing, where the importance and accuracy
of tests is paramount, test validity is crucial.
Test validity incorporates a number of different validity types, including criterion validity,
content validity and construct validity. If a research project scores highly in these areas, then
the overall test validity is high.
Criterion Validity
Criterion validity establishes whether the test matches a certain set of abilities.
Concurrent validity measures the test against a benchmark test, and high correlation
indicates that the test has strong criterion validity.
Predictive validity is a measure of how well a test predicts abilities, such as measuring
whether a good grade point average at high school leads to good results at university.
Content Validity
Content validity establishes how well a test compares to the real world. For example, a school
test of ability should reflect what is actually taught in the classroom.
Construct Validity
Construct validity is a measure of how well a test measures up to its claims. A test designed
to measure depression must only measure that particular construct, not closely related ideals
such as anxiety or stress.
In many cases, researchers do not subdivide test validity, and see it as a single construct that
requires an accumulation of evidence to support it.
19
Messick, in 1975, proposed that proving the validity of a test is futile, especially when it is
impossible to prove that a test measures a specific construct. Constructs are so abstract that
they are impossible to define, and so proving test validity by the traditional means is ultimately
flawed.
Messick believed that a researcher should gather enough evidence to defend his work, and
proposed six aspects that would permit this. He argued that this evidence could not justify the
validity of a test, but only the validity of the test in a specific situation. He stated that this
defense of a test's validity should be an ongoing process, and that any test needed to be
constantly probed and questioned.
Finally, he was the first psychometrical researcher to propose that social and ethical
implications of a test were an inherent part of the process, a huge paradigm shift from the
accepted practices. Considering that educational tests can have a long-lasting effect on an
individual, then this is a very important implication, whatever your view on the competing
theories behind test validity.
This new approach does have some basis; for many years, IQ tests were regarded as
practically infallible.
However, they have been used in situations vastly different from the original intention, and
they are not a great indicator of intelligence, only of problem solving ability and logic.
Messick's methods certainly appear to predict these problems more satisfactorily than the
traditional approach.
Both methods have their own strengths and weaknesses, so it comes down to personal
choice and what your supervisor prefers. As long as you have a strong and well-planned test
design, then the test validity will follow.
Works Cited
Wainer, H. Braun, H.I. (1988) Test Validity. New Jersey: Lawrence Erlbaum Associates.
Martyn Shuttleworth (Sep 19, 2009). Test Validity. Retrieved from [Link]:
20
[Link]
21
5.1 Criterion Validity
To measure the criterion validity of a test, researchers must calibrate it against a known
standard or against itself.
Comparing the test with an established measure is known as concurrent validity; testing it
over a period of time is known as predictive validity.
It is not necessary to use both of these methods, and one is regarded as sufficient if the
experimental design is strong.
One of the simplest ways to assess criterion related validity is to compare it to a known
standard.
A new intelligence test, for example, could be statistically analyzed against a standard IQ test;
if there is a high correlation between the two data sets, then the criterion validity is high. This
is a good example of concurrent validity, but this type of analysis can be much more subtle.
With this test, they hope to predict how people are likely to vote. To assess the criterion validity
of the test, they do a pilot study, selecting only members of left wing and right wing political
parties.
If the test has high concurrent validity, the members of the leftist party should receive scores
that reflect their left leaning ideology. Likewise, members of the right wing party should
receive scores indicating that they lie to the right.
If this does not happen, then the test is flawed and needs a redesign. If it does work, then the
researchers can assume that their test has a firm basis, and the criterion related validity is
high.
22
Most pollsters would not leave it there and in a few months, when the votes from the election
were counted, they would ask the subjects how they actually voted.
This predictive validity allows them to double check their test, with a high correlation again
indicating that they have developed a solid test of political ideology.
Insurance companies have to measure a construct called 'overall health,' made up of lifestyle
factors, socio-economic background, age, genetic predispositions and a whole range of other
factors.
Maintaining high criterion related validity is difficult, with all of these factors, but getting it
wrong can bankrupt the business.
Diligently, they researched whether people liked the new flavor, performing taste tests and
giving out questionnaires. People loved the new flavor, so Coca-Cola rushed New Coke into
production, where it was a titanic flop.
The mistake that Coke made was that they forgot about criterion validity, and omitted one
important question from the survey.
People were not asked if they preferred the new flavor to the old, a failure to establish
concurrent validity.
The Old Coke, known to be popular, was the perfect benchmark, but it was never used. A
simple blind taste test, asking people which flavor they preferred out of the two, would have
saved Coca Cola millions of dollars.
Ultimately, the predictive validity was also poor, because their good results did not correlate
with the poor sales. By then, it was too late!
23
Martyn Shuttleworth (Jan 12, 2009). Criterion Validity. Retrieved from [Link]:
[Link]
24
5.1.1 Concurrent Validity
The tests are for the same, or very closely related, constructs and allow a researcher to
validate new methods against a tried and tested stalwart.
For example, IQ, Emotional Quotient, and most school grading systems are good examples of
established tests that are regarded as having a high validity. One common way of looking at
concurrent validity is as measuring a new test or procedure against a gold-standard
benchmark.
For example, testing a group of students for intelligence, with an IQ test, and then performing
the new intelligence test a couple of days later would be perfectly acceptable.
If the test takes place a considerable amount of time after the initial test, then it is regarded as
predictive validity. Both concurrent and predictive validity are subdivisions of criterion validity
and the timescale is the only real difference.
They then compare this with the test scores already held by the school, a recognized and
reliable judge of mathematical ability.
Cross referencing the scores for each student allows the researchers to check if there is a
correlation, evaluate the accuracy of their test, and decide whether it measures what it is
supposed to. The key element is that the two methods were compared at about the same time.
25
If the researchers had measured the mathematical aptitude, implemented a new educational
program, and then retested the students after six months, this would be predictive validity.
For example, IQ tests are often criticized, because they are often used beyond the scope of
the original intention and are not the strongest indicator of all round intelligence. Any new
intelligence test that showed strong concurrent validity with IQ tests would, presumably,
contain the same inherent weaknesses.
Despite this weakness, concurrent validity is a stalwart of education and employment testing,
where it can be a good guide for new testing procedures. Ideally, researchers initially test
concurrent validity and then follow up with a predictive validity based experiment, to give a
strong foundation to their findings.
Bibliography
Craighead, W.E., & Nemeroff, C.B. (2004). The Concise Corsini Encyclopedia of Psychology
and Behavioral Sciences (3rd Ed.).
Stevens, J.P. (2009). Applied Multivariate Statistics for the Social Sciences. Fifth Edition. New
York: Routledge. Hoboken, NJ: John Wiley
26
5.1.2 Predictive Validity
Predictive validity involves testing a group of subjects for a certain construct, and then
comparing them with results obtained at some point in the future.
Most educational and employment tests are used to predict future performance, so predictive
validity is regarded as essential in these fields.
In this process, the basic assumption is that a high-school pupil with a high grade point
average will achieve high grades at university.
Quite literally, there have been hundreds of studies testing the predictive validity of this
approach. To achieve this, a researcher takes the grades achieved after the first year of
studies, and compares them with the high school grade point averages.
A high correlation indicates that the selection procedure worked perfectly, a low correlation
signifies that there is something wrong with the approach.
Most studies show that there is a strong correlation between the two, and the predictive validity
of the method is high, although not perfect.
Intuitively, this seems logical; previously excellent students may well struggle with
homesickness or decide to spend the first year drinking beer.
27
Predictive validity is regarded as a very strong measure of statistical validity, but it does
contain a few weaknesses that statisticians and researchers need to take into consideration.
Predictive validity does not test all of the available data, and individuals who are not selected
cannot, by definition, go on to produce a score on that particular criterion.
In the university selection example, this approach does not test the students who failed to
attend university, due to low grades, personal preference or financial concerns. This leaves a
hole in the data, and the predictive validity relies upon this incomplete data set, so the
researchers must always make some assumptions.
If the students with the highest grade point averages score higher after their first year at
university, and the students who just scraped in get the lowest, researchers assume that non-
attendees would score lower still. This downwards extrapolation might be incorrect, but
predictive validity has to incorporate such assumptions.
Despite this weakness, predictive validity is still regarded as an extremely powerful measure
of statistical accuracy.
In many fields of research, it is regarded as the most important measure of quality, and
researchers constantly seek ways to maintain high predictive validity.
Martyn Shuttleworth (Sep 23, 2009). Predictive Validity. Retrieved from [Link]:
[Link]
28
6 Content Validity
Content validity, sometimes called logical or rational validity, is the estimate of how
much a measure represents every single element of a construct.
For example, an educational test with strong content validity will represent the subjects
actually taught to students, rather than asking unrelated questions.
Content validity is qualitative in nature, and asks whether a specific element enhances or
detracts from a test or research program.
Face validity requires a personal judgment, such as asking participants whether they thought
that a test was well constructed and useful. Content validity arrives at the same answers, but
uses an approach based in statistics, ensuring that it is regarded as a strong type of validity.
For surveys and tests, each question is given to a panel of expert analysts, and they rate it.
They give their opinion about whether the question is essential, useful or irrelevant to
measuring the construct under study.
Their results are statistically analyzed and the test modified to improve the rational validity.
29
A school wants to hire a new science teacher, and a panel of governors begins to look
through the various candidates. They draw up a shortlist and then set a test, picking the
candidate with the best score. Sadly, he proves to be an extremely poor science teacher.
After looking at the test, the education board begins to see where they went wrong. The vast
majority of the questions were about physics so, of course, the school found the most talented
physics teacher.
However, this particular job expected the science teacher to teach biology, chemistry and
psychology. The content validity of test was poor and did not fully represent the construct of
'being a good science teacher.'
Suitably embarrassed, the school redesigned the test and submitted it to a panel of
educational experts. After asking the candidates to sit the revised test, the school found
another teacher, and she proved to be an excellent and well-rounded science teacher. This
test had a much higher rational validity and fully represented every element of the construct.
30
7 Construct Validity
Construct validity defines how well a test or experiment measures up to its claims. It
refers to whether the operational definition of a variable actually reflect the true
theoretical meaning of a concept.
The simple way of thinking about it is as a test of generalization, like external validity, but it
assesses whether the variable that you are testing for is addressed by the experiment.
Construct validity is a device used almost exclusively in social sciences, psychology and
education.
For example, you might design whether an educational program increases artistic ability
amongst pre-school children. Construct validity is a measure of whether your research
actually measures artistic ability, a slightly abstract label.
Some specific examples could be language proficiency, artistic ability or level of displayed
aggression, as with the Bobo Doll Experiment. These concepts are abstract and theoretical,
but have been observed in practice.
An example could be a doctor testing the effectiveness of painkillers on chronic back sufferers.
31
Every day, he asks the test subjects to rate their pain level on a scale of one to ten - pain
exists, we all know that, but it has to be measured subjectively.
In this case, construct validity would test whether the doctor actually was measuring pain and
not numbness, discomfort, anxiety or any other factor.
Therefore, with the definition of a construct properly defined, we can look at construct ability, a
measure of how well the test measures the construct. It is a tool that allows researchers to
perform a systematic analysis of how well designed their research is.
Construct validity is valuable in social sciences, where there is a lot of subjectivity to concepts.
Often, there is no accepted unit of measurement for constructs and even fairly well known
ones, such as IQ, are open to debate.
These pilot studies establish the strength of their research and allow them to make any
adjustments.
Using an educational example, such a pre-test might involve a differential groups study,
where researchers obtain test results for two different groups, one with the construct and one
without.
The other option is an intervention study, where a group with low scores in the construct is
tested, taught the construct, and then re-measured. If there is a significant difference pre and
post-test, usually analyzed with simple statistical tests, then this proves good construct validity.
There were attempts, after the war, to devise statistical methods to test construct validity, but
they were so long and complicated that they proved to be unworkable. Establishing good
construct validity is a matter of experience and judgment, building up as much supporting
evidence as possible.
A whole battery of statistical tools and coefficients are used to prove strong construct validity,
and researchers continue until they feel that they have found the balance between proving
validity and practicality.
32
of the main candidates:
Hypothesis Guessing
This threat is when the subject guesses the intent of the test and consciously, or
subconsciously, alters their behavior.
It does not matter whether they guess the hypothesis correctly, only that their behavior
changes.
Evaluation Apprehension
This particular threat is based upon the tendency of humans to act differently when under
pressure. Individual testing is notorious for bringing on an adrenalin rush, and this can
improve or hinder performance.
Researchers are only human and may give cues that influence the behavior of the subject.
Humans give cues through body language, and subconsciously smiling when the subject
gives a correct answer, or frowning at an undesirable response, all have an effect.
This effect can lower construct validity by clouding the effect of the actual research variable.
To reduce this effect, interaction should be kept to a minimum, and assistants should be
unaware of the overall aims of the project.
See also:
Double Blind Experiment
Research Bias
Construct validity is all about semantics and labeling. Defining a construct in too broad or too
narrow terms can invalidate the entire experiment.
For example, a researcher might try to use job satisfaction to define overall happiness. This is
too narrow, as somebody may love their job but have an unhappy life outside the workplace.
33
Equally, using general happiness to measure happiness at work is too broad. Many people
enjoy life but still hate their work!
Mislabeling is another common definition error: stating that you intend to measure depression,
when you actually measure anxiety, compromises the research.
The best way to avoid this particular threat is with good planning and seeking advice before
you start your research program.
Construct Confounding
This threat to construct validity occurs when other constructs mask the effects of the
measured construct.
For example, self-esteem is affected by self-confidence and self-worth. The effect of these
constructs needs to be incorporated into the research.
This particular threat is where more than one treatment influences the final outcome.
Sadly, the researcher then finds that some of the subjects also used nicotine patches and
gum, or electronic cigarettes. The construct validity is now too low for the results to have any
meaning. Only good planning and monitoring of the subjects can prevent this.
Unreliable Scores
For example, an educational researcher devises an intelligence test that provides excellent
results in the UK, and shows high construct validity.
However, when the test is used upon immigrant children, with English as a second language,
the scores are lower.
Mono-Operation Bias
34
This threat involves the independent variable, and is a situation where a single manipulation is
used to influence a construct.
For example, a researcher may want to find out whether an anti-depression drug works. They
divide patients into two groups, one given the drug and a control given a placebo.
The problem with this is that it is limited (e.g. random sampling error), and a solid design
would use multi-groups given different doses.
The other option is to conduct a pre-study that calculates the optimum dose, an equally
acceptable way to preserve construct validity.
Mono-Method Bias
This threat to construct validity involves the dependent variable, and occurs when only a
single method of measurement is used.
For example, in an experiment to measure self-esteem, the researcher uses a single method
to determine the level of that construct, but then discovers that it actually measures self-
confidence.
Don't Panic
These are just a few of the threats to construct validity, and most experts agree that there are
at least 24 different types. These are the main ones, and good experimental design, as well
as seeking feedback from experts during the planning stage, will see you avoid them.
For the ‘hard' scientists, who think that social and behavioral science students have an easy
time, you could not be more wrong!
35
7.1 Convergent and Discriminant Validity
If a research program is shown to possess both of these types of validity, it can also be
regarded as having excellent construct validity.
In many areas of research, mainly the social sciences, psychology, education and medicine,
researchers need to analyze non-quantitative and abstract concepts, such as level of pain,
anxiety or educational achievement.
A researcher needs to define exactly what trait they are measuring if they are to maintain
good construct validity.
Constructs very rarely exist independently, because the human brain is not a simple machine
and is made up of an interlinked web of emotions, reasoning and senses. Any research
program must untangle these complex interactions and establish that you are only testing the
desired construct.
This is practically impossible to prove beyond doubt, so researchers gather enough evidence
to defend their findings from criticism.
The basic difference between convergent and discriminant validity is that convergent validity
tests whether constructs that should be related, are related. Discriminant validity tests whether
believed unrelated constructs are, in fact, unrelated.
36
Imagine that a researcher wants to measure self-esteem, but she also knows that the other
four constructs are related to self-esteem and have some overlap. The ultimate goal is to
maker an attempt to isolate self-esteem.
In this example, convergent validity would test that the four other constructs are, in fact,
related to self-esteem in the study. The researcher would also check that self-worth and
confidence, and social skills and self-appraisal, are also related.
Discriminant validity would ensure that, in the study, the non-overlapping factors do not
overlap. For example, self esteem and intelligence should not relate (too much) in most
research projects.
As you can see, separating and isolating constructs is difficult, and it is one of the factors that
makes social science extremely difficult.
Social science rarely produces research that gives a yes or no answer, and the process of
gathering knowledge is slow and steady, building on top of what is already known.
Bibliography
Domino G., & Domino M.L. (2006). Psychological Testing: An Introduction. (2nd Ed.).
Cambridge: Cambridge University Press
John, O.P., & Benet-Martinez, V. (2000). Measurement: Reliability, Construct Validation, and
Scale Construction. In Reis, H.T., & Judd, C.M. (Eds.). Handbook of Research Methods in
Social and Personality Psychology, pp 339-370. Cambridge, UK: Cambridge University Press
Struwig, M., Struwig, F.W., & Stead, G.B. (2001). Planning, Reporting, and Designing
Research, Cape Town, South Africa: Pearson Education
37
How to cite this article:
Martyn Shuttleworth (Aug 21, 2009). Convergent and Discriminant Validity. Retrieved from
[Link]: [Link]
38
8 Face Validity
It is built upon the principle of reading through the plans and assessing the viability of the
research, with little objective measurement.
This 'common sense' approach often saves a lot of time, resources and stress.
The difference is that content validity is carefully evaluated, whereas face validity is a more
general measure and the subjects often have input.
An example could be, after a group of students sat a test, you asked for feedback, specifically
if they thought that the test was a good one. This enables refinements for the next research
project and adds another dimension to establishing validity.
Face validity is classed as 'weak evidence' supporting construct validity, but that does not
mean that it is incorrect, only that caution is necessary.
For example, imagine a research paper about Global Warming. A layperson could read
through it and think that it was a solid experiment, highlighting the processes behind Global
Warming.
On the other hand, a distinguished climatology professor could read through it and find the
paper, and the reasoning behind the techniques, to be very poor.
This example shows the importance of face validity as useful filter for eliminating shoddy
research from the field of science, through peer review.
39
If Face Validity is so Weak, Why is it Used?
Especially in the social and educational sciences, it is very difficult to measure the content
validity of a research program.
Often, there are so many interlinked factors that it is practically impossible to account for them
all. Many researchers send their plans to a group of leading experts in the field, asking them if
they think that it is a good and representative program.
This face validity should be good enough to withstand scrutiny and helps a researcher to find
potential flaws before they waste a lot of time and money.
In the social sciences, it is very difficult to apply the scientific method, so experience and
judgment are valued assets.
Before any physical scientists think that this has nothing to do with their more quantifiable
approach, face validity is something that pretty much every scientist uses.
Every time you conduct a literature review, and sift through past research papers, you apply
the principle of face validity.
Although you might look at who wrote the paper, where the journal was from and who funded
it, ultimately, you ask 'Does this paper do what it sets out to?'
Bibliography
Babbie, E.R. (2007). The Practice of Social Research. Belmont, CA: Wadsworth Cengage
Learning
Gatewood, R.D., Feild, H.S., & Barrick, M. (2008). Human Resource Selection (6th Ed.).
Mason, OH: Thomson
Polit, D.E., & Tatano Beck, C. (2008). Nursing Research : Generating and Assessing Evidence
for Nursing Practice (8th Ed.). Philadelphia, PA: Lippincott Williams & Watkins
Martyn Shuttleworth (Mar 21, 2009). Face Validity. Retrieved from [Link]:
[Link]
40
9 Definition of Reliability
The definition of reliability, as given in 'The Free Dictionary', is "Yielding the same or
compatible results in different clinical experiments or statistical trials" 1.
In normal language, we use the word reliable to mean that something is dependable and that
it will give the same outcome every time. We might talk of a football player as reliable,
meaning that he gives a good performance game after game.
In science, the definition is the same, but needs a much narrower and unequivocal definition.
If you use three replicate samples for each manipulation, and one generates completely
different results from the others, then there may be something wrong with the experiment.
1. For many experiments, results follow a ‘normal distribution' and there is always a chance
that your sample group produces results lying at one of the extremes. Using multiple
sample groups will smooth out these extremes and generate a more accurate spread of
results.
2. If your results continue to be wildly different, then there is likely to be something very
wrong with your design; it is unreliable.
A good example of a failure to apply the definition of reliability correctly is provided by the cold
41
fusion case, of 1989
Fleischmann and Pons announced to the world that they had managed to generate heat at
normal temperatures, instead of the huge and expensive tori used in most research into
nuclear fusion.
This announcement shook the world, but researchers in many other institutions across the
world attempted to replicate the experiment, with no success. Whether the researchers lied, or
genuinely made a mistake is unclear, but their results were clearly unreliable.
Ecologists and social scientists, on the other hand, understand fully that achieving exactly the
same results is an exercise in futility. Research in these disciplines incorporates random
factors and natural fluctuations and, whilst any experimental design must attempt to eliminate
confounding variables and natural variations, there will always be some disparities.
The key to performing a good experiment is to make sure that your results are as reliable as is
possible; if anybody repeats the experiment, powerful statistical tests will be able to compare
the results and the scientist can make a solid estimate of statistical reliability.
A researcher devises a new test that measures IQ more quickly than the standard IQ test:
If the new test delivers scores for a candidate of 87, 65, 143 and 102, then the test is not
reliable or valid, and it is fatally flawed.
If the test consistently delivers a score of 100 when checked, but the candidates real IQ
is 120, then the test is reliable, but not valid.
If the researcher's test delivers a consistent score of 118, then that is pretty close, and
the test can be considered both valid and reliable.
Reliability is an essential component of validity but, on its own, is not a sufficient measure of
42
validity. A test can be reliable but not valid, whereas a test cannot be valid yet unreliable.
Reliability, in simple terms, describes the repeatability and consistency of a test. Validity
defines the strength of the final results and whether they can be regarded as accurately
describing the real world.
For her results to be reliable, another researcher must be able to perform exactly the same
experiment on another group of people and generate results with the same statistical
significance. If repeat experiments fail, then there may be something wrong with the original
research.
The Test-Retest Method is the simplest method for testing reliability, and involves testing the
43
same subjects at a later date, ensuring that there is a correlation between the results. An
educational test retaken after a month should yield the same results as the original.
The difficulty with this method is that it assumes that nothing has changed in that time period.
Staying with education, if you administer exactly the same test, the student may perform much
better because they remember the questions and have thought about the questions.
How many times have you left an exam and, after a couple of hours, thought; “How could I
have been so stupid - I knew the answer to that one!” Of course, next time, you will get that
question right, meaning that the test is unreliable.
For this reason, if you have to retake an exam, you will be faced with different questions and
may be marked a little more strictly to take into account that you had extra time to revise. This
is not the complete picture, because the two exams will need to be compared, to ensure that
they produce the same results. This shows the importance of reliability in our lives and also
highlights the fact that there is no easy way to test it.
Internal Consistency
The internal consistency test compares two different versions of the same instrument, to
ensure that there is a correlation and that they measure the same thing.
For example, sticking with exams, imagine that an examining board wants to test that its new
mathematics exam is reliable, and selects a group of test students. For each section of the
exam, such as calculus, geometry, algebra and trigonometry, they actually ask two questions,
designed to measure the aptitude of the student in that particular area.
If there is a high internal consistency, and the results for the two sets of questions are similar,
then the new test is likely to be reliable. The test - retest method involves two separate
administrations of the same instrument, internal consistency measures two different versions
at the same time.
A horribly complicated statistical formula, called Cronbach's Alpha tests the reliability and
compares the various pairs of questions but, luckily, computer programs take care of that and
spit out a single number, telling you exactly how reliable the test is!
44
For this reason, extensive research programs always involve a number of pre-tests, ensuring
that all of the instruments used are consistent. Even physical scientists perform instrumental
pretests, ensuring that all of their measuring equipment is calibrated against established
standards.
Martyn Shuttleworth (Aug 22, 2009). Definition of Reliability. Retrieved from [Link]:
[Link]
45
10 Test–Retest Reliability
The test-retest reliability method is one of the simplest ways of testing the stability and
reliability of an instrument over time.
For example, if a group of students takes a test, you would expect them to show very similar
results if they take the same test a few months later. This definition relies upon there being no
confounding factor during the intervening time interval.
Instruments such as IQ tests and surveys are prime candidates for test-retest methodology,
because there is little chance of people experiencing a sudden jump in IQ or suddenly
changing their opinions.
On the other hand, educational tests are often not suitable, because students will learn much
more information over the intervening period and show better results in the second test.
If, on the other hand, the test and retest are taken at the beginning and at the end of the
semester, it can be assumed that the intervening lessons will have improved the ability of the
students. Thus, test-retest reliability will be compromised and other methods, such as split
testing, are better.
Even if a test-retest reliability process is applied with no sign of intervening factors, there will
always be some degree of error. There is a strong chance that subjects will remember some
of the questions from the previous test and perform better.
Some subjects might just have had a bad day the first time around or they may not have taken
the test seriously. For these reasons, students facing retakes of exams can expect to face
different questions and a slightly tougher standard of marking to compensate.
Even in surveys, it is quite conceivable that there may be a big change in opinion. People may
have been asked about their favourite type of bread. In the intervening period, if a bread
company mounts a long and expansive advertising campaign, this is likely to influence opinion
46
in favour of that brand. This will jeopardise the test-retest reliability and so the analysis that
must be handled with caution.
Perfection is impossible and most researchers accept a lower level, either 0.7, 0.8 or 0.9,
depending upon the particular field of research.
However, this cannot remove confounding factors completely, and a researcher must
anticipate and address these during the research design to maintain test-retest reliability.
To dampen down the chances of a few subjects skewing the results, for whatever reason, the
test for correlation is much more accurate with large subject groups, drowning out the
extremes and providing a more accurate result.
47
10.1 Reproducibility
The basic principle is that, for any research program, an independent researcher should be
able to replicate the experiment, under the same conditions, and achieve the same results.
This gives a good guide to whether there were any inherent flaws within the experiment and
ensures that the researcher paid due diligence to the process of experimental design.
A replication study ensures that the researcher constructs a valid and reliable methodology
and analysis.
If a type of measuring device has a design flaw, then it is likely that this artefact will be
apparent in all models.
An astronomer measuring the spectrum of a star notes down the instruments and
methodology used, and an independent researcher should be able to achieve exactly the
same results, Even in biochemistry, where naturally variable living organisms are used, good
research shows remarkably little variation.
However, the social sciences, ecology and environmental science are a much more difficult
48
case. Organisms can show a huge amount of variation, making it difficult to replicate research
exactly and so reproducibility is a process of attempting to make the experiment as
reproducible as possible, ensuring that the researcher can defend their position.
In addition, these sciences have to make much more use of statistics to dampen down
experimental noise caused by physiological and psychological differences between the
subjects.
This is one of the reasons why most social sciences accept a 95% probability level, which is a
contrast to the 99% confidence required by most physical sciences.
In any study, there is a far smaller chance of finding confounding evidence if the claims are
narrowly defined than if they are sweeping generalizations.
For example, a psychologist who found that aggression in children under the age of five
increased if they watched violent TV, could generalize that all children under five would
display the same condition.
Extending this to all children means that the experiment is prone to replication issues - A
researcher finding that aggression did not increase in nine year old children would invalidate
the entire premise by questioning the reproducibility.
49
The Framingham Heart Study, an experiment testing three generations of nurses for cardiac
issues, has been going on for over 60 years and nobody is seriously expected to replicate it.
Instead, results from other studies around the world are used to build up a database of
statistical evidence supporting the findings.
The rise of the Intelligent Design movement has seen evolutionary science under attack,
because creationists claim that evolution is not reproducible and, therefore, it is not valid. This
has opened up an intense debate about the role of replication study as, for example, a
geologist cannot very well recreate conditions found on the primordial earth and observe
rocks metamorphosing.
However, creationists misunderstand the idea of reproducibility and assume that it applies to
an entire theory. In fact, this is incorrect and it is a manipulation of scientific practices;
replicating research only applies to a specific experiment or observation.
However, a more talented geologist than I later travels to exactly the same place, and points
out that the rocks there are deformed and twisted 180 degrees, so my observations were the
wrong way around. My field study was reproducible in that another researcher could come
and try to replicate my observations.
Looking at the process from the other angle, imagine that an astronomer discovers a planet
circling around a distant star. Nobody is suggesting that he builds a gaseous cloud and waits
a few billion years for matter to accrete and an identical solar system to form, because that
would be absurd.
Performing a replication study would involve other astronomers observing the star to try to find
the planets, showing that there really are planets and that the original astronomer had no
equipment malfunction.
50
Creationism
When Arthur Evans discovered Knossos, on Crete, and proposed that there was an ancient,
advanced Minoan civilization, nobody suggested that he should recreate such a civilization
and see if they built an identical city. Absurd as it may seem, this is the type of assumption
that proponents of Creationism make.
Looking at this process in reverse, if a team of builders builds an exact replica of Knossos, it
does not prove that such a civilization existed, although it would be a useful exercise in
looking at some of the techniques used by ancient builders, allowing archaeologists to refine
their ideas. To suggest otherwise really is a deliberate misunderstanding and warping of the
scientific method.
Ultimately, if Creationists use the argument that evolution is wrong because it is not
reproducible, then they destroy their own argument. If evolutionary processes cannot be
subjected to replicable research, neither can Intelligent Design, so their argument founders on
its own presumptions. Surely, proponents of ID need to recreate the six days of Genesis
before their ideas can be accepted by science!
Bibliography
Rosen, J. (2008). Symmetry Rules: How Science and Nature are Founded on Symmetry.
Berlin, Germany: Springer-Verlag
Pickering, A. (1995). Beyond Constraint: The Temporality of Practice and the Historicity of
Knowledge. In J.Z. Buchwald (Ed.). Scientific Practice: Theories and Stories of Doing Physics.
(42-86). Chicago, University of Chicago Press
51
10.2 Replication Study
A replication study involves repeating a study using the same methods but with
different subjects and experimenters.
The researchers will apply the existing theory to new situations in order to determine
generalizability to different subjects, age groups, races, locations, cultures or any such
variables.
Suppose you are part of a healthcare team facing a problem, for instance, regarding use and
efficacy of certain "pain killer medicine" in patients before surgery. You search the literature
for same problem and indentify an article exactly addressing "this" problem.
Now question arise that how can you be sure that the results of this study in hand are
applicable and transferable into "your" clinical setting? Therefore you decide to focus on
preparation and implementation of a replication study. You will perform the deliberate
repetition of previous research procedures in your clinical setting and thus will be able to
strengthen the evidence of previous research finding, and correct limitations, and thus overall
results may be in favor of the results of previous study or you may find completely different
results.
A question may arise that how to decide if a replication study can be carried out or not?
Following are the guidelines or criteria proposed to replicate an original study:
52
The original research question is important and can contribute to the body of information
supporting the discipline
The existing literature and policies relating to the topic are supporting the topic for its
relevance
The replication study if carried out carries the potential to empirically support the results
of the original study, either by clarifying issues raised by the original study or extending
its generalizability.
The team of researchers has all expertise in the subject area and also has the access to
adequate information related to original study to be able to design and execute a
replication.
Any extension or modifications of the original study can be based on current knowledge
in the same field.
Lastly, the replication of the same rigor as was in original study is possible.
In field conditions, more opportunities are available to researchers that are not open to
investigations in laboratory settings.
Also, laboratory investigators commonly have only small number of potential participants in
their research trials. However in applied settings such as schools, classrooms, patients at
hospitals or other settings with large proportion of participants are often generously available
in field settings.
It is therefore possible in field settings to repeat or replicate a research on large scale and
more than once too.
53
11 Interrater Reliability
For any research program that requires qualitative rating by different researchers, it is
important to establish a good level of interrater reliability, also known as interobserver
reliability.
This ensures that the generated results meet the accepted criteria defining reliability, by
quantitatively defining the degree of agreement between two or more observers.
For example, watching any sport using judges, such as Olympics ice skating or a dog show,
relies upon human observers maintaining a great degree of consistency between observers. If
even one of the judges is erratic in their scoring system, this can jeopardize the entire system
and deny a participant their rightful prize.
Outside the world of sport and hobbies, inter-rater reliability has some far more important
connotations and can directly influence your life.
Examiners marking school and university exams are assessed on a regular basis, to ensure
that they all adhere to the same standards. This is the most important example of
interobserver reliability - it would be extremely unfair to fail an exam because the observer
was having a bad day.
For most examination boards, appeals are usually rare, showing that the interrater reliability
process is fairly robust.
Obviously, you cannot count thousands of birds individually; apart from the huge numbers,
they constantly move, leaving and rejoining the group. Using experience, we estimated the
54
numbers and then compared our estimate.
If one person estimated 1000 dunlin, one 4000 and the other 12000, then there was
something wrong with our estimation and it was highly unreliable.
If, however, we independently came up with figures of 4000, 5000 and 6000, then that was
accurate enough for our purposes, and we knew that we could use the average with a good
degree of confidence.
One good example is Bandura's Bobo Doll experiment, which used a scale to rate the levels
of displayed aggression in young children. Apart from extensive pre-testing, the observers
constantly compared and calibrated their ratings, adjusting their scales to ensure that they
were as similar as possible.
Experience is also a great teacher; researchers who have worked together for a long time will
be fully aware of each other's strengths, and will be surprisingly similar in their observations.
Bibliography
Auerbach, C., La Porte, H.H. & Caputo, R.K. (2004). Statistical Methods for Estimates of
Interrater Reliability. In Roberts, A.R. & Yeager, K.R. Evidence Based Practice Manual:
Reasearch and Outcome Measures in Health and Human Services, pp 444-448, New York,
NY: Oxford University Press
Jackson, S.L. (2011). Research Methods and Statistics: A Critical Thinking Approach (2nd
Ed.). Belmont, CA: Wadsworth Cengage Learning
Rubin, A., & Babbie, E.R. (2007). Essential Research Methods for Social Work, Belmont, CA:
Wadsworth Cengage Learning
55
Martyn Shuttleworth (Aug 16, 2009). Interrater Reliability. Retrieved from [Link]:
[Link]
56
12 Internal Consistency Reliability
Internal consistency reliability defines the consistency of the results delivered in a test,
ensuring that the various items measuring the different constructs deliver consistent
scores.
For example, an English test is divided into vocabulary, spelling, punctuation and grammar.
The internal consistency reliability test provides a measure that each of these particular
aptitudes is measured correctly and reliably.
One way of testing this is by using a test-retest method, where the same test is administered
some after the initial test and the results compared.
However, this creates some problems and so many researchers prefer to measure internal
consistency by including two versions of the same instrument within the same test. Our
example of the English test might include two very similar questions about comma use, two
about spelling and so on.
The basic principle is that the student should give the same answer to both - if they do not
know how to use commas, they will get both questions wrong. A few nifty statistical
manipulations will give the internal consistency reliability and allow the researcher to evaluate
the reliability of the test.
There are three main techniques for measuring the internal consistency reliability, depending
upon the degree, complexity and scope of the test.
They all check that the results and constructs measured by a test are correct, and the exact
type used is dictated by subject, size of the data set and resources.
Split-Halves Test
The split halves test for internal consistency reliability is the easiest type, and involves dividing
a test into two halves.
For example, a questionnaire to measure extroversion could be divided into odd and even
questions. The results from both halves are statistically analysed, and if there is weak
correlation between the two, then there is a reliability problem with the test.
57
The split halves test gives
a measurement of in
between zero and one,
with one meaning a perfect
correlation.
The division of the question into two sets must be random. Split halves testing was a popular
way to measure reliability, because of its simplicity and speed.
However, in an age where computers can take over the laborious number crunching,
scientists tend to use much more powerful tests.
Kuder-Richardson Test
The Kuder-Richardson test for internal consistency reliability is a more advanced, and slightly
more complex, version of the split halves test.
In this version, the test works out the average correlation for all the possible split half
combinations in a test. The Kuder-Richardson test also generates a correlation of between
zero and one, with a more accurate result than the split halves test. The weakness of this
approach, as with split-halves, is that the answer for each question must be a simple right or
wrong answer, zero or one.
For example, a series of questions might ask the subjects to rate their response between one
and five. Cronbach's Alpha gives a score of between zero and one, with 0.7 generally
accepted as a sign of acceptable reliability.
The test also takes into account both the size of the sample and the number of potential
responses. A 40-question test with possible ratings of 1 - 5 is seen as having more accuracy
than a ten-question test with three possible levels of response.
Of course, even with Cronbach's clever methodology, which makes calculation much simpler
58
than crunching through every possible permutation, this is still a test best left to computers
and statistics spreadsheet programmes.
Summary
Internal consistency reliability is a measure of how well a test addresses different constructs
and delivers reliable scores. The test-retest method involves administering the same test,
after a period of time, and comparing the results.
By contrast, measuring the internal consistency reliability involves measuring two different
versions of the same item within the same test.
Martyn Shuttleworth (Apr 26, 2009). Internal Consistency Reliability. Retrieved from
[Link]: [Link]
59
13 Instrument Reliability
Instrument reliability is a way of ensuring that any instrument used for measuring
experimental variables gives the same results every time.
In the physical sciences, the term is self-explanatory, and it is a matter of making sure that
every piece of hardware, from a mass spectrometer to a set of weighing scales, is properly
calibrated.
Instruments in Research
Some of the highly accurate balances can give false results if they are not placed upon a
completely level surface, so this calibration process is the best way to avoid this.
Political opinion polls, on the other hand, are notorious for producing inaccurate results and
delivering a near unworkable margin of error.
In the physical sciences, it is possible to isolate a measuring instrument from external factors,
such as environmental conditions and temporal factors. In the social sciences, this is much
more difficult, so any instrument must be tested with a reasonable range of reliability.
Test of Stability
Any test of instrument reliability must test how stable the test is over time, ensuring that the
same test performed upon the same individual gives exactly the same results.
The test-retest method is one way of ensuring that any instrument is stable over time.
60
Of course, there is no such thing as perfection and there will be always be some disparity and
potential for regression, so statistical methods are used to determine whether the stability of
the instrument is within acceptable limits.
Test of Equivalence
Testing equivalence involves ensuring that a test administered to two people, or similar tests
administered at the same time give similar results.
Split-testing is one way of ensuring this, especially in tests or observations where the results
are expected to change over time. In a school exam, for example, the same test upon the
same subjects will generally result in better results the second time around, so testing stability
is not practical.
Checking that two researchers observe similar results also falls within the remit of the test of
equivalence.
For example, a test of IQ should measure IQ only, and every single question must also
contribute. One way of doing this is with the variations upon the split-half tests, where the test
is divided into two sections, which are checked against each other. The odd-even reliability is
a similar method used to check internal consistency.
Physical sciences often use tests of internal consistency, and this is why sports drugs testers
take two samples, each measured independently by different laboratories, to ensure that
experimental or human error did not skew or influence the results.
Martyn Shuttleworth (Apr 16, 2009). Instrument Reliability. Retrieved from [Link]:
[Link]
61
[Link] - Copyright © 2008-2015 All Rights Reserved.
62