Review of Educational Research
Winter 1975, Vol.45, No. 1,Pp.43-57
Scales and Statistics
Paul Leslie Gardner
Monash University,
Victoria, Australia
Over a quarter of a century has elapsed since
the publication of Stevens' (1946) influential paper on the rela-
tionship between scales of measurement and appropriate statis-
tics; nevertheless, the debate is still continuing. The central issue
in the debate is the nature of the relationship between ordinal and
interval scales, and parametric and nonparametric statistics. A
related and equally important issue is the extent to which scales
can be unerringly classified. The distinction between ordinal and
interval scales is not a black-and-white distinction: many kinds of
summated scales occupy a grey region somewhere in between.
S t e v e n s ' and SiegeVs Position
Stevens (1946) distinguishes four scales of measurement: nomi-
nal, ordinal, interval, and ratio. It is the distinction between
ordinal and interval scales t h a t is the most frequent source of
discussion in the psychometric literature. An ordinal scale is one
in which the objects are empirically ordered in the same way as
the set of numbers assigned to those objects. Measurements of
ordinal strength, therefore, place the set of objects being mea-
sured in rank order. No meaning can be attached to the size of the
interval between the measurements. No meaning can be attached
to the shape of the frequency distribution of the set of measure-
ments. Measurements of interval strength, on the other hand, do
The author gratefully acknowledges the helpful comments by the RER referees
and by Professor H. J. Walberg on an earlier version of this paper.
43
REVIEW OF EDUCATIONAL RESEARCH Vol. 45, No. 1
more than rank the set of objects. Interval scales 1 possess equality
of units over different parts of the scale; relative sizes of the in-
tervals between pairs of measurements may be interpreted mean-
ingfully; meaning can be attached to the shape of the particular
frequency distribution. 2
Associated with each type of scale is a unique set of permissible
mathematical transformations that leaves the scale invariant.
For example, linear transformations of interval scales are per-
missible, since the interval properties of such a scale would be
undisturbed. Likewise, nonlinear transformations of ordinal
scales are also permissible, but nonlinear transformations of
interval scales are not. In general, an ordinal scale can be
transformed by any monotonic increasing function, an interval
scale by a linear transformation.
The point of making the distinction between ordinal and inter-
val scales, in Stevens' view, is t h a t different statistical procedures
are appropriate in dealing with the two types of scale. Stevens
regards a particular statistical procedure as appropriate only if
the statistic being computed remains invariant under transfor-
mations t h a t leave the scale of measurement invariant. Anderson
(1961) provides a useful clarification of this idea:
It means t h a t if a statistic is computed from a set of scale
1
In the present discussion this term is synonymous with "equal-interval scales."
However, unequal-interval scales can also exist. For example, in a logarithmic
scale to base 10, each successive interval is ten times as large as the preceding one.
Unequal-interval scales can always be transformed into equal-interval scales by
an appropriate mathematical function.
2
The fact that interval scales permit meaningful assertions to be made about the
shapes of distributions, whereas ordinal scales do not, can be illustrated by
contrasting two physical measurements, mineral hardness and people's height.
The hardness of various minerals can be quantitatively described by comparing
them with a standard reference set of ten minerals which have been assigned
numbers from 1 (talc) to 10 (diamond). These ten constitute Moh's scale of
hardness, a purely ordinal scale. Suppose that we now collected samples of a
hundred different minerals found in a particular region, measured their hard-
nesses, and stated that the measures were normally distributed about some cen-
tral value. As a statement about the properties of the numbers, such an assertion is
perfectly acceptable. However, the apparently similar statement that the
hardnesses are normally distributed would be a meaningless assertion, since the
shape of the distribution depends upon the choice of reference minerals, and the
numbers assigned to them. A different reference set, and a different set of ordinal
numbers, might well have resulted in an entirely differently shaped distribution.
In contrast, consider a set of measurements of the heights of a random sample of
adult males. Again, we might note that the measures are normally distributed, but
this time we can go further and meaningfully assert that the heights are normally
distributed. In other words, the shape of the distribution is not merely a
description of a set of numbers, but a description of the underlying physical
concept as well. This transition from meaningful assertions about numbers to
meaningful assertions about concepts is possible because of the nonarbitrary
nature of the intervals in an interval scale. In an ordinal scale, the intervals are
arbitrary.
44
GARDNER SCALES AND STATISTICS
values and this statistic is then transformed, the identical
result will be obtained as when the separate scale values
are transformed and the statistic is then computed from
these transformed scale values. (p.3O9)
This argument leads directly to a classification of statistical
procedures into two categories: parametric and nonparametric.
Parametric statistics (e.g., -tests and F-tests) are so called
because they require the estimation of at least one parameter (i.e.,
population value). The derivation of these statistics assumes that
the samples being compared are drawn from a population t h a t is
normally distributed. The frequency distributions of the samples
are assumed to be meaningful, and the measures are assumed to
have interval strength. Nonparametric statistics (e.g., analysis of
variance by ranks) do not require the estimation of population
values; no assumptions are made about the equivalence of units
along the scale, or about the shape of the distribution of scores in
the population. The absence of any such assumptions leads to two
alternative names for nonparametric statistics: distribution-free
methods 3 and ranking tests.
The association of interval scales with parametric statistics has
generated the central thesis of a book by Siegel (1956), who
explicitly proscribes the use of parametric statistics for ordinal
scale data. If Siegel is correct, parametric procedures, such as
-tests, product-moment correlations, factor analysis and analysis
of variance, simply have no place in the statistical armory of any
researcher whose data possess less than interval strength. This
would effectively prevent most researchers in the behavioral
sciences from using parametric statistics.
This discussion raises two questions:
1. Given a particular measuring instrument, how does one
decide upon its strength?
2. Having decided upon its strength, what kinds of statistical
procedure are appropriate in dealing with the data?
These two questions belong to two quite distinct domains of
enquiry. The first belongs to the psychometric domain: an answer
to this question involves empirical matters related to the substan-
tive variable being measured and to the properties of the scale
used to measure that variable. The second question belongs to the
statistical domain: to answer it, a researcher needs (a) to know the
assumptions underlying the statistical procedure he wishes to
employ; (b)to evaluate the importance of these assumptions (i.e.,
the extent to which he can violate them and still use the
procedure); and (c) to analyse whether the assumptions have been
3
The terms "distribution-free" and "nonparametric" are not exactly synony-
mous. For example, the median test, a distribution-free statistic, uses an estimate
of the population median, a parameter.
45
REVIEW OF EDUCATIONAL RESEARCH Vol. 45, No. 1
met in his particular research. These two questions are sometimes
confused in discussions about scales and statistics, but it should be
emphasized that they are quite distinct: the question of whether a
particular method of statistical inference is appropriate when
applied to a particular set of numbers is not dependent on what
the numbers actually mean. This point will be elaborated later.
The first question is normally answered by appealing to either
or both of two criteria: the mode of construction of the instrument,
or the distribution of scores it yields.
1. The mode of construction of the instrument
a. If a test is composed of items of similar difficulty and if the
items all display moderate positive correlations with one
another, then it could be argued that the measures possess
interval strength.
b. If a test is constructed by psychophysical scaling methods
(Stevens, 1951), then, it is argued, the measures possess
interval strength.
2. The distribution of scores yielded by the instrument
When this criterion is adopted, the argument runs as
follows. Many biophysical properties (e.g., height, weight,
reaction time), known to have ratio strength (and hence, a
fortiori, interval strength), result in a normal distribution
of measures when a large sample of the population is
measured. If, the argument continues, a scale yields a
normal distribution, then its strength must be interval.
Logically, this argument is quite indefensible. It is a classic
illustration of the fallacy of affirming the consequent. ("All cats
have four legs; my cocker spaniel has four legs, therefore it is a
cat.") Nevertheless, the normal distribution argument has been
maintained as a convenient fiction, mainly by psychometricians
who wished to give nodding recognition to Stevens' and Sieges
position and still use parametric statistics.
Thus, an achievement test constructed out of moderately dif
ficult and positively discriminating items would be regarded as
interval under criterion la; a Thurstone-type differential scale
measuring attitudes would appeal to criterion lb; IQ tests fre
quently appeal to criterion 2b. Scales such as these, it is argued,
are interval (or as close to interval as it is ever possible to get in
psychometrics). For such scales, Question 2 is easily answered:
parametric statistics are appropriate.
If one accepts Stevens' and Sieges position, scales not meeting
these criteria would be classified as ordinal, and nonparametric
procedures should be adopted. For example, rating scales and
ranking procedures would be considered ordinal.
46
GARDNER SCALES AND STATISTICS
Later Debate
During the last decades, however, an extensive literature has
developed around the issue of scale strength and appropriate
statistics, and the answers to Questions 1 and 2 are no longer so
easily given. The counter argument to Stevens' and Sieges
position turns around two issues:
1. whether the distinction between ordinal and interval scales
can be easily made; and, more importantly,
2. whether the requirements assumed to be necessary for
parametric statistics to be used appropriately are, in fact, neces
sary.
On the first issue, Abelson and Tukey (1959; also cited in Tufte,
1970) argue t h a t
the typical state of knowledge short of metric information
is not rank-order information; ordinarily, one possesses
something more than rank-order information. For exam
ple, one may know t h a t Z ^ Z a a n d X , are ordered and in
addition t h a t Z 2 is closer toXj than it is to,X. (Tufte, 1970,
pp.'407-408)
Gaito (1960) points out t h a t the same data may be considered to
have the properties of two or more scales, depending on the
context in which it is considered. Responses to a single test item
(right/wrong) are purely of nominal strength, yet combining many
items yields a total score of at least ordinal strength. Gaito argues
that
This is similar to the situation of approximating the
binomial distribution (a distribution having nominal, or
possibly ordinal, scale properties) by the normal distribu
tion (a distribution having at least ordinal, possibly inter
val, scale properties) when n increases andp is close to 0.5.
Another example is the relation between the sign test and
the binomial distribution test. For a given set of data these
two methods will give the same result; yet Siegel lists the
former as of an ordinal nature, the latter as nominal.
(pp.277-278)
In short, the distinction between types of scales (and, by implica
tion, appropriate statistics) is by no means clear-cut.
The second issue, whether the requirements assumed to be
necessary for the proper use of parametric statistics are in fact
necessary, is the more important, and the more complex, issue.
Siegel (1956) lists five assumptions t h a t underlie the use of
parametric tests such as the t and F tests for making comparisons
between sets of experimental data:
1. The observations must be independent. That is, the
47
REVIEW OF EDUCATIONAL RESEARCH Vol. 45, No. 1
selection of any one case from the population for inclu-
sion in the sample must not bias the chances of any
other case for inclusion, and the score which is as-
signed to any one case must not bias the score which
is assigned to any other case.
2. The observations must be drawn from normally distrib-
uted populations.
3. These populations must have the same variance (or, in
special cases, they must have a known ratio of
variances).
4. The variables involved must have been measured in at
least an interval scale, so that it is possible to use the
operations of arithmetic (adding, dividing, finding
means, etc.) on the scores. .. .
5. The means of these normal and homoscedastic popula-
tions must be linear combinations of effects due to col-
umns and/or rows. That is, the effects must be additive.
(p.19)
The fifth assumption refers only to factorial anova designs: it is
not a requirement of the -test or one-way anova.
Gaito (1959) analyzes each of these assumptions in detail.
Assumption 1 is necessary; it is equally a requirement of non-
parametric techniques. Assumption 5 is not considered necessary
by some writers, and there has been much confusion concerning
the interpretation of this assumption. Discussion of the validity of
Assumptions 2 and 3 involves the concept of robustness, which, as
Labovitz (1967) reminds us
is the ability of a statistical test to maintain its logically
deduced conclusion when one or more assumptions have
been violated, (p.156)
The robustness of parametric testsi.e., the validity of Assump-
tions 2 and 3has been the subject of many empirical studies. In a
review of such studies, Box and Anderson (Note 1; cited in Gaito,
1959) state t h a t analysis of variance and -tests are
highly robust to both non-normality and inequality of
variances. Therefore the practical experimenter may use
these tests of significance with relatively little worry
concerning the failure of these assumptions to hold exactly
in experimental situations. The rather striking exception
to this rule is the sensitivity of the analysis of variance
heterogeneity when groups are of unequal size. (Gaito,
1959, p.117)
On the same issue, McNemar (1969) cites a study by Boneau (1960),
who computed a thousand values of t for the difference between
48
GARDNER SCALES AND STATISTICS
independent means, under different combinations of sample sizes,
shapes of population distributions, and equality or inequality of
population variances. Boneau studied the proportions of the s
reaching the .05 and .01 levels of significance under these varying
conditions. Unequal variances alone did not seriously affect the
proportion of ts reaching significance; however, unequal var-
iances coupled with widely differing sample sizes severely af-
fected the proportion of significant s. In most cases, changes in
the shape of the population distribution hardly affected the -test:
negligible effects if the population was rectangular, slight effects
if it was severely skewed. Only when one sample was drawn from a
normal or rectangular distribution and the other from a severely
skewed (J-shaped) distribution was the -test strongly affected.
Labovitz (1967) cites this study and several others (Bartlett, 1935;
Boneau, 1962; Gayen, 1949; Srivastava, 1958) and concludes that
(especially with large samples) the population distribution
can be markedly skewed or platykurtic with negligible
effects on the t values, (p.157)
Labovitz also cites several references (Norton, 1952, reported in
Lindquist, 1953, pp.78-90; Box, 1953; Cochran, 1947; Cochran &
Cox, 1950, pp.28-32; Fisher, 1922; Godard & Lindquist, 1940)
indicating the robustness of the F-distribution. More recently,
Gaito (1972) has shown that unequal sample size, when coupled
with unequal variances, has the greatest effect upon probability
levels when only two groups are being compared; for larger
numbers of groups, the effect is less marked.
The evidence clearly indicates that under most conditions,
parametric statistics are highly robust. However, care needs to be
taken not to interpret this generalization too liberally and go to
the extreme of assuming that parametric techniques are robust
under all conditions. Labovitz (1967) warns t h a t
A word of caution is necessary on robustness tests. Al-
though it frequently turns out that the violation of one
assumption does not appreciably alter the test, the viola-
tion of two or more assumptions frequently does have a
marked effect, (p.158)
Assumption 4, the measurement requirement, is not an element
of the parametric statistical model (and Siegel recognizes this),
but it is, according to Siegel, a central assumption underlying the
use of the model. It is this assumption that leads him to argue t h a t
nonparametric statistics must be used for ordinal data. As Gaito
(1959) points out, this assumption appears to be "the most serious
obstacle to the use of parametric techniques inasmuch as most
psychological research deals with sub-interval type data" (p.118).
The assumption is, of course, only a serious obstacle if it is, in
49
REVIEW OF EDUCATIONAL RESEARCH Vol. 45, No. 1
fact, valid, and there is some doubt about whether Siegel is
actually correct. The arguments against the validity of the
assumption are of two kinds. The first kind of argument is based
upon logical analysis of the mathematical assumptions underly-
ing the various statistical procedures; the second appeals to
empirical evidence on the effects of performing various types of
transformation upon the validity of statistical inferences.
In a paper illustrating the first kind of argument, Lord (1953)
points out t h a t a statistical test can hardly be cognizant of the
empirical meaning of the numbers with which it deals. Anderson
(1961) takes up Lord's argument and asserts categorically that
Siegel is wrong: the -test, for example, simply informs us of the
probability that the difference between the means of two groups of
numbers is real rather than due to chance; the validity of a
statistical inference, Anderson argues, cannot depend on the type
of measuring scale used. The same point is taken up by Gaito
(1960), who asserts t h a t the requirements of an interval scale
cannot be found if one examines the mathematical assumptions
underlying the analysis of variance procedure. Kempthorne (1955)
argues that whether a scale is interval or ordinal makes little
difference:
[T]he level of significance of the analysis of variance test
for differences between treatments is little affected by the
choice of a scale of measurement, (p.958)
McNemar (1969) concurs:
Nowhere in the derivations purporting to show that vari-
ous ratios will have sampling distributions which follow
either the F or the t or the normal distribution does one
find any reference to a requirement of equal units, (p.43l)
The whole argument is neatly summarized by Baker, Hardyck
and Petrinovich (1966):
[S]tatistics apply to numbers rather than things .. . the
formal properties of measurement scales, as such, should
have no influence on the choice of statistics, (p.292)
The second kind of argument is founded upon empirical studies
of the effects of transformations of data upon the probability
levels associated with the values of various statistics. The basis of
this argument is that, if one alters the metric properties of scales
and still reaches the same conclusions regardless of whether the
data have been transformed or not, then, for the purposes of
statistical inference, it cannot matter very much what the scale
properties are. The argument is illustrated by the work of
Labovitz (1967) and Baker, Hardyck, and Petrinovich (1966).
50
GARDNER SCALES AND STATISTICS
Labovitz invented some hypothetical experimental data of ordi-
nal strength, transformed the data according to seven different
arbitrary scoring systems, applied parametric statistics (point-
biserial correlation, followed by a -test) and found t h a t the
conclusions drawn were hardly affected. Baker, Hardyck, and
Petrinovich imposed monotonic transformations on a set of data,
and explored the effect of these distortions of the original data
upon the t distribution. Extreme departures from linear trans-
formations affected the values of t dramatically; for smaller
departures the effects were not large, and the authors conclude
that "strong statistics such as the -test are more than adequate
to cope with weak measurements" (p.3O8).
In the light of this discussion, Assumption 4 begins to lose some
of its force. In practice, because of the robustness of parametric
techniques, treating ordinal data as if they were interval would be
unlikely to lead to improper conclusions. If there is good reason to
suspect that there is serious inequality of units along the ordinal
scale, the original data can be transformed. (Transformations are
usually carried out when the distribution of scores deviates
markedly from a normal distribution.) Several transformational
techniques are available. 4 For example, extremely skewed dis-
tributions can be converted using a square root or a logarithmic
transformation. Both of these transformations retain all the
information present in the original data so that subsequent
reconversion of the transformed data into the original metric is
possible. A common alternative procedure is to normalize the
data. This requires the assumption that the underlying distribu-
tion in the population is, in fact, normal; this procedure does not
permit recovery of the information present in the original data.
The use of averaged ranks provides another normalizing proce-
dure. As Gaito (1959) points out,
when the rankings of each of a number of S s are averaged,
these average ranks would appear to more closely approx-
imate the normal curve, (p.118)
Another transformational procedure has been developed by Abel-
son and Tukey (1959). This procedure involves applying certain
rules to assign sets of numbers to data which are regarded as
stronger than ordinal. Which rule is most appropriate depends on
the assumptions one is prepared to make about the relative sizes
of the differences between successive scores in different parts of
the distribution.
4
Apart from altering the size of units along a scale and converting skewed
distributions to a more symmetrical form, transformations may have two other
effects: they may increase the sensitivity of a statistical test, and they may remove
interaction effects.
51
REVIEW OF EDUCATIONAL RESEARCH Vol. 45, No. 1
The Present Position
Considering all the arguments together, it seems clear that the
parametric/nonparametric issue is not as critical an issue as it was
thought to be two decades ago. Heerman and Braskamp (1970)
consider the arguments on both sides of this issue and conclude:
In summary, the main arguments advanced for the use of
non-parametric methods in place of parametric methods
are not very compelling. Most investigators seem to agree
that scale type is irrelevant to the choice of a statistical
tool, and even though the use of parametric methods
requires more assumptions than non-parametric methods,
failure to meet these assumptions does not appear to have
serious consequences in most instances, (p.37)
McNemar (1969) agrees: for him, the crucial question is
whether or not the F, t and z tests can, in view of their
dependence on means and variances, be safely used when
the scale of measurement is, as is the rule in psychology,
somewhere between the ordinal and interval scales. The
question boils down to this: Will F s , ts and zs follow their
respective theoretical sampling distributions when the
underlying scores are not on an interval scale? The answer
to this is a firm yes provided the score distributions do not
markedly depart from the normal form, (p.43l)
Significantly, Stevens has changed his original position on this
issue. In his earlier work, he clearly associates the ordinal/
i n t e r v a l dichotomy with the n o n p a r a m e t r i c / p a r a m e t r i c
dichotomy. In his later work, he is less proscriptive and more
pragmatic about the use of parametric statistics for ordinal scale
data. Five years after his 1946 paper, Stevens (1951) writes that
most of the scales widely and effectively used by
psychologists are ordinal scales: In the strictest propriety,
the ordinary statistics involving means and standard
deviations ought not to be used with these scales, for these
statistics imply a knowledge of something more than the
relative rank ordering of data. On the other hand, for this
'illegal' statisticizing there can be invoked a kind of
pragmatic sanction: in numerous instances it leads to
fruitful results, (p.26)
In a much later paper (1968), he is now prepared to ask questions
such as
What is the degree of risk entailed when use is made of
statistics that may be inappropriate in the strict sense
52
GARDNER SCALES AND STATISTICS
t h a t they fail the test of invariance under permissible scale
transformation? (p.85l)
This new position leads him to the conclusion that
The question is thereby made to turn, not on whether the
measurement scale determines the choice of a statistical
procedure, but on how and to what degree an inappro-
priate statistic may lead to a deviant conclusion. . . . By
spelling out the costs, we may convert the issue from
seeming proscription to a calculated risk, (p.852)
If one accepts this new position, the first of the two questions
posed earlier remains unchanged, but the second has to be
modified:
1. Given a particular measuring instrument, how does one
decide upon its strength?
2. What advantages and risks are there in using parametric
statistics?
Although Question 1 is the same as before, the criteria for
dealing with it have to be modified somewhat, if one adopts the
position of Abelson and Tukey (1959) and Gaito (1960), namely that
the distinction between scale types is not clear-cut. In between the
black (those scales approximating well to the requirements of
interval strength) and the white (rating scales and ranks, which
are clearly ordinal), there lies a grey region, occupied by a large
number of instruments which can be conveniently labelled as
summated scales. Examples of such scales include
1. objective achievement tests in which the items display a
limited range of difficulties; the total score is the sum of a set of
nominal measures (right/wrong);
2. tests containing several short essays, in which the total score
is the sum of a set of essentially ordinal rankings;
3. summated-rating attitude scales, in which the total score is
the sum of a set of ordinal weightings; in a well-constructed scale,
the items will display a limited range of popularity (the attitudinal
equivalent of cognitive difficulty) and will all be differentiating
(discriminating).
The summated scale category obviously includes a large propor-
tion of all the instruments used in educational and psychological
research. The category occupies an intermediate position on the
ordinal/interval continuum. On the one hand, there is no claim
that equal increments in observed score along the scale represent
equal increments in the underlying latent variable being mea-
sured; on the other hand, the mode of construction in each case
suggests t h a t the deviations from interval properties will not be
extreme. The physical analogy is with a rubber ruler t h a t is
compressed and dilated along its length, but not beyond its elastic
53
REVIEW OF EDUCATIONAL RESEARCH Vol. 45, No. 1
limits. In view of the robustness of the commonly employed
parametric statistics, it is unlikely that such distortions will make
normal test theory inapplicable. If this argument is accepted,
there is little risk of drawing unwarranted conclusions when
parametric procedures are applied to the data obtained from
summated scales.
An important question which remains to be answered is, does it
matter whether parametric or nonparametric statistics are
employed? The answer to this question depends greatly upon the
researcher's purpose. For example, consider an experiment
(Gatta, 1973) illustrating the common treatment by levels design.
The researcher was interested in the effects of two grading
procedures upon the attitudes of students of varying achievement
level, and he wished to investigate the interaction effects between
achievement level and grading procedure. This virtually demands
the use of parametric statistics: as Gaito (1959) points out, the few
nonparametric tests for interaction that are available are compu-
tationally laborious and, more importantly, very much less power-
ful. Because of the relatively poor ability of nonparametric
techniques to discriminate amongst different treatment combina-
tions, McNemar (1969) recommends the use of parametric
techniques for experimental data:
In general, distribution-free methods, when applied for
comparative purposes to data which are normal or nearly
normal, are not so sensitive (that is, as powerful for
avoiding type II errors) as the appropriate z, t or F
technique. Consequently, in using a non-parametric
method as a short cut, we are throwing away dollars in
order to save pennies, (pp.431-432)
The use of parametric methods in handling data from factorial
experiments is, therefore, defended on pragmatic grounds.
Gaito and Firth (1973) contribute a further argument to the
discussion; they point out that analysis of variance designs permit
estimation of the magnitude of the various treatment effects
involved in the experiment. They argue that such estimates
could provide an important function, that of reducing the
incidence of trivial "significant" results which are em-
phasized as important in the literature. Thus, in recent
years, there has been a reaction against exclusive concern
with tests of H 0 , favoring the use of estimates of the
statistical sources of variation, (p.151)
Finally, Sawrey (1958) argues that
non-parametric techniques most often test slightly dif-
ferent hypotheses than do parametric tests, as, for exam-
54
GARDNER SCALES AND STATISTICS
pie, a median difference rather than a mean difference. If
the distributions are skewed, these become definitely dif-
ferent hypotheses, (p.176)
Suppose t h a t teacher behaviour X linearly affects pupil attitude
Y, so t h a t all pupils with a particular teacher increase (or
decrease) in attitude during a school year; there would then be no
difference in the conclusions drawn from a parametric or a
nonparametric test. However, if behaviourX affected the Y scores
of only a particular nonrandom group of students in the class (e.g.,
those with initially extremely positive attitudes) but had no effect
upon the rest, then the shape of the distribution of Y scores would
change. A test of the significance of means would detect this
change, but a median test would not. Such changes in the shapes
of distributions are very real possibilities in educational research,
and it would seem undesirable to miss the opportunity of detecting
them. A neat illustration of this point is provided by Wick and
Yager (1966), who studied high school students' attitudes to
science at various grade levels. They found that, during the school
year, the mean score consistently declined while the median
remained stable.
Summary
The arguments presented in this review can be summarized in a
few sentences:
1. Questions about the interpretation of measurements and
questions about the validity of statistical procedures lie in distinct
domains. In other words, the question of whether a given set of
data is ordinal or interval is important if we wish to attach mean-
ing to the numbers in the set; it is not important if we wish to esti-
mate the probability t h a t two sets of numbers have been drawn
from a single population.
2. The distinction between ordinal and interval scales is not
sharp. Many summated scales yield scores that, although not
strictly of interval strength, are only mildly distorted versions of
an interval scale.
3. Some of the arguments underlying the assertion t h a t
parametric procedures require interval strength statistics appear
to be of doubtful validity.
4. Parametric procedures are, in any case, robust and yield
valid conclusions even when mildly distorted data are fed into
them. Furthermore, if the distortions are severe, various trans-
formation techniques can be applied to the data.
5. For some kinds of research design, parametric procedures
retain a number of important experimental benefits: they possess
greater sensitivity, the ability to detect interaction effects in
55
REVIEW OF EDUCATIONAL RESEARCH Vol. 45, No. 1
factorial experiments, and the ability to yield estimates of the
magnitudes of treatment effects.
Reference Note
1. Box, G. E. P., and Anderson, S. L. Robust tests for variances and effect of
non-normality and variance heterogeneity on standard tests. Technical
Report No. 7. Ordinance Project No. TB2-001(832), Department of Army
Project No. 599-01-004. (no date).
References
Abelson R. P., and Tukey, J. W. Efficient conversion of non-metric into metric
information. Proceedings of the Social Statistics Section of the American
Statistical Association. Washington, 1959, 226-230. Reprinted in E. R. Tufte
(Ed.), The quantitative analysis of social problems. Reading, Massachusetts:
Addison-Wesley, 1970.
Anderson, N. H. Scales and statistics: Parametric and non-parametric. Psycholog-
ical Bulletin, 1961, 58, 305-316.
Baker, B. O. Hardyck, C. D., and Petrinovich, L. F. Weak measurements vs. strong
statistics: An empirical critique of S. S. Stevens' prescriptions on statistics.
Educational and Psychological Measurement, 1966, 26, 291-309.
Bartlett, M. S. The effect of non-normality on thet-distributions.Proceedings of the
Cambridge Philosophic Society, 1935, 31, 223-231.
Boneau, C. A. The effects of violations of assumptions underlying the t-test.
Psychological Bulletin, 1960, 57, 49-64.
Boneau, C. A. A comparison of the power of the U and t -tests. Psychological Review,
1962, 69, 246-256.
Box, G. E. P. Non-normality and tests on variances. Biometrika, 1953, 40, 318-335.
Cochran, W. G. Some consequences when the assumptions for the analysis of
variance are not satisfied. Biometrics, 1947, 3, 22-38.
Cochran, W. G., and Cox, G. M. Experimental designs. New York: Wiley, 1950.
Fisher, R. A. On the mathematical foundations of theoretical statistics.
Philosophical Transactions of the Royal Society of London. Series A, 1922,222,
309-368.
Gaito, J. Non-parametric methods in psychological research. Psychological Re-
ports, 1959, 5, 115-125.
Gaito, J. Scale classification and statistics. Psychological Review, 1960,67,277-278.
Gaito, J. An index of estimation to ascertain the effect of unequal n on ANOVA F
tests. American Psychologist, 1912,27, 1081-1082.
Gaito, J., and Firth, J. Procedures for estimating magnitude of effects. The Journal
of Psychology, 1913,83, 151-161.
Gatta, L. A. An analysis of the pass-fail grading system as compared to the
conventional grading system in high school chemistry. Journal of Research in
Science Teaching, 1973,10, 3-12.
Gayen, A. K. The distribution of Student's t in random samples of any size drawn
from non-normal universes. Biometrika, 1949, 36, 353-369.
Godard, R. H., and Lindquist, E. F. An empirical study of the effect of heterogene-
ous within-groups variance upon certain F-tests of significance in analysis of
variance. Psychometrika, 1940, 5, 263-274.
Heerman, E. F., and Braskamp, L. A. Readings in statistics for the behavioural
sciences. Englewood Cliffs, N.J.: Prentice-Hall, 1970.
Kempthorne, O. The randomization theory of experimental inference. Journal of
the American Statistical Association, 1955, 50, 946-967.
Labovitz, S. Some observations on measurement and statistics. Social Forces,
1967, J6, 151-160.
56
GARDNER SCALES AND STATISTICS
Lindquist, E. F. Design and analysis of experiments in psychology and education.
New York: Houghton Mifflin, 1953.
Lord, F. M. On the statistical treatment of football numbers. American
Psychologist, 1953, 8, 750-751.
McNemar, Q. Psychological statistics (4th ed.). New York: Wiley, 1969.
Norton, D. W., An empirical investigation of some effects of non-normality and
heterogeneity on the F-distribution. Unpublished doctoral dissertation, State
University of Iowa, 1952.
Sawrey, W. L. A distinction between exact and approximate non-parametric
methods. Psychometrika, 1958, 23, 171-177.
Siegel, S. Non-parametric statistics for the behavioural sciences. New York:
McGraw-Hill, 1956.
Srivastava, A. B. L. Effects of non-normality of the power function of -test.
Biometrika, 1958, 45, 421-429.
Stevens, S. S. On the theory of scales of measurement. Science, 1946,103,677-680.
Stevens, S. S. Mathematics, measurement and psychophysics. In S. S. Stevens
(Ed.), Handbook of experimental psychology. New York: Wiley, 1951.
Stevens, S.S. Measurement, statistics and the schemapiric view. Science, 1968,161,
849-856.
Tufte, E. R. The quantitative analysis of social problems. Reading, Mass.:
Addison-Wesley, 1970.
Wick, J. W., and Yager, R. E. Some aspects of the student's attitude in science
courses. School Science and Mathematics, 1966, 66, 269-273.
AUTHOR
PAUL LESLIE GARDNER Address: Faculty of Education, Monash University,
Clayton, Victoria, Australia, 3168. Title: Senior Lecturer in Education. De-
grees: B.Sc, M.Ed., Melbourne; Ph.D., Monash. Specialization: Attitude
measurement, science education.
57