The Seductive Beauty of Latent Variable Models
The Seductive Beauty of Latent Variable Models
A R T I C L E I N F O A B S T R A C T
Keywords: Seduced by their mathematical beauty, psychologists have been using latent variable models for more than a
Latent variables century. Whether discussing a general factor of cognitive ability, personality, or psychopathology there has been
Reliability an unfortunate tendency to reify hierarchical structures without examining the utility of alternative models. To
Validity
some of us, the use of latent variables was an unfortunate mistake. By emphasizing internal consistency rather
Massively Missing Completely at Random
(MMCAR)
than validity, parsimony of fit rather than function, the use of latent variables has led psychological measurement
Scale construction and theory down a beautifully seductive garden path rather than focusing on the real problem of actually being
Factor analysis useful. I will address some of these alternatives and suggest that it is time to think more critically of the use of
Item analysis latent variable models in our theorizing and applications.
Open source
☆
Based upon the Distinguished Contribution Lecture to the International Society for the Study of Individual Differences, July 2023.
E-mail address: [email protected].
https://doi.org/10.1016/j.paid.2024.112552
2
W. Revelle Personality and Individual Differences 221 (2024) 112552
3
W. Revelle Personality and Individual Differences 221 (2024) 112552
Table 1
Self report and peer report from the SAPA-project. Correlations reported by Zola et al. (2021). Reliabilities on the main diagonal. Raw correlations below the diagonal.
Correlations corrected for reliability above the diagonal. Upper left quadrant reflects SAPA Personality Inventory scores (Condon, 2018) for 158,631 participants,
mean n/item = 18,180. Other quadrants reflect 908 peer rated participants. Values > 0.4 are highlighted in bold. Data from the zola dataset in the psychTools package.
Variable Self report Peer ratings
Agrbl Cnscn Nrtcs Extrv Opnnn Agrbl Cnscn Stblt Extrv IntlO
Agreeableness 0.87 0.32 − 0.14 0.28 0.09 0.75 0.21 0.18 0.34 0.22
Conscientiousness 0.28 0.87 − 0.20 0.13 0.06 0.16 0.78 0.22 0.42 0.13
Neuroticism − 0.12 − 0.18 0.90 − 0.28 − 0.10 − 0.01 − 0.16 ¡0.78 − 0.40 − 0.25
Extraversion 0.25 0.12 − 0.25 0.90 0.14 0.01 − 0.01 0.07 0.71 0.14
Opennness 0.08 0.05 − 0.09 0.13 0.86 − 0.14 − 0.06 0.10 0.17 0.49
Agreeableness 0.47 0.10 − 0.01 0.00 − 0.09 0.45 0.36 0.47 0.15 0.44
Conscientiousness 0.15 0.55 − 0.12 − 0.01 − 0.04 0.18 0.58 0.42 0.41 0.47
Stability 0.13 0.16 ¡0.58 0.05 0.07 0.25 0.25 0.60 0.38 0.52
Extraversion 0.23 0.28 − 0.27 0.49 0.11 0.07 0.23 0.22 0.52 0.32
IntellectOpenness 0.14 0.08 − 0.15 0.09 0.30 0.19 0.24 0.27 0.15 0.44
To which I will suggest that boiling an egg is sometimes more 2.2. Test theory
practically important than spending years studying chemistry.
With the emphasis upon constructs, much of the work in test theory
2.1. The multi-trait-multi-method matrix became how to design tests to maximize internal consistency measures
of reliability. In contrast to the earlier work by Gulliksen (1950) and
The third paper in this series emphasizing constructs was by Donald Nunnally (1978) which emphasized validity much of the past 60 years
Campbell and Donald Fiske (Campbell & Fiske, 1959) who elaborated on has emphasized reliability and internal structure and equated validity
the nomological network and introduced the concept of the Multi-Trait- with factorial validity. For a discussion of the move towards construct
Multi-Method Matrix (MTMM). They emphasized that it is the pattern of validity and away from simple prediction, see Slaney (2017).
correlations with measures of the same construct measured in the same Developments in test theory emphasized unidimensional constructs
way (reliability) as well as different ways (convergent validity) as con to be measured with “the New Psychometrics” of Item Response Theory
trasted to measures of different constructs (divergent validity). They (Embretson, 1996; Embretson & Hershberger, 1999; Reise, 1999) and
were specifically not interested in testing the utility of their measures so considered validity in terms of Structural Equation Models (Bollen,
much as the convergence of multiple measures of the same construct as 1989; Jöreskog, 1978; Wiley, 1973). IRT is based upon the concept of a
indications of validity. latent variable causing the manifest responses to items, SEM is regres
An early example of a MTMM correlation matrix was the set of sion with latent variables (observed variables corrected for measure
correlations between self ratings, self report test scores, and peer ratings ment error). These new approaches have enshrined latent variables
on 5 dimensions taken from the (Guilford, 1940) inventory of factors without considering the consequences.
reported by Carroll (1952). As would be hoped, higher convergence was Although originally requiring knowing how to code and having fa
found for traits across methods than for different traits within method. A miliarity with matrix algebra IRT and SEM procedures have become
similar approach to assess the validity of scales was proposed by McCrae easier to use without necessarily understanding when and why to use or
et al. (2011) who reported the long term stability of NEO facets, as well not use various methods. “One side of the problem is that psychologists
as the agreement of self rated facet scores with peer and spouse ratings have a tendency to endow obsolete techniques with obscure in
on those same facets. Although they do not report the discriminative terpretations. The other side is that psychometricians insufficiently
validity presumably they thought of these correlations as the diagonal communicate their advances to psychologists, and when they do they
values of a MTMM and thus as convergent mono-trait-hetero-method meet with limited success” (Borsboom, 2006, p. 428). The critiques are
validities. written in matrix notation in journals such as Multivariate Behavioral
A more recent example of a Multi-Trait-Multi-Method Matrix con Research and Psychometrika and seem to most non-experts as debating
siders the results of a validation study of traits measured by self report as the number of angels who can dance on the head of a pin.
well as by peer ratings (Zola et al., 2021). From an online sample using Our users are taught to push buttons on menu driven programs and
Massively Missing Completely at Random sampling of items (roughly to report the statistics that are seen as necessary. They are not taught to
100–200 items per subject from a pool of almost 700 items) data were think about what these various measures mean in their endless search
collected from 158,631 anonymous volunteer participants on items from for construct validity. For “construct validity functions as a black hole
the SAPA Personality Inventory (spi-135) (Condon, 2018). Correlations from which nothing can escape: Once a question gets labeled as a
were found using the Noah's Ark procedure (pairwise complete). In problem of construct validity, its difficulty is considered superhuman
addition, all participants were asked if they would nominate peers to and its solution beyond a mortal's ken.” (Borsboom, 2006, p. 431).
supply ratings on their personality. Peer ratings were thus collected on
1554 individual participants who rated 921 of the original participants 3. Prediction versus theory
on a short form of 30 items measuring 8 constructs. Table 1 shows the
correlations between five trait measures (α reliabilities on the diagonal). Although classic texts on measurement (e.g., Gulliksen, 1950; Nun
The upper left quadrant of the table shows the correlations of the self nally, 1978) devote entire chapters to issues of validity, more recently
report scales, the lower right quadrant the peer ratings. Except for the there has been less emphasis upon the practical problem of prediction
diagonal elements, these are all multi-trait-mono-method correlations. and more on the beauty of equations specifying latent variables. As
The lower left quadrant shows the raw correlations of the multi-trait- Hogan (2009) put it “Mainstream psychometrics concerns measuring
hetero-method correlations. The values above the diagonal reflect cor entities (i.e., determining ‘true scores’). But applied assessment has a job
relations corrected for attenuation. The two minor diagonals reflect the to do, and that is to predict outcomes.”
mono-trait-hetero-method validities. Although criticizing construct validity Borsboom and Mellenbergh
(2004) add an even stronger criticism of criterion validity:
4
W. Revelle Personality and Individual Differences 221 (2024) 112552
Fig. 1. α and validity as a function of the number of items and the average correlation showing the tradeoff between internal consistency and predictive validity;
“the idea of construct validity was introduced to get rid of the 4. Aggregation should be purposeful
atheoretical, empiricist idea of criterion validity, which is a
respectable undertaking because criterion validity was truly one of We have known since Spearman that test reliability goes up with test
the most serious mistakes ever made in the theory of psychological length (Fig. 1 left hand panel), as does validity (Fig. 1 right hand panel).
measurement. The idea that validity consists in the correlation be This leads us to form progressively longer scales in a hope that irrelevant
tween a test and a criterion has obstructed a great deal of under variance will diminish as a source of test variance.
standing and continues to do so.” (p. 1065) The classic example of the effects of aggregation is seen with the most
used statistic in psychology “coefficient α” (Cronbach, 1951) (Eq. (5)).
They go on to say
This measure is also known as KR-20 (Kuder & Richardson, 1937) or λ3
“Therefore, not just criterion validity but any correlational concep (Guttman, 1945). Part of the appeal of α/λ3 is that it can be found from
tion of validity is hopeless. The double-headed arrows of correlation the item variances and total test variance and is available in commercial
should be replaced by the single-headed arrows of causation, and software (Sijtsma, 2009a). Although this was convenient in the period of
these arrows must run from the attribute to the measurements”. the desk calculator, this is no longer important and so-called model
based estimates can be found from the covariances (Eqs. (9), (10)). For
“Validity is a property of tests: A valid test can convey the effect of
fixed average correlation, both α/λ3 increase with the number of items.
variation in the attribute one intends to measure. This means that the
Aggregation can also increase validity by combining k items with
relation between test scores and attributes is not correlational but
average validity ry
causal.” (p. 1067)
kry kry
ryk = = √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅. (11)
σx k + k*(k − 1)r
3.1. In defense of predictive validity
But there is an interesting contrast between Eqs. (5) and (11): “What
In striking contrast to these critiques of predictive validity is the one selects when optimizing predictive utility are items that are mutu
success of several groups of researchers concerned with vocational in ally uncorrelated but highly correlated with the criterion. This is not
terests (Dawis, 1992; Donnay, 1997; Holland, 1959; Strong Jr., 1927), what one expects or desires in measurement. Note that this does not
psychopathology (Hathaway & McKinley, 1943), or the analysis of “folk preclude that tests constructed in this manner may be highly useful for
concepts” of social interaction (Gough, 1965). Strong Jr. (1927) prediction. It does imply that optimizing measurement properties and
championed the predictive power of scales formed from items that optimizing predictive properties are not convergent lines of test con
distinguished members of a particular occupation from “People In struction.” (Borsboom & Mellenbergh, 2004, p. 1067). That is, there is a
General”. This completely empirical procedure was adapted by the de tradeoff between internal consistency and validity. This tradeoff may be
velopers of the MMPI (Hathaway & McKinley, 1943) and the CPI seen when comparing (Fig. 1 left hand panel) with (Fig. 1 right hand
(Gough, 1965). Harrison Gough was interested in predicting such panel). For while both internal consistency and validity increase with
varying criteria of socialization ranging from those seen as “best citi the number of items. The highest validity is found for those items that
zens” to incarcerated felons (Gough, 1965). Whether using the Califor lead to the lowest internal consistency.
nia Psychological Inventory (Gough, 1957) or an Adjective Check List The power of aggregation is that composite scales can include
(Gough, 1960) the goal was not a clean factor structure so much as scales important variance and reduce the contribution of extraneous error.
that worked. However, aggregation to maximize internal consistency (Eq. (5)) will
Perhaps more well known to readers of this journal or members of tend to minimize variance that is not random and not common with
ISSID is the success of the Hogan Personality Inventory (Hogan & Hogan, other items. My colleagues and I refer to such aggregation as spear-
1995). These tests are validated by their success in predicting real world fishing – developing sharp, pointed instruments with high internal
outcomes. consistency (Garner, in press; Revelle & Garner, 2023). The alternative
5
W. Revelle Personality and Individual Differences 221 (2024) 112552
Fig. 2. 10 items from Athenstaedt (2003) show a clear two factor structure representing 5 items reflecting feminine activities and five representing masculine
activities. Although the first and second sets of five items are clearly independent, both sets correlated with gender.
Table 2
Correlations of item composites corrected for item overlap. α reliabilities on the diagonal (in italics). The F and M scales show high correlations within and low between
the two sets of scales. e.g., the five F scale correlates 0.06 with the five item M scale. The data are from Athenstaedt (2003) and are available in the Athenstaedt dataset
in the psychTools package. The bottom two lines report the correlations with gender, and the ωh measure of general factor saturation. See Fig. 3 to see the validity and
internal consistency trade off.
Variable F2 F3 F4 F5 M2 M3 M4 M5 MF2 MF4 MF6 MF8 MF10 gendr
F2 0.72
F3 0.75 0.79
F4 0.77 0.80 0.82
F5 0.77 0.81 0.84 0.85
M2 0.12 0.15 0.16 0.14 0.79
M3 0.09 0.12 0.13 0.10 0.75 0.76
M4 0.09 0.12 0.13 0.10 0.77 0.78 0.81
M5 0.06 0.09 0.10 0.06 0.79 0.80 0.81 0.82
MF2 0.36 0.46 0.48 0.48 0.38 0.41 0.45 0.46 0.11
MF4 0.48 0.55 0.58 0.57 0.52 0.51 0.53 0.53 0.46 0.59
MF6 0.52 0.56 0.58 0.58 0.55 0.54 0.56 0.56 0.56 0.66 0.69
MF8 0.54 0.58 0.60 0.59 0.58 0.57 0.58 0.57 0.61 0.71 0.73 0.75
MF10 0.54 0.59 0.61 0.60 0.59 0.57 0.58 0.57 0.63 0.73 0.75 0.77 0.77
gender 0.52 0.57 0.58 0.56 0.54 0.55 0.54 0.52 0.67 0.71 0.75 0.74 0.74 1.00
ωh 0.72 0.79 0.69 0.71 0.79 0.77 0.7 0.69 0.11 0.13 0.23 0.24 0.15
approach is to use a net – diffuse scales that include multiple items with tional criteria (Nunnally, 1978) “acceptable” with values of 0.77. That is
criterion validity, even if not highly associated with each other. As we to say, we would expect such a 10 item scale to correlate 0.77 with a
suggest, you can catch more fish with a net than a spear. parallel measure. But from the point of view of whether these scales
Consider the correlations of 10 items from Athenstaedt (2003) that measure one thing, they clearly do not. The ωh values of 0.15 suggest
are discussed by Eagly and Revelle (2022) (Fig. 2). These items are that just 15 % of the variance is due to one latent factor.
included in the Athenstaedt data set in the psychTools package (Revelle, That is, from a traditional measurement point of view, the MF scales
2023b) for the R statistical system (R Core Team, 2023). The analyses are clearly inadequate for they do not represent one construct. Just 11 to
and graphics were done using the psych package (Revelle, 2023a) in R. 15 % of their variance is common to the scale. But their predictive
Using the inter-ocular trauma test for the number of factors, these 10 validity is far superior to that of the “better” scales that are purer
items clearly represent 2 independent factors. Although the sets of items measures of a single construct. As Eagly and Revelle (2022) said “the
are basically orthogonal, they all correlate with gender. We can find patterning of psychological gender/sex differences can be difficult to
composite scales of these items by combining the first 2, 3, 4 or 5 from discern in narrowly defined attributes but emerges more strongly in
each factor (F2…, F5, M2… M5) or composite scales of 1, 2, 3, 4, 5 from general trends. It follows that neither similarity nor difference prevails
each set (MF2, MF4, MF6, MF8, MF10). (Table 2). Just M or just F scales but instead a more complex intertwining of these two types of findings”.
are very internally consistent (ωh = 0.72…0.85) and reasonably valid This tradeoff between validity and internal consistency is seen in Fig. 3
(rgender = 0.52…0.58). But the composite (MF) scales are much less which plots the validity correlations against the ωh measures of general
internally consistent (ωh = 0.11…0.23, α = 0.11…0.77) and more valid factor saturation.
(rgender = 0.67…0.75). We have previously reported similar findings (Eagly & Revelle,
It is interesting to compare the two indicators of internal consistency. 2022) using a data set from Gruber et al. (2020) which also show the
The conventional measure for the 10 item MF scales, α, is by conven power of aggregation and the benefit of aggregating independent
6
W. Revelle Personality and Individual Differences 221 (2024) 112552
5.1. Ability
Fig. 4. Hierarchical analysis of 16 ability from the ICAR (panel A) and 19 size measures from the United States Airforce (panel B). Data sets in the psychTools package
are ability and USAF respectively. Measures of internal consistency: ωh = 0.66, 0.53, α = 0.83, 0.90, ωt = 0.86, 0.95 for ability and size respectively.
7
W. Revelle Personality and Individual Differences 221 (2024) 112552
and Germany, the ICAR project now has 17 item types and a database of
several thousand items (Dworak et al., 2021; Revelle et al., 2020). These
items show the traditional hierarchical structure of ability items (Fig. 4
panel A).
This hierarchical structure is remarkably similar to that of 19 mea
sures of physical size taken from the United States Airforce which also
show a higher level factor structure (Fig. 4 panel B). This factor, best
summarized as physical size cannot be said to be a cause of arm length or
chest diameter. For size is a formative sum of the component
measurements.
5.2. Temperament
8
W. Revelle Personality and Individual Differences 221 (2024) 112552
Table 3
Various estimates of internal structure for 5 “Big Few” and 27 lower level scales from the spi dataset. For a list of the items and scoring keys for these scales, see the help
page for the spi dataset in the psychTools package. Calculations done using the reliability function in the psych package. The first three columns are the traditional
measures of internal consistency, the next three represent three measures of unidimensionality, the next two are results of split half analyses and represent the best and
worst split half reliabilities. The final three columns report the mean and median inter-item correlations and the number of items per scale.
Variable ωh α ωt Uni τ ρp Max split Min split r Median r N items
Agree 0.55 0.87 0.89 0.69 0.80 0.86 0.91 0.66 0.32 0.25 14
Consc 0.58 0.86 0.88 0.75 0.84 0.90 0.91 0.70 0.30 0.27 14
Neuro 0.61 0.90 0.92 0.84 0.90 0.94 0.94 0.75 0.40 0.36 14
Extra 0.66 0.89 0.91 0.82 0.89 0.92 0.94 0.77 0.38 0.34 14
Open 0.47 0.84 0.86 0.68 0.77 0.88 0.89 0.62 0.27 0.22 14
Compassion 0.80 0.88 0.89 0.99 0.99 1.00 0.87 0.82 0.59 0.58 5
Trust 0.80 0.87 0.89 0.99 0.99 1.00 0.87 0.81 0.58 0.58 5
Honesty 0.71 0.81 0.84 0.96 0.97 0.99 0.83 0.70 0.46 0.46 5
Conservatism 0.56 0.78 0.85 0.82 0.90 0.91 0.84 0.61 0.41 0.35 5
Authoritarianism 0.63 0.81 0.86 0.89 0.93 0.95 0.85 0.63 0.46 0.46 5
EasyGoingness 0.45 0.68 0.76 0.90 0.92 0.98 0.73 0.58 0.29 0.29 5
Perfectionism 0.34 0.70 0.74 0.82 0.83 0.99 0.72 0.53 0.31 0.33 5
Order 0.62 0.81 0.85 0.92 0.94 0.99 0.83 0.66 0.46 0.42 5
Industry 0.72 0.84 0.86 0.99 0.99 1.00 0.84 0.76 0.52 0.50 5
Impulsivity 0.72 0.87 0.90 0.98 0.98 1.00 0.87 0.80 0.58 0.58 5
SelfControl 0.49 0.76 0.83 0.90 0.94 0.96 0.80 0.60 0.39 0.36 5
EmotionalStability 0.65 0.85 0.89 0.98 0.98 1.00 0.84 0.76 0.52 0.50 5
Anxiety 0.83 0.90 0.91 0.99 0.99 1.00 0.89 0.83 0.64 0.62 5
Irritability 0.78 0.89 0.91 0.98 0.99 0.99 0.89 0.79 0.61 0.60 5
WellBeing 0.80 0.90 0.92 0.99 0.99 1.00 0.90 0.81 0.63 0.63 5
EmotionalExpressiveness 0.73 0.80 0.83 0.92 0.93 0.99 0.83 0.68 0.45 0.43 5
Sociability 0.66 0.85 0.89 0.97 0.98 0.99 0.85 0.75 0.53 0.50 5
Adaptability 0.62 0.80 0.84 0.92 0.93 0.99 0.82 0.68 0.44 0.42 5
Charisma 0.67 0.82 0.86 0.94 0.96 0.98 0.84 0.72 0.47 0.43 5
Humor 0.68 0.78 0.82 0.91 0.92 0.99 0.81 0.64 0.42 0.40 5
AttentionSeeking 0.80 0.88 0.90 0.92 0.93 0.99 0.89 0.77 0.58 0.67 5
SensationSeeking 0.77 0.86 0.89 0.97 0.98 0.99 0.87 0.77 0.55 0.54 5
Conformity 0.67 0.82 0.87 0.89 0.93 0.96 0.85 0.67 0.47 0.47 5
Introspection 0.56 0.78 0.84 0.92 0.93 0.99 0.81 0.68 0.41 0.41 5
ArtAppreciation 0.68 0.80 0.83 0.89 0.90 0.99 0.81 0.65 0.44 0.46 5
Creativity 0.70 0.85 0.86 0.97 0.97 1.00 0.85 0.77 0.52 0.53 5
Intellect 0.81 0.86 0.87 0.99 0.99 1.00 0.84 0.78 0.54 0.52 5
Table 4
Descriptive statistics for the eight criteria used in the examples from the spi dataset. The trimmed mean represents the mean with the top and bottom 10 % removed.
The Mad is the median absolute difference from the median. For a discussion of the estimates of skewness and kurtosis see the help pages for describe in the psych
package.
Variable Vars n Mean SD Median Trmmd Mad Min Max Range Skew Krtss SE
In a practical sense, the question about the utility of theory versus and finding meaning in life (Gottlieb et al., 2021; Hogan, 1982; Hogan &
prediction has been answered by the success of companies that develop Blickle, 2018) and emphasize predictive rather than factorial validity.
instruments to predict employee success by using proprietary in Combining multiple dimensions is better than any single dimension.
struments. Rather than adopt factorially pure instruments with high Thus Hogan et al. (1994) in their review of personality and leadership
construct validity, these companies emphasize scales that discriminate effectiveness cite literature that surgency, emotional stability, and
successful from unsuccessful workers. Criteria of interest include conscientiousness predict better leadership performance.
absenteeism, theft, malicious behaviors and general dishonesty or lack The debate about scale construction procedures between those fa
of integrity (Hogan et al., 1996; Hogan & Sherman, 2020). Predictive voring latent variable models, those favoring theory driven models, and
validity is shown for truck drivers, service dispatchers, or machine op those using criterion oriented scales was addressed by Hase and Gold
erators. The success of this approach may be seen by the number of berg (1967) who reached the conclusion that all of these procedures
companies that use these proprietary instruments. Their instruments are were about equally effective when predicting a variety of criteria. In a
broadly theory relevant, e.g., socioanalytic theory suggests that we monumental followup which also addressed basic scale construction
should study the interpersonal challenges of getting along, getting ahead principles, Goldberg (1972) came to somewhat different conclusions,
9
W. Revelle Personality and Individual Differences 221 (2024) 112552
Table 5
Standardized β weights for 5 and 27 predictors of 8 criteria. Also shown are the multiple R values for the derivation sample (N = 2000) and cross validation sample (N
= 2000). Although values r > 0.075 have Bonferroni adjusted probabilities of < 0.01, I highlight (in bold) those β > 0.1. Calculations done with the lmCor and
crossValidation functions in the psych package.
Variable p1edu p2edu ER wllns smoke exer edctn helth
10
W. Revelle Personality and Individual Differences 221 (2024) 112552
0.3
Manhattan Plot of health Manhattan Plot of exer
0.3
Correlations with health
0.2
0.1
0.1
0.0
0.0
Agree
Consc
Neuro
Extra
Open
Compassion
Trust
Honesty
Conservatism
Authoritarianism
EasyGoingness
Perfectionism
Order
Industry
Impulsivity
SelfControl
EmotionalStability
Anxiety
Irritability
WellBeing
EmotionalExpressiveness
Sociability
Adaptability
Charisma
Humor
AttentionSeeking
SensationSeeking
Conformity
Introspection
ArtAppreciation
Creativity
Intellect
Agree
Consc
Neuro
Extra
Open
Compassion
Trust
Honesty
Conservatism
Authoritarianism
EasyGoingness
Perfectionism
Order
Industry
Impulsivity
SelfControl
EmotionalStability
Anxiety
Irritability
WellBeing
EmotionalExpressiveness
Sociability
Adaptability
Charisma
Humor
AttentionSeeking
SensationSeeking
Conformity
Introspection
ArtAppreciation
Creativity
Intellect
Fig. 6. Manhattan plots organize individual item validities by 5 higher order Agree.. Open and 27 lower order factors. The data are the derivation sample from the
spi. N = 2000. The dashed line represents the Bonferroni adjusted level of significance at the p < 0.01 level.
Table 6
20 spi items that best predict exercise. The last two columns identify items that are markers (if they are) of the five higher order factors and then the 27 lower level
factors. The item numbers correspond to those from Condon (2019). The item validities are the means of 10 folds. Estimates of internal consistency: ωh = 0.62, α =
0.88, ωt = 0.90, u = 0.69, rexercise = 0.33.
Variable Mean r Item B5 L27
11
W. Revelle Personality and Individual Differences 221 (2024) 112552
Table 7
20 spi items that best predict health. The last two columns identify items that are markers (if they are) of the five higher order factors and then the 27 lower level
factors. The item validities are the means of 10 folds. Estimates of internal consistency: ωh = 0.64, α = 0.90, ωt = 0.92, u = 0.37, rhealth = 0.43.
Variable Mean r Item B5 L27
showing how factorially based scales worked better on easy to predict For each of these eight criteria, Fig. 5 shows the cross validated
criteria, but that criterion oriented techniques were better with harder to multiple correlations for scales representing the Big Few, the “little 27”,
predict criteria. Hase and Goldberg (1967); Goldberg (1972) examined as well as scales formed from finding the best cross validated items using
468 unique items taken from the CPI to predict 13 different criteria for a the bestScales function. Although all the β values for the 5 and 27 pre
total sample of just 152 subjects. Being firm believers in the need to cross dictors on the 8 criteria are shown in Table 5, for conciseness, I just
validate their results, the derivation and cross validation samples had discuss self ratings of wellness and reported exercise. The three largest β
just 76 participants. Using much larger samples, my colleagues and I weights suggest that Exercise is done more by people who are high on
have found that empirical item level and lower level factor scales conscientious, emotional stability and more extraverted. These same
dominate high level factor based prediction (Revelle et al., 2021). Here I three factor based scales predict self ratings of health, but with a bigger
elaborate on those findings. effect for emotional stability and an overall larger R. When examining
these relationships in more detail, by looking at the lower level factor/
scales, we see that Exercise is associated with not being easy going, but
6.1. Examples of prediction at the scale level being sociable and a seeking stimulation. Health is also associated with
not being easy going, but is particularly associated with well being, low
At a more micro level, I have already used the example of predicting anxiety, self control and sensation seeking.
gender from various stereotypical gender items (Table 2, Fig. 3) to show
that increasing internal consistency does not necessarily lead to in
creases in validity. In fact, there is a well known (but forgotten) tradeoff 6.2. Prediction at the item level
between the two. I now consider a more complicated example which
uses dimensions that are commonly seen in personality research and In addition to using higher level and lower level factors/scales, it is
examine predicting a set of 8 criteria using three levels of analysis also possible to use the items themselves. A graphical demonstration of
(Fig. 5). how subsets of items from each of these higher level or lower level
For reproducibility of my results, I use data from the spi dataset in factors relate to the criteria is shown as a pair of “Manhattan” plots
the psychTools package and include the relevant R code in Appendix A. (Fig. 6). These two plots show the zero order correlations for each item
The spi dataset was collected as part of the SAPA project discussed in each scale with the criteria. Thus, although Neuroticism correlates
earlier and includes 135 items from Condon (2018). These 135 were − 0.27 with health, we can see that this is due to about seven of the 14
carefully curated from a larger set of 696 items which in turn were taken items in the scale and the high correlation of well being with health
from the more than 2000 items in the International Personality Item reflects the high correlations of all of the items in that short scale.
Pool (Goldberg et al., 2006). Of these 135 items, 70 may be formed into A more detailed pattern for exercise and health is found by looking at
5 higher level composites representing the Big Few, while all 135 items the items that are most descriptive. A simple “machine leaning” algo
can be scored for 27 different lower level item composites. Conventional rithm, implemented in the bestScales function identifies those items
estimates of internal consistency (ωh , α, ωt ) as well as various measures of which are most related to a criterion in each of 10 “folds” of the data. K-
unidimensional structure (Revelle & Condon, 2023) are shown in fold cross validation splits the data into k folds, and treats N*(k-1)/k
Table 3. As expected (Widaman & Revelle, 2023a, 2023b) scale scores participants as the derivation sample and N/k as the cross validation
found by unit weighting of the keyed items match factor score estimates sample. Pooled cross validation coefficients are then used to choose the
with all correlations > 0.97 (Table 4). “best” items. We have compared bestScales to more conventional tech
Because of the well known need to cross validate any empirical niques such as LASSO regression and finds that it performs about as well
finding (Cureton, 1950), all analyses were done on a randomly chosen (Elleman et al., 2020). The advantage of bestScales is that it is
50 % of the data and then the resulting β weights were applied to the completely transparent and produces a list of the best items for any
other 50 % of the sample. With the sample sizes I am using, (derivation criteria. Given that SAPA data normally has a high degree of missingness
N = 2000, cross validation N = 2000) the amount of shrinkage in the (by design) and that it works on both raw data as well as covariance
cross validation samples was minimal (compare the multiple R values matrices, we have found bestScales to be particularly useful.
for the derivation and cross validation samples in Table 5).
12
W. Revelle Personality and Individual Differences 221 (2024) 112552
Based upon the zero order correlations, we see that Extraverts ex finite number of items, factor score estimates are not latent variables,
ercise more (r = 0.13) or that the linear regression of Extraversion + they are merely weighted sum scores. Focusing on measures of internal
Conscientiousness combines the need for stimulation with the belief that consistency at the cost of focusing on predictive validity is a mistake.
exercise is healthy (R = 0.22). Or we can use lower level constructs that An alternative to the simple factor model of scale construction was
suggest people with a high sense of well being, who are not easygoing proposed by McCrae (2014) in his distinction between scales as the
and are high in industriousness exercise more (R = 0.33). Finally, we can intersection of items versus the union of items. Reconceptualizing our
find (and cross validate) the items that actually predict exercising (R = scales as formed from the union of multiple items that carry unique
0.33) (Table 6) or health (R = 0.43) (Table 7). All of these are reasonable information makes problems in Differential Item Functioning and
levels of understanding and prediction. It is important to point out the factorial invariance less challenging than thinking of homogeneous
multiple regressions done with the little 27 were based upon 135 items scales all meant to measure one latent construct. Consider the case of sex
(5 items per scale), the bestsScales results were based upon just the 20 differences in depression. Items measuring depression (e.g., “In the past
items most related to each criteria. week I have felt downhearted or blue” or “In the past week I felt hopeless
about the future”) have roughly equal endorsement characteristics for
7. Discussion and conclusions males and females. But the item “In the past week I have cried easily or
felt like crying” has a much higher threshold for men than for women
The tension between theory and prediction has been with us for (Schaeffer, 1988; Steinberg & Thissen, 2006) indicating a much higher
many years. Empirically based scale construction using items to predict level of depression for men who endorse the item. Similarly, lack of
outcomes is not a new idea (e.g., Hathaway & McKinley, 1943; Stewart factorial invariance across cultures is not a reason to reject a scale, but is
et al., 2022; Strong Jr., 1927, 1947) although it seems to have been a reason to more carefully investigate the pattern of item differences
forgotten by those who prefer constructs and latent variables. The across these cultures. Discussions of DIF in terms of relative versus ab
elegance of the arguments for construct validity (Cronbach & Meehl, solute measurement help clarify the need to examine the meaning of
1955; Loevinger, 1957) and the sheer pleasure of successfully doing a items before leaping to conclusions about factor invariance at the scale
factor analysis or structural equation model has seduced us from the level (Borsboom et al., 2002).
path towards predicting outcomes.
With the advent of very large data bases and recognizing the need for 7.1. Conclusions
cross validation, the empirical approach has become popular in other
fields. For knowing how to add (find sum scores) is, after all, the basic In the preceding pages I have taken the somewhat radical position
principle of polygenic risk scores used in Genome Wide Association that our emphasis upon latent variables and construct validity as an
Studies (GWAS) or in risk scores for medical outcomes. GWAS identifies attempt to understand the structure of personality has been done at the
the single SNPs correlated with outcomes as diverse as height or years of cost of showing that personality is actually useful. Although it is much
education which are then summed to produce a single score (the PRS). easier (and more enjoyable) to talk about theories of Extraversion and
The effectiveness of PRS is evaluated by correlation with the criterion Neuroticism (Eysenck, 1967) or Impulsivity and Anxiety (Gray, 1981,
variable. While the effect of each SNP is trivial (but reliable given the 1987), to use these higher level dimensions in predicting real outcomes
sample sizes used), the combined scores have much larger effects. Thus is difficult. For to predict specific outcomes it is better to resort to short,
Lee and his colleagues formed a PRS for years of education that could non-homogenous tests made up of the specific items that actually work.
explain 11 % of the variance (Lee et al., 2018) from the composite score Such scales are formative measures that do not reflect some underlying
of 1271 unrelated SNPs. Not using GWAS, but just combining unrelated latent cause, but are merely the observed sums of observed variables. We
predictors is seen in the Environmental Risk Scores for psychosis (Vassos should stop believing in the Easter Bunny.
et al., 2020) or the Environment Wide Association Studies to quantify
general health risks of environmental pollutants (Park et al., 2014). All Declaration of competing interest
of these studies are using SNPs as items in formative measures of risk.
They do not posit a latent variable causing the SNPs. The authors declare that they have no known competing financial
Although most users of SEM think of the items as reflective indicators interests or personal relationships that could have appeared to influence
of latent variables, the alternative is to recognize that many of our latent the work reported in this paper.
variables are just formative sums of independent items. I am not denying
the power of aggregation to form better measures, I am just suggesting Acknowledgment
that our measures need to be recognized for what they are: sums of in
dependent items which do not necessarily, and frequently do not, have I would like to thank David Condon, David Funder, Kayla Garner,
anything in common. That is, to think of a scale as more than a simple Lew Goldberg, Robert Hogan, René Mõttus, and Daniel Ozer for their
sum and to reify it as some latent variable is to mislead ourselves. With a comments and suggestions.
13
W. Revelle Personality and Individual Differences 221 (2024) 112552
14
W. Revelle Personality and Individual Differences 221 (2024) 112552
References Dworak, E. M., Revelle, W., Doebler, P., & Condon, D. M. (2021). Using the International
Cognitive Ability Resource as an open source tool to explore individual differences in
cognitive ability. Personality and Individual Differences, 169. https://doi.org/
Anni, K., Vainik, U., & Möttus, R. (2023). Personality profiles of 263 occupations.
10.1016/j.paid.2020.109906
psyarxiv/ajvg2. https://osf.io/preprints/psyarxiv/ajvg2.
Eagly, A. H., & Revelle, W. (2022). Understanding the magnitude of psychological
Armstrong, P. I., Smith, T. J., Donnay, D. A., & Rounds, J. (2004). The Strong ring: A
differences between women and men requires seeing the forest and the trees.
basic interest model of occupational structure. Journal of Counseling Psychology, 51,
Perspectives on Psychological Science, 17, 1339–1358. https://doi.org/10.1177/
299–313. https://doi.org/10.1037/0022-0167.51.3.299
17456916211046006
Athenstaedt, U. (2003). On the content and structure of the gender role self-concept:
Elleman, L. G., McDougald, S., Revelle, W., & Condon, D. (2020). That takes the BISCUIT:
Including gender-stereotypical behaviors in addition to traits. Psychology of Women
A comparative study of predictive accuracy and parsimony of four statistical learning
Quarterly, 27, 309–318. https://doi.org/10.1111/1471-6402.00111
techniques in personality data, with data missingness conditions. European Journal of
Bainbridge, T. F., Ludeke, S. G., & Smillie, L. D. (2022). Evaluating the big five as an
Psychological Assessment, 36, 948–958. https://doi.org/10.1027/1015-5759/
organizing framework for commonly used psychological trait scales. Journal of
a000590
Personality and Social Psychology, 122, 749–777. https://doi.org/10.1037/
Embretson, S. E. (1996). The new rules of measurement. Psychological Assessment, 8,
pspp0000395
341–349. https://doi.org/10.1037/1040-3590.8.4.341
Bartholomew, D., Deary, I., & Lawn, M. (2009). A new lease of life for Thomson’s bonds
Embretson, S. E., & Hershberger, S. L. (1999). The new rules of measurement: What every
model of intelligence. Psychological Review, 116, 567–579. https://doi.org/10.1037/
psychologist and educator should know. Mahwah, N.J: L. Erlbaum Associates.
a0016262
Eysenck, H. J. (1944). Types of personality: A factorial study of seven hundred neurotics.
Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.
The British Journal of Psychiatry, 90, 851–861. https://doi.org/10.1192/
Bollen, K. A. (2002). Latent variables in psychology and the social sciences. Annual
bjp.90.381.851
Review of Psychology, 53, 605–634. https://doi.org/10.1146/annurev.
Eysenck, H. J. (1952). The scientific study of personality. London: Routledge & K. Paul.
psych.53.100901.135239
Eysenck, H. J. (1953). Uses and abuses of psychology. London, Baltimore: Penguin Books.
Borg, I. (2018). A note on the positive manifold hypothesis. Personality and Individual
Eysenck, H. J. (1964). Sense and nonsense in psychology. Baltimore: Penguin Books.
Differences, 134, 13–15. https://doi.org/10.1016/j.paid.2018.05.041
Eysenck, H. J. (1965). Fact and fiction in psychology. Baltimore: Penguin Books.
Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425–440.
Eysenck, H. J. (1967). The biological basis of personality. Springfield: Thomas.
https://doi.org/10.1007/s11336-006-1447-6
Eysenck, H. J. (1990). Biological dimensions of personality. In L. A. Pervin (Ed.),
Borsboom, D., & Mellenbergh, G. J. (2004). Why psychometrics is not pathological.
Handbook of personality: Theory and research (pp. 244–276). New York, NY: Guilford
Theory & Psychology, 14, 105–120. https://doi.org/10.1177/0959354304040200
Press.
Borsboom, D., Mellenbergh, G. J., & Heerden, J. V. (2002). Different kinds of DIF: A
Eysenck, H. J., & Eysenck, M. W. (1985). Personality and individual differences: A natural
distinction between absolute and relative forms of measurement invariance and bias.
science approach. New York: Plenum.
Applied Psychological Measurement, 26, 433–450. https://doi.org/10.1177/
Eysenck, H. J., & Eysenck, S. B. G. (1964). Eysenck Personality Inventory. San Diego,
014662102237798
California: Educational and Industrial Testing Service.
Bravais, A. (1844). Analyse mathématique sur les probabilités des erreurs de situation d’un
Eysenck, H. J., & Himmelweit, H. T. (1947). Dimensions of personality; a record of research
point. Memoires Presentees al’Academie Royale des Sciences de L ’Institut de France.
carried out in collaboration with H.T. Himmelweit [and others]. London: Routledge &
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the
Kegan Paul.
multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. https://doi.org/
Flynn, J. R. (1984). The mean IQ of Americans: Massive gains 1932 to 1978. Psychological
10.1037/h0046016
Bulletin, 95, 29–51. https://doi.org/10.1037/0033-2909.95.1.29
Carroll, J. B. (1952). Ratings on traits measured by a factored personality inventory. The
Flynn, J. R. (1987). Massive IQ gains in 14 nations: What IQ tests really measure.
Journal of Abnormal and Social Psychology, 47, 626.
Psychological Bulletin, 101, 171–191. https://doi.org/10.1037/0033-2909.101.2.171
Cattell, R. B., & Stice, G. (1957). Handbook for the Sixteen Personality Factor Questionnaire.
Forbes, M. K., Sunderland, M., Rapee, R. M., Batterham, P. J., Calear, A. L., Carragher, N.,
Champaign, Ill: Institute for Ability and Personality Testing.
Ruggero, C., Zimmerman, M., Baillie, A. J., Lynch, S. J., Mewton, L., Slade, T., &
Comrey, A. L. (2008). The Comrey Personality Scales. In G. J. Boyle, G. Matthews, &
Krueger, R. F. (2021). A detailed hierarchical model of psychopathology: From
D. H. Saklowfske (Eds.), Vol. II. Sage handbook of personality theory and testing:
individual symptoms up to the general factor of psychopathology. Clinical
Personality measurement and assessment (pp. 113–134). London: Sage.
Psychological Science, 9, 139–168. https://doi.org/10.1177/2167702620954799
Condon, D. M. (2018). The SAPA Personality Inventory: An empirically-derived,
Galton, F. (1888). Co-relations and their measurement. Proceedings of the Royal Society.
hierarchically-organized self-report personality assessment model. PsyArXiv. https://
London Series, 45, 135–145.
doi.org/10.31234/osf.io/sc4p9
Garner, K. M. (2024). The forgotten trade-off between internal consistency and validity
Condon, D. M. (2019). Database of Individual Differences Survey Tools. Harvard Dataverse.
(abstract). In Multivariate behavioral research (in press).
https://doi.org/10.7910/DVN/T1NQ4V
Goldberg, L. R. (1972). Parameters of personality inventory construction and utilization:
Condon, D. M. (2022). RetestReliability = f(Stability,Memory,Personality) + ε. Presented at
A comparison of prediction strategies and tactics. In Multivariate behavioral research
symposium in honor of Sarah Dubrow.
monographs. no 72-2 7.
Condon, D. M. (2023). In osf.o/da59z (Ed.), Big five replicability. ARP.
Goldberg, L. R. (1990). An alternative “description of personality”: The big-five factor
Condon, D. M., & Revelle, W. (2014). The International Cognitive Ability Resource :
structure. Journal of Personality and Social Psychology, 59, 1216–1229. https://doi.
Development and initial validation of a public-domain measure. Intelligence, 43,
org/10.1037/0022-3514.59.6.1216
52–64. https://doi.org/10.1016/j.intell.2014.01.004
Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., &
Costa, P. T., & McCrae, R. R. (1992). Four ways five factors are basic. Personality and
Gough, H. G. (2006). The international personality item pool and the future of
Individual Differences, 13, 653–665. https://doi.org/10.1016/0191-8869(92)90236-I
public-domain personality measures. Journal of Research in Personality, 40, 84–96.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests.
https://doi.org/10.1016/j.jrp.2005.08.007
Psychometrika, 16, 297–334. https://doi.org/10.1007/BF02310555
Gottlieb, T., Furnham, A., & Klewe, J. B. (2021). Personality in the light of identity,
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests.
reputation and role taking: A review of socioanalytic theory. Psychology, 12,
Psychological Bulletin, 52, 281–302. https://doi.org/10.1037/h0040957
2020–2041. https://doi.org/10.4236/psych.2021.1212123
Cudeck, R., & MacCallum, R. C. (2007). Factor analysis at 100: Historical developments and
Gough, H. G. (1957). Manual for the California Psychological Inventory.
future directions. Mahwah, N.J: Lawrence Erlbaum Associates.
Gough, H. G. (1960). The adjective check list as a personality assessment research
Cureton, E. E. (1950). Validity, reliability, and baloney. Educational and Psychological
technique. Psychological Reports, 6, 107–122. https://doi.org/10.2466/
Measurement, 10, 94–96. https://doi.org/10.1177/001316445001000107
pr0.1960.6.1.107
Cutler, A., & Condon, D. M. (2023). Deep lexical hypothesis: Identifying personality
Gough, H. G. (1965). Conceptual analysis of psychological test scores and other
structure in natural language. Journal of Personality and Social Psychology, 125,
diagnostic variables. Journal of Abnormal Psychology, 70, 294–302. https://doi.org/
173–197. https://doi.org/10.1037/pspp0000443
10.1037/h0022397
Dawis, R. V. (1992). The individual differences tradition in counseling psychology.
Gray, J. A. (1981). A critique of Eysenck’s theory of personality. In H. J. Eysenck (Ed.),
Journal of Counseling Psychology, 39, 7–19. https://doi.org/10.1037/0022-
A model for personality (pp. 246–277). Berlin: Springer.
0167.39.1.7
Gray, J. A. (1987). Perspectives on anxiety and impulsivity: A commentary. Journal of
Deary, I. J. (2001). Intelligence: A very short introduction. OUP Oxford.
Research in Personality, 21, 493–509. https://doi.org/10.1016/0092-6566(87)
Deary, I. J. (2009). Introduction to the special issue on cognitive epidemiology.
90036-5
Intelligence, 37, 517–519. https://doi.org/10.1016/j.intell.2009.05.001
Gruber, F. M., Distlberger, E., Scherndl, T., Ortner, T. M., & Pletzer, B. (2020).
Digman, J. M. (1990). Personality structure: Emergence of the five-factor model. Annual
Psychometric properties of the multifaceted gender-related attributes survey
Review of Psychology, 41, 417–440. https://doi.org/10.1146/annurev.
(GERAS). European Journal of Psychological Assessment, 36, 612–623. https://doi.org/
ps.41.020190.002221
10.1027/1015-5759/a000528
Digman, J. M. (1997). Higher-order factors of the big five. Journal of Personality and
Guilford, J. P. (1940). Inventory of factors STDCR. Beverly Hills, Calif: Sheridan Supply
Social Psychology, 73, 1246–1256. https://doi.org/10.1037/0022-3514.73.6.1246
Co.
Donnay, D., Morris, M., Schaubhut, N., & Thompson, R. (2005). Strong interest inventory
Guilford, J. P. (1954). Psychometric methods (2nd ed.). New York: McGraw-Hill.
manual (rev. ed.). Palo Alto: Consulting Psychologists Press, Inc.
Guilford, J. P. (1956). The structure of intellect. Psychological Bulletin, 53, 267–293.
Donnay, D. A. (1997). E.K. Strong’s legacy and beyond: 70 years of the Strong interest
https://doi.org/10.1037/h0040755
inventory. Career Development Quarterly, 46, 2–22. https://doi.org/10.1002/j.2161-
Guilford, J. P. (1975). Factors and factors of personality. Psychological Bulletin, 82,
0045.1997.tb00688.x
802–814. https://doi.org/10.1037/h0077101
Donnay, D. A., & Borgen, F. H. (1996). Validity, structure, and content of the 1994 strong
Gulliksen, H., 1950. Theory of mental tests. John Wiley & Sons, Inc.
interest inventory. Journal of Counseling Psychology, 43, 275–291 (doi: 0022-0167/
96).
15
W. Revelle Personality and Individual Differences 221 (2024) 112552
Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10, McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, N.J: L. Erlbaum
255–282. https://doi.org/10.1007/BF02288892 Associates.
Hase, H. D., & Goldberg, L. R. (1967). Comparative validity of different strategies of Möttus, R., Wood, D., Condon, D. M., Back, M. D., Baumert, A., Costantini, G.,
constructing personality inventory scales. Psychological Bulletin, 67, 231–248. Epskamp, S., Greiff, S., Johnson, W., Lukaszewski, A., Murray, A., Revelle, W.,
https://doi.org/10.1037/h0024421 Wright, A. G., Yarkoni, T., Ziegler, M., & Zimmermann, J. (2020). Descriptive,
Hathaway, S., & McKinley, J. (1943). Manual for administering and scoring the MMPI. predictive and explanatory personality research: Different goals, different
Henry, S., Thielmann, I., Booth, T., & Mõttus, R. (2022). Test-retest reliability of the approaches, but a shared need to move beyond the big few traits. European Journal of
hexaco-100—And the value of multiple measurements for assessing reliability. PLoS Personality, 34, 1175–1201. https://doi.org/10.1002/per.2311
One, 17, 1–14. https://doi.org/10.1371/journal.pone.0262465 Musek, J. (2007). A general factor of personality: Evidence for the big one in the five-
Herrnstein, R. J., & Murray, C. (2010). The Bell Curve: Intelligence and class structure in factor model. Journal of Research in Personality, 41, 1213–1233. https://doi.org/
American life. Simon and Schuster. 10.1016/j.jrp.2007.02.003
Hogan, R. (1982). A socioanalytic theory of personality. In Nebraska Symposium on Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill.
Motivation (pp. 55–89). University of Nebraska Press. Park, S. K., Tao, Y., Meeker, J. D., Harlow, S. D., & Mukherjee, B. (2014). Environmental
Hogan, R. (2009). John Holland. URL: https://www.hoganassessments.com/blog/john risk score as a new tool to examine multi-pollutants in epidemiologic research: An
-holland/. example from the nhanes study using serum lipid levels. PLoS One, 9, Article e98632.
Hogan, R., Blickle, G., 2018. Socioanalytic theory: Basic concepts, supporting evidences, https://doi.org/10.1371/journal.pone.0098632
and practical implications, in: Shackelford, V.Z.H..T.K. (Ed.), The SAGE handbook of Peabody, D. (1967). Trait inferences: Evaluative and descriptive aspects. Journal of
personality and individual differences: The science of personality and individual Personality and Social Psychology, 7, 1–18. https://doi.org/10.1037/h0025230
differences. Sage reference Vol. 1, pp. 110–129. doi: https://doi.org/10.4135/9781 Pearson, K. (1896). Mathematical contributions to the theory of evolution. III.
526451163.n5. Regression, heredity, and panmixia. Philisopical Transactions of the Royal Society of
Hogan, R., Curphy, G. J., & Hogan, J. (1994). What we know about leadership: London. Series A, 187, 254–318. https://doi.org/10.1098/rsta.1896.0007
Effectiveness and personality. American Psychologist, 49, 493–504. https://doi.org/ Plato, n.d. Plato The Republic : The complete and unabridged Benjamin Jowett
10.1037/0003-066X.49.6.493 translation (1892). 3rd ed., Oxford Univeristy Press, Oxford.
Hogan, R., & Hogan, J. (1995). The Hogan personality inventory manual (2nd. ed.). Tulsa, R Core Team. (2023). R: A language and environment for statistical computing. Vienna,
OK: Hogan Assessment Systems. Austria: R Foundation for Statistical Computing. URL: https://www.R-project.org/.
Hogan, R., Hogan, J., & Roberts, B. W. (1996). Personality measurement and Reise, S. (1999). Personality measurement issues viewed through the eyes of IRT. In
employment decisions: Questions and answers. American Psychologist, 51, 469–477. S. E. Embretson, & S. L. Hershberger (Eds.), The new rules of measurement: What every
https://doi.org/10.1037/0003-066X.51.5.469 psychologist and educator should know (pp. 219–241). Mahwah, N.J: Lawrence
Hogan, R., & Sherman, R. A. (2020). Personality theory and the nature of human nature. Erlbaum Associates.
Personality and Individual Differences, 152, Article 109561. https://doi.org/10.1016/ Revelle, W. (1983). Factors are fictions, and other comments on individuality theory.
j.paid.2019.109561 Journal of Personality, 51, 707–714. https://doi.org/10.1111/1467-6494.ep7380795
Holland, J. L. (1959). A theory of vocational choice. Journal of Counseling Psychology, 6, Revelle, W. (1989). Personality theory is alive and well and living in europe.
35–45. https://doi.org/10.1037/h0040767 Contemporary Psychology: APA Review of Books, 34, 235–236. https://doi.org/
Holland, J. L. (1996). Exploring careers with a typology: What we have learned and some 10.1037/027760
new directions. American Psychologist, 51, 397–406. https://doi.org/10.1037/0003- Revelle, W. (2023a). psych: Procedures for psychological, psychometric, and personality
066X.51.4.397 research (2.3.9 ed.). Evanston: Northwestern University https://CRAN.r-project.
Howell, R. D., Breivik, E., & Wilcox, J. B. (2007). Reconsidering formative measurement. org/package=psych (R package version 2.3.9).
Psychological Methods, 12, 205–218. https://doi.org/10.1037/1082-989X.12.2.205 Revelle, W. (2023b). psychTools tools to accompany the psych package for psychological
Jensen, A. R. (1969). How much can we boost iq and scholastic achievement. Harvard research. Evanston: Northwestern University (psychTools. R package version 2.3.9).
Educational Review, 39, 1–123. https://doi.org/10.17763/haer.39.1. Revelle, W., & Condon, D. (2023). Using unidim rather than omega in estimating
l3u15956627424k7 undimensionality (submitted).
Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Prager. Revelle, W., Dworak, E. M., & Condon, D. M. (2020). Cognitive ability in everyday life:
Jensen, A. R., & Weng, L. J. (1994). What is a good g? Intelligence, 18, 231–258. https:// The utility of open source measures. Current Directions in Psychological Science, 29,
doi.org/10.1016/0160-2896(94)90029-9 358–363. https://doi.org/10.1177/0963721420922178
Johnson, W., Brett, C. E., & Deary, I. J. (2010). The pivotal role of education in the Revelle, W., Dworak, E. M., & Condon, D. M. (2021). Exploring the persome: The power
association between ability and social class attainment: A look across three of the item in understanding personality structure. Personality and Individual
generations. Intelligence, 38, 55–65. https://doi.org/10.1016/j.intell.2009.11.008 Differences, 169. https://doi.org/10.1016/j.paid.2020.109905
Jonas, K. G., & Markon, K. E. (2016). A descriptivist approach to trait conceptualization Revelle, W., Ellman, L.G., 2016. Factors are still fictions [peer commentary on “towards
and inference. Psychological Review, 123, 90–96. https://doi.org/10.1037/a0039542 more rigorous personality trait–outcome research,” by R. Mõttus]. European Journal
Jöreskog, K. G. (1978). Structural analysis of covariance and correlation matrices. of Personality 30, 324–325.
Psychometrika, 43, 443–477. https://doi.org/10.1007/BF02293808 Revelle, W., & Garner, K. M. (2023). Measurement: Reliability, construct validation, and
Kovacs, K., & Conway, A. R. (2019). A unified cognitive/differential approach to human scale construction. In T. Harry, T. W. Reis, & C. M. Judd (Eds.), Handbook of research
intelligence: Implications for iq testing. Journal of Applied Research in Memory and methods in social and personality psychology (in press).
Cognition, 8, 255–272. https://doi.org/10.1016/j.jarmac.2019.05.003 Revelle, W., & Wilt, J. (2013). The general factor of personality: A general critique.
Kovacs, K., & Conway, A. R. A. (2016). Process overlap theory: A unified account of the Journal of Research in Personality, 47, 493–504. https://doi.org/10.1016/j.
general factor of intelligence. Psychological Inquiry, 27, 151–177. https://doi.org/ jrp.2013.04.012
10.1080/1047840X.2016.1153946 Revelle, W., Wilt, J., & Condon, D. (2011). Individual differences and differential
Krueger, R. F., & Markon, K. E. (2006a). Reinterpreting comorbidity: A model-based psychology: A brief history and prospect. In T. Chamorro-Premuzic, A. Furnham, &
approach to understanding and classifying psychopathology. Annual Review of S. von Stumm (Eds.), Handbook of individual differences (pp. 3–38). Oxford: Wiley-
Clinical Psychology, 2, 111–133. https://doi.org/10.1146/annurev. Blackwell.
clinpsy.2.022305.095213 Roberts, B. W., Kuncel, N. R., Shiner, R., Caspi, A., & Goldberg, L. R. (2007). The power
Krueger, R. F., & Markon, K. E. (2006b). Understanding psychopathology: Melding of personality: The comparative validity of personality traits, socioeconomic status,
behavior genetics, personality, and quantitative psychology to develop an and cognitive ability for predicting important life outcomes. Perspectives on
empirically based model. Current Directions in Psychological Science, 15, 113–117. Psychological Science, 2, 313–345. https://doi.org/10.1111/j.1745-6916.2007.000
https://doi.org/10.1111/j.0963-7214.2006.0041 Royce, J. R. (1983). Personality integration: A synthesis of the parts and wholes of
Kuder, G., & Richardson, M. (1937). The theory of the estimation of test reliability. individuality theory. Journal of Personality, 51, 683–706. https://doi.org/10.1111/
Psychometrika, 2, 151–160. https://doi.org/10.1007/BF02288391 j.1467-6494.1983.tb00874.x
Lee, J. J., Wedow, R., Okbay, A., et al. (2018). Gene discovery and polygenic prediction Schaeffer, N. C. (1988). An application of item response theory to the measurement of
from a genome-wide association study of educational attainment in 1.1 million depression. Sociological Methodology, 18, 271–307. URL: http://www.jstor.org/stab
individuals. Nature Genetics, 50, 1112–1121. https://doi.org/10.1038/s41588-018- le/271051.
0147-3 Sijtsma, K. (2009a). On the use, the misuse, and the very limited usefulness of
Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Cronbach’s alpha. Psychometrika, 74, 107–120. https://doi.org/10.1007/s11336-
Reports Monograph Supplement, 9(3), 635–694. https://doi.org/10.2466/ 008-9101-0
pr0.1957.3.3.635 Sijtsma, K. (2009b). Reliability beyond theory and into practice. Psychometrika, 74,
Markon, K. E., Krueger, R. F., & Watson, D. (2005). Delineating the structure of normal 169–173. https://doi.org/10.1007/s11336-008-9103-y
and abnormal personality: An integrative hierarchical approach. Journal of Slaney, K. (2017). Historical precursors and early testing theory. London: Palgrave
Personality and Social Psychology, 88, 139–157. https://doi.org/10.1037/0022- Macmillan UK. https://doi.org/10.1057/978-1-137-38523-9_2
3514.88.1.139 Spearman, C., 1904a. “General Intelligence,” objectively determined and measured.
Marschak, J. (1954). Probability in the social sciences. In P. Lazarfeld (Ed.), Mathematical American Journal of Psychology 15, 201–292. doi: https://doi.org/10.2307/141210
thinking in the social sciences (pp. 166–215). Free Press. 7.
McCrae, R. R. (2014). A more nuanced view of reliability: Specificity in the trait Spearman, C. (1904b). The proof and measurement of association between two things.
hierarchy. Personality and Social Psychology Review, 19, 97–112. https://doi.org/ The American Journal of Psychology, 15, 72–101. https://doi.org/10.2307/1412159
10.1177/1088868314541857 Steinberg, L., & Thissen, D. (2006). Using effect sizes for research reporting: Examples
McCrae, R. R., Kurtz, J. E., Yamagata, S., & Terracciano, A. (2011). Internal consistency, using item response theory to analyze differential item functioning. Psychological
retest reliability, and their implications for personality scale validity. Personality and Methods, 11, 402–415. https://doi.org/10.1037/1082-989X.11.4.402
Social Psychology Review, 15, 28–50. https://doi.org/10.1177/1088868310366253
16
W. Revelle Personality and Individual Differences 221 (2024) 112552
Stewart, R. D., Mõttus, R., Seeboth, A., Soto, C. J., & Johnson, W. (2022). The finer Watts, A. L., Greene, A. L., Bonfifay, W., & Fried, E. I. (2023). A critical evaluation of the p-
details? The predictability of life outcomes from big five domains, facets, and factor literature. PsyArXiv 7yrnp/. https://doi.org/10.31234/osf.io/7yrnp
nuances. Journal of Personality, 90, 167–182. https://doi.org/10.1111/jopy.12660 Webb, E. (1915). Character and intelligence: An attempt at an exact study of character.
Strong, E. K., Jr. (1927). Vocational interest test. Educational Record, 8, 107–121. The British Journal of Psychology, Monograph Supplements I.
Strong, E. K., Jr. (1947). Vocational interests of men and women. Stanford University Press. Widaman, K. F., & Revelle, W. (2023a). Thinking thrice about sum scores, and then some
Su, R., Tay, L., Liao, H. Y., Zhang, Q., & Rounds, J. (2019). Toward a dimensional model more about measurement and analysis. Behavior Research Methods, 55, 788–806.
of vocational interests. Journal of Applied Psychology, 104, 690. https://doi.org/ https://doi.org/10.3758/s13428-022-01849-w
10.1037/apl0000373 Widaman, K. F., & Revelle, W. (2023b). Thinking about sum scores yet again, maybe the
Thomson, G. H. (1916). A hierarchy without a general factor. British Journal of last time, we don’t know, oh no…: A comment on McNeish (2023). Educational and
Psychology, 8, 271–281. Psychological Measurement, 0, Article 00131644231205310. https://doi.org/
Thomson, G. H. (1935). The definition and measurement of “g” (general intelligence). 10.1177/00131644231205310
Journal of Educational Psychology, 26, 241–262. https://doi.org/10.1037/h0059873 Wiley, D. E. (1973). The identification problem for structural equation models with
Thurstone, L. L. (1934). The vectors of mind. Psychological Review, 41, 1. https://doi.org/ unmeasured variables. In A. S. Goldberger, & O. D. Duncan (Eds.), Structural equation
10.1037/h0075959 models in the social sciences (pp. 69–83). New York: Seminar Press.
Thurstone, L. L. (1935). The vectors of mind: Multiple-factor analysis for the isolation of Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology:
primary traits. Chicago: Univ. of Chicago Press. Lessons from machine learning. Perspectives on Psychological Science, 12, 1100–1122.
Vassos, E., Sham, P., Kempton, M., Trotta, A., Stilo, S. A., Gayer-Anderson, C., … https://doi.org/10.1177/1745691617693393
Morgan, C. (2020). The Maudsley environmental risk score for psychosis. Zola, A., Condon, D. M., & Revelle, W. (2021). The convergence of self and informant
Psychological Medicine, 50, 2213–2220. reports in a large online sample. Collabra. Psychology, 7. https://doi.org/10.1525/
collabra.25983
17