344
British Journal of Mathematical and Statistical Psychology (2016), 69, 344–351
© 2016 The British Psychological Society
www.wileyonlinelibrary.com
A short note on the maximal point-biserial
correlation under non-normality
Ying Cheng* and Haiyan Liu
Department of Psychology, University of Notre Dame, Indiana, USA
The aim of this paper is to derive the maximal point-biserial correlation under non-
normality. Several widely used non-normal distributions are considered, namely the
uniform distribution, t-distribution, exponential distribution, and a mixture of two normal
distributions. Results show that the maximal point-biserial correlation, depending on the
non-normal continuous variable underlying the binary manifest variable, may not be a
function of p (the probability that the dichotomous variable takes the value 1), can be
symmetric or non-symmetric around p = .5, and may still lie in the range from 1.0 to 1.0.
Therefore researchers should exercise caution when they interpret their sample point-
biserial correlation coefficients based on popular beliefs that the maximal point-biserial
correlation is always smaller than 1, and that the size of the correlation is always further
restricted as p deviates from .5.
1. Introduction
The population product-moment correlation between a continuous (X) and dichotomous
(Y ) variable, denoted by qpb ðX; Y Þ, is also known as the point-biserial correlation:
E ½ðX lX ÞðY lY Þ
qpb ðX; Y Þ ¼ :
rX rY
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Given that lY ¼ P ðY ¼ 1Þ ¼ p and rY ¼ pð1 pÞ, we get
E ½ðX lX ÞðY pÞ
qpb ðX; Y Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
rX pð1 pÞ
Here qpb ðX; Y Þ is the point-biserial correlation at the population level between X and
Y, whose sample counterpart is denoted by rpb ðX; Y Þ.
When the binary variable Y can be assumed to come from dichotomizing a normally
distributed continuous variable Y , the product-moment correlation betweenX and Y is
called the biserial
correlation. Suppose that P ðY ¼ 1Þ ¼ P Y [ sp ¼ p and
P ðY ¼ 0Þ ¼ P Y sp ¼ 1 p. Then sp is the dichotomization threshold on Y . If Y
follows the standard normal distribution, sp is equal to U1 ð1 pÞ, where UðÞ represents
the standard normal cumulative distribution function (c.d.f.). Point-biserial and biserial
correlations play an important role in psychometric theory, for example item analysis
(Crocker & Algina, 2006, p. 317).
*Correspondence should be addressed to Ying Cheng, Department of Psychology, University of Notre Dame, 118
Haggar Hall, Notre Dame, IN 46556, USA (email: [email protected]).
DOI:10.1111/bmsp.12075
Max point-biserial correlation under non-normality 345
It is well known that the range of the point-biserial correlation may be constrained.
Lord and Novick (1968, p. 340) noted that the ‘point-biserial is never as much as four-fifths
of the biserial’ correlation, because
/ sp
qpb ðX; Y Þ ¼ qb ðX; Y Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;
pð1 pÞ
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
where qb ðX; Y Þ is the corresponding biserial correlation, and /ðsp Þ= pð1 pÞ\:8
Here /ðÞ is the standard normal probability distribution function. Lord and Novick (1968)
further noted that since jqb j 1, qpb has to be smaller than .8.
Gradstein (1986) analytically derived the maximal qb ðX; Y Þ when X is normally
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
distributed. The maximum is exactly /ðsp Þ= pð1 pÞ, a function that depends only on p.
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
As p departs further from .5, /ðsp Þ= pð1 pÞ decreases. Therefore, researchers
cautioned against interpreting the size of the point-biserial correlation based on the
usual range of the product-moment correlation, that is, [1, 1] (Allen & Yen, 1979, p. 39;
Gradstein, 1986).
On the other hand, a simple Google search pulled up many web pages that claimed
point-biserial correlation values range from 1.0 to +1.0 (e.g., http://www.statisticsso-
lutions.com/point-biserial-correlation/). Interested users of the point-biserial correlation
may be confused by this apparently glaring discrepancy and inconsistency. One possible
reason is violation of the normality assumption which was used in the derivation by
Gradstein (1986). While the assumption of normality is traditionally very common in
psychological research, there has been increasing awareness of and attention to non-
normal data in recent years. For example, researchers have investigated the effect of non-
normality on factor analysis and structural equation modelling (Curran, West, & Finch,
1996; Jennrich & Satorra, 2014; Yanagihara, Tonda, & Matsumoto, 2005; Yuan, Bentler, &
Zhang, 2005; Yuan, Marshall, & Bentler, 2002), reliability coefficient estimation (Sheng &
Sheng, 2012; Yuan & Bentler, 2002), test equating (Zu & Yuan, 2012), mixture item
response theory modelling (Sen, Cohen, & Kim, 2016), as well as missing-data analysis
(Tong, Zhang, & Yuan, 2014; Yuan, 2009; Yuan & Bentler, 2010).
The aim of this paper is to derive the range of qpb under several widely used non-normal
distributions. Interestingly, the issue of non-normality has been well recognized in the
context of statistical testing, for example in power analysis and sample size planning
(Bonett & Wright, 2000; Fowler, 1987) and corrections for statistical tests of correlation
under non-normality (Yuan & Bentler, 2000), but we are not aware of any analytical
treatment of the range of the point-biserial correlation under non-normality. In addition,
the non-normality discussed in the context of statistical testing of qpb usually focuses on
the underlying distribution of the binary variable Y instead of the distribution of the
continuous variable X. In the derivation and discussion below it is not essential what
distribution underlies Y.
2. Maximal qpb ðX; Y Þ when X is normally distributed
Gradstein (1986) derived the maximal qpb ðX; Y Þ when X is normally distributed. Here we
provide details of the derivation, which will facilitate our derivation of the range of qpb
under non-normal distributions. Let us again consider the continuous variable
X Nðl; r2 Þ, and the dichotomous variable Y, which is coded as 0 or 1, with
P ðY ¼ 1Þ ¼ p. So E½Y ¼ p and Var½Y ¼ pð1 pÞ. Without loss of generality, we can
346 Ying Cheng and Haiyan Liu
consider the correlation between Z and Y, where Z is a linear transformation of X by
Z = (X l)/r. Clearly Z ~ N (0,1), so that lZ = 0 and rZ = 1. Then
E ½ðZ lZ ÞðY pÞ E ½ZY
qpb ðX; Y Þ ¼ qpb ðZ; Y Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
rZ pð1 pÞ pð1 pÞ
For any given p, the maximum of qpb ðX; Y Þ is reached when E ½ZY is maximized. Since
Y is either 1 or 0, ZY is either Z or 0. Therefore E ½ZY is maximized when the top 100p% of
the Z distribution happens to have Y = 1, and
maxfE ½ZY g ¼ p E ZjZ [ sp :
Note that the right-hand side of the equation is the mean of a left-truncated standard
normal distribution.
The first moment of a truncated normal distribution with mean l and variance r2 is
al
/ / bl
lþ r r r;
U bl
r U al
r
where a and b are the left and right truncation points, respectively, and a < b.
Therefore, the mean of a left-truncated standard normal distribution as above can be
obtained by plugging in a = sp, b = +∞, and l = 0, r = 1:
/ sp
E ZjZ [ sp ¼ ;
p
and consequently
maxfE ½ZY g ¼ / sp :
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
The range of qpb ðX; Y Þ is thus /ðsp Þ= pð1 pÞ; /ðsp Þ= pð1 pÞ , when X is
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
normally distributed. The maximum of /ðsp Þ= pð1 pÞ is .798, which is reached when
p = .5. This is the familiar result discussed in classic texts such as Lord and Novick (1968)
and Allen and Yen (1979).
3. Maximal qpb ðX; Y Þ when X is non-normally distributed
The derivation above clearly shows where the normality assumption comes to play in
obtaining the range of qpb ðX; Y Þ. If X is non-normally distributed, the range will change.
Here we consider four non-normal distributions, uniform, Student’s t, exponential, and a
mixture of two normal distributions. These distributions are also considered in other
studies of point-biserial correlation in the presence of non-normality (e.g., Fowler, 1987).
3.1. X follows the uniform distribution, a symmetric distribution within an interval
Suppose X U ða; bÞ, which is a uniform distribution with a lower bound of a and an
upper bound of b. The correlation between X and Y is the same as the correlation between
Max point-biserial correlation under non-normality 347
ZU and Y, where ZU = (X a)/(b a) ~ U(0,1). ZU has mean .5 and variance 1/12.
Following the same logic as above, we get
pffiffiffiffiffiffi
E ½ðZU :5ÞðY pÞ 12 E ½ZU Y p2
qpb ðX; Y Þ ¼ qpb ðZU ; Y Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
pð1 pÞ=12 pð1 pÞ
The maximum of qpb ðX; Y Þ is reached when E ½ZU Y is maximized, or when the top
100p% of the ZU distribution happens to have Y = 1:
p
maxfE ½ZU Y g ¼ pE ½ZU jZU [ ð1 pÞ ¼ p 1 :
2
Therefore
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
maxfqpb ðX; Y Þg ¼ 3pð1 pÞ;
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
or the range of qpb ðX; Y Þ is 3pð1 pÞ; 3pð1 p pÞ . Clearly this
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi range is only
pffiffiffiffiffiffiffiffi
dependent on p, is symmetric around p = .5, and max 3pð1 pÞ ¼ 3=4 :87. The
maximum is reached when p = .5.
3.2. X follows Student’s t-distribution, a symmetric, fat-tailed distribution
Suppose Z ~ t(m), which is a Student’s t-distribution with m degrees of freedom. Here we
only consider m > 2, where a finite variance, m/(m 2), exists. Kim (2008) showed that the
first moment of such a t-distribution truncated with a lower bound of a and an upper
bound of b is
ðm1Þ=2 ðm1Þ=2
Gm AðmÞ BðmÞ ;
for m > 1, where
m=2
C m1
2 m
Gm ¼ ;
2½Fm ðbÞ Fm ðaÞC 2m C 12
AðmÞ ¼ m þ a2 ;
BðmÞ ¼ m þ b2 ;
and Fm() is the c.d.f. of the standard Student’s t-distribution with m degrees of freedom,
and Γ() represents the gamma function.
It is also straightforward to get
E ½ZY
qpb ðZ; Y Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
m
ffi;
m2 pð1 pÞ
and
348 Ying Cheng and Haiyan Liu
maxfE ½ZY g ¼ p E ZjZ [ spm :
For our purposes, we need the mean of a t-distribution left truncated at spm, where
spm ¼ Fm1 ð1 pÞ. So a ¼ spm ; b ¼ þ1; AðmÞ ¼ m þ s2pm , and BðmÞ ¼ þ1; Fm ðbÞ ¼ 1;
Fm ðaÞ ¼ 1 p. The mean of the left truncated t-distribution of interest is therefore
m=2 ðm1Þ=2
C m1 m
E ZjZ [ spm ¼ 2m 1 m þ s2pm :
2pC 2 C 2
Therefore for a given p and a given m, we have
ðm1Þ=2
m=2
C m1 m m þ s2pm
ffi2 m 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
max qpb ðZ; Y Þ ¼ pffiffiffiffiffiffi
m
2 m2 C 2 C 2 pð1 pÞ
The same result holds for X ~ t(l, r2; m), the generalized t-distribution. The range of
qpb ðX; Y Þ again only depends on p, and is
2 ðm1Þ=2 ðm1Þ=2 3
m1 m=2 m1 m=2
C 2 m m þ spm
2
C 2 m m þ spm
2
6 7
ffi m 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; pffiffiffiffiffiffi
4 pffiffiffiffiffiffi
m m
ffi m 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 5;
2 m2C 2 C 2 pð1 pÞ 2 m2C 2 C 2 pð1 pÞ
which is symmetric around p = .5.
3.3. X follows the exponential distribution, a skewed distribution
Suppose X 2 Rþ (where Rþ is the set of positive real numbers) follows the exponential
distribution with rate parameter k, or X ~ l(k), k > 0. X has mean 1/k and variance 1/k2.
We get
E ½ðX lX ÞðY pÞ kE ½XY p
qpb ðX; Y Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;
rX pð1 pÞ pð1 pÞ
For a given p, the maximum of qpb ðX; Y Þ is reached when E ½XY is maximized, or
when the top 100p% of the exponential distribution happens to have Y = 1. Then
maxfE ½XY g ¼ p E XjX [ spl ;
where spl ¼ Fl ð1Þ ð1 pÞ, and Fl ðÞ is the c.d.f. of the exponential distribution:
lnð pÞ
spl ¼ :
k
For exponential distribution left truncated at a, the mean is
Max point-biserial correlation under non-normality 349
1.0
N(μ,σ2)
U(α,β)
Exp(λ)
0.8
t(10)
t(3)
0.6
Max ρpb
0.4
0.2
0.0
0 .2 .4 .6 .8 1
p
Figure 1. The maximum point-biserial correlation as a function of p when a continuous
variable follows a normal, uniform, Student’s t (degrees of freedom 3 or 10), or exponential
distribution.
1 1
þ a eak :
1 F l ða Þ k
When a = spl = ln (p) / k, the mean is (1 ln (p)) / k. Therefore
pð1 lnðpÞÞ
maxfE ½XY g ¼
k
and
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
p
maxfqpb ðX; Y Þg ¼ lnð pÞ :
ð1 pÞ
This function is not symmetric around p = .5. It is maximized when lnð pÞ ¼ 2ðp 1Þ.
The maximum is achieved when p .2, and the maximum is around .8, which can be
obtained numerically using the Newton–Raphson procedure.
3.4. X follows a mixture of two normal distributions
Suppose X follows a mixture distribution of two normal distributions with equal
variance,
XjðY ¼ jÞ Nðlj ; r2 Þ;
where j = 0, 1. Denote the standardized difference of means by ϑ = (l1 l0)/r. In this
case, following Tate (1954),
350 Ying Cheng and Haiyan Liu
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pð1 pÞ
qpb ðX; Y Þ ¼ 0 :
1 þ pð1 pÞ02
It is straightforward to see that jqpb ðX; Y Þj increases with ϑ2. When
0 ! 1; qpb ðX; Y Þ ! 1. The range of qpb ðX; Y Þ is therefore ½1; 1for any p. It is worth
2
noting that the result can be generalized to any mixture distribution for X. When the
overlap between the two mixture distributions goes to 0 (e.g., when ϑ2?∞), qpb ðX; Y Þ
approaches 1. As a reviewer pointed out, this becomes obvious when you think of the
scatterplot of (X, Y) when X | Y = 0 and X | Y = 1 have nearly no overlap and both have
small variances.
4. Conclusion and discussion
This paper derives the range of qpb ðX; Y Þ under several non-normal distributions:
uniform, Student’s t, exponential, and a mixture of two normal distributions. The maximal
qpb ðX; Y Þ under these distributions is plotted in Figure 1 (except for the mixture normal,
for which the maximal qpb ðX; Y Þ is always 1) against that under the normal distribution.
For the t-distribution, we plotted two curves, one for m = 3, and the other for m = 10. As
the degrees of freedom increase, the maximal qpb ðX; Y Þ of a t-distribution approaches that
of the normal distribution.
Altogether the results in this paper show that the maximum of the population point-
biserial correlation, in spite of popular belief, may not be a function of p (as in the case of
the mixture of two normal distributions), can be symmetric or non-symmetric around
p = .5, and may still range from 1.0 to 1.0 (as in the case of the mixture of two normal
distributions). Therefore researchers should exercise caution when they interpret their
sample point-biserial correlation coefficients based on the popular belief that point-
biserial correlation always has a restricted range, and that the restriction is always more
severe when p deviates from .5.
It is also transparent from this paper that analytical derivation of maximal point-biserial
correlation relies on the moments of truncated continuous distributions. It may be
straightforward to derive the maximal point-biserial correlations under other non-normal
distributions, given the moments of those truncated distributions. While the maximum of
qpb ðX; Y Þ is 1 when X follows a mixture distribution, an interesting question to pursue in a
future study p what is the maximal qpb ðX; Y Þ when X is unimodal, and whether it can be
is ffiffiffiffiffiffiffiffi
higher than 3=4.
References
Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Monterey, CA: Brooks/
Cole.
Bonett, D. G., & Wright, T. A. (2000). Sample size requirements for estimating Pearson, Kendall and
Spearman correlations. Psychometrika, 65, 23–28. doi:10.1007/BF02294183
Crocker, L., & Algina, J. (2006). Introduction to classical and modern test theory. Mason, OH:
Cengage Learning.
Curran, P. J., West, S. G., & Finch, J. F. (1996). The robustness of test statistics to nonnormality
and specification error in confirmatory factor analysis. Psychological Methods, 1, 16–29.
doi:10.1037/1082-989X.1.1.16
Max point-biserial correlation under non-normality 351
Fowler, R. L. (1987). Power and robustness in product-moment correlation. Applied Psychological
Measurement, 11, 419–428. doi:10.1177/014662168701100407
Gradstein, M. (1986). Maximal correlation between normal and dichotomous variables. Journal of
Educational Statistics, 11, 259–261. doi:10.3102/10769986011004259
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-
Wesley.
Kim, H.-J. (2008). Moments of truncated Student-t distribution. Journal of the Korean Statistical
Society, 37, 81–87. doi:10.1016/j.jkss.2007.06.001
Jennrich, B., & Satorra, A. (2014). The nonsingularity of Gamma in covariance structure analysis of
nonnormal data. Psychometrika, 79, 51–59. doi:10.1007/s11336-013-9353-1
Sen, S., Cohen, A. S., & Kim, S.-H. (2016). The impact of non-normality on extraction of spurious
latent classes in mixture IRT models. Applied Psychological Measurement, Advance online
publication. doi:10.1177/0146621615605080
Sheng, Y., & Sheng, Z. (2012). Is coefficient alpha robust to non-normal data? Frontiers in
Psychology, 3, 34. doi:10.3389/fpsyg.2012.00034
Tate, R. F. (1954). Correlation between a discrete and a continuous variable: Point-biserial
correlation. Annals of Mathematical Statistics, 25, 603–607. doi:10.1214/aoms/1177728730
Tong, X., Zhang, Z., & Yuan, K.-H. (2014). Evaluation of test statistics for robust structural
equation modeling with nonnormal missing data. Structural Equation Modeling, 21, 553–565.
doi:10.1080/10705511.2014.919820
Yanagihara, H., Tonda, T., & Matsumoto, C. (2005). The effects of nonnormality on asymptotic
distributions of some likelihood ratio criteria for testing covariance structures under normal
assumption. Journal of Multivariate Analysis, 96, 237–264. doi:10.1016/j.jmva.2004.10.014
Yuan, K.-H. (2009). Normal distribution based pseudo ML for missing data: With applications to
mean and covariance structure analysis. Journal of Multivariate Analysis, 100, 1900–1918.
doi:10.1016/j.jmva.2009.05.001
Yuan, K.-H., & Bentler, P. M. (2000). Inferences on correlation coefficients in some classes of
nonnormal distributions. Journal of Multivariate Analysis, 72, 230–248. doi:10.1016/
j.jmva.1999.1858
Yuan, K.-H., & Bentler, P. M. (2002). On robustness of the normal-theory based asymptotic
distributions of three reliability coefficient estimates. Psychometrika, 67, 251–259.
doi:10.1007/BF02294845
Yuan, K.-H., & Bentler, P. M. (2010). Consistency of normal distribution based pseudo maximum
likelihood estimates when data are missing at random. American Statistician, 64, 263–267.
doi:10.1198/tast.2010.09203
Yuan, K.-H., Bentler, P. M., & Zhang, W. (2005). The effect of skewness and kurtosis on mean and
covariance structure analysis: The univariate case and its multivariate implication. Sociological
Methods & Research, 34, 240–258. doi:10.1177/0049124105280200
Yuan, K.-H., Marshall, L. L., & Bentler, P. M. (2002). A unified approach to exploratory factor analysis
with missing data, nonnormal data, and in the presence of outliers. Psychometrika, 67, 95–122.
doi:10.1007/BF02294711
Zu, J., & Yuan, K.-H. (2012). Standard error of linear observed-score equating for the NEAT design
with nonnormally distributed data. Journal of Educational Measurement, 49, 190–213.
doi:10.1111/j.1745-3984.2012.00168.x
Received 23 January 2016; revised version received 17 May 2016