0% found this document useful (0 votes)
7 views

Student, 1908.

T-studen - Probable Error of a Correlation Coefficient

Uploaded by

Juliana Andrade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Student, 1908.

T-studen - Probable Error of a Correlation Coefficient

Uploaded by

Juliana Andrade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Downloaded from http://biomet.oxfordjournals.

org/ at National Dong Hwa University Library on March 30, 2014


PEOBABLE ERROR OF A CORRELATION
COEFFICIENT.
BY STUDENT.

AT the discussion of Mr R H. Hooker's recent paper "The correlation of the


weather and crops" (Juurn. Royal SUd. Soc. 1907) Dr Shaw made an enquiry
as to the significance of correlation coefficients derived from small numbers
of cases.
His question was answered by Messrs Yule and Hooker and Professor Edgeworth,
all of whom considered that Mr Hooker was probably safe in taking #50 as his
limit of significance for a sample of 21. They did not, however, answer Dr Shaw's
question in any more general way. Now Mr Hooker is not the only statistician
who is forced to work with very small samples, and until Dr Shaw's question has
been properly answered the results of such investigations lack the criterion which
would enable us to make full use of them. The present paper, which is an account
of some sampling experiments, has two objects: (1) to throw some light by empirical
methods on the problem itself, (2) to endeavour to interest mathematicians who
have both time and ability to solve i t
Before proceeding further, it may be as well to state the problem which occurs
in practice, for it is often confused with other allied questions.
A random sample has been obtained from an indefinitely large* population
and rf calculated between two variable characters of the individuals composing the
sample. We require the probability that R for the population from which the sample
is drawn shall lie between any given limits.
It is clear that in order to solve this problem we must know two things: (1) the
distribution of values of r derived from samples of a population which has a given
* Note that the indefinitely large population need not actually exist. In Mr Hooker's case hit
sample was 21 years of fanning under modern conditions in England, and included all the years about
which information was obtainable. Probably it could not actually have been made much larger
without loss of homogeneity, due to the mixing with farming under conditions not modern; bat one
can imagine the population indefinitely increased and the 21 years to be a sample from this.
f Throughout the rest of this pnper "r " is written for the correlation coefficient of a sample and R
for correlation coefficient of a population.
B Y STUDENT 80S

22, and (2) the d priori probability that R for the population lies between any given
limits. Now (2) can hardly ever be known, so that some arbitrary assumption
most in general be made; when we know (1) it will be time enough to discuss
what will be the beet assumption to make, but meanwhile I may suggest two
more or less obvious distributions. The first is that any value is equally likely
between + 1 and — 1, and the second that the probability that a is the value is
proportional to 1 — z*: this I think is more in accordance with ordinary experi-
ence : the distribution of d priori distribution would then be expressed by the
equation y™}(1 — a^).

Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
But whatever assumption be made, it will be necessary to know (1), so that
the solution really turns on the distribution of r for samples drawn from the same
population. Now this has been determined for large samples with as much accuracy
as is required, for Pearson and Filon (Phil Trans. VoL 191 k, p. 229 et teq.) showed
that the standard deviation is — = - and of course for large samples the distribution
vn
is sure to be practically normal unless r is very close to unity. But their method
involves approximations which are not legitimate when the sample is smalL
Besides this the distribution is not then normal, so that even if we had the standard
deviation a great deal would still remain unknown.
In order to throw some light on this question I took a correlation table*
containing 3000 cases of stature and length of left middle finger of criminals,
and proceeded to draw samples of four from this population f. This gave me
750 values of r for a population whose real correlation was 66. By taking the
statures of one sample with the middle finger lengths of the next sample I was
enabled to get 750 values of r for a population whose real correlation was zero.
Next I combined each of the samples of four with the tenth sample before it and
with the tenth sample after it, thus obtaining two nets of 750 J values from samples
of 8, with real correlation *66 and zero.
Besides this empirical work it is possible to calculate d priori the distribution
for samples of two as follows.
For clearly the only values possible are + 1 and — 1, since two points must
always lie on the regression line which joins them§.
Next consider the correlation between the difference between the values of one
character in two successive individuals, and the difference between the values of
the other character in the same individuals. It is well known to be the same as
that between the values themselves, if the individuals be iu random order.

* Bitmetrika, VoL L p. 819. W. B. MaodonnnTI, t BUmutrika, VoL TL p. 18. Student


t Not strictly independent, but practically mffleiently nearly so. Thl« method was adopted in order
to HITS arithmetic.
i Then a n of eonne indeterminate u s e s wben the ralnes a n the same tor one character, bat they
beornnw rarer a* we decrease the unit of grouping until with an Infinitesimal unit of grouping the
statement in the text U true.
304 Probable Error of a Correlation Coefficient
Also, if an indefinitely large number of such differences be taken, it is clear
that the means of the distributions will have the value zero. Hence, if the
correlation be determined from a fourfold division through zero we can apply
Mr Sheppard's* result that if A and B be the numbers in the large and the
small divisions of the table respectively cos .—o"™-^ w n e r e & " tne
correlation
of the original system.
But if a pair of individuals whose difference falls in either of the small divisions
be considered to be a random sample of 2, their r will be found to be — 1, while

Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
that of a pair whose difference falls in one of the large divisions is + 1 . Hence
the distribution of r for samples of 2 is AN at +l,and BN at —1, where . 4 + 5 = 1 ,
and B

When R =• 0, there is of course even division, half the values being + 1, and half
- 1 ; when R--66, J 3 c o e r ' ' 6 6 3 - 2 7 l , therefore 4 - 7 2 9 , and the mean is at
•729 - -271 = -458. The 8. D. = Vl-(458)» = -889. It is noteworthy that the mean
value is considerably less than R.
I have dealt with the cases £.' samples of 2 at some length, because it is possible
2
that this limiting value of the distribution with its mean of — sin' 1 R and its second
IT
2 V
(
- sin"1 R1 may furnish a clue to the distribution when
n is greater than 2.
Besides these series, I have another shorter one of 100 values of r from samples
of 30, when the real value is '66. The distributions of the various trials are given
in the table.
Several peculiarities will be noticed which are due to the effects of grouping,
particularly in the samples of 4. Firstly, there is a lump at zero; with such small
numbers zero is not an uncommon value of the product moment and then, whatever
the values of the standard deviations, r = 0.
Next there are five indeterminate cases in each of the distributions for samples
of 4. These are due to the whole sample falling in the same group for one variable.
In such a case, both the Standard Deviation and the product moment vanish and
r is indeterminate.
Lastly, with such small samples one cannot use Sheppard's corrections for the
Standard Deviations, as r often becomes greater than unity. So I did not use
the corrections except in the case of the samples of 30, yet on the whole the values
of the Standard Deviations are no doubt too large. This does not much affect the
values of r in the neighbourhood of zero, but there is a tendency for larger values
• PkiL Traru. A. Vol. cxcii. p. 141.
Distribution of Correlation Coefficients.

3 3 3 3 3 8 $ «3 8 § 8 5* to 9 8 g
i i 1 i i i i 1 i i i 1

-It
Scale

+88
+SS

00-
57- 53
1S-
08+-
75+-

13+-
18+-
&J+-
• +S6

38+-

• +S8
9 3 32 9 8 8 9 8 9 8
* \ |
1 i 1 i i i i i 1 i 1 1 i
• •
ni i . 1 i + + + + + + + + + + + + + + + +
No corre-1
lation, 1 8 17 16 23 13 83 19 17 23 184264 16 22 22 19 194 834 15 30 18 14 23 22 224 24 134 17 23 24 244 174 174 9 88* 134 9
Samples f H 194 18*
of 4* J
No corre-
lation, 1 — — 1 7 10 10 16 14 16 18 24 18 87 36 33 43 45 264 264 34 424 274 34 234 364 334 884 19 24 82 15 134 9 34 . 7 6 3 — — —
Samples
of 8
Correla-
tion of -66 3 6 9 8 7 3 1 4 10 6 5 3 6 16 4 7 9 11 6 14 11 204 18 30 234 27 334 46*1 394 41 804 64 91 69
Samples 3 * «i 2 S 8 34
of4t
Correla- -.
tionof •ee I — — — — — — — — — 2 — 2 — 3 — 3 6 4 3 4 6 7 7 6 11 17 80 17 22 34 46* 444 61 69 664 68* 80* 60 66 334 4
Samples f
of8 J oc

5.
89
3
Scale 3 3 3 3 3 3 3 3 3 3 3 3 3 3
53 to. •
56 to •

•50 to
•71 to -75

•55 to •
9 9 3
Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014

Distribution with samples of 30 1 — — 1 — 1 — 1 6 4 6 9 is 10 15 12 7 10 5 1

* There are five indeterminate cases so that the total is 745, while there are 760 in the other two distributions,

C t The moment coefficients of this distribution were actually calculated from a different grouping as below:
r

0 0

OS
OS

50
08
99

55

8 9
1 1 1 1 1 1 1 1 1 1 1 1 1 1 + ! +

59 -69
to CO
+9g

OD

06--t6
90--60

59-
11 +

39-
J
56+

3" i $ 8 R O
Oi
08-+9JL- +
06-+98- +
96- + I6- +

1
00-1+96- +

1
+ 10- +

1
s1 1 1 1 1 1 1 1 I 1 1 1
sr1 1 1 1 + + + T + + + + + + + + + • +
6 4 2 2 6 4 3 4 7 9 7 6 1 8 6 7 7 6 4 18 8 5 11 8 12 124 7* 17 21 24 20 35 31 36 60 34 58J 76 77 981
300 Probable Error of a Correlation Coefficient
to come too low, so that there is a deficiency of cases towards 1 and — 1. This
introduces an error into the Standard Deviation of all the series to some extent,
but of course the mean is unaltered when there is no correlation. The series for
samples of 4 are affected more than those from samples of 8, as the mean Standard
Deviation of samples of 4 is the smaller, so that the unit of grouping is compara-
tively larger.
The moment coefficients of the five distributions were determined, and the
following values found*:—

Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
Mean ft Pt

Samples of 4(r- 0 ) •6518 •3038 •1768 __ 1-918


Samples of 8(r- 0 ) •3731 •1392 -0454 — 2-336
Sample* of 4(r= •CO) •5009 •4680 •2190 - 1570 •2152 2-240 4-489
Samples of 8(r= •00) .0139 •2684 •07202 -•02634 •02714 1-857 0-232
Samples of 30(r--66) •601 •1001 •01003 --000882 •000401 •7713 4-580

Considering first the "no correlation" distributions I attempted to fit a Pearson


curve to the first of them. As might be expected, the range proved limited and
as symmetry had been assumed in calculating the moments, a Type II curve
a? VCT
(
1 — T^yTg 1 , the range of which is 2-074.
Now the real range is clearly 2, and only a very small alteration in #, is
required to make the value of the index zero. Consequently the equation
y =» y, (1 — a?¥ was suggested. This means an even distribution of r between 1
and - 1 , with s. D. = 5774 ±-010 vice 5512 actual, ^ = 3 3 3 3 ±-0116 vice 3038,
/x4= -2000 ± -01G vice 17G8 and &=- 1800 ± -12 vice 1-918, all values an close as
could perhaps be expected considering that the grouping must make both
fL, and fj.4 too low.
Working from y = y o (l-« 3 )* for samples of 4 I guessed the formula
y =» y0 (1 — «*) * and proceeded to calculate the moments.
By using the transformation a«» sin 0 we get y = y» cos""4 6,
dx = ens fidfi,

2 | ydx = 2,v. \ cos"-10dfl,


Ju' J9
w w

*ydx = 2y. [' cos—' 0(10 - 1>ja {' cos*-1 0d0,


.' ii J<i Jo
and so on.
Whence
1 li

In the ctuicn of no correlation the moment* were tiiken nlmut xcm, the known aiitmiil of the
dintribution.
B Y STUDENT 307
Putting n — 8 we get the equation y = y, (1 — x*)* and
f^ =. f B 1429 ± -0050 instead of actual 1392,
M 4 =.^. =-0476 ±-0038 ., „ -0454,
a =• 3780 ± -0066 „ „ 3731,
/ S , - 3 - f = 2-333±-012 „ „ 2336.
The equation calculated from the actual momenta is y — y, f 1 - ^ ) whence

Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
the calculated range is 1'98, whereas it is known to be 2.
The following tables compare the actual distributions with those calculated from
the equations.
Distribution of r from samples of 4 compared urith the equation

ii i
s
I
>3 <o

3 3 3

Actual ... 554 67 59 62 63 58 60 64 514 54


66 56 56 56 56 56 56 56 56 56 66 65

- 1 -10* -1 + 11 + 3 + 6 +7 +2 + 4 +8 -14J -11

From this we get x * a 13*30, P — '34. It will however be noticed that the grouping
has caused all the middle compartments to contain more than the calculated, as
pointed out above.
Distribution of r from samples of 8 compared with the equation
750x15., _„
y ie-fl-**
988-

989-

989-

8i 11 i
075

§
1 I 1 1 + + + + + + +
3 3 3 3 3 3 3 3 3 3 3 3 3
I s+ i+
•075
•675

§
1 i
§
i 1
i i+
' i + + +
Actual ... 2 27 44 60 96 103 85 984 65 374 HI 3
Calculated 804 43 67 87 1004 105 1004 87 67 43 20|

Difference +64 + 1 - 7 +9 + 14 -2 -154 -2 -54 - 6

whence x1 -13^4, P - 30.


308 Probable Error of a Correlation Coefficient
In this case the grouping has had less influence and the largest contributions
to x* (in the second, sixth, eighth, and twelfth compartments) are due to
differences of opposite sign on opposite sides, and may therefore be supposed to be
entirely due to random sampling.
My equation then fits the two series of empirical results about as well as could
be expected. I will now show that it is in accordance with the two theoretical
cases n "large" and n = 2, for o - = — = which approximates sufficiently closely to

Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
Pearson and Filon's —;=- when r = 0 and n is large. Also when n is large /9,
vn
becomes 3 and the distribution is normal.
And if n — 2, the equation becomes y = y, (1 — s 1 )- 1 * where
N
2 T (I - a?)-*
'o
Jo
Put x — sin 0. Then dx -»cos 0d$,

y*~j/] sec0d0-^/oo~O,
i.6. there is no frequency except where (1 — x*)'1 is infinite, all the frequency is
equally divided between x =• 1 and x = — 1 which we know to be actually the case.
»—*
Consequently I believe that the equation y = y,(l — x*)~ probably represents the
theoretical distribution of r when samples of n are drawn from a normally distri-
buted population with no correlation. Even if it does not do so, I am sure that it
will give a close approximation to i t
Let us consider Mr Hooker's limit of 50 in the light of this equation. For
=
21 cases the equation becomes n . j- and the proportion of the area lying
beyond x = ± "50 will be
cos u 0d0
-«Jn-»-60

/ :

I find this to be •02099, or we may expect to find one case in 50 occurring


outside the limits ± '50 when there is no correlation and the sample numbers 21.

* If a Peuson eorre be fitted to the diatribotion whose moment coefficient* are fn=l=M* *od
fij=O we h»Te ft=>l, ft = 0, henee the curve must be of Type II. and the equation is given by

agreeing with the genenJ formal*.


B Y STUDENT 309

When however there is correlation, I cannot suggest an equation which will


accord with the facts, but as I have spent a good deal of time over the problem
I will point out some of the necessities of the case.
(1) With small samples the value certainly lies nearer to zero than the real
value of R, eg.
2
samples
r
of 2 : mean at - sin~l R,
TT
samples of 4 (real value *66) *561 * ± -Oil,

Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
samples of 8 (real value *66) 614 f ± O65.
But with samples of 30 (real value 66) mean at 6609 ± *OO67 shows that the mean
value approaches the real value comparatively rapidly.
1 —r*
(2) The standard deviation is larger than accords with the formula •-
even if we give the mean value -of r for samples of the size taken, ejj. for
samples of 2,

For samples of 4, calculated^ 3957 ± '-OO69; actual -4680,


8 2355 ± 0041; actual -2684
But samples of 30 calculated '1046 ± 0018, actual -1001, again show that with
samples as large as 30 the ordinary formula is justified.
(3) When there was no correlation the range found by fitting a Pearson curve
to the distribution was accurately 2 in the theoretical case of samples of two, and
well within the probable error for empirical distributions of samples of 4 and 8.
But when we have correlation this process does not give the range closely for the
empirical distribution (samples of 4 give 2*137, samples of 8 2*699, samples of
30 infinity) and the range calculated from samples of 2, which is

(where /*, = 1 — ( - sin"1 RJ J is always less than 2 except in the case where m is 1,
ue. when there is no correlation.
Hence the distribution probably cannot be represented by any of Prof. Pearson's
types of frequency curve unless R = 0.
(4) The distribution is skew with a tail towards zero.
* The -nine mast be slightly larger than thii (perhaps eren by -OS) as Sheppard's corrections were
not used.
t Again higher, bat not by more than -03.
1-r*
t i where r ia taken as the mean rain* for the Die of the namjle. If we took the real
Talne B, the difference would be even greater.
310 Probable Error of a Correlation CoefficieiU
(5) To sum up:—If y=*<f>(x, R, n) be the equation, it most satisfy the following
requirements. If R = 1, 1 is the only value of x which gives the value of y other
than zero. If n » 2, ± 1 are the only values of x to do so. If R = 0 the equation
»—t
probably reduces to y = y o (l -&) * •

Conclusion*.
It has been shown that when there is no correlation between two normally
»-•
distributed variables y =• y0 (1 — a?) i gives fairly closely the distribution of r found

Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
from samples of n.
Next, the general problem has been stated and three distributions of r have
been given which show the sort of variation which occurs. I hope they may serve
as illustrations for the successful solver of the problem.

You might also like