Student, 1908.
Student, 1908.
22, and (2) the d priori probability that R for the population lies between any given
limits. Now (2) can hardly ever be known, so that some arbitrary assumption
most in general be made; when we know (1) it will be time enough to discuss
what will be the beet assumption to make, but meanwhile I may suggest two
more or less obvious distributions. The first is that any value is equally likely
between + 1 and — 1, and the second that the probability that a is the value is
proportional to 1 — z*: this I think is more in accordance with ordinary experi-
ence : the distribution of d priori distribution would then be expressed by the
equation y™}(1 — a^).
Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
But whatever assumption be made, it will be necessary to know (1), so that
the solution really turns on the distribution of r for samples drawn from the same
population. Now this has been determined for large samples with as much accuracy
as is required, for Pearson and Filon (Phil Trans. VoL 191 k, p. 229 et teq.) showed
that the standard deviation is — = - and of course for large samples the distribution
vn
is sure to be practically normal unless r is very close to unity. But their method
involves approximations which are not legitimate when the sample is smalL
Besides this the distribution is not then normal, so that even if we had the standard
deviation a great deal would still remain unknown.
In order to throw some light on this question I took a correlation table*
containing 3000 cases of stature and length of left middle finger of criminals,
and proceeded to draw samples of four from this population f. This gave me
750 values of r for a population whose real correlation was 66. By taking the
statures of one sample with the middle finger lengths of the next sample I was
enabled to get 750 values of r for a population whose real correlation was zero.
Next I combined each of the samples of four with the tenth sample before it and
with the tenth sample after it, thus obtaining two nets of 750 J values from samples
of 8, with real correlation *66 and zero.
Besides this empirical work it is possible to calculate d priori the distribution
for samples of two as follows.
For clearly the only values possible are + 1 and — 1, since two points must
always lie on the regression line which joins them§.
Next consider the correlation between the difference between the values of one
character in two successive individuals, and the difference between the values of
the other character in the same individuals. It is well known to be the same as
that between the values themselves, if the individuals be iu random order.
Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
that of a pair whose difference falls in one of the large divisions is + 1 . Hence
the distribution of r for samples of 2 is AN at +l,and BN at —1, where . 4 + 5 = 1 ,
and B
When R =• 0, there is of course even division, half the values being + 1, and half
- 1 ; when R--66, J 3 c o e r ' ' 6 6 3 - 2 7 l , therefore 4 - 7 2 9 , and the mean is at
•729 - -271 = -458. The 8. D. = Vl-(458)» = -889. It is noteworthy that the mean
value is considerably less than R.
I have dealt with the cases £.' samples of 2 at some length, because it is possible
2
that this limiting value of the distribution with its mean of — sin' 1 R and its second
IT
2 V
(
- sin"1 R1 may furnish a clue to the distribution when
n is greater than 2.
Besides these series, I have another shorter one of 100 values of r from samples
of 30, when the real value is '66. The distributions of the various trials are given
in the table.
Several peculiarities will be noticed which are due to the effects of grouping,
particularly in the samples of 4. Firstly, there is a lump at zero; with such small
numbers zero is not an uncommon value of the product moment and then, whatever
the values of the standard deviations, r = 0.
Next there are five indeterminate cases in each of the distributions for samples
of 4. These are due to the whole sample falling in the same group for one variable.
In such a case, both the Standard Deviation and the product moment vanish and
r is indeterminate.
Lastly, with such small samples one cannot use Sheppard's corrections for the
Standard Deviations, as r often becomes greater than unity. So I did not use
the corrections except in the case of the samples of 30, yet on the whole the values
of the Standard Deviations are no doubt too large. This does not much affect the
values of r in the neighbourhood of zero, but there is a tendency for larger values
• PkiL Traru. A. Vol. cxcii. p. 141.
Distribution of Correlation Coefficients.
3 3 3 3 3 8 $ «3 8 § 8 5* to 9 8 g
i i 1 i i i i 1 i i i 1
-It
Scale
+88
+SS
00-
57- 53
1S-
08+-
75+-
13+-
18+-
&J+-
• +S6
38+-
• +S8
9 3 32 9 8 8 9 8 9 8
* \ |
1 i 1 i i i i i 1 i 1 1 i
• •
ni i . 1 i + + + + + + + + + + + + + + + +
No corre-1
lation, 1 8 17 16 23 13 83 19 17 23 184264 16 22 22 19 194 834 15 30 18 14 23 22 224 24 134 17 23 24 244 174 174 9 88* 134 9
Samples f H 194 18*
of 4* J
No corre-
lation, 1 — — 1 7 10 10 16 14 16 18 24 18 87 36 33 43 45 264 264 34 424 274 34 234 364 334 884 19 24 82 15 134 9 34 . 7 6 3 — — —
Samples
of 8
Correla-
tion of -66 3 6 9 8 7 3 1 4 10 6 5 3 6 16 4 7 9 11 6 14 11 204 18 30 234 27 334 46*1 394 41 804 64 91 69
Samples 3 * «i 2 S 8 34
of4t
Correla- -.
tionof •ee I — — — — — — — — — 2 — 2 — 3 — 3 6 4 3 4 6 7 7 6 11 17 80 17 22 34 46* 444 61 69 664 68* 80* 60 66 334 4
Samples f
of8 J oc
5.
89
3
Scale 3 3 3 3 3 3 3 3 3 3 3 3 3 3
53 to. •
56 to •
•50 to
•71 to -75
•55 to •
9 9 3
Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
* There are five indeterminate cases so that the total is 745, while there are 760 in the other two distributions,
C t The moment coefficients of this distribution were actually calculated from a different grouping as below:
r
0 0
OS
OS
50
08
99
55
8 9
1 1 1 1 1 1 1 1 1 1 1 1 1 1 + ! +
59 -69
to CO
+9g
OD
06--t6
90--60
59-
11 +
39-
J
56+
3" i $ 8 R O
Oi
08-+9JL- +
06-+98- +
96- + I6- +
1
00-1+96- +
1
+ 10- +
1
s1 1 1 1 1 1 1 1 I 1 1 1
sr1 1 1 1 + + + T + + + + + + + + + • +
6 4 2 2 6 4 3 4 7 9 7 6 1 8 6 7 7 6 4 18 8 5 11 8 12 124 7* 17 21 24 20 35 31 36 60 34 58J 76 77 981
300 Probable Error of a Correlation Coefficient
to come too low, so that there is a deficiency of cases towards 1 and — 1. This
introduces an error into the Standard Deviation of all the series to some extent,
but of course the mean is unaltered when there is no correlation. The series for
samples of 4 are affected more than those from samples of 8, as the mean Standard
Deviation of samples of 4 is the smaller, so that the unit of grouping is compara-
tively larger.
The moment coefficients of the five distributions were determined, and the
following values found*:—
Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
Mean ft Pt
In the ctuicn of no correlation the moment* were tiiken nlmut xcm, the known aiitmiil of the
dintribution.
B Y STUDENT 307
Putting n — 8 we get the equation y = y, (1 — x*)* and
f^ =. f B 1429 ± -0050 instead of actual 1392,
M 4 =.^. =-0476 ±-0038 ., „ -0454,
a =• 3780 ± -0066 „ „ 3731,
/ S , - 3 - f = 2-333±-012 „ „ 2336.
The equation calculated from the actual momenta is y — y, f 1 - ^ ) whence
Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
the calculated range is 1'98, whereas it is known to be 2.
The following tables compare the actual distributions with those calculated from
the equations.
Distribution of r from samples of 4 compared urith the equation
ii i
s
I
>3 <o
3 3 3
From this we get x * a 13*30, P — '34. It will however be noticed that the grouping
has caused all the middle compartments to contain more than the calculated, as
pointed out above.
Distribution of r from samples of 8 compared with the equation
750x15., _„
y ie-fl-**
988-
989-
989-
8i 11 i
075
§
1 I 1 1 + + + + + + +
3 3 3 3 3 3 3 3 3 3 3 3 3
I s+ i+
•075
•675
§
1 i
§
i 1
i i+
' i + + +
Actual ... 2 27 44 60 96 103 85 984 65 374 HI 3
Calculated 804 43 67 87 1004 105 1004 87 67 43 20|
Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
Pearson and Filon's —;=- when r = 0 and n is large. Also when n is large /9,
vn
becomes 3 and the distribution is normal.
And if n — 2, the equation becomes y = y, (1 — s 1 )- 1 * where
N
2 T (I - a?)-*
'o
Jo
Put x — sin 0. Then dx -»cos 0d$,
y*~j/] sec0d0-^/oo~O,
i.6. there is no frequency except where (1 — x*)'1 is infinite, all the frequency is
equally divided between x =• 1 and x = — 1 which we know to be actually the case.
»—*
Consequently I believe that the equation y = y,(l — x*)~ probably represents the
theoretical distribution of r when samples of n are drawn from a normally distri-
buted population with no correlation. Even if it does not do so, I am sure that it
will give a close approximation to i t
Let us consider Mr Hooker's limit of 50 in the light of this equation. For
=
21 cases the equation becomes n . j- and the proportion of the area lying
beyond x = ± "50 will be
cos u 0d0
-«Jn-»-60
/ :
* If a Peuson eorre be fitted to the diatribotion whose moment coefficient* are fn=l=M* *od
fij=O we h»Te ft=>l, ft = 0, henee the curve must be of Type II. and the equation is given by
Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
samples of 8 (real value *66) 614 f ± O65.
But with samples of 30 (real value 66) mean at 6609 ± *OO67 shows that the mean
value approaches the real value comparatively rapidly.
1 —r*
(2) The standard deviation is larger than accords with the formula •-
even if we give the mean value -of r for samples of the size taken, ejj. for
samples of 2,
(where /*, = 1 — ( - sin"1 RJ J is always less than 2 except in the case where m is 1,
ue. when there is no correlation.
Hence the distribution probably cannot be represented by any of Prof. Pearson's
types of frequency curve unless R = 0.
(4) The distribution is skew with a tail towards zero.
* The -nine mast be slightly larger than thii (perhaps eren by -OS) as Sheppard's corrections were
not used.
t Again higher, bat not by more than -03.
1-r*
t i where r ia taken as the mean rain* for the Die of the namjle. If we took the real
Talne B, the difference would be even greater.
310 Probable Error of a Correlation CoefficieiU
(5) To sum up:—If y=*<f>(x, R, n) be the equation, it most satisfy the following
requirements. If R = 1, 1 is the only value of x which gives the value of y other
than zero. If n » 2, ± 1 are the only values of x to do so. If R = 0 the equation
»—t
probably reduces to y = y o (l -&) * •
Conclusion*.
It has been shown that when there is no correlation between two normally
»-•
distributed variables y =• y0 (1 — a?) i gives fairly closely the distribution of r found
Downloaded from http://biomet.oxfordjournals.org/ at National Dong Hwa University Library on March 30, 2014
from samples of n.
Next, the general problem has been stated and three distributions of r have
been given which show the sort of variation which occurs. I hope they may serve
as illustrations for the successful solver of the problem.