InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

 

 

Pearson's correlation coefficient

Worked example 1

Our first worked example uses data from Gereda et al (2006) who related allergen sensitization to house dust endotoxin. We looked at some of the analysis of this work in units 1 and 8. Data are presented below:

mcorr2.gif
House dust endotoxin
in relation to
allergen sensitization
No. Log concn
house dust
endotoxin
Log %
interferon-γ
CD4 cells
12.2420.246
22.7700.117
32.6360.223
42.7250.293
52.6610.316
62.4830.375
72.5540.410
82.8390.375
92.9150.422
103.0680.328
113.0680.352
123.0040.480
132.6860.516
143.0420.598
152.7630.609
162.7190.656
173.3860.792
183.2711.430

Check of assumptions

  1. The data formed part of an observational study so it is difficult to assess whether the observations can be regarded as independent. For example, were any of the children living in the same household?? We will, however, accept them as being independent in the absence of any evidence to the contrary.
  2. Both X and Y are measurement variables
  3. The plot of log Y on log X suggests a weak linear relationship, although there was one (possible) outlier that had a much higher level of sensitization than would be expected.
  4. The q-q plots indicated that the distribution of log (x) was approximately normal, but that of log(y) was less so - mainly because of the influence of the outlier.
  5. There are insufficient data to assess whether the variance of each variable was constant across the range of the other.
We will accept that assumptions for a parametric test are (more or less) met, although we should run the analysis with and without the outlier (influential point?) to check on how robust the result is.
Using

Calculation of Pearson correlation coefficient

r    =    24.93460 −(50.832 8.538 /18)
√[144.9090 −2583.892/18] [5.4962 −72.8974/18]
     =    0.587

Test of significance

t    =    0.587√ (16)    =    2.90
√ (1 -0.5872)

This gives a two-tailed P-value of 0.01044. We conclude there is a significant linear correlation between log % interferon-γ CD4 cells and log concentration house dust endotoxin.

Confidence interval

The observed correlation coefficient is transformed to a Fisher's z-value:
z    =    1  ln 1.587    =   0.6731
2 0.413

The 95% confidence interval for this z-value is 0.67308 1.96/√(n-3) which is 0.67308 0.5061. This gives the (transformed) interval as 0.16698 - 1.17918 , which detransformed is 0.1654 - 0.8272.

 

Worked example 2

Our second worked example uses data from Luiselli (2006) who tested hypotheses on the ecological patterns of rarity using snake communities worldwide. We first considered this work in . Data are presented below:

{Fig. 2}
mcorr2.gif

Snake data
No. sp. Rank % rare Rank
8
24
17
18
16
12
15
13
5
3
12
20
28
6
6
34
6
17
68
7
4
33
23
5
9
46
14
17
18
17
5
10
5
27
17
11.0
29.0
22.0
25.5
19.0
14.5
18.0
16.0
4.5
1.0
14.5
27.0
31.0
8.0
8.0
33.0
8.0
22.0
35.0
10.0
2.0
32.0
28.0
4.5
12.0
34.0
17.0
22.0
25.5
22.0
4.5
13.0
4.5
30.0
22.0
12.5
20.8
11.8
16.7
18.7
8.33
20
23.1
0
0
0
10
25
0
0
14.7
0
23.5
14.7
14.2
0
3
17.4
0
0
23.9
21.4
5.9
38.9
11.8
0
10
0
33.3
29.4
19.0
27.0
17.5
23.0
25.0
14.0
26.0
29.0
6.0
6.0
6.0
15.5
32.0
6.0
6.0
21.5
6.0
30.0
21.5
20.0
6.0
12.0
24.0
6.0
6.0
31.0
28.0
13.0
35.0
17.5
6.0
15.5
6.0
34.0
33.0

Check of assumptions

  1. The issue of independence of observations was not considered by the author. This was an unfortunate oversight, since there would appear to be a real danger of spatial autocorrelation and pseudoreplication if some of the study areas are adjacent and/or overlapping. This would result in the number of degrees of freedom being overestimated, which would make making the statistical tests too liberal. We cannot test for this because there is insufficient information given in the paper.
  2. Both X and Y are measurement variables
  3. The plot of Y on X could indicate a weak linear relationship - but there is so much scatter that this is difficult to assess.
  4. The q-q plots indicated that neither the distribution of X nor Y was (remotely) normal; inspection indicates that this cannot be remedied by a transformation (many zeros in X).
  5. The variance of Y appears to increase with X, and vice versa.
We conclude that a parametric test of the correlation coefficient is not appropriate. We could instead use a non-parametric correlation coefficient (as done by the author and by us in the More Information page on Nonparametric correlation and regression ), or alternatively test the Pearson correlation coefficient using a randomization test. We will carry out a randomization test using R, and compare the result with that obtained using the standard parametric test.
Using

Our observed correlation coefficient is obtained as before:

Calculation of Pearson correlation coefficient

r    =    9396.36-(250982.5/35)
√[15757 − 342225/35 ][9372.219 −184066.7/35]
     =    0.44875

we got
> # observed correlation coefficient
> (C=cov(x,y) / sd(x) / sd(y))
[1] 0.4487518

> 2*(.5-abs(P-.5))
[1] 0.01

> # conventional t-test of C
> df=length(x)-2
> P=pt(C*sqrt(df/(1-C^2)),df)
> 2*(.5-abs(P-.5))
[1] 0.006852728
We conclude there is a significant linear relationship between the percentage of rare species and species richness - although we should emphasize that a linear relationship still seems improbable, and a non-parametric test implying only a monotonically increasing relationship would be greatly preferable.