 InfluentialPoints.com
Biology, images, analysis, design...
 Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

# Pearson's correlation coefficient  #### Worked example 1

Our first worked example uses data from Gereda et al (2006) who related allergen sensitization to house dust endotoxin. We looked at some of the analysis of this work in units 1 and 8 . Data are presented below:

 House dust endotoxin in relation to allergen sensitization No. Log concn house dust endotoxin Log % interferon-γ CD4 cells 1 2.242 0.246 2 2.770 0.117 3 2.636 0.223 4 2.725 0.293 5 2.661 0.316 6 2.483 0.375 7 2.554 0.410 8 2.839 0.375 9 2.915 0.422 10 3.068 0.328 11 3.068 0.352 12 3.004 0.480 13 2.686 0.516 14 3.042 0.598 15 2.763 0.609 16 2.719 0.656 17 3.386 0.792 18 3.271 1.430

#### Check of assumptions

1. The data formed part of an observational study so it is difficult to assess whether the observations can be regarded as independent. For example, were any of the children living in the same household?? We will, however, accept them as being independent in the absence of any evidence to the contrary.
2. Both X and Y are measurement variables
3. The plot of log Y on log X suggests a weak linear relationship, although there was one (possible) outlier that had a much higher level of sensitization than would be expected.
4. The q-q plots indicated that the distribution of log (x) was approximately normal, but that of log(y) was less so - mainly because of the influence of the outlier.
5. There are insufficient data to assess whether the variance of each variable was constant across the range of the other.

We will accept that assumptions for a parametric test are (more or less) met, although we should run the analysis with and without the outlier (influential point?) to check on how robust the result is. Using

#### Calculation of Pearson correlation coefficient r = 24.93460 −(50.832 × 8.538 /18) √[144.9090 −2583.892/18] [5.4962 −72.8974/18] = 0.587

#### Test of significance

 t = 0.587√ (16) = 2.90 √ (1 -0.5872)

This gives a two-tailed P-value of 0.01044. We conclude there is a significant linear correlation between log % interferon-γ CD4 cells and log concentration house dust endotoxin.

#### Confidence interval

The observed correlation coefficient is transformed to a Fisher's z-value:

 z = 1 ln 1.587 =   0.6731  2 0.413

The 95% confidence interval for this z-value is 0.67308 ± 1.96/√(n-3) which is 0.67308 ± 0.5061. This gives the (transformed) interval as 0.16698 - 1.17918 , which detransformed is 0.1654 - 0.8272.

#### Worked example 2

Our second worked example uses data from Luiselli (2006) who tested hypotheses on the ecological patterns of rarity using snake communities worldwide. We first considered this work in . Data are presented below:

 Snake data No. sp. Rank % rare Rank 8241718161215135312202866346176874332359461417181751052717 11.029.022.0 25.5 19.0 14.5 18.0 16.0 4.5 1.014.527.031.0 8.0 8.0 33.0 8.0 22.0 35.0 10.0 2.0 32.0 28.0 4.5 12.034.0 17.0 22.0 25.522.04.5 13.0 4.530.0 22.0 12.520.811.816.718.78.332023.100010250014.7023.514.714.20317.40023.921.45.938.911.8010033.329.4 19.027.017.5 23.025.014.0 26.029.0 6.06.0 6.0 15.5 32.0 6.0 6.0 21.5 6.0 30.021.5 20.0 6.0 12.0 24.0 6.0 6.031.028.013.0 35.017.5 6.015.5 6.034.0 33.0

#### Check of assumptions

1. The issue of independence of observations was not considered by the author. This was an unfortunate oversight, since there would appear to be a real danger of spatial autocorrelation and pseudoreplication if some of the study areas are adjacent and/or overlapping. This would result in the number of degrees of freedom being overestimated, which would make making the statistical tests too liberal. We cannot test for this because there is insufficient information given in the paper.
2. Both X and Y are measurement variables
3. The plot of Y on X could indicate a weak linear relationship - but there is so much scatter that this is difficult to assess.
4. The q-q plots indicated that neither the distribution of X nor Y was (remotely) normal; inspection indicates that this cannot be remedied by a transformation (many zeros in X).
5. The variance of Y appears to increase with X, and vice versa.
We conclude that a parametric test of the correlation coefficient is not appropriate. We could instead use a non-parametric correlation coefficient (as done by the author and by us in the More Information page on Nonparametric correlation and regression ), or alternatively test the Pearson correlation coefficient using a randomization test. We will carry out a randomization test using R, and compare the result with that obtained using the standard parametric test. Using

Our observed correlation coefficient is obtained as before:

#### Calculation of Pearson correlation coefficient r = 9396.36-(250982.5/35) √[15757 − 342225/35 ][9372.219 −184066.7/35] = 0.44875

 we got > # observed correlation coefficient > (C=cov(x,y) / sd(x) / sd(y))  0.4487518 > 2*(.5-abs(P-.5))  0.01 > # conventional t-test of C > df=length(x)-2 > P=pt(C*sqrt(df/(1-C^2)),df) > 2*(.5-abs(P-.5))  0.006852728

We conclude there is a significant linear relationship between the percentage of rare species and species richness - although we should emphasize that a linear relationship still seems improbable, and a non-parametric test implying only a monotonically increasing relationship would be greatly preferable.