Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)




The Pearson product-moment correlation coefficient (population parameter ρ, sample statistic r) is a measure of strength and direction of the linear association between two variables. In other words it assesses to what extent the two variables covary.

Although Pearson (1895) developed the mathematical formula that is still most commonly used today, the theory behind the coefficient was developed by Galton (1885) who published the first bivariate scatterplot. It has been suggested that the popular name for the index should be therefore be the Galton-Pearson correlation coefficient. Today, the correlation coefficient, together with regression, constitute the most frequently used statistical methodology in observational studies in many disciplines.

For correlation there is no distinction between Y and X in terms of which is an explanatory variable and which a response variable. The coefficient is obtained by dividing the sample covariance between the two variables by the product of their sample standard deviations:

Algebraically speaking -

r    =    Σ(X − )(Y − ) 
√ [Σ(X − )2 Σ(Y − )2]
     =    ΣXY −(ΣX)(ΣY)/n
√[ΣX2 −(ΣX)2/n)] √[ΣY2 − (ΣY)2/n]
  • r is the Pearson product moment correlation coefficient,
  • X and Y are the individual observations of the two variables,
  • and are the arithmetic means of the two sets of observations.
  • n is the number of bivariate observations

This result is exactly equivalent to dividing the sum of the products of the standard standardized observations (confusingly called 'scores') of the two measures by the degrees of freedom as explained in the core text.

{Fig. 1}
The value of the correlation coefficient can vary from +1 (perfect positive correlation) through 0 (no correlation) to -1 (perfect negative correlation). The absolute value of the coefficient is often taken as a measure of the closeness of the relationship between X and Y.

However, this approach has to be viewed with caution because the value of the coefficient depends not only on the strength of the relation between the two variables, but also on the sample size. Moreover, it is not a linear measure and has to exceed about 0.7 before the relationship is readily apparent.

A more appropriate measure is the coefficient of determination (r2) which quantifies the percentage of the variance of Y that can be accounted for by a linear fit of X to Y.

A significant correlation between two variables is never a sufficient condition to establish causality. The hypothesis of a causal relationship can only be supported if various criteria are met. These include the strength and consistency of the association, a demonstration that cause precedes effect, a dose response relationship, a plausible biological mechanism, and experimental evidence that one variable really does affect the other.


Testing the significance of the correlation coefficient

  1. Against the null hypothesis that ρ = 0

    The conventional parametric test of the correlation coefficient can be carried out by comparing the observed statistic with tabulated values of correlation coefficients (for example Table A11 in Snedecor & Cochran (1989)). Alternatively r can be studentized by dividing it by its standard error. Note this gives exactly the same formula as for the studentized regression slope.

    Algebraically speaking -

    t    =    by.x    =    r √ (n - 2)
    sb √ (1 - r2)
    • t is the estimated t-statistic; under the null hypothesis it is a random quantile of the t-distribution with (n - 2) degrees of freedom,
    • by.x is the regression coefficient of Y on X,
    • sb is the standard error of the regression coefficient,
    • r is the Pearson product moment correlation coefficient,
    • n is the number of bivariate observations.

    The parametric test of the correlation coefficient is only valid if the assumption of bivariate normality is met. But even if the distributions are far from normal, the coefficient still characterizes the degree of dependence. It is just that you cannot apply (standard) significance tests to it.

    If you wish to use the correlation coefficient on non-normally distributed data, you should use a permutation (randomization) test to test significance. The available x and y-values are randomly paired, and for each set of pairings a correlation coefficient is calculated. The probability of getting the observed statistic is then obtained from the distribution of correlation coefficients. Note that Pearson's coefficient is not stable to outliers, and using the randomization test does not alter this fact. Nor is it appropriate for non-linear relationships. In such situations a non-parametric rank-based correlation coefficient may be more appropriate.

  2. Against the null hypothesis that ρ is a value other than 0

    When ρ is not zero, the distribution is skewed, so the above methods are inappropriate. Instead the Fisher z transformation should be used.

    Algebraically speaking -

    z    =    1  ln (1 +r)    =    arctanh(r)
    2 (1 − r)
    • z is distributed approximately normally with a standard error of 1/√(n − 3)
    • r is the Pearson product moment correlation coefficient,
    • n is the number of bivariate observations.

    The observed value of r and the value of ρ under the null hypothesis are both transformed into z-values. The difference between the two z-values is divided by their standard error (1/√(n − 3)) and tested as a standard normal deviate (Z).

  3. Attaching a confidence interval to r

    Again the Fisher z transformation is used. The observed correlation coefficient is transformed, and the required confidence interval is attached to the transformed value using the normal approximation. The upper and lower limits are then back transformed using:

    Algebraically speaking -

    Detransformed r    =    exp(2z) − 1    =    tanh(z)
    exp(2z) +1
    • z is Fisher's z,
    • tanh is the hyperbolic tangent.

    N.B. Some authorities divide Fisher's z by its standard error (1/√(n − 3)) to produce a z-score that follows a standard normal distribution. The confidence interval is then worked out for the z-score, before being detransformed back to the original correlation scale.


Weighted correlation coefficient

Sometimes observations have differing degrees of importance that can be described with a weight w. This may occur if different numbers of repeated observations are made on each individual. A weighted correlation coefficient can be estimated using the mean values for each individual (i, i) in the formula below:

Algebraically speaking -

rw    =    Σwiii− (Σwii)(Σwii)/Σwi
√[Σwii2 - (Σwii)2/wi] √[Σwii2 - (Σwii)2/wi]
  • r is the weighted Pearson product moment correlation coefficient,
  • and i are the means of the repeated observations on each individual (or the individual observations if the aim is to add weight to certain observations),
  • n is the number of bivariate observations


Correcting for bias due to measurement error

If measurement error is present for one or both of the variables, conventional estimates of the Pearson product-moment correlation coefficients suffer from attenuation - on other words they are biased towards zero. If repeated observations on an individual are made, the magnitude of this bias decreases with increases in the number of measurements used to calculate the mean value of the variable for each individual. The lower the repeatability for the measurement (as measured by the intraclass correlation coefficient), the higher the degree of bias.

Algebraically speaking -

rcorr    =     r 
( 1 + 1 - ICCX ) + ( 1 + 1 - ICCY )
  • r is the uncorrected correlation coefficient calculated using the sample mean values of each variable for each individual;
  • ICCX and ICCY are intraclass correlation coefficients for X and Y;
  • nX and nY are the per-individual sample sizes for X and Y.


When not to use correlation analysis

There are two common situations where Pearson's correlation coefficient is misused:

  1. Assessment of measurement error

    The same variable is measured twice on a number of individuals, and a correlation coefficient - known as the test-retest reliability of the measurement method - is then calculated between the repeated measurements. This is a poor and potentially misleading measure of measurement error for two reasons.

    1. The value of the correlation coefficient will depend on the variability between the subjects. The higher their variability, the higher will be the r value.
    2. Secondly, the order of readings should not be important in this sort of study - but if one changes the order for some of the subjects, the value of the correlation coefficient will change. This latter problem is overcome by using the intraclass correlation coefficient which estimates the average correlation among all possible orderings of pairs (see ).

  2. Comparison of methods studies

    The same variable is measured twice using different methods. A high correlation coefficient is then taken to indicate good agreement between the two methods. This is incorrect because you are not interested just in whether they are correlated - you are interested in whether the readings are the same or not. Remember the Pearson correlation coefficient is scale free so if one reading is always precisely double the other reading, you will get a correlation coefficient of 1 indicating perfect correlation. Yet the readings are quite different! For many situations the Bland Altman method is the most appropriate analytical method for comparison of methods studies, although some advocate the use of errors-in-variables regression.




  • Pairs of observations are independent. In other words, each observation of X should be independent of other observations of X and each observation of Y should be independent of other observations of Y. If observations are serially correlated, either spatially or temporally, the significance test of the correlation will be misleading.
  • Both variables are measurement variables - in other words at the interval/ratio scale. It is true that the correlation coefficient is often used where one or other (or both) scores are on ordinal scale, especially in the case of visual analogue scales. But for such data it is better to use non-parametric correlation coefficients. Note that the equivalent correlation coefficient for dichotomous variables is the phi coefficient.
  • Measurements on Y are linearly related to X. The correlation coefficient is not appropriate for curvilinear relationships, and may fail to detect any relationship at all for 'humped' relationships.

    The above assumptions must be met whether the significance is tested by randomization or by parametric methods. Parametric tests of significance makes two further assumptions:

  • X and Y have a bivariate normal distribution.
  • Measurements on Y have similar variance across all levels of X and vice versa.

    topics :

    Shapiro-Wilk's test for normality

    Phi coefficient