Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Non-parametric correlation and regressionOn this page: Spearman rank-order correlation coefficient Kendall rank-order correlation coefficient Assumptions
Spearman rank-order correlation coefficient
Unlike Pearson's correlation coefficient, Spearman's correlation coefficient only requires that each variable at least be measured on the ordinal scale. It also makes no distributional assumptions, so can be used for measurement variables where the assumption of bivariate normality does not hold. The data may consist of numeric observations to which ranks are applied, or to non-numeric observations that can only be ranked. In the case of ties in either the X or Y value, an average rank is assigned.
Computationally Spearman's correlation coefficient is simply Pearson's correlation coefficient applied to the ranks of the observations. The value of the coefficient can range from -1 (perfect negative correlation) to 0 (complete independence between rankings) to +1 (perfect positive correlation). Since it is a measure of the linearity of the ranked observations, it provides a test of a monotonic trend of the original data. Note, however, that it cannot be used to detect a non-monotonic trend, for example where Y initially increases with X, but then decreases at higher values. Hence relationships should always be plotted first before calculating the coefficient.
The Spearman correlation coefficient (ρ) (for which we use rs for the statistic) is given by:
If there are no ties, this simplifies to:
Spearman's correlation coefficient is not especially sensitive to ties, and if there are only a small number, the simpler formula can be used. However, ties do bias the value of the statistic upwards so for borderline values it is safer to use the longer formula.
For small samples (N ≤ 10) the significance of rs can be tested using a permutation test, or by comparing the value obtained with that in published tables (see for example Table A10 in Conover
For larger samples (n > 10) we can studentize the statistic by dividing it by its standard error:
Important!It has unfortunately become common practice in some disciplines to calculate a non-parametric correlation coefficient with its associated P-value, but then plot a best fit least squares line to the data. This is very bad practice and is highly misleading. The P-value is not applicable to a linear fit of the (untransformed) Y against X, but to a linear fit of rank (Y) against rank (X).
Use as test for trend
The Spearman correlation coefficient can be used as a test for trend. In other words, if one has a set of estimates of (say) population density of an organism over time, one can assess whether numbers are declining or increasing, or whether there is no significant change over time. Measurements are simply paired with the time at which they were taken. The test for trend based on the Spearman correlation coefficient is generally considered more powerful than the Cox and Stuart test for trend.
Kendall rank-order correlation coefficient
The Kendall rank correlation coefficient is another measure of association between two variables measured at least on the ordinal scale. As with the Spearman rank-order correlation coefficient, the value of the coefficient can range from -1 (perfect negative correlation) to 0 (complete independence between rankings) to +1 (perfect positive correlation). Since it is a measure of the linearity of the ranked observations, it provides a test of a monotonic trend of the original data.
The coefficient is computed in a similar way to the Wilcoxcon-Mann-Whitney statistic. It is based on the principle that if there is an association between the ranks of X and the ranks of Y, then if the x ranks are arranged in ascending order, then the y ranks should show an increasing trend if there is a positive association and vice versa if there is a negative association. Starting from the first Y rank we therefore assess whether the difference is positive (a concordant pair) or negative (a discordant pair) with each subsequent Y rank. We then do the same for the second Y rank, until all observations are covered.
Kendall's correlation coefficient (τ) (for which we use rk for the statistic) is then given by:
Kendall's coefficient calculated as above only takes the value +1 to −1 if there are no ties. If there are ties in the data, then an alternate formulation should be used. The most popular approach is to calculate what is known as the Gamma coefficient which we will denote by rg.
In this situation if two Y ranks are equal and the two corresponding X ranks are not equal, the pair should be counted as ½ concordant and ½ discordant, and the totals adjusted accordingly. If two X ranks are equal, no comparison is made. The coefficient is then calculated using the following formula:
Another way of dealing with ties is to do the following:
Assumptions of rank order correlation coefficients
Unlike the Pearson product moment correlation coefficient, no distributional assumptions are made by the rank order coefficients. However, they do assume the following:
Hence individual observations can be ranked into two ordered series.
The coefficients should not be used for U-shaped or hat-shaped relationships between X and Y. Monotonicity can be checked by simple inspection of the X-Y scatterplot, or by plotting the rank of each Y observation against the rank of each X observation. Some texts claim there is no point in making this plot, but in fact it provides the most sensitive way to assess whether the relationship between X and Y really is monotonic.