InfluentialPoints.com
Biology, images, analysis, design...
 Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

# Non-parametric correlation and regression

On this page: Spearman rank-order correlation coefficient  Kendall rank-order correlation coefficient  Assumptions

### Spearman rank-order correlation coefficient

Unlike Pearson's correlation coefficient, Spearman's correlation coefficient only requires that each variable at least be measured on the ordinal scale. It also makes no distributional assumptions, so can be used for measurement variables where the assumption of bivariate normality does not hold. The data may consist of numeric observations to which ranks are applied, or to non-numeric observations that can only be ranked. In the case of ties in either the X or Y value, an average rank is assigned.

Computationally Spearman's correlation coefficient is simply Pearson's correlation coefficient applied to the ranks of the observations. The value of the coefficient can range from -1 (perfect negative correlation) to 0 (complete independence between rankings) to +1 (perfect positive correlation). Since it is a measure of the linearity of the ranked observations, it provides a test of a monotonic trend of the original data. Note, however, that it cannot be used to detect a non-monotonic trend, for example where Y initially increases with X, but then decreases at higher values. Hence relationships should always be plotted first before calculating the coefficient.

The Spearman correlation coefficient (ρ) (for which we use rs for the statistic) is given by:

#### Algebraically speaking -

rs    =
 ΣR(X)R(Y) − n ( n + 1 ) 2 2
[ΣR(X)2 − n( n + 1 )2][ΣR(Y)2 −n( n + 1 )2]
22
where
• rs is the Spearman correlation coefficient,
• R(X) and R(Y) are the ranks of the individual observations of the two variables,
• n is the number of bivariate observations

If there are no ties, this simplifies to:

#### Algebraically speaking -

 rs = 1 − 6 Σ[R(X) − (R(Y)]2 n(n2 − 1)
where
• rs is the Spearman correlation coefficient,
• R(X) and R(Y) are the ranks of the individual observations of the two variables,
• n is the number of bivariate observations

Spearman's correlation coefficient is not especially sensitive to ties, and if there are only a small number, the simpler formula can be used. However, ties do bias the value of the statistic upwards so for borderline values it is safer to use the longer formula.

#### Testing significance

For small samples (N ≤ 10) the significance of rs can be tested using a permutation test, or by comparing the value obtained with that in published tables (see for example Table A10 in Conover (1999), or Table P in Siegel (1956)). The calculated statistic is significant if it is more than the tabulated value.

For larger samples (n > 10) we can studentize the statistic by dividing it by its standard error:

#### Algebraically speaking -

 t = rs √ n − 2 1 − rs2
where
• t is Student's t statistic; under the null hypothesis of independence t is a random quantile of the t-distribution with (n − 2) degrees of freedom,
• rs is the Spearman correlation coefficient,
• n is the number of bivariate observations

#### Important!

It has unfortunately become common practice in some disciplines to calculate a non-parametric correlation coefficient with its associated P-value, but then plot a best fit least squares line to the data. This is very bad practice and is highly misleading. The P-value is not applicable to a linear fit of the (untransformed) Y against X, but to a linear fit of rank (Y) against rank (X).

#### Use as test for trend

The Spearman correlation coefficient can be used as a test for trend. In other words, if one has a set of estimates of (say) population density of an organism over time, one can assess whether numbers are declining or increasing, or whether there is no significant change over time. Measurements are simply paired with the time at which they were taken. The test for trend based on the Spearman correlation coefficient is generally considered more powerful than the Cox and Stuart test for trend.

### Kendall rank-order correlation coefficient

The Kendall rank correlation coefficient is another measure of association between two variables measured at least on the ordinal scale. As with the Spearman rank-order correlation coefficient, the value of the coefficient can range from -1 (perfect negative correlation) to 0 (complete independence between rankings) to +1 (perfect positive correlation). Since it is a measure of the linearity of the ranked observations, it provides a test of a monotonic trend of the original data.

The coefficient is computed in a similar way to the Wilcoxcon-Mann-Whitney statistic. It is based on the principle that if there is an association between the ranks of X and the ranks of Y, then if the x ranks are arranged in ascending order, then the y ranks should show an increasing trend if there is a positive association and vice versa if there is a negative association. Starting from the first Y rank we therefore assess whether the difference is positive (a concordant pair) or negative (a discordant pair) with each subsequent Y rank. We then do the same for the second Y rank, until all observations are covered.

Kendall's correlation coefficient (τ) (for which we use rk for the statistic) is then given by:

#### Algebraically speaking -

 rk = nc − nd 0.5 n (n − 1)
where
• nc is the number of concordant pairs,
• nd is the number of discordant pairs,
• rk is the Kendall rank correlation coefficient,
• n is the number of bivariate observations

Kendall's coefficient calculated as above only takes the value +1 to −1 if there are no ties. If there are ties in the data, then an alternate formulation should be used. The most popular approach is to calculate what is known as the Gamma coefficient which we will denote by rg.

In this situation if two Y ranks are equal and the two corresponding X ranks are not equal, the pair should be counted as ½ concordant and ½ discordant, and the totals adjusted accordingly. If two X ranks are equal, no comparison is made. The coefficient is then calculated using the following formula:

#### Algebraically speaking -

 rg = nc − nd nc + nd
where
• nc is the number of concordant pairs,
• nd is the number of discordant pairs,
• rg is the Kendall rank correlation coefficient adjusted for ties, otherwise known as the Gamma coefficient,
• n is the number of bivariate observations

Another way of dealing with ties is to do the following:

#### Algebraically speaking -

 rg = nc − nd √ [0.5 n (n − 1) − TX] √ [0.5 n (n − 1) − TY]
where
• nc is the number of concordant pairs,
• nd is the number of discordant pairs,
• rKc is the gamma coefficient,
• n is the number of bivariate observations
• TX = 0.5 ΣtX (tX − 1), tX being the number of tied observations in each group of ties on the X variable,
• TY = 0.5 ΣtY (tY − 1), tY being the number of tied observations in each group of ties on the Y variable,

### Assumptions of rank order correlation coefficients

Unlike the Pearson product moment correlation coefficient, no distributional assumptions are made by the rank order coefficients. However, they do assume the following:

• Pairs of observations are independent.
• Variables are measured at least on an ordinal (rank order) scale.
Hence individual observations can be ranked into two ordered series.
• The relationship between the two variables is monotonic - in other words, continually increasing or decreasing.
The coefficients should not be used for U-shaped or hat-shaped relationships between X and Y. Monotonicity can be checked by simple inspection of the X-Y scatterplot, or by plotting the rank of each Y observation against the rank of each X observation. Some texts claim there is no point in making this plot, but in fact it provides the most sensitive way to assess whether the relationship between X and Y really is monotonic.