Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



Runs tests: Use & misuse

(one-sample runs test, Wald-Wolfowitz test, test of randomness, comparing distributions, trends)

Statistics courses, especially for biologists, assume formulae = understanding and teach how to do  statistics, but largely ignore what those procedures assume,  and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...

Use and Misuse

The one-sample runs test assesses whether a sequence of observations on a dichotomous (or binary) variable can be considered random. The same test can be applied to the two-sample situation in which case it is known as the Wald-Wolfowitz test. It functions as an overall test of difference between two independent samples. In other words, the alternative hypothesis is that the distribution of the groups differ in some way - whether location, dispersion, skew or kurtosis. The runs test and the Wald-Wolfowitz test are now rarely found in the medical literature, perhaps reflecting the awareness that their use is seldom justified. The tests are, however, still found in the ecological literature, especially for preliminary analysis of spatial and temporal data. Another use is to assess trend in the residuals of nonlinear regression.

One misuse of the test results from its lack of power  as noted above. We found one example where the runs test was used to assess whether shedding of bacteria by cattle is random or clustered over time. This was bound to be unproductive given a sample size of only 12. Other examples were found where sample sizes were adequate, but the comparison would have had even more power if the Kolmogorov-Smirnov test  had been used. Some authors used the normal  approximation even for small samples, or used the test on data with large numbers of ties. Both of these will give misleading results. Exact or Monte Carlo solutions should be used for small samples, and the test should not be used at all on data with large numbers of ties.

The other major misuse of the test was to accept a significant result of the Wald-Wolfowitz test as demonstrating that means (or medians) differ. Unfortunately the test cannot do that - it can only indicate that the distributions differ in some way. We found a well known example of this where the test (wrongly) appeared to show that left handed people do not live as long as right handed people. The moral of the story is that if you wish to show a difference between means or medians, then use a test which will demonstrate this such as the median test, the Wilcoxon-Mann-Whitney  (if distributions are a similar shape) or a t-test (if distributions approach normal).

There is perhaps some justification for using runs test(s) as an initial (global) test to detect trends, with subsequent tests only applied if the initial runs test is significant. We found two ecological examples, one in relation to detecting spatial clustering and the other considering cyclic fluctuations over time. Similarly, the runs test can be used to check for trend in the residuals of nonlinear regression, but it cannot on its own provide a test of goodness of fit.  The danger with the runs test for both these applications is that there are some non-random patterns that it will fail to identify.


What the statisticians say

Sprent (1993) (1998) covers both the one-sample runs test and the two-sample Wald-Wolfowitz test. Zar (1998) covers only the one sample test in Chapter 25, noting that use of the Wilcoxon-Mann-Whitney test is preferable to the Wald-Wolfowitz test in the two sample situation. Sokal & Rohlf (1995) covers the one-sample runs test with a good explanation of its varied usage. Gibbons & Chakraborti (1992) give a detailed treatment of runs tests including exact (permutation) tests, and tests based on the length of longest run. Siegel (1956) is an older text but is still useful for nonparametric tests.

Mogull (1994) reports that the test is incapable of signaling departures from randomness with run lengths of two. Moore & Wallis (1943) look at runs tests for carrying out significance tests on time series data, whilst Huitema (1996) notes that it has the wrong type I error rate if used to evaluate the independence of errors in time-series regression models. Mood (1940) reviews much of the underlying theory on the distribution of runs, whilst Wald & Wolfowitz (1940) proposes the two independent sample runs test.

Wikipedia describes the main features of the runs test (Wald-Wolfowitz runs test and runs test are treated as synonyms).