Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Kolmogorov-Smirnov and related tests: Use & misuse
(one and two sample tests, normality, estimated parameters, Lilliefors test, discrete distributions)
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and Misuse
These tests provide a means of comparing distributions, whether two sample distributions or a sample distribution with a theoretical distribution. The distributions are compared in their cumulative form as empirical distribution functions. The test statistic developed by Kolmogorov and Smirnov to compare distributions was simply the maximum vertical distance between the two functions. Kolmogorov-Smirnov tests have the advantages that (a) the distribution of statistic does not depend on cumulative distribution function being tested and (b) the test is exact. They have the disadvantage that they are more sensitive to deviations near the centre of the distribution than at the tails.
Both the one- and two-sample Kolmogorov-Smirnov and related tests are widely used in all disciplines. Unfortunately, the one-sample Kolmogorov-Smirnov test is commonly misused to test normality when the parameters of the normal distribution are estimated from the sample rather than specified a priori. The result is that the test is far too conservative, and distributions that are clearly not normal are wrongly classified as such. This practice is perhaps reinforced by a sometimes unconcealed desire to demonstrate normality so that subsequent parametric tests can be carried out. The situation is not helped by various software packages being unclear about which test is being used. The correct test to use to test for normality when the parameters of the normal distribution are estimated from the sample is Lilliefors test.
When it comes to goodness of fit to discrete distributions, the test can be adapted to give the correct P-value, and various packages provide software to test goodness of fit to the Poisson distribution and the Zipf distribution. However, there is no Lilliefors equivalent for these distributions, so again parameters cannot be estimated from the sample. A second major problem arises from testing discrete variables against continuous distribution functions. We give a well known example where a Kolmogorov-Smirnov test of final digits of P-values (a discrete variable) suggested that they deviated from the expected (continuous) uniform distribution. The test, however, gave the wrong P-value because with many ties, the test is far too liberal. A more basic error that we find with all goodness of fit tests is misinterpretation of a small P-value to indicate a 'good fit'. In fact of course it means the opposite, but researchers are so imbued with the need for significance that they forget that, with goodness of fit tests, a significant result means a deviation from the 'null' distribution.
With the two sample test, the question usually is - what is it one wants to compare? A Kolmogorov-Smirnov test compares the overall distributions rather than specifically locations or dispersions. By and large we have found the test is used correctly in this respect. But there is the same problem as with the one-sample test over the interpretation of non-significant P-values. In some cases authors seem to think that they have proved the null hypothesis, and that two distributions are therefore 'the same'. This may appear rather pedantic, but it is important. The Kolmogorov-Smirnov test has rather little power against the null hypothesis when comparing distributions, and for small sample sizes, the two distributions would need to be completely different for this test to show a significant difference.
What the statisticians sayConover (1999) devotes a full chapter to statistics of the Kolmogorov-Smirnov type with full details on estimation of confidence intervals - but recent developments on improving power of the test are not covered. Sokal & Rohlf (1995) gives an up-to-date account of the Kolmogorov-Smirnov tests including the recent two-stage d-adjustment. Sprent (1998) covers both the one- and two-sample tests in Chapter 6. Siegel (1956) introduces the Kolmogorov-Smirnov tests, but does not of course consider the (later) tests by Lilliefors and Anderson-Darling.
Khamis et al. (2000) (1992) propose a modification of the test which improves its power for small to moderate size samples. Harter et al. (1984) show that you can allow for differences between observed and expected frequencies before and after each step of the cumulative distribution by subtracting 0.5 from each observed frequency. Lilliefors (1967) showed that the Kolmogorov-Smirnov one-sample test is too conservative if expected frequencies are calculated using parameters estimated from the sample - commonly tabulated (and software) values are only valid for a fully defined distribution. Anderson & Darling (1952) proposed the Anderson-Darling test and Stephens (1974) modified it for use when the distribution is not completely specified.