Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site



The Wilcoxon-Mann-Whitney U-test: Use & misuse

(versus t-test, similarity of distributions, reported measure of location, small samples, tied data)

Statistics courses, especially for biologists, assume formulae = understanding and teach how to do  statistics, but largely ignore what those procedures assume,  and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...

Use and Misuse

The Wilcoxon-Mann-Whitney test is widely used in all disciplines, probably nearly as much as the ubiquitous t-test.  Despite its lower power, it is often favoured over the t-test because of the misconception that no assumptions have to be met for the test to be valid. In fact the basic assumptions of the two tests (namely that both samples are random samples and are mutually independent) are identical. Not surprisingly, therefore, we find similar misuse as with the t-test concerning these aspects. But the biggest problems come where the assumptions do differ from those of the t-test - namely the distribution of the data.

If the test is to be used to compare arithmetic means,  the two distributions must be both symmetrical and identical apart from location. If the test is to be used to compare medians, the two distributions must be considered as identical apart from location. Yet we give a number of examples where distributions were clearly different, yet a significant result was still assumed to indicate either a difference in means or a difference in medians. In such situations the test is still valid to test for dominance of one distribution over another - but few researchers seem to be aware that that is what is being tested. If it is actually medians and/or distributions that are being compared, then reporting mean and standard error is clearly inappropriate for this test. Although most medical researchers are now aware of this, the practice is still widespread in other disciplines.

But there is one special situation where both the arithmetic mean and the median/distribution are of interest - namely where the total is of importance. This is because only the arithmetic mean is directly related to the total. We give two examples were this might be the case - costs of care and duration of disturbance to endangered mammals. In this situation the Wilcoxon-Mann-Whitney test may be appropriate to compare distributions, but only a randomization test can adequately compare the arithmetic means.

Other misuses relate to the problems of small samples and tied data. There is an exact test for small samples, but this is only valid if there are few or no ties within or between groups. The test is sometimes applied to heavily tied data which makes the test too liberal in reporting differences. We also find examples where use of the normal approximation is borderline for the sample sizes used. A confidence interval is sometimes attached to the median difference, but this is rarely done except in medical research. This is a pity, because estimation of magnitude of the treatment effect should be a primary component of any statistical analysis.

We give a few examples of another test, the median test, although it is now rarely used. This is a pity because it is less susceptible to differences in distributions, and hence more readily interpretable in terms of differences between medians. Surprisingly, the few examples we have included make the rather obvious error of reporting arithmetic means and standard errors. This can be wildly misleading if distributions are skewed - as the name suggests, the median test compares ... medians!

What the statisticians say

Conover (1999) covers the Wilcoxon-Mann-Whitney as the Mann-Whitney test, although he only gives details on the (Wilcoxon) sum-of ranks statistic. Table values of W for nA,nB up to 20 are given. Sprent (1998) provides a comprehensive treatment of rank tests of location for two independent samples in Chapter 4. Hollander & Wolfe (1973) and Siegel (1956) both cover the Wilcoxon-Mann-Whitney test in their texts on nonparametric statistics.

Okeh (2009) reviews the application of the Wilcoxon Mann-Whitney U test in medical research studies. Zimmerman (2003) warns that the large-sample Wilcoxon-Mann-Whitney test can be strongly influenced by unequal variances of treatment groups even when sample sizes are equal. Hart (2001) notes that the Wilcoxon-Mann-Whitney test is a test of both location and shape - not as most researchers consider it a test of difference between medians. Freidlin & Gastwirth (2000) advocate the retirement of the median test from general use, being replaced by the Wilcoxon-Mann-Whitney and related tests. Potvin and Roff (1993) propose more general use of non-parametric tests in ecological research, but Johnson (1995) and Smith (1995) take issue with this point of view.

Wilcoxon (1945) first proposed the test for equal sample sizes, and then Mann & Whitney (1947) extended the test to cover different sample sizes. Hodges & Lehmann (1963) discuss the properties of the Hodges-Lehmann estimator of median difference.

Wikipedia (2008) provides a comprehensive account of the Wilcoxon-Mann-Whitney test with a useful section of its relation to other tests; the median test and the Hodges-Lehmann estimator are also covered. Various universities give tables of the Wilcoxon Rank Sum statistic on line.