Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site



Why bother with randomization?

Under both the null and alternate hypothesis parametric ANOVA assumes errors within groups are randomly and independently selected from identically distributed normal, and therefore infinite, (parent) populations - it also assumes group sizes do not vary. The Kruskal-Wallis (KW) test, whilst 'more-or-less' equivalent to ANOVA upon rank-transformed data, only removes the normality assumption - and at a price.

Firstly its inferences are conditional upon the observed (null) error distribution (of ranks). Secondly, whilst the KW test does not assume errors are normal, it does assume data are continuous (untied), errors are independently distributed - and the observed differences in their distribution within groups have arisen by simple chance.

Furthermore, because testing a mean rank is equivalent to testing a median, the KW test employs a different measure of location to parametric ANOVA - but it is only a test of the difference between medians where error variances are homogenous. Like the WMW test from which it was derived, the KW test is really a test of dominance - a fact that some people find hard to interpret. Like parametric ANOVA, if the variance of ranks within groups are very unequal, although the KW test is conditional, it is not only sensitive to differences in location - particularly when group sizes are unequal.

An alternate and increasingly popular way to relax some parametric ANOVA assumptions is to use simulation models to estimate the distribution of F under H0, rather than using a mathematically tractable (but arbitrary) distribution function. An additional advantage of simulation models is they enable you to use statistics such as trimmed means. Whatever your statistics of choice, a simulation model is used to generate repeated sets of values under the null hypothesis, each of which are subject to ANOVA - and the resulting distribution of F-statistics (or their trimmed-mean analogues) are used instead of the parametric F-distribution. Of these simulation models, the most popular is the permutation test.

  1. Because it was written for clarity and generality, the R code used in these examples is comparatively slow.
  2. All these examples obtain the F-statistic using a straightforward one-way fixed effects analysis of variance.
  3. Because the most appropriate simulation depends upon both the study design and type of data, there are many possible simulation models - even for a simple 1-way ANOVA. So we only describe a small subset here.



ANOVA by simple permutation

How to do it

First of all you calculate the ANOVA table and F-statistic in the usual way. To estimate the distribution of F (under Hnil), observations are pooled and randomly assigned (without replacement) to K groups of predefined sample size (n1 n2 to nK). Then an ANOVA is performed, and the F-value is recorded. To estimate how F varies due to random assignment, this process is repeated a sufficient number of times (perhaps 5000). For a conventional 1-tailed test the P-value is what proportion of those F-values equal or exceed the observed F-value.

Worked Example 1:

Our first worked example uses the same data from Cobo et al. (1998) that we used for the KW test

This gave us a mid P-value of 0.0021, and a conventional P-value of 0.0022, compared with P=0.00895 from a KW test. You will of course get a slightly different P-value each time you run the test - although with 5000 replicates the variation will not be very great unless P is small. Recall the KW test uses a large-sample chi-squared approximation, and assumes the data lack ties (these data are tied). A log-transformation stabilizes the variances reasonably well. Applying a permutation test to the ln-transformed data gave a mid P-value of 0.0023, and a conventional P-value of 0.0024.


Worked example 2

Our second worked example uses the same data from Johnston et al. (2001) that we used for the one-way ANOVA.

This gave us a mid P-value of 0.0015, and a conventional P-value of 0.0016, whereas P=0.001329 of the parametric F-distribution exceeded the observed value of F. Applying the same 3 tests to untransformed data gave P=0.0009, 0.001 and 0.001283 respectively.

Assumptions and properties

Although commonly described as such, permutation tests are only slightly more distribution-free than the KW test in that permutation tests do not assume data are continuous (un-tied). ANOVA by permutation still assumes errors are identically distributed and independently assigned. However, unlike the KW test, transforming data can influence the results of ANOVA F-tests using permutation. In addition, by pooling observations assuming Hnil is true, no allowance is made for treatment effects. Therefore, when group means differ (under HA), the effect of treatment upon group location will be incorporated into that nil model - thus increasing the predicted variation of F, and reduce the resulting P-value. Conversely, when groups have similar means but their error distributions differ in other respects (such as variance or skew) this assumption can bias your inference.

Permutation models do not assume data represent an infinite normal population but are conditional upon the entirely finite set of values you have observed. In other words, ANOVA by permutation ignores what happens if you were to repeat your study and observe different values. Again, whilst ANOVA by permutation does not assume data are continuous, small sets of heavily-tied data will cause F to be noticeably discrete - making conventional inference conservative compared to mid-P.



ANOVA by simple bootstrap

How to do it

Again we use the data from Johnston et al. (2001) Proceed exactly as for a simple permutation test, but sample with replacement.

Worked example 3

Again this gave us a mid P-value of 0.0015, and a conventional P-value of 0.0016.

Assumptions and properties

ANOVA by simple bootstrap uses almost the same method and assumptions as ANOVA by simple permutation, but does differ in three respects:

  • For small data sets, or heavily tied values, the bootstrap distribution of F tends to be smoother - making conventional inference less conservative.
  • Simple bootstrap ANOVA, by resampling, estimates the distribution of the F-statistic under Hnil - that is when samples are taken from an infinite population that has the same distribution as the pooled data.
  • Therefore in principle this test is not conditional or, to be more realistic, it is less conditional than a permutation test.

These points aside, permutation and bootstrap distributions of F are not always as different as you might expect. Which is why our examples yielded so similar P-values.

A practical problem with bootstrapping is, because the distribution of the pooled observations is unavoidably discrete, it may not provide a very good model of the population from which the observations were drawn. This model population is determined by your observed values, so analysis of small studies are vulnerable to aberrant values. Also the range of values will seldom be as great as those of the parent population - which, like a permutation test, restricts the range of possible P-values when testing small tied groups. Whilst jittering can reduce both of these problems, it has yet to come into common use.

An alternative 'semi-parametric' method is to assume observations represent a known (infinite) frequency distribution, estimate its parameters from your data, then sample that distribution. Whilst semi-parametric bootstrapping can be useful for estimators whose properties are poorly described, the estimates tend to be biased, and your choice of distribution may be criticized as arbitrary. Jittered bootstrapping avoids such estimates and assumptions - and selecting a suitable jittering distribution is generally easier and less controversial.



ANOVA simulation models using residuals

How to do it

Calculate the ANOVA table and F-statistic as usual, and find the difference between each observation and its group mean. Assuming errors are identically distributed, these differences are pooled, and this pool is randomly sampled without replacement (permuted) - or it is sampled with replacement (bootstrapped). The observed value of F is then compared with its estimated distribution under H0. Below we again analyze the data from Johnston et al. (2001)using a permutation test on the residuals. To make it into a bootstrap test change replace=FALSE to replace=TRUE


Permuting the errors gave us a mid P-value of 0.0007, and a conventional P-value of 0.0008. Whereas applying a permutation test to the ln-transformed observations gave a mid P-value of 0.0023, and a conventional P-value of 0.0024.

Assumptions and properties

One attraction of permuting or bootstrapping deviations from group means, rather than the observations themselves, it this enables you to ensure the null hypothesis is true, rather than merely assuming it is so - and hence reduces P-values where group means are observed to differ. An important disadvantage is, by shifting distributions in this way, highly implausible error structures are sometimes created. Another problem is that, whilst lack of 'power' is worst when testing small samples, this is also where bootstrapping is most vulnerable to data artefacts.


On the other hand, when errors cannot be assumed to be distributed identically, instead of merging them, simulation models can be devised that keep each group of errors separate - and sample them with replacement. In doing so, notice that you are assuming the null hypothesis refers to group locations per se - and assumes a simple additive model is plausible. Then again, because this type of bootstrap uses a number of (small) model populations, rather than a single combined one, such an approach only works well when applied to large groups.


In conclusion, whilst simulation models offer considerable potential, this is seldom realized outside of statistical journals. Among biologists their application is dictated more by precedence and ease of application than by their appropriateness to study design and data. One reason for this is that non-statisticians are understandably reluctant to employ analyses whose reasoning and properties are novel and little understood.