Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Permutation testsOn this page: What is a permutation, randomization, test? Principles & properties of a permutation test Why not use an ordinary 'parametric' test? 3 steps to consider: Defining the selection process Describing the statistic's distribution Comparing observed & expected How much power does this test have? A parametric model Parametric versus Exact Properties & Assumptions How to do it How many resample-statistics are needed?
What is a permutation (randomization) test?
In principle, this test is identical to any other null hypothesis significance test, aside from 2 important points:
For this last reason, permutation tests could be described as 'non-parametric', but more often it is because errors are not assumed to represent a normally-distributed
Permutation tests are also said to be 'exact'. One, not very good, reason for this is because (in principle) it is possible to calculate the exact probability of obtaining your test statistic's observed value, and for every more deviant value. (We consider the difference between 'exact' and 'approximate' tests in
If that rather academic description leaves you struggling, a simple example may be more revealing. Or, if you are only interested in the properties and assumptions of randomization
Principles and properties of a permutation test(Consider this very finite population)
The farmer selects 5 of her calves and divides them into two groups. One group contains 2, the other group has 3 animals. The first group she protects using impregnated ear tags. To the other group she applies an oil-based formulation of the same insecticide, an artificial pyrethroid. The farmer pens each calf separately, and estimates the number of stable-flies in each pen by hanging sticky flypaper above the animal. After 3 days she removes her flypapers and counts the number of stable-flies on each.
Here are her results (plus our notes).
With a little arithmetic, she produces these statistics.
From which she concludes, "that pour-on might be a bit messy, but it is more than twice as good as the ear tags - I got half as many flies round those calves".
The farmer is rather pleased with her small experiment - especially since these treatments were free samples. She feels such a clear difference is pretty strong evidence the pour-on gets rid of more of the flies. But, before she invests any money, she would like to be sure her results are not just some sort of statistical fluke.
When you have thought, read on.
Why not use an ordinary 'parametric' test?
There are obviously a number of problems with this farmer's experiment.
For example -
In other words, it is hard to argue her observations are random samples of some larger population, or populations, whose parameters you can therefore estimate. Nor, from the available evidence, can we reasonably say whether her results represent a normal, lognormal, or some more exotic probability distribution. Given which it seems reasonable to accept that, because we can infer nothing useful about any 'wider' population, we must exclude it from our analysis.
Allowing for all of these problems,
Realistically, the most critical problem is her small sample size. Laying that aside for the moment, the other problems could be viewed as technical. Therefore let us see whether we can get around them.
A Permutation, or Randomization Test
But in order to speak more generally, let us use θ - to denote any statistic of interest.
Let us begin by considering our null hypothesis.
We can express this another way.
If her treatments were equally effective, or ineffective, it would not matter which flypaper was monitoring which treatment. In other words, on average at least, she might as well re-label her treatments at random - provided that each flypaper ended up with a label.
For example -
Nevertheless, if our null hypothesis is correct, and there really was no difference between the effect of these treatments, any randomly selected observations must provide an equally valid description of the observations they are drawn from. In which case, any sample statistic summarizing those observations should be equally
If this null hypothesis really is true, provided their size of her treatment groups remain the same, any variation amongst these observed ('sample') statistics must be entirely due to chance. Only if the results our farmer has observed are very unusual, might we conclude they are unlikely to arise from this group, and consider rejecting our null hypothesis - that her treatments produced the same result.
In other words, we have confined the population of observations to the 5 that our farmer has actually made. By definition therefore, we are not trying to estimate parameters of some larger population of observations - or of their statistics. Nevertheless, we would be wise to explicitly state what else this model implies.
Given the way our farmer has labelled her results, we are assuming no animal was assigned to both groups. - In other words, no animal received both an ear tag and pour-on. Similarly, we assume the farmer used just one flypaper for each calf, and that she recorded the result from each flypaper once and only once.
When stated in these terms, we cannot realistically argue these treatment were assigned independently of one another. The probability of a calf being allocated to a group varied according to how the other calves had already been assigned. For example, if by chance the farmer randomly assigned the first three animals to the 'pour-on' group, the remaining two animals could only be assigned to the 'tags' group. - Nor, once an animal had been selected from the 'un-assigned' animals, was it replaced or reassigned.
We must therefore treat our observations in the same way. In other words we should select each observation from our model population without replacement. The table below shows every possible way in which 5 observations could be divided into two groups, comprising 2 and 3 observations. All the possible outcomes of our experiment are described as the sample space. It turns out there are just 10 alternative ways of dividing her 5 observations into two groups (of 2 'tagged' & 3 'pour-on'). Provided the farmer assigned her treatments at random, each of these events are equally probable, even if the results are assigned after her experiment is done.
To make things clearer, we have shaded observations assigned to tagged animals green. Observations assigned to pour-on treated animals are shaded turquoise. We have ranked these results according to their test statistic.
In order to work out the probability this model would produce any given result, including our farmer's observed result, we need to work out the probability of observing each of the 10 possible results listed above. Each of the 10 possible results (listed above) has 12 possible orders in which the treatments might be assigned - so each of them are equally likely to occur by
In other words, provided we restrict our inference to just her 5 observations, these
Therefore, when analysing a permutation test, it is hard to justify fitting the lognormal distribution - shown in grey on the graph
Now we have some idea of how these results might be expected to vary, let us compare the 'exact' distribution predicted by our 'null' model, with the result our farmer actually observed.
Only two of the 10 possible test statistics (or 24 of the 120 permutations) are as extreme as our farmer's result (2.45 times as many flies round tagged calves). No test statistic was more extreme than her result.
A conventional 1-tailed test (which assumes a smooth distribution of θ) requires less than 5 percent of this population of statistics to be greater than or equal to the result she obtained. In this case 24 of the 120 possible results are the same as the one she achieved, so there is a 20% probability (P=0.2) of finding the pour-on was more effective if in fact there were no difference.
If however, the farmer only intended to find which treatment was best, we should consider both tails of the distribution. Simply doubling the 1-tailed P-value gives us a P-value of 0.4
Given this distribution is discrete, you could justifiably argue a mid-P-value is more
How much power does this test have?
Around this point you, like our farmer, may be wondering just how big a difference we would need to obtain for us to conclude that one of her treatments did had a significant effect.
Consider this more extreme example:
Here the tags are 'obviously' very effective indeed. Instead of a 2.45 times difference, we have observed nearly a one thousand-fold reduction. In this instance, of the ten possible samples from these observations, just one produces such an extreme result. Given that the sample space is still
Given such a large treatment effect, you may find this a rather puzzling result.
One way of explaining this paradox would be to say ''the problem is a permutation test is nonparametric''. Unfortunately, whilst this explanation might silence the credulous, it does nothing to increase your understanding.
If we define power as the probability of discarding the null hypothesis when it is false, the problem becomes more obvious. Very simply, our permutation test assumes the null hypothesis is true - and, because our inference applies to just these 5 observations, we have defined its sample space accordingly. The net result is, even if the true difference between treated and untreated animals was utterly colossal, given a total of just 5 observations a 2-tailed permutation test cannot show this treatment effect is significant (at P<0.05).
From this you might conclude that permutation tests must lack power. In fact the converse is true.
Many 'non-parametric' tests are not as powerful as their parametric equivalent because the rank transformation looses information. Randomization tests (and similar exact tests) are generally considered to be very powerful because they are able to make use of all the information in your observations.
If you are used to parametric tests you may find these facts hard to accept.
A popular gambit, in this situation, is to forget about the finer points of statistical etiquette - and simply use a different test. Since we wish to compare their statistical models, let us do just that.
A parametric model
In essence, a parametric model is constructed and tested as follows - albeit not always in this order.
Given that the statistic of interest is the ratio between means, and that these insect catches are liable to be skewed, we log-transformed our farmer's 5 Stomoxys catches. Assuming the null hypothesis is true, and differences between means of transformed data are approximately normal, their studentized difference might be expected to obey the t-distribution.
Unfortunately, a (2-tailed) test of the equal-variance t-statistic was non-significant, at P=0.17 - and if we assume our samples represent unequal population variances, P=0.66
- None of which is of much solace to our farmer.
We next applied the same transformation and tests to our second, rather more extreme, set of results from above. This time the equal variance t-statistic tested highly significant, at a 2-tailed value of P=0.00056 - and the unequal variance statistic slightly less so, at P=0.0025
Given these log transformed observations yielded an F-ratio of 1.041 you might conclude the equal variance t-statistic was a quite defendable result - and that a ratio of 0.0009 was very unlikely to occur by chance.
Since (on this occasion) the parametric and exact models have yielded such divergent inferences, let us see why their differing assumptions result in this discrepancy.
Parametric versus Exact
Once we cut away the mathematics, there are three fundamental differences between our (exact) permutation test model and of an ordinary (parametric) t-test.
Given their assumptions are so different, it is scarcely surprising our two models are producing quite different conclusions. Nevertheless, where the assumptions of both tests are not too outrageously violated, they result in very similar inferences. Otherwise, given the points above, we can make the following observations.
In principle, because the exact distribution represents all the information available, provided your model is reasonable, that distribution is the most powerful measure of your observed statistic.
Although you require less information to estimate the parameters of a known distribution, than to describe an unknown one, you still need a clear idea which known distribution (if any) your results represent.
Resampling data can mimic the effect of intrinsic variation (within a given set of values) upon their summary statistic's value. Lacking any better information about how observations 'should' be distributed, resampling allows you to make the best use of the information at hand. Resampling methods do this by assuming that the variation of your sample statistic results from random selection among the available observations as represented within your sample. But when those values cannot reasonably be interchanged resampling is, AT VERY BEST, an approximation - indeed, if some assignments are more likely than others, random resampling can be a highly misleading approximation.
Resampling allows you considerable scope in choosing which sample statistic to be test. Although resampling techniques are computationally intensive, they have minimal assumptions, and minimal mathematics. As a result, they are much easier to understand, and allow you to concentrate upon what you are testing, rather than how to do it.
Surprisingly, whilst resampling methods sometimes make assumptions about how your statistic is distributed, they do not make any assumptions about the distribution of whatever population your sample might, or might not represent. To avoid making such assumptions, you must have enough information.
Here we describe two methods of resampling observations, bootstrapping and resampling. Which of these is most appropriate to your data, depends upon how your observations were made.
Because of the number of available permutations, or 'sample space', available from even a moderate sample, a complete evaluation is generally not possible. The precision of a resampling test, or confidence limit, depends upon:
If you have a very small number of observations it may be worth calculating the entire sample space by working through every possible combination of observations. Unfortunately, even for quite moderate samples this is impractical because the number of permutations becomes excessive.
For 2 samples, containing n1 and n2 observations, the potential sample space is
To do this you need a computer, and either:
Whichever of these you use, the underlying logic is fairly similar.
How many resample-statistics are needed?
Some authors suggest magic numbers of randomizations for arithmetic convenience, so you do not have to interpolate to obtain the Neyman-Pearson P=0.05. That point aside, the critical issue is - given you are only sampling the null distribution of test-statistics - how accurately do you need to estimate the probability distribution of that statistic? Ideally, of course, you would prefer to have complete accuracy. But to achieve that, you would need to calculate every possible tail end value, how many ways each could occur, and how many ways every possible outcome might occur. Since that is usually only possible for the most extreme possible values, you need to know how many randomizations are required to provide a reasonably accurate P−value.
Notice also that, unlike a parametric test, it is generally unreasonable to assume your probability function is 'known', and estimate its parameters from your randomization distribution. Other considerations aside, moderately sized populations can yield some rather peculiar distributions.
Monte Carlo models assume that, by the process of random selection/allocation, the distribution of your Monte Carlo statistics approximates their exact population distribution - on average, at least. In other words, if your exact model has a probability P of yielding a result less than some number, X, then that is the probability of your Monte Carlo model producing it (assuming both models have the same assumptions) - and in this sense, a Monte Carlo test is also 'exact'.
For example, since there are a mere 252 ways in which, ignoring order, 10 unique items can be divided into two equal groups, we were able to work them all out. The graph set below allows you to compare this entire exact frequency distribution of two statistics, with the same distributions approximated by the same statistics obtained by Monte Carlo simulation - where observations were randomly assigned (without replacement).
As you might expect, the more randomizations you perform, the better an approximation to the population frequency distribution you can
There are some obvious constraints to resampling a set of observations to estimate how a statistic might vary. For example, Monte Carlo estimates of your statistic tend to underestimate the range (max.−min.) of the population being sampled. Similarly, the smaller the proportion of the population that falls within any given class, the more variable will be your Monte Carlo estimate of that proportion - and the more Monte Carlo statistics that are calculated, the more reliable that estimate is likely to be. As a result, the tails of a statistic's distribution are the most difficult to estimate.
In plain English, the smaller the P-value you are interested in, the more randomizations you need for a reliable result.
For example, consider the situation if we classified our farmer's 15
All of which is very encouraging, except for the fact that we can expect the exact P-value to vary each time we perform this test, and (as yet) we have not indicated how great this variation might be. In other words, how many randomizations you need to do to obtain a reliable P-value.
One obvious answer to this problem is to repeatedly test her observed result, say 500 times, and see how much the resulting exact P-values vary. In order to discover how this variation is related to the number of
The second graph illustrates the fact that randomization test P-values vary binomially. Moderate P-values approach a (bounded) normal distribution, but very small P-values can be highly skewed.
For general purposes therefore, most people use between 1000 and 10,000 randomizations.
Let us begin by reminding ourselves that, unless you compute exact P-values using the probability of obtaining all possible tail-end values, your (permutation test) P-value is only an estimate of that exact value. To distinguish between these two P-values, let us use say p is the proportion of all possible test results whose ranks are more extreme than your observed result, and is your estimate of p.
If your samples are large, these estimates
Given which, about 95% of the P-values you estimate will lie in this range:
A little arithmetic reveals:
For 2-tailed-tests, assuming the resample-statistic is symmetrically distributed, each tail is assumed to have half that probability. Or you have to estimate the deviation in each tail separately. (For example, if you are testing a difference between means of + 312 mg, then to test the left hand tail of an asymmetrical distribution, you would presumably see what proportion were less than or equal to - 312 mg. In practice, such comparisons are not always meaningful.)