In order to assess the divergence of your observed statistic (for generality, let us call that statistic θ
) permutation tests assume all your observations represent a single (nil) population,
or that residuals represent a single null population.
Having pooled these values, the variation of θ under H0 is estimated by
- randomly reassigning values (sampling without replacement) to treatment groups of fixed sized, then calculating θ,
- and repeating this sufficiently many times (perhaps 5000).
Given which, it may not be immediately obvious how you might use the permutation test model to attach confidence limits to θ.
If you were not thinking very clearly you might assume that all you need to is to combine your observations into a single (nil) population then, by repeatedly permuting those values and calculating θ, estimate the 95% range of θ.
Whilst outwardly attractive, the range so-calculated is not a confidence interval of your observed value of θ. It only provides an estimate of θ's 95% theoretical range on the condition that null hypothesis is true. Nor would that range tell you anything useful about θ under the alternate hypothesis - because it simply ignores that possibility. Worse still, large treatment effects would be treated as random error, with predictably misleading consequences...
Under certain circumstances it is however possible to obtain a P−value function via test-inversion - applying permutation tests to 'shifted' data.
To clarify how this may work let us consider a simple example and two everyday summary statistics.
A larger experiment
You may recall the farmer's pour-on/tags comparison we subjected to a permutation test in Unit 5.
Trying to analyse the result of too few observations is a frustrating affair. As we shall see however, useful numbers of observations bring their own problems. For example let us assume that, encouraged by her initial 'suggestive' result, our farmer decides to apply her treatments to a slightly larger experimental population.
For sake of argument, let us assume she selected 15 calves for her second experiment. Following our advice, she randomly divides them into 3 three groups of 5 animals.
| | She treats with a 'Pour-on' insecticide formulation. |
| She applies impregnated ear Tags to. |
| She leaves Untreated.
|
| |
Then each animal is randomly assigned to a calf pen, above which a flypaper is hung. After 2 weeks the flypapers are removed, the catch is identified and counted.
Having, somewhat laboriously, identified and counted her catches, our farmer obtains the following results.
Once again, it looks very much as if the pour-on was most effective at reducing the number of Stomoxys, although there remains the question of how often we would expect such a result to arise if this conclusion is incorrect.
If we accept that both insecticide treatments are effective, the question is which is most effective? Assuming that the statistic of interest is the ratio of catches, or log ratio, a permutation test of the difference between log means, d, found 11.41% of randomizations gave a difference as great - whereas testing the ratio of catches, r, yielded a (very similar) one-sided mid-P value of 12.02%.
Let us estimate confidence intervals for r and d.
Although we test r & d assuming all their observations were part of the same population, the reason we measured them in the first place was to estimate their true values, μr & μd.
Since r & d are (approximately) equivalent statistics, we are assuming that our farmer's pour-on killed or repelled proportionally more Stomoxys than her ear tags. In other words, if we ignore other sources of variation, we assume the only effect of her pour-on is to reduce the catch around tagged animals by r times - or to reduce the log (tag) catch by d. In which case, we can easily remove this difference, either by dividing all the tag catches by r, or by subtracting d from the individual log (tag) catches.
The means of these modified results would therefore have a difference of 0 and a ratio of 1.
Given that r & d are our best available estimates of μr & μd, to construct confidence intervals by test-inversion we need to modify our results by modifying the difference or ratio between these modified populations. A simple way of obtaining a known difference between means is to add the same amount D to all the observations in one of our samples.
To obtain a known ratio we multiply the observations in one sample by some constant, R.
Notice that, although these modified results become our model's parameters, they are not estimates of the population parameters - we are merely playing a game of 'what if?' If these model parameters are unrealistic, the fault is entirely our own. Whatever the true μr & μd actually are, some of the results of testing modified tag catches are easy enough to predict.
- Where R = r or D = d, then the resulting ratio between modified 'tag' and unmodified 'pour-on' catches will be 1, which we would expect to be very typical member of its population of randomization statistics, with a one-sided P of around 0.5
- Where R ≠ r or D ≠ d, then the difference between log (modified) tag and log pour-on catches of Stomoxys will deviate from zero, and the greater this deviation the smaller the P value we would obtain from a permutation test.
However, if we find less than a ( 2-tailed α =) 5% probability of obtaining our modified results from their pooled population, then R or D lie within the 95% confidence interval of r or d.
The graph set below shows the result of testing 21 possible values of R & D, and our estimated 95% confidence intervals.
{Fig. 4}
Because we are modifying the samples as well as their combined population, we are actually estimating confidence limits about d = 0 - as indicated by the red +, above. In other words, we are assuming that the width of our confidence limits are unrelated to their location - and therefore that the variance of our statistic does not depend upon its true location. This would not be the case if were estimating confidence limits for the difference between these means rather than the difference between log means, as we have done here.
Notice also that, unlike a parametric test, although all of these tests assume the null hypothesis holds, the population of statistics we are comparing our observed result with is not assumed to have a mean of either zero or one. Instead, by modifying our data, we are setting our own parametric value (in this case as D or R). Nor does any of this imply we are sampling an infinite population. Our experimental population remained just 10 observations - even though we were estimating how closely our observed result was likely to resemble its true value, simply by varying the parameter of interest in our samples, and observing how often such a (divergent) result would occur by chance - all else being equal.
Remember, the only population of observations these permutation tests refer to is the very finite collection we have actually observed. - Any extrapolation to a wider population is non statistical, and requires you make due allowance for the various biases in selecting your experimental subjects.
