InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

 

 

What is a permutation (randomization) test?

In principle, this test is identical to any other null hypothesis significance test, aside from 2 important points:

  1. The test statistic's null distribution is obtained by repeatedly calculating the test statistic following random reassignment of values to treatment groups (or to sample groups).
    • Under the nil hypothesis the values so randomized are the original observations - this is its most popular form.
    • Under the null (or nil) hypothesis the original data are modified to remove any treatment effect prior to random reassignment - but randomizing residuals can produce implausible outcomes.
    • The permutations (random assignments or randomizations) are equivalent to sampling these values without replacement.
  2. Since only a finite number of values are permuted, and a finite number of test statistics are calculated, the resulting null distribution is unavoidably discrete.
    • P-values are the relative rank of the test statistic's observed value within that null distribution.

For this last reason, permutation tests could be described as 'non-parametric', but more often it is because errors are not assumed to represent a normally-distributed population. Again, because their null distribution depends upon your observations, rather than their 'parent' population, these tests are 'conditional' upon your observed values. In other words the null-distribution used by a permutation test is a only miniscule subset of the null distribution you would obtain if you obtained those statistics from repeated random samples of that parent population.

Permutation tests are also said to be 'exact'. One, not very good, reason for this is because (in principle) it is possible to calculate the exact probability of obtaining your test statistic's observed value, and for every more deviant value. (We consider the difference between 'exact' and 'approximate' tests in Unit 4.) A better justification for using a permutation test is when it's workings match your experimental design. Where that is untrue, a permutation test is sometimes justified as being simpler, 'better-behaved', or more robust, than alternatives such as bootstrapping (where values are resampled with replacement) - but it would be misleading to describe such a permutation test as 'exact'.

If that rather academic description leaves you struggling, a simple example may be more revealing. Or, if you are only interested in the properties and assumptions of randomization tests, or how to do them, see below.

 

Principles and properties of a permutation test

(Consider this very finite population)

Imagine the following trial has been performed by a farmer who wants to compare two insecticide treatments to reduce 'stable fly' (Stomoxys) problems in her calf-rearing pens.

The farmer selects 5 of her calves and divides them into two groups. One group contains 2, the other group has 3 animals. The first group she protects using impregnated ear tags. To the other group she applies an oil-based formulation of the same insecticide, an artificial pyrethroid. The farmer pens each calf separately, and estimates the number of stable-flies in each pen by hanging sticky flypaper above the animal. After 3 days she removes her flypapers and counts the number of stable-flies on each.

 

Here are her results (plus our notes).

Flypaper number 12345
Number of
stable-flies caught
3050 7 1230
Treatment tagspour-on
Total 8049
 
 
 
  ( Y )
 
   ( ΣY )

With a little arithmetic, she produces these statistics.

Treatment Total
caught
Number of
calves treated
Average flies
per calf
tags80240
pour-on49316.33
Ratio (tags : pour-on)    ⇒ 2.45
 
 
  (  )
   ( T )
   ( P )
   ( T / P )

 

From which she concludes, "that pour-on might be a bit messy, but it is more than twice as good as the ear tags - I got half as many flies round those calves".

  The farmer is rather pleased with her small experiment - especially since these treatments were free samples. She feels such a clear difference is pretty strong evidence the pour-on gets rid of more of the flies. But, before she invests any money, she would like to be sure her results are not just some sort of statistical fluke.

    What do you think ?

 

When you have thought, read on.

 

 

Why not use an ordinary 'parametric' test?

There are obviously a number of problems with this farmer's experiment.

For example -

  • She has few observations, and uneven sample sizes.
  • Any variation between calves, or their pens, will be included in her results. A crossover design might have been better, unfortunately the pour-on formulation is oily, so cannot be easily removed (from calves or pens).
  • She has no untreated 'control' animals. Nor has she kept the animals separate. So some calves may have benefited from both treatments.
  • It is most unlikely she selected these calves at random from the available population. She either bred them from her own stock, or bought the best calves available for the price she could afford.
  • Her test statistic (the ratio of the average catches, T / P) is unlikely to be normally distributed. Stomoxys catches are usually skewed, so are ratios. And with so few data, not only are you unlikely to show these observations differ significantly from normal, you will also find it difficult (if not impossible) to defend any normalising transformation you choose.

    Between them, these objections appear to invalidate the more conventional parametric models.

In other words, it is hard to argue her observations are random samples of some larger population, or populations, whose parameters you can therefore estimate. Nor, from the available evidence, can we reasonably say whether her results represent a normal, lognormal, or some more exotic probability distribution. Given which it seems reasonable to accept that, because we can infer nothing useful about any 'wider' population, we must exclude it from our analysis.

Allowing for all of these problems,

  1. Can we give this farmer any useful advice?
  2. Can we defend our conclusions?

 

Realistically, the most critical problem is her small sample size. Laying that aside for the moment, the other problems could be viewed as technical. Therefore let us see whether we can get around them.

 

A Permutation, or Randomization Test

There are three steps we need to consider.

  1.  We need a statistic to test
      In this particular case our farmer is interested in ratio of her two sample means, T / P
      But in order to speak more generally, let us use θ - to denote any statistic of interest.
  2.  We need to decide how θ might be expected to vary.
      This depends upon our underlying biological and statistical model.
  3.  We need to compare the statistic she observed with the expected distribution - predicted by our model.
      How we do this depends upon the hypothesis under test.

 

  •  Defining the model population

    Let us begin by considering our null hypothesis.

    From what the farmer has told us, she used a 'completely randomized' experimental design. Our null hypothesis is that any difference she observed was merely due to chance.

    We can express this another way.

    Imagine for a moment, what would happen if all the labels had fallen off our farmer's flypapers.

     

    If her treatments were equally effective, or ineffective, it would not matter which flypaper was monitoring which treatment. In other words, on average at least, she might as well re-label her treatments at random - provided that each flypaper ended up with a label.

    Of course it is unlikely that a group of observations, selected (or assigned) at random, will give the same results the farmer achieved.

    For example -
    Random assignment of results to treatment group
    Group number 1 2
    Number of
    stable-flies caught
    1230  7 3050
    Total 42 87
    Mean 1 = 21 2 = 29
    Ratio 1 / 2 )   ⇒   0.72 

    Nevertheless, if our null hypothesis is correct, and there really was no difference between the effect of these treatments, any randomly selected observations must provide an equally valid description of the observations they are drawn from. In which case, any sample statistic summarizing those observations should be equally valid.

    If this null hypothesis really is true, provided their size of her treatment groups remain the same, any variation amongst these observed ('sample') statistics must be entirely due to chance. Only if the results our farmer has observed are very unusual, might we conclude they are unlikely to arise from this group, and consider rejecting our null hypothesis - that her treatments produced the same result.

    Assuming the farmer allocated the treatments at random, we can estimate the distribution of her sample statistic under the null hypothesis. Using this distribution we can work out the likelihood of finding her result, without assuming her observations represent (or approximate) any hypothetical population with a 'known' distribution. We simply assume that her treatments were assigned at random.
     

    In other words, we have confined the population of observations to the 5 that our farmer has actually made. By definition therefore, we are not trying to estimate parameters of some larger population of observations - or of their statistics. Nevertheless, we would be wise to explicitly state what else this model implies.  

  •  Defining the selection process

    Given the way our farmer has labelled her results, we are assuming no animal was assigned to both groups. - In other words, no animal received both an ear tag and pour-on. Similarly, we assume the farmer used just one flypaper for each calf, and that she recorded the result from each flypaper once and only once.

    When stated in these terms, we cannot realistically argue these treatment were assigned independently of one another. The probability of a calf being allocated to a group varied according to how the other calves had already been assigned. For example, if by chance the farmer randomly assigned the first three animals to the 'pour-on' group, the remaining two animals could only be assigned to the 'tags' group. - Nor, once an animal had been selected from the 'un-assigned' animals, was it replaced or reassigned.

    We must therefore treat our observations in the same way. In other words we should select each observation from our model population without replacement. The table below shows every possible way in which 5 observations could be divided into two groups, comprising 2 and 3 observations. All the possible outcomes of our experiment are described as the sample space. It turns out there are just 10 alternative ways of dividing her 5 observations into two groups (of 2 'tagged' & 3 'pour-on'). Provided the farmer assigned her treatments at random, each of these events are equally probable, even if the results are assigned after her experiment is done.

    To make things clearer, we have shaded observations assigned to tagged animals green. Observations assigned to pour-on treated animals are shaded turquoise. We have ranked these results according to their test statistic.

 
All possible ways of assigning each observation a treatment - ignoring the order of assignment
 
Result #  Number of stable-flies on each flypaper 
label ⇒
    tags      pour-on  
  total (tags) total (pour-on) Ratio of the average catches
1  ⇒  7  12 30 30 50 19 110   0.26
2  ⇒ 7 12 30 30 50 37 92 0.60
3  ⇒ 7 12 30 30 50 37 92 0.60
4  ⇒ 7 12 30 30 50 42 87 0.72
5  ⇒ 7 12 30 30 50 42 87 0.72
6  ⇒ 7 12 30 30 50 57 72 1.19
7  ⇒ 7 12 30 30 50 60 69 1.30
8  ⇒ 7 12 30 30 50 62 67 1.39
9  ⇒ 7 12 30 30 50 80 49 2.45
10  ⇒ 7 12 30 30 50 80 49 2.45
12345⇐  Flypaper no.

 

  •  Describing the statistic's distribution

    In order to work out the probability this model would produce any given result, including our farmer's observed result, we need to work out the probability of observing each of the 10 possible results listed above. Each of the 10 possible results (listed above) has 12 possible orders in which the treatments might be assigned - so each of them are equally likely to occur by chance. Simple arithmetic tells us there are 120 different orders (or permutations) in which our two treatments can be assigned to these 5 observations - this population of statistics comprises all the points in the sample space.

    In other words, provided we restrict our inference to just her 5 observations, these (10 × 12 =) 120 possible results comprise the entire population of T / P that are possible - if this model is correct. The scatterplot below shows how the relative (sequential) rank of each possible result is related to its location. Given that some results are identical, to ease comparison, we have color coded them to match the table above.

{Fig. 1}
IMAGE MUCH EDITED stom2.gif from stom2&3.stg and stom2.sta, using data from the table above.

    This population distribution differs from the distributions assumed used by parametric models, such as the z-test (or t-test, described in Unit 8) in two important ways.

    1. This population of statistics is discrete instead of smooth.
    2. Its range is strictly bound (because 0.26 ≤ [T / P] ≤ 2.45) compared to a normal distribution (where −∞ < θ < +∞) or a lognormal distribution (where 0 < θ < +∞).

    Therefore, when analysing a permutation test, it is hard to justify fitting the lognormal distribution - shown in grey on the graph above. As we shall see, the properties of the permutation distribution can affect what we can realistically infer.

     

  •  Comparing observed and expected

    Now we have some idea of how these results might be expected to vary, let us compare the 'exact' distribution predicted by our 'null' model, with the result our farmer actually observed.

    Only two of the 10 possible test statistics (or 24 of the 120 permutations) are as extreme as our farmer's result (2.45 times as many flies round tagged calves). No test statistic was more extreme than her result.

    A conventional 1-tailed test (which assumes a smooth distribution of θ) requires less than 5 percent of this population of statistics to be greater than or equal to the result she obtained. In this case 24 of the 120 possible results are the same as the one she achieved, so there is a 20% probability (P=0.2) of finding the pour-on was more effective if in fact there were no difference.

    If however, the farmer only intended to find which treatment was best, we should consider both tails of the distribution. Simply doubling the 1-tailed P-value gives us a P-value of 0.4

     

    Given this distribution is discrete, you could justifiably argue a mid-P-value is more appropriate. When expressed as a mean rank, 90% of the population had a rank of less than the statistic she observed. Even so, a 1-tailed P-value of 0.1 might, most generously, be described as 'suggestive' rather than 'significant'.

    Either way, in plain English, it is very possible her results were a statistical fluke.

 

How much power does this test have?

Around this point you, like our farmer, may be wondering just how big a difference we would need to obtain for us to conclude that one of her treatments did had a significant effect.

Consider this more extreme example−

Treatment Tags Pour-on
Number of
stable-flies caught
2 1 1358 2481 960
Total 3 4799
Ratio (tags / pour-on)   ⇒   0.0009

Here the tags are 'obviously' very effective indeed. Instead of a 2.45 times difference, we have observed nearly a one thousand-fold reduction. In this instance, of the ten possible samples from these observations, just one produces such an extreme result. Given that the sample space is still (5!=) 120, and there are 12 permutations which can produce each result, by a conventional 1-tailed test this result is not significant because P = 12/120 = 0.1 or 10%. Irritatingly, the 1-tailed mid P value of 6/120, or 0.05, just fails to be significant at the 5% level.

 

Given such a large treatment effect, you may find this a rather puzzling result.

One way of explaining this paradox would be to say ''the problem is a permutation test is nonparametric''. Unfortunately, whilst this explanation might silence the credulous, it does nothing to increase your understanding.

If we define power as the probability of discarding the null hypothesis when it is false, the problem becomes more obvious. Very simply, our permutation test assumes the null hypothesis is true - and, because our inference applies to just these 5 observations, we have defined its sample space accordingly. The net result is, even if the true difference between treated and untreated animals was utterly colossal, given a total of just 5 observations a 2-tailed permutation test cannot show this treatment effect is significant (at P<0.05).

 

From this you might conclude that permutation tests must lack power. In fact the converse is true.

Many 'non-parametric' tests are not as powerful as their parametric equivalent because the rank transformation looses information. Randomization tests (and similar exact tests) are generally considered to be very powerful because they are able to make use of all the information in your observations.

  • Part of the reason this permutation test is running into difficulties is because our few observations do not permit the sample space to be subdivided into small enough bits.
  • It is also running into difficulty because, by combining observations under the nil hypothesis, that model ALSO includes any treatment variation.
      One way of increasing the permutation test's power is to remove any observed treatment effect BEFORE pooling the 'residual' values under H0
Nevertheless, where the assumptions of the parametric and permutation models are compatible, they can yield similar inferences.

 

If you are used to parametric tests you may find these facts hard to accept.

A popular gambit, in this situation, is to forget about the finer points of statistical etiquette - and simply use a different test. Since we wish to compare their statistical models, let us do just that.

 

A parametric model

In essence, a parametric model is constructed and tested as follows - albeit not always in this order.

     
  1. Summarise your observations using some sort of statistic.
    For sake of argument, let us call this sample statistic 'theta', or θ, for short. Generally, but not always, we are only interested in this statistic as an estimate of some parameter 'Theta', or Θ.
     
     
  1. Decide what value of Θ you would either expect, or wish to compare θ with.
    This may be defined by your null hypothesis or alternate hypothesis, but implicitly or explicitly, this step cannot be avoided. For a confidence interval, your best estimate of Θ is θ.

  1. Given the above, you decide how θ would be expected to vary about Θ.
    In order to do this you nearly always have to assume that either your observations, or treatments, were selected at random - or that random selection could have operated. If this assumption is violated the rest of your model is invalid, and its predictions are liable to be biased or simply meaningless.
     
     
  1. Compare your observed θ with the frequency distribution you have estimated from steps 2 and 3.
    Depending upon how this is done, you may end up with a probability or a relative probability. But, in either case, you assess how far θ deviates from Θ. - The greater the deviation, the less likely it is predicted to have arisen by chance alone.

  1. For a P-value plot, or to attach confidence limits to θ, steps 1 to 4 may evaluated iteratively - or, depending upon the assumptions that are made, they may be integrated into a single (short-cut formula).
    For example if θ is assumed to be normal, when studentized, θ' may be assumed to follow a t-distribution, and tested as such.
     

 

Let us therefore apply the standard 'text' book approach to this problem, and use a t-test.

Given that the statistic of interest is the ratio between means, and that these insect catches are liable to be skewed, we log-transformed our farmer's 5 Stomoxys catches. Assuming the null hypothesis is true, and differences between means of transformed data are approximately normal, their studentized difference might be expected to obey the t-distribution.

    If you are uncertain about that line of reasoning, we introduce the t-distribution in Unit 6, but Unit 8 explores the reasoning and behaviour of 't-statistics' in some depth.

Unfortunately, a (2-tailed) test of the equal-variance t-statistic was non-significant, at P=0.17 - and if we assume our samples represent unequal population variances, P=0.66

- None of which is of much solace to our farmer.

 

We next applied the same transformation and tests to our second, rather more extreme, set of results from above. This time the equal variance t-statistic tested highly significant, at a 2-tailed value of P=0.00056 - and the unequal variance statistic slightly less so, at P=0.0025

Given these log transformed observations yielded an F-ratio of 1.041 you might conclude the equal variance t-statistic was a quite defendable result - and that a ratio of 0.0009 was very unlikely to occur by chance.

Since (on this occasion) the parametric and exact models have yielded such divergent inferences, let us see why their differing assumptions result in this discrepancy.

 

Parametric versus Exact

Once we cut away the mathematics, there are three fundamental differences between our (exact) permutation test model and of an ordinary (parametric) t-test.

  1. The size of the population
    • Our permutation test assumes the population is both finite and tangible.
    • The t-test assumed our results represent an infinite, albeit theoretical, population.
  1. How θ can vary.
    • If we are sampling a finite population, our statistic must have a discrete finite (unknown) distribution.
    • A parametric test assumes θ has a continuous known (in this case, normal) distribution, whose variance we estimate from the two sample variances.
  1. How the models behave if the null hypothesis is false.
    • As it stands, our permutation test model makes no allowance for what happens if the null hypothesis is false.
    • The t-statistic performs satisfactorily under the alternate hypothesis - enabling you to estimate confidence limits to your estimate.

Given their assumptions are so different, it is scarcely surprising our two models are producing quite different conclusions. Nevertheless, where the assumptions of both tests are not too outrageously violated, they result in very similar inferences. Otherwise, given the points above, we can make the following observations.

  1. By confining our model to a finite (less theoretical) population, in terms of that population, our inference is strong. A parametric test, which treats these observations as a random sample of an infinite population, is not only very weak - but is unduly conservative.

    If we accept we are sampling a finite population of results, we ought to apply a finite population correction to our t-test's variances - so reducing our t-test's P-value. For two, equally-sized groups of observations, this would approximately halve the sample variances.

  1. Because the t-test assumes θ has a continuous and unbounded (normal) distribution, it also assumes that θ can have very large deviations from Θ - and, as a result, that its sample space can be partitioned into correspondingly minute portions.

    If however, we accept this is an approximation to a discrete distribution, these very small P-values are unrealistic - or, in plain English, they are merely an artefact of our unreasonable assumptions.

  1. A permutation test of observations assumes all your observations represent the same population (under Hnil). So conventional estimates of power, or confidence limits for our observed statistic (which assume HA) appear to be a contradiction in terms.

    Sometimes, but not always, this problem can be addressed by removing the observed treatment effect prior to performing the test.

 

In principle, because the exact distribution represents all the information available, provided your model is reasonable, that distribution is the most powerful measure of your observed statistic.

Although you require less information to estimate the parameters of a known distribution, than to describe an unknown one, you still need a clear idea which known distribution (if any) your results represent.

If you are not certain what distribution your statistic represents, your inference is based upon supposition not information.

 

 

Resampling data can mimic the effect of intrinsic variation (within a given set of values) upon their summary statistic's value. Lacking any better information about how observations 'should' be distributed, resampling allows you to make the best use of the information at hand. Resampling methods do this by assuming that the variation of your sample statistic results from random selection among the available observations as represented within your sample. But when those values cannot reasonably be interchanged resampling is, AT VERY BEST, an approximation - indeed, if some assignments are more likely than others, random resampling can be a highly misleading approximation.

Resampling allows you considerable scope in choosing which sample statistic to be test. Although resampling techniques are computationally intensive, they have minimal assumptions, and minimal mathematics. As a result, they are much easier to understand, and allow you to concentrate upon what you are testing, rather than how to do it.

All resampling methods assume:
  1. Treatments were randomly allocated, or observations were randomly selected.
      In practice it is much easier to randomly allocate treatments than to randomly select observations. Therefore the former assumption is considerably stronger.
  2. Your sample is sufficiently large that resampling can adequately estimate how your statistic is distributed.
      Resampling techniques obtain all their information on how your statistics are distributed from the observations within your sample. For unbiased results you usually require at least 30 observations, and preferably more than 100.

Surprisingly, whilst resampling methods sometimes make assumptions about how your statistic is distributed, they do not make any assumptions about the distribution of whatever population your sample might, or might not represent. To avoid making such assumptions, you must have enough information.

Here we describe two methods of resampling observations, bootstrapping and resampling. Which of these is most appropriate to your data, depends upon how your observations were made.

  • Randomization, or resampling without replacement, assumes you are sampling an appreciable proportion of a finite population, so the each observation has a marked influence on the outcome of the next. Randomization is particularly useful in analyzing experimental results.
  • Bootstrapping, or resampling with replacement, assumes that the result of one observation does not affect the results of those you make subsequently. In other words it assumes your sample population is infinite, or effectively so. For example, bootstrapping would be appropriate for surveys, where only a small proportion were sampled.

Because of the number of available permutations, or 'sample space', available from even a moderate sample, a complete evaluation is generally not possible. The precision of a resampling test, or confidence limit, depends upon:

  1. The number of times the test statistic is estimated - not the proportion of the sample space that is evaluated.
  2. How deviant your result is. Statistics near the edge of a distribution require proportionally more resample statistics.
  3. The choice of an appropriate test statistic. A good test statistic is sensitive to your alternate hypothesis, but insensitive to things not of interest when under your null hypothesis. (For example, the difference between means is a poor test statistic because it is also sensitive to unequal variances. The t-statistic for unequal observations may be preferable.)
  4. The distribution of your test statistic. One-tailed tests and parametric confidence limits make no assumptions, but two-tailed tests assume your test statistic is distributed symmetrically. Non-parametric confidence limits assume your test statistic is distributed normally, or can be transformed such that it is.

 

 

If you find our example (above ) too long-winded and messy, you find the following instructions are more useful.

  1. Select the best statistic to test your null hypothesis.
      Aside from their sensitivity to your alternate hypothesis, and insensitivity to parameters not of interest, when calculating large numbers of statistics you need to minimise your calculations. A number of short-cuts are available where resample-statistics summarise the same number of observations. For example, totals are distributed in the same fashion as means, and root sum of squares are distributed as standard errors. Differences and ratios are quicker if you employ the total number of observations, which remain constant.
  2. Calculate your sample statistic.
  3. Combine and correct your observations so they match your null-hypothesis.
      Most comparisons assume there is no difference between samples, but parametric comparisons are sometimes of interest. If you wish to test for an expected difference, this has to be applied BEFORE you combine your samples.
  4. Randomly select the same number of observations as the sample under test. (Replace observations only after selecting the entire group.)
      Randomization requires a different algorithm from bootstrapping for this. [See below.] A good, fast random number generator is essential. Many packages and programming languages have pseudo-random number generators. But it is worth making a few careful checks to ensure their values arise with equal frequency when divided into many intervals. A cumulative scatterplot of rank against value is a good quick test (it should be linear).
  5. Calculate their resample-statistic.
  6. Repeat steps 3 to 5 sufficient times.
      How many resample-statics you need varies according to what you are testing, how you are testing it, and how much precision you require.
  7. Compare your sample statistic with the resulting distribution of resample-statistics.
    • Remember to LOOK at their distribution. Although 1-tailed tests do not assume any particular distribution, 2-tailed tests assume the distribution is symmetrical. But non-parametric confidence limits assume resample-statistics are normally distributed, or can be transformed such that they are. Remember, you can transform either your observations or their statistics.

If you have a very small number of observations it may be worth calculating the entire sample space by working through every possible combination of observations. Unfortunately, even for quite moderate samples this is impractical because the number of permutations becomes excessive.

For 2 samples, containing n1 and n2 observations, the potential sample space is (n1 + n2)! /  (n1! + n2!) permutations. Two samples of 10 observations allow approximately 3.3 × 1011 possible permutations. In practice therefore, it is simpler and easier to randomly sample their sample space.

To do this you need a computer, and either:

  • A package that supports resampling.
      A few do, but they can be infuriating to use because they consider this an 'advanced' technique, and may expect you to have a degree in mathematical statistics!
  • Or a compiler specifically for resampling, and its instruction manual.
      Such as 'Resampling Stats' from The Resampling Project, College of Business and Management, University of Maryland, College Park, MD 20742, 703-522-2713 USA.
  • Or a fast spreadsheet, that allows you to re-order (sort) columns.
      You can permute or select observations using a column of random numbers, but you may find this a somewhat tedious operation.
  • Or a programming language, such as R, and the ability to write simple programmes.
      This is probably the fastest and most flexible method.

Whichever of these you use, the underlying logic is fairly similar.

For resampling without replacement (randomization) under the nil hypothesis:
  1. Combine your observations into a single column or array.
      Enter sample 1 in cells 1 to n1, sample 2 in cells n1+1 to n2 etc. Then always calculate their resample statics from these groups of cells.
  2. Permute your observations:
    • For spreadsheets:
      Generate a column of random numbers, and resort your observations in that order.
    • For programs:
      Taking each cell of your array in turn, swap the cell contents with a cell selected at random.
        You can select the address of the cell to swap with by multiplying a random number between 0 and 1 by the number of cells in your array (N), and adding 1 to allow for integer truncation. Some languages permit you to have a cell address of zero; and most round DOWN when truncating floating point values to integers.
  3. Calculate and record your resample statistic from the cell addresses that originally held those samples.
      If you entered sample 1 in cells 1 to n1, calculate their resample statistic from cells 1 to n1. Calculate sample 2 from cells n1+1 to n2 etc.
  4. Repeat steps 1-3 to obtain sufficient resample statistics (see below).

 

How many resample-statistics are needed?

Some authors suggest magic numbers of randomizations for arithmetic convenience, so you do not have to interpolate to obtain the Neyman-Pearson P=0.05. That point aside, the critical issue is - given you are only sampling the null distribution of test-statistics - how accurately do you need to estimate the probability distribution of that statistic? Ideally, of course, you would prefer to have complete accuracy. But to achieve that, you would need to calculate every possible tail end value, how many ways each could occur, and how many ways every possible outcome might occur. Since that is usually only possible for the most extreme possible values, you need to know how many randomizations are required to provide a reasonably accurate P−value.

Notice also that, unlike a parametric test, it is generally unreasonable to assume your probability function is 'known', and estimate its parameters from your randomization distribution. Other considerations aside, moderately sized populations can yield some rather peculiar distributions.

Monte Carlo models assume that, by the process of random selection/allocation, the distribution of your Monte Carlo statistics approximates their exact population distribution - on average, at least. In other words, if your exact model has a probability P of yielding a result less than some number, X, then that is the probability of your Monte Carlo model producing it (assuming both models have the same assumptions) - and in this sense, a Monte Carlo test is also 'exact'.

For example, since there are a mere 252 ways in which, ignoring order, 10 unique items can be divided into two equal groups, we were able to work them all out. The graph set below allows you to compare this entire exact frequency distribution of two statistics, with the same distributions approximated by the same statistics obtained by Monte Carlo simulation - where observations were randomly assigned (without replacement).

{Fig. 2}
(1) excperm1.gif from excprm1.stg & allperms.sta and exacprm.STA using exacperm.for; and (2) Excperm2.stg & sta using -exacprm.sta

As you might expect, the more randomizations you perform, the better an approximation to the population frequency distribution you can achieve. Nevertheless, because you are only sampling the sample space, this process introduces a certain amount of Monte Carlo variation. Provided the number of simulations is sufficiently large, this is generally negligible - compared to the other sources of error. With modern computing abilities, this is fairly easy to achieve - nevertheless, some purists are deeply upset by Monte Carlo error.

There are some obvious constraints to resampling a set of observations to estimate how a statistic might vary. For example, Monte Carlo estimates of your statistic tend to underestimate the range (max.−min.) of the population being sampled. Similarly, the smaller the proportion of the population that falls within any given class, the more variable will be your Monte Carlo estimate of that proportion - and the more Monte Carlo statistics that are calculated, the more reliable that estimate is likely to be. As a result, the tails of a statistic's distribution are the most difficult to estimate.

In plain English, the smaller the P-value you are interested in, the more randomizations you need for a reliable result.

 

For example, consider the situation if we classified our farmer's 15 results as 'treated' (using either tags or pour-on) or 'control' (using neither). After 4999 random assignments, a permutation test of the unequal variance t-statistic, gave a 1-sided mid-P-value of 0.0041 - equivalent to a 2-sided significance of P<0.01

All of which is very encouraging, except for the fact that we can expect the exact P-value to vary each time we perform this test, and (as yet) we have not indicated how great this variation might be. In other words, how many randomizations you need to do to obtain a reliable P-value.

One obvious answer to this problem is to repeatedly test her observed result, say 500 times, and see how much the resulting exact P-values vary. In order to discover how this variation is related to the number of randomizations ( R ), and the statistic under test ( θ ), we tested 4 values of θ using R=100, 1000, and 10,000 randomizations. Since the importance of an error in estimating P is directly related to its true P-value, the graph set below shows our observed P-values using a logarithmic scale - and their variation is expressed as a coefficient of variation, rather than the standard deviation of P ( sp ).

{Fig. 3}
IMAGE EDITED Cvpvals.gif from Cvpvals.stg and pvar10k.STG using pvars.sta from pvarn.stg using -pvar2.sta

The second graph illustrates the fact that randomization test P-values vary binomially. Moderate P-values approach a (bounded) normal distribution, but very small P-values can be highly skewed.

  • If we assume errors are normal, and our observed 2-tailed P-value of 0.0082 is correct, using 4999 randomizations we would expect our permutation test to yield a P-value between 0.0077 and 0.0087 on 95% of occasions. From which we can argue that this amount Monte Carlo variation does not materially affect our inference.

  • If however we only performed 99 randomizations, on 70% of occasions we would expect our observed result to be the most extreme within the randomization distribution. Since the lowest 2-tailed P-value you can obtain from a distribution of 100 points is a mid-P value of (2×0.5/100=) 0.02, this noticeably affects your inference unless you merely wish to know whether P<0.05

For general purposes therefore, most people use between 1000 and 10,000 randomizations.

 

For large-samples

Let us begin by reminding ourselves that, unless you compute exact P-values using the probability of obtaining all possible tail-end values, your (permutation test) P-value is only an estimate of that exact value. To distinguish between these two P-values, let us use say p is the proportion of all possible test results whose ranks are more extreme than your observed result, and is your estimate of p.

If your samples are large, these estimates () are distributed approximately normally around their mean (p) with a variance of p(1-p)/R - where R is the number of randomly-obtained statistics, not the number of values in your combined sample. (All of which means, in plain English, that is assumed to have a binomial distribution.)

Given which, about 95% of the P-values you estimate will lie in this range: p ± 1.96√{p(1-p)/R}, and 99% of the P-values you estimate will lie in the range p ± 2.58√{p(1-p)/R}.

A little arithmetic reveals:
For p = 0.05 and R =1,000 that 99% of lie between 0.032 and 0.068
For p = 0.01 and R =5,000 that 99% of lie between 0.046 and 0.0086
For p = 0.001 and R =10,000 that 99% of lie between 0.00918 and 0.0018

Generally, it is simplest to calculate between five and twenty thousand statistics for all tests.
say 10,000 for a 1-tailed test, 20,000 for a 2-tailed test.

Except for standard non-parametric confidence limits
for which 100 to 200 may suffice.

For 2-tailed-tests, assuming the resample-statistic is symmetrically distributed, each tail is assumed to have half that probability. Or you have to estimate the deviation in each tail separately. (For example, if you are testing a difference between means of + 312 mg, then to test the left hand tail of an asymmetrical distribution, you would presumably see what proportion were less than or equal to - 312 mg. In practice, such comparisons are not always meaningful.)