What is a permutation (randomization) test?
In principle, this test is identical to any other null hypothesis significance test, aside from 2 important points:
 The test statistic's null distribution is obtained by repeatedly calculating the test statistic following random reassignment of values to treatment groups (or to sample groups).
 Under the nil hypothesis the values so randomized are the original observations  this is its most popular form.
 Under the null (or nil) hypothesis the original data are modified to remove any treatment effect prior to random reassignment  but randomizing residuals can produce implausible outcomes.
 The permutations (random assignments or randomizations) are equivalent to sampling these values without replacement.
 Since only a finite number of values are permuted, and a finite number of test statistics are calculated, the resulting null distribution is unavoidably discrete.
 Pvalues are the relative rank of the test statistic's observed value within that null distribution.
For this last reason, permutation tests could be described as 'nonparametric', but more often it is because errors are not assumed to represent a normallydistributed population. Again, because their null distribution depends upon your observations, rather than their 'parent' population, these tests are 'conditional' upon your observed values. In other words the nulldistribution used by a permutation test is a only miniscule subset of the null distribution you would obtain if you obtained those statistics from repeated random samples of that parent population.
Permutation tests are also said to be 'exact'. One, not very good, reason for this is because (in principle) it is possible to calculate the exact probability of obtaining your test statistic's observed value, and for every more deviant value. (We consider the difference between 'exact' and 'approximate' tests in Unit 4.) A better justification for using a permutation test is when it's workings match your experimental design. Where that is untrue, a permutation test is sometimes justified as being simpler, 'betterbehaved', or more robust, than alternatives such as bootstrapping (where values are resampled with replacement)  but it would be misleading to describe such a permutation test as 'exact'.
If that rather academic description leaves you struggling, a simple example may be more revealing. Or, if you are only interested in the properties and assumptions of randomization tests, or how to do them, see below.
Principles and properties of a permutation test
(Consider this very finite population)
Imagine the following trial has been performed by a farmer who wants to compare two insecticide treatments to reduce 'stable fly' (Stomoxys) problems in her calfrearing pens.
The farmer selects 5 of her calves and divides them into two groups. One group contains 2, the other group has 3 animals. The first group she protects using impregnated ear tags. To the other group she applies an oilbased formulation of the same insecticide, an artificial pyrethroid. The farmer pens each calf separately, and estimates the number of stableflies in each pen by hanging sticky flypaper above the animal. After 3 days she removes her flypapers and counts the number of stableflies on each.
Here are her results (plus our notes).
Flypaper number
 1  2  3  4  5

Number of stableflies caught
 30  50  7
 12  30

Treatment
 tags  pouron

Total
 80  49


  ( Y )

 ( ΣY ) 
 
With a little arithmetic, she produces these statistics.
From which she concludes, "that pouron might be a bit messy, but it is more than twice as good as the ear tags  I got half as many flies round those calves".
The farmer is rather pleased with her small experiment  especially since these treatments were free samples. She feels such a clear difference is pretty strong evidence the pouron gets rid of more of the flies. But, before she invests any money, she would like to be sure her results are not just some sort of statistical fluke.
When you have thought, read on.
Why not use an ordinary 'parametric' test?
There are obviously a number of problems with this farmer's experiment.
For example 
In other words, it is hard to argue her observations are random samples of some larger population, or populations, whose parameters you can therefore estimate. Nor, from the available evidence, can we reasonably say whether her results represent a normal, lognormal, or some more exotic probability distribution. Given which it seems reasonable to accept that, because we can infer nothing useful about any 'wider' population, we must exclude it from our analysis.
Allowing for all of these problems,
 Can we give this farmer any useful advice?
 Can we defend our conclusions?
Realistically, the most critical problem is her small sample size. Laying that aside for the moment, the other problems could be viewed as technical. Therefore let us see whether we can get around them.
A Permutation, or Randomization Test
There are three steps we need to consider.
 We need a statistic to test
In this particular case our farmer is interested in ratio of her two sample means, _{T} / _{P}
But in order to speak more generally, let us use θ  to denote any statistic of interest.
 We need to decide how θ might be expected to vary.
This depends upon our underlying biological and statistical model.
 We need to compare the statistic she observed with the expected distribution  predicted by our model.
How we do this depends upon the hypothesis under test.
Defining the model population
Let us begin by considering our null hypothesis.
From what the farmer has told us, she used a 'completely randomized' experimental design. Our null hypothesis is that any difference she observed was merely due to chance.
We can express this another way.
Imagine for a moment, what would happen if all the labels had fallen off our farmer's flypapers.
If her treatments were equally effective, or ineffective, it would not matter which flypaper was monitoring which treatment. In other words, on average at least, she might as well relabel her treatments at random  provided that each flypaper ended up with a label.
Of course it is unlikely that a group of observations, selected (or assigned) at random, will give the same results the farmer achieved.
Nevertheless, if our null hypothesis is correct, and there really was no difference between the effect of these treatments, any randomly selected observations must provide an equally valid description of the observations they are drawn from. In which case, any sample statistic summarizing those observations should be equally valid.
If this null hypothesis really is true, provided their size of her treatment groups remain the same, any variation amongst these observed ('sample') statistics must be entirely due to chance. Only if the results our farmer has observed are very unusual, might we conclude they are unlikely to arise from this group, and consider rejecting our null hypothesis  that her treatments produced the same result.
Assuming the farmer allocated the treatments at random, we can estimate the distribution of her sample statistic under the null hypothesis. Using this distribution we can work out the likelihood of finding her result, without assuming her observations represent (or approximate) any hypothetical population with a 'known' distribution. We simply assume that her treatments were assigned at random. 
In other words, we have confined the population of observations to the 5 that our farmer has actually made. By definition therefore, we are not trying to estimate parameters of some larger population of observations  or of their statistics. Nevertheless, we would be wise to explicitly state what else this model implies.
Defining the selection process
Given the way our farmer has labelled her results, we are assuming no animal was assigned to both groups.  In other words, no animal received both an ear tag and pouron. Similarly, we assume the farmer used just one flypaper for each calf, and that she recorded the result from each flypaper once and only once.
When stated in these terms, we cannot realistically argue these treatment were assigned independently of one another. The probability of a calf being allocated to a group varied according to how the other calves had already been assigned. For example, if by chance the farmer randomly assigned the first three animals to the 'pouron' group, the remaining two animals could only be assigned to the 'tags' group.  Nor, once an animal had been selected from the 'unassigned' animals, was it replaced or reassigned.
We must therefore treat our observations in the same way. In other words we should select each observation from our model population without replacement. The table below shows every possible way in which 5 observations could be divided into two groups, comprising 2 and 3 observations. All the possible outcomes of our experiment are described as the sample space. It turns out there are just 10 alternative ways of dividing her 5 observations into two groups (of 2 'tagged' & 3 'pouron'). Provided the farmer assigned her treatments at random, each of these events are equally probable, even if the results are assigned after her experiment is done.
To make things clearer, we have shaded observations assigned to tagged animals green. Observations assigned to pouron treated animals are shaded turquoise. We have ranked these results according to their test statistic.
{Fig. 1}

This population distribution differs from the distributions assumed used by parametric models, such as the ztest (or ttest, described in Unit 8) in two important ways.
 This population of statistics is discrete instead of smooth.
 Its range is strictly bound (because 0.26 ≤ [_{T} / _{P}] ≤ 2.45) compared to a normal distribution (where −∞ < θ < +∞) or a lognormal distribution (where 0 < θ < +∞).
Therefore, when analysing a permutation test, it is hard to justify fitting the lognormal distribution  shown in grey on the graph above. As we shall see, the properties of the permutation distribution can affect what we can realistically infer.
Comparing observed and expected
Now we have some idea of how these results might be expected to vary, let us compare the 'exact' distribution predicted by our 'null' model, with the result our farmer actually observed.
Only two of the 10 possible test statistics (or 24 of the 120 permutations) are as extreme as our farmer's result (2.45 times as many flies round tagged calves). No test statistic was more extreme than her result.
A conventional 1tailed test (which assumes a smooth distribution of θ) requires less than 5 percent of this population of statistics to be greater than or equal to the result she obtained. In this case 24 of the 120 possible results are the same as the one she achieved, so there is a 20% probability (P=0.2) of finding the pouron was more effective if in fact there were no difference.
If however, the farmer only intended to find which treatment was best, we should consider both tails of the distribution. Simply doubling the 1tailed Pvalue gives us a Pvalue of 0.4
Given this distribution is discrete, you could justifiably argue a midPvalue is more appropriate. When expressed as a mean rank, 90% of the population had a rank of less than the statistic she observed. Even so, a 1tailed Pvalue of 0.1 might, most generously, be described as 'suggestive' rather than 'significant'.
Either way, in plain English, it is very possible her results were a statistical fluke.
How much power does this test have?
Around this point you, like our farmer, may be wondering just how big a difference we would need to obtain for us to conclude that one of her treatments did had a significant effect.
Consider this more extreme example:
Treatment
 Tags
 Pouron

Number of stableflies caught
 2
 1
 1358
 2481
 960

Total
 3
 4799

Ratio (tags / pouron) ⇒ 0.0009 
 
Here the tags are 'obviously' very effective indeed. Instead of a 2.45 times difference, we have observed nearly a one thousandfold reduction. In this instance, of the ten possible samples from these observations, just one produces such an extreme result. Given that the sample space is still (5!=) 120, and there are 12 permutations which can produce each result, by a conventional 1tailed test this result is not significant because P = 12/120 = 0.1 or 10%. Irritatingly, the 1tailed mid P value of 6/120, or 0.05, just fails to be significant at the 5% level.
Given such a large treatment effect, you may find this a rather puzzling result.
One way of explaining this paradox would be to say ''the problem is a permutation test is nonparametric''. Unfortunately, whilst this explanation might silence the credulous, it does nothing to increase your understanding.
If we define power as the probability of discarding the null hypothesis when it is false, the problem becomes more obvious. Very simply, our permutation test assumes the null hypothesis is true  and, because our inference applies to just these 5 observations, we have defined its sample space accordingly. The net result is, even if the true difference between treated and untreated animals was utterly colossal, given a total of just 5 observations a 2tailed permutation test cannot show this treatment effect is significant (at P<0.05).
From this you might conclude that permutation tests must lack power. In fact the converse is true.
Many 'nonparametric' tests are not as powerful as their parametric equivalent because the rank transformation looses information. Randomization tests (and similar exact tests) are generally considered to be very powerful because they are able to make use of all the information in your observations.
 Part of the reason this permutation test is running into difficulties is because our few observations do not permit the sample space to be subdivided into small enough bits.
 It is also running into difficulty because, by combining observations under the nil hypothesis, that model ALSO includes any treatment variation.
One way of increasing the permutation test's power is to remove any observed treatment effect BEFORE pooling the 'residual' values under H_{0}
Nevertheless, where the assumptions of the parametric and permutation models are compatible, they can yield similar
inferences.
If you are used to parametric tests you may find these facts hard to accept.
A popular gambit, in this situation, is to forget about the finer points of statistical etiquette  and simply use a different test. Since we wish to compare their statistical models, let us do just that.
A parametric model
In essence, a parametric model is constructed and tested as follows  albeit not always in this order.
 Summarise your observations using some sort of statistic.
For sake of argument, let us call this sample statistic 'theta', or θ, for short. Generally, but not always, we are only interested in this statistic as an estimate of some parameter 'Theta', or Θ.

 Decide what value of Θ you would either expect, or wish to compare θ with.
This may be defined by your null hypothesis or alternate hypothesis, but implicitly or explicitly, this step cannot be avoided. For a confidence interval, your best estimate of Θ is θ.
 Given the above, you decide how θ would be expected to vary about Θ.
In order to do this you nearly always have to assume that either your observations, or treatments, were selected at random  or that random selection could have operated. If this assumption is violated the rest of your model is invalid, and its predictions are liable to be biased or simply meaningless.

 Compare your observed θ with the frequency distribution you have estimated from steps 2 and 3.
Depending upon how this is done, you may end up with a probability or a relative probability. But, in either case, you assess how far θ deviates from Θ.  The greater the deviation, the less likely it is predicted to have arisen by chance alone.
 For a Pvalue plot, or to attach confidence limits to θ, steps 1 to 4 may evaluated iteratively  or, depending upon the assumptions that are made, they may be integrated into a single (shortcut formula).
For example if θ is assumed to be normal, when studentized, θ' may be assumed to follow a tdistribution, and tested as such.

Let us therefore apply the standard 'text' book approach to this problem, and use a ttest.
Given that the statistic of interest is the ratio between means, and that these insect catches are liable to be skewed, we logtransformed our farmer's 5 Stomoxys catches. Assuming the null hypothesis is true, and differences between means of transformed data are approximately normal, their studentized difference might be expected to obey the tdistribution.
If you are uncertain about that line of reasoning, we introduce the tdistribution in Unit 6,
but Unit 8 explores the reasoning and behaviour of 'tstatistics' in some depth.
Unfortunately, a (2tailed) test of the equalvariance tstatistic was nonsignificant, at P=0.17  and if we assume our samples represent unequal population variances, P=0.66
 None of which is of much solace to our farmer.
We next applied the same transformation and tests to our second, rather more extreme, set of results from above. This time the equal variance tstatistic tested highly significant, at a 2tailed value of P=0.00056  and the unequal variance statistic slightly less so, at P=0.0025
Given these log transformed observations yielded an Fratio of 1.041 you might conclude the equal variance tstatistic was a quite defendable result  and that a ratio of 0.0009 was very unlikely to occur by chance.
Since (on this occasion) the parametric and exact models have yielded such divergent inferences, let us see why their differing assumptions result in this discrepancy.
Parametric versus Exact
Once we cut away the mathematics, there are three fundamental differences between our (exact) permutation test model and of an ordinary (parametric) ttest.
 The size of the population
 Our permutation test assumes the population is both finite and tangible.
 The ttest assumed our results represent an infinite, albeit theoretical, population.

 How θ can vary.
 If we are sampling a finite population, our statistic must have a discrete finite (unknown) distribution.
 A parametric test assumes θ has a continuous known (in this case, normal) distribution, whose variance we estimate from the two sample variances.

 How the models behave if the null hypothesis is false.
 As it stands, our permutation test model makes no allowance for what happens if the null hypothesis is false.
 The tstatistic performs satisfactorily under the alternate hypothesis  enabling you to estimate confidence limits to your estimate.

Given their assumptions are so different, it is scarcely surprising our two models are producing quite different conclusions. Nevertheless, where the assumptions of both tests are not too outrageously violated, they result in very similar inferences. Otherwise, given the points above, we can make the following observations.
 By confining our model to a finite (less theoretical) population, in terms of that population, our inference is strong. A parametric test, which treats these observations as a random sample of an infinite population, is not only very weak  but is unduly conservative.
If we accept we are sampling a finite population of results, we ought to apply a finite population correction to our ttest's variances  so reducing our ttest's Pvalue. For two, equallysized groups of observations, this would approximately halve the sample variances.

 Because the ttest assumes θ has a continuous and unbounded (normal) distribution, it also assumes that θ can have very large deviations from Θ  and, as a result, that its sample space can be partitioned into correspondingly minute portions.
If however, we accept this is an approximation to a discrete distribution, these very small Pvalues are unrealistic  or, in plain English, they are merely an artefact of our unreasonable assumptions.

 A permutation test of observations assumes all your observations represent the same population (under H_{nil}). So conventional estimates of power, or confidence limits for our observed statistic (which assume H_{A}) appear to be a contradiction in terms.
Sometimes, but not always, this problem can be addressed by removing the observed treatment effect prior to performing the test.

In principle, because the exact distribution represents all the information available, provided your model is reasonable, that distribution is the most powerful measure of your observed statistic.
Although you require less information to estimate the parameters of a known distribution, than to describe an unknown one, you still need a clear idea which known distribution (if any) your results represent.
If you are not certain what distribution your statistic represents, your inference is based upon supposition not information.

Resampling data can mimic the effect of intrinsic variation (within a given set of values) upon their summary statistic's value. Lacking any better information about how observations 'should' be distributed, resampling allows you to make the best use of the information at hand. Resampling methods do this by assuming that the variation of your sample statistic results from random selection among the available observations as represented within your sample. But when those values cannot reasonably be interchanged resampling is, AT VERY BEST, an approximation  indeed, if some assignments are more likely than others, random resampling can be a highly misleading approximation.
Resampling allows you considerable scope in choosing which sample statistic to be test. Although resampling techniques are computationally intensive, they have minimal assumptions, and minimal mathematics. As a result, they are much easier to understand, and allow you to concentrate upon what you are testing, rather than how to do it.
All resampling methods assume:
 Treatments were randomly allocated, or observations were randomly selected.
In practice it is much easier to randomly allocate treatments than to randomly select observations. Therefore the former assumption is considerably stronger.
 Your sample is sufficiently large that resampling can adequately estimate how your statistic is distributed.
Resampling techniques obtain all their information on how your statistics are distributed from the observations within your sample. For unbiased results you usually require at least 30 observations, and preferably more than 100.

Surprisingly, whilst resampling methods sometimes make assumptions about how your statistic is distributed, they do not make any assumptions about the distribution of whatever population your sample might, or might not represent. To avoid making such assumptions, you must have enough information.
Here we describe two methods of resampling observations, bootstrapping and resampling. Which of these is most appropriate to your data, depends upon how your observations were made.
 Randomization, or resampling without replacement, assumes you are sampling an appreciable proportion of a finite population, so the each observation has a marked influence on the outcome of the next. Randomization is particularly useful in analyzing experimental results.
Bootstrapping, or resampling with replacement, assumes that the result of one observation does not affect the results of those you make subsequently. In other words it assumes your sample population is infinite, or effectively so. For example, bootstrapping would be appropriate for surveys, where only a small proportion were sampled.

Because of the number of available permutations, or 'sample space', available from even a moderate sample, a complete evaluation is generally not possible. The precision of a resampling test, or confidence limit, depends upon:
 The number of times the test statistic is estimated  not the proportion of the sample space that is evaluated.
 How deviant your result is. Statistics near the edge of a distribution require proportionally more resample statistics.
 The choice of an appropriate test statistic. A good test statistic is sensitive to your alternate hypothesis, but insensitive to things not of interest when under your null hypothesis. (For example, the difference between means is a poor test statistic because it is also sensitive to unequal variances. The tstatistic for unequal observations may be preferable.)
 The distribution of your test statistic. Onetailed tests and parametric confidence limits make no assumptions, but twotailed tests assume your test statistic is distributed symmetrically. Nonparametric confidence limits assume your test statistic is distributed normally, or can be transformed such that it is.
If you find our example (above ) too longwinded and messy, you find the following instructions are more useful.
 Select the best statistic to test your null hypothesis.
Aside from their sensitivity to your alternate hypothesis, and insensitivity to parameters not of interest, when calculating large numbers of statistics you need to minimise your calculations. A number of shortcuts are available where resamplestatistics summarise the same number of observations. For example, totals are distributed in the same fashion as means, and root sum of squares are distributed as standard errors. Differences and ratios are quicker if you employ the total number of observations, which remain constant.
 Calculate your sample statistic.
 Combine and correct your observations so they match your nullhypothesis.
Most comparisons assume there is no difference between samples, but parametric comparisons are sometimes of interest. If you wish to test for an expected difference, this has to be applied BEFORE you combine your samples.
 Randomly select the same number of observations as the sample under test. (Replace observations only after selecting the entire group.)
Randomization requires a different algorithm from bootstrapping for this. [See below.] A good, fast random number generator is essential. Many packages and programming languages have pseudorandom number generators. But it is worth making a few careful checks to ensure their values arise with equal frequency when divided into many intervals. A cumulative scatterplot of rank against value is a good quick test (it should be linear).
 Calculate their resamplestatistic.
 Repeat steps 3 to 5 sufficient times.
How many resamplestatics you need varies according to what you are testing, how you are testing it, and how much precision you require.
 Compare your sample statistic with the resulting distribution of resamplestatistics.
 Remember to LOOK at their distribution. Although 1tailed tests do not assume any particular distribution, 2tailed tests assume the distribution is symmetrical. But nonparametric confidence limits assume resamplestatistics are normally distributed, or can be transformed such that they are. Remember, you can transform either your observations or their statistics.
If you have a very small number of observations it may be worth calculating the entire sample space by working through every possible combination of observations. Unfortunately, even for quite moderate samples this is impractical because the number of permutations becomes excessive.
For 2 samples, containing n_{1} and n_{2} observations, the potential sample space is (n_{1} + n_{2})! / (n_{1}! + n_{2}!) permutations. Two samples of 10 observations allow approximately 3.3 × 10^{11} possible permutations. In practice therefore, it is simpler and easier to randomly sample their sample space.
To do this you need a computer, and either:
 A package that supports resampling.
A few do, but they can be infuriating to use because they consider this an 'advanced' technique, and may expect you to have a degree in mathematical statistics!
 Or a compiler specifically for resampling, and its instruction manual.
Such as 'Resampling Stats' from The Resampling Project, College of Business and Management, University of Maryland, College Park, MD 20742, 7035222713 USA.
 Or a fast spreadsheet, that allows you to reorder (sort) columns.
You can permute or select observations using a column of random numbers, but you may find this a somewhat tedious operation.
 Or a programming language, such as R, and the ability to write simple programmes.
This is probably the fastest and most flexible method.
Whichever of these you use, the underlying logic is fairly similar.
For resampling without replacement (randomization) under the nil hypothesis:
 Combine your observations into a single column or array.
Enter sample 1 in cells 1 to n1, sample 2 in cells n1+1 to n2 etc. Then always calculate their resample statics from these groups of cells.
 Permute your observations:
 For spreadsheets:
Generate a column of random numbers, and resort your observations in that order.
 For programs:
Taking each cell of your array in turn, swap the cell contents with a cell selected at random. You can select the address of the cell to swap with by multiplying a random number between 0 and 1 by the number of cells in your array (N), and adding 1 to allow for integer truncation. Some languages permit you to have a cell address of zero; and most round DOWN when truncating floating point values to integers.
 Calculate and record your resample statistic from the cell addresses that originally held those samples.
If you entered sample 1 in cells 1 to n1, calculate their resample statistic from cells 1 to n1. Calculate sample 2 from cells n1+1 to n2 etc.
 Repeat steps 13 to obtain sufficient resample statistics (see below).

How many resamplestatistics are needed?
Some authors suggest magic numbers of randomizations for arithmetic convenience, so you do not have to interpolate to obtain the NeymanPearson P=0.05. That point aside, the critical issue is  given you are only sampling the null distribution of teststatistics  how accurately do you need to estimate the probability distribution of that statistic? Ideally, of course, you would prefer to have complete accuracy. But to achieve that, you would need to calculate every possible tail end value, how many ways each could occur, and how many ways every possible outcome might occur. Since that is usually only possible for the most extreme possible values, you need to know how many randomizations are required to provide a reasonably accurate P−value.
Notice also that, unlike a parametric test, it is generally unreasonable to assume your probability function is 'known', and estimate its parameters from your randomization distribution. Other considerations aside, moderately sized populations can yield some rather peculiar distributions.
Monte Carlo models assume that, by the process of random selection/allocation, the distribution of your Monte Carlo statistics approximates their exact population distribution  on average, at least. In other words, if your exact model has a probability P of yielding a result less than some number, X, then that is the probability of your Monte Carlo model producing it (assuming both models have the same assumptions)  and in this sense, a Monte Carlo test is also 'exact'.
For example, since there are a mere 252 ways in which, ignoring order, 10 unique items can be divided into two equal groups, we were able to work them all out. The graph set below allows you to compare this entire exact frequency distribution of two statistics, with the same distributions approximated by the same statistics obtained by Monte Carlo simulation  where observations were randomly assigned (without replacement).
{Fig. 2}
As you might expect, the more randomizations you perform, the better an approximation to the population frequency distribution you can achieve. Nevertheless, because you are only sampling the sample space, this process introduces a certain amount of Monte Carlo variation. Provided the number of simulations is sufficiently large, this is generally negligible  compared to the other sources of error.
With modern computing abilities, this is fairly easy to achieve  nevertheless, some purists are deeply upset by Monte Carlo error.
There are some obvious constraints to resampling a set of observations to estimate how a statistic might vary. For example, Monte Carlo estimates of your statistic tend to underestimate the range (max.−min.) of the population being sampled. Similarly, the smaller the proportion of the population that falls within any given class, the more variable will be your Monte Carlo estimate of that proportion  and the more Monte Carlo statistics that are calculated, the more reliable that estimate is likely to be. As a result, the tails of a statistic's distribution are the most difficult to estimate.
In plain English, the smaller the Pvalue you are interested in, the more randomizations you need for a reliable result.
For example, consider the situation if we classified our farmer's 15 results as 'treated' (using either tags or pouron) or 'control' (using neither). After 4999 random assignments, a permutation test of the unequal variance tstatistic, gave a 1sided midPvalue of 0.0041  equivalent to a 2sided significance of P<0.01
All of which is very encouraging, except for the fact that we can expect the exact Pvalue to vary each time we perform this test, and (as yet) we have not indicated how great this variation might be. In other words, how many randomizations you need to do to obtain a reliable Pvalue.
One obvious answer to this problem is to repeatedly test her observed result, say 500 times, and see how much the resulting exact Pvalues vary. In order to discover how this variation is related to the number of randomizations ( R ), and the statistic under test ( θ ), we tested 4 values of θ using R=100, 1000, and 10,000 randomizations. Since the importance of an error in estimating P is directly related to its true Pvalue, the graph set below shows our observed Pvalues using a logarithmic scale  and their variation is expressed as a coefficient of variation, rather than the standard deviation of P ( s_{p} ).
{Fig. 3}
The second graph illustrates the fact that randomization test Pvalues vary binomially. Moderate Pvalues approach a (bounded) normal distribution, but very small Pvalues can be highly skewed.
 If we assume errors are normal, and our observed 2tailed Pvalue of 0.0082 is correct, using 4999 randomizations we would expect our permutation test to yield a Pvalue between 0.0077 and 0.0087 on 95% of occasions. From which we can argue that this amount Monte Carlo variation does not materially affect our inference.
 If however we only performed 99 randomizations, on 70% of occasions we would expect our observed result to be the most extreme within the randomization distribution. Since the lowest 2tailed Pvalue you can obtain from a distribution of 100 points is a midP value of (2×0.5/100=) 0.02, this noticeably affects your inference unless you merely wish to know whether P<0.05
For general purposes therefore, most people use between 1000 and 10,000 randomizations.
For largesamples
Let us begin by reminding ourselves that, unless you compute exact Pvalues using the probability of obtaining all possible tailend values, your (permutation test) Pvalue is only an estimate of that exact value. To distinguish between these two Pvalues, let us use say p is the proportion of all possible test results whose ranks are more extreme than your observed result, and is your estimate of p.
If your samples are large, these estimates () are distributed approximately normally around their mean (p) with a variance of p(1p)/R  where R is the number of randomlyobtained statistics, not the number of values in your combined sample. (All of which means, in plain English, that is assumed to have a binomial distribution.)
Given which, about 95% of the Pvalues you estimate will lie in this range: p ± 1.96√{p(1p)/R}, and 99% of the Pvalues you estimate will lie in the range p ± 2.58√{p(1p)/R}.
A little arithmetic reveals:
Where p = 0.05, and R = 1,000, that 99% of lie between 0.032 and 0.068
Where p = 0.01, and R = 5,000, that 99% of lie between 0.046 and 0.0086
Where p = 0.001, and R = 10,000, that 99% of lie between 0.00918 and 0.0018
Generally, it is simplest to calculate between five and twenty thousand statistics for all tests.
 say 10,000 for a 1tailed test, 20,000 for a 2tailed test.
 Except for standard nonparametric confidence limits
 for which 100 to 200 may suffice.

For 2tailedtests, assuming the resamplestatistic is symmetrically distributed, each tail is assumed to have half that probability. Or you have to estimate the deviation in each tail separately. (For example, if you are testing a difference between means of + 312 mg, then to test the left hand tail of an asymmetrical distribution, you would presumably see what proportion were less than or equal to  312 mg. In practice, such comparisons are not always meaningful.)