 
"The property of being bootstrappable might well be added to those of efficiency, robustness and ease of computation, as a fundamentally desirable property for statistical procedures in general."
Peter Hall in Brown, B.M., Hall, P., & Young, G.A. (2001). The smoothed median and the bootstrap. Biometrika, 88(2), 519534.
 "An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem."
Tukey quoted in Meier, P. (1975)
Statistics and medical experimentation.
Biometrics 31, 511529

Bootstrapping (sampling with replacement) is an increasingly popular way of obtaining confidence intervals for otherwise intractable statistics  and is sometimes used to estimate parameters and for significance tests. Unlike jackknifing, the justification for bootstrap models is they are assumed to mimic the process by which your sample was obtained.
Bootstrap confidence intervals are useful where:
 The statistic's distribution is unknown;
 Or the formula for estimating the statistic's distribution is too approximate;
 Or the formula's predictions need to be checked.
For many commonlyused statistics there is, quite simply, no analytical formula to estimate their standard error
 or to attach confidence limits.

Using (analytical) parametric methods it is difficult to estimate confidence intervals for growth rates, hazard functions, genetic distances, heritability indices, niche overlap indices, the Gini coefficient (of inequality of plant sizes), species richness indices (such as the Jaccard index of similarity of species composition), population estimates using MRR, dose response estimates, population density from line transects, catch per unit effort (eg in fisheries), time interval estimators (such as tumour onset following exposure or post treatment). Unfortunately, jackknifing can run into serious difficulties for nonnormal estimators such as variance ratios, correlation coefficients, and Xsquared error probabilities.
Note:
 For generality, let us use to represent your observed value of any one of these estimates, where Θ is the value being estimated.
 To avoid becoming tied up in terminological knots, we shall describe the reasoning in terms of a statistic calculated from a single sample of n observations. This does not imply the underlying reasoning only applies to single samples, merely that the explanation is much easier from that perspective.
Bootstrapping is particularly useful where the statistic's distribution depends upon the population being sampled, or where the statistic's bias varies with sample size  but there are a few statistics for which bootstrapping cannot estimate confidence intervals.
Although there has been a great deal of theoretical and practical research on bootstrapping, among biologists (although the method used is seldom specified) the most popular interval estimator is Efron's simple, percentile, nonparametric (backwards) bootstrap.
For a simple percentile bootstrap interval:
 Your observed estimate, , is assumed to be the best available estimate of the population parameter, Θ.
 The distribution of the n observations that was calculated from is assumed to be the best estimate of their population's distribution  and is therefore used as a model of that population.
 The sampling process is simulated by taking n observations from that sample, at random with replacement, and a bootstrap estimate ( ^{*} or, if you prefer, ^{B} ) is calculated identically to its original observed value, .
 Given a sufficient number of bootstrapped replicates, the most deviant 100α% of these bootstrap estimates provide an estimate of the corresponding theoretical interval. That 'simple percentile interval' is assumed to estimate the 100(1−α)% confidence interval about .
It may surprise you to learn that, although bootstrapping was devised by Efron in 1979, it is still somewhat controversial. One reason for this is that, because the amount of computation is comparatively large, bootstrap estimators are much more recent than the usual 'textbook' statistics (and jackknifing, for that matter). Fortunately, although some statisticians complained that bootstrapping was just a way of getting a 'free lunch', it is now generally accepted that bootstrapping is perfectly valid upon theoretical grounds  provided its assumptions are reasonable. Unfortunately, an awful lot of biologists either believe that (being nonparametric) bootstrap estimators do not make any assumptions, or simply do not bother to verify the assumptions are met.
So, before we go much further, we would be wise to understand the limitations inherent to bootstrapping in general, and simple nonparametric confidence limits in particular. To expose the assumptions, and some of their implications, let us consider four simple questions.
 Why should resampling samples enable me to estimate how my statistic varies?
 How many resamplestatistics do I need to calculate?
 How well can bootstrapping estimate bias and standard error?
 Are bootstrap estimates assumed to be tdistributed?

 Why resampling samples enable me to estimate how my statistic varies
(Bootstrap estimates of )
For bootstrapping to estimate how your sample statistic varies, the first and most important assumption is that your sample of observations ( Y_{1} ) reflects the properties of the 'parent' population ( Y_{0} ) from which those observations were drawn. This assumption is frequently violated because observations were selected nonrandomly. Bootstrapping cannot overcome selection bias! Therefore simple random bootstrap sampling is also inappropriate for seriallycorrelated, clustered, multilevel, or otherwise associated observations. More complex models have been developed for a few of these designs.
Provided you can assume that your sample provides a reasonable representation of the population it was drawn from, it is similarly reasonable to use one as a model of the other. Each bootstrap sample ( Y_{2} ) is selected in the same way as your original sample ( Y_{1} ).
In which case resampling your sample is justified as being a simulation of your original sampling process  and, assuming Y_{0} is infinite, Y_{2} is selected from Y_{1} with replacement.
A permutation test, in contrast, makes no assumptions regarding how your observations were obtained  and, because your inference is confined to the result of assigning those values, they are selected without replacement.
More formally, given that the observed cumulative distribution of your sample is an estimate of the cumulative distribution of its population, the cumulative distribution of bootstrap estimators calculated from that sample can be expected to reflect the (true, population, or parametric) cumulative distribution of your estimator.
Random sampling and complex models aside, the most obvious limitation of this approach is the size of your original sample.
 The larger your sample is the more reliably and more completely it can be expected to reflect the composition of its population.
 Perversely, if central limit is effective, the larger your sample the simpler its distribution  and the less important these details are.
 Small samples, on the other hand, suffer from a variety of problems.
 Very obviously, they tend to be unreliable. So, regardless of how sophisticated your analysis might be, their results will carry less weight.
 The distribution of any statistic is liable to be more complex, much harder to predict analytically, and much more directly related to the particular population you have sampled.
 Small samples can only reflect the grossest aspects of their population. In other words its location and, less reliably, its variance  and, rather less reliably its skew  and, if it was very pronounced (e.g. bimodal), you might get some indication of kurtosis.
Although the cumulative distribution of your sample may be roughly similar to its population  on average at least, this is almost never true of the frequency distribution. Even if your population has a continuous normal distribution function, you sample will always have a discrete distribution  resampling a sample therefore produces a number of irregular, randomly positioned, steps in the bootstrap statistic's cumulative distribution. This spurious 'fine structure' not only limits the detail in your model, it also is a source of annoying artefacts and 'crazy' results  all of which limits the power (and reliability) of your inference. This loss of detail, and tendency to artefacts bedevils a number of the more sophisticated bootstrapping techniques, especially where they are applied to 'residuals'  in other words to the deviations of observations from some model's predictions.
Whereas bootstrapping is most useful for moderately large samples, even in theory, there are definite limits upon what can be expected  if for no other reason than the fact that the most extreme results in any population tend to be rarest, and hence least liable to be represented in a sample. One consequence of this is the amount of variation tends to be underestimated, and extreme quantiles are both variable and biased. Correcting the standard error is relatively simple if your statistic is a mean (or behaves as if it is), but for less tractable estimators the approximation error may be reduced by calculating a studentizedbootstrap estimator (also known as a bootstrapt statistic), for which the standard error of your statistic is estimated by a second stage of resampling  in other words, by repeatedly resampling each bootstrap sample.
The notation for bootstrap estimators varies somewhat.
A bootstrap statistic may be distinguished from a sample statistic, , in a number of ways. For example by writing ^{*} or ^{B} or _{B} or _{(B)} or Θ_{(B)} or even  which is misleading, because a tilde (~) is used to signify approximately unbiased estimators.
So, to avoid confusion we shall use ^{*}. Since we need to refer to a number of bootstrap statistics (B of them), this allows us to refer to a particular statistic (the ith) as ^{*}_{i} 
Given that is a plug in estimator of Θ (in other words, they are both calculated using the same formula) these quantities could also be expressed using function notation.
In other words, where Y_{1} is a random sample of Y_{0}, we could write the parameter as Θ[Y_{0}], the sample estimator as Θ[Y_{1}], and the bootstrap estimator as Θ[Y_{2}].
Pursuing this line of reasoning, it follows that Θ[Y_{3}], or ^{**}, would an estimate calculated from a second stage sample  a bootstrap sample of a bootstrap sample.
In a nonparametric bootstrap ^{**}_{ij} would be an estimate calculated from the jth sample, of the ith sample, of your original sample.

"The assumption made in a bootstrap is basically that the sorts of errors made in inferring from a sampled sample to the sample (errors that we can see) are similar to the sorts of errors made in inferring from the sample to the population (errors that we can't see)." J.E.H. Shaw
An exalgebraist who lost his ideals, his associates, and finally his identity 
 How many bootstrapstatistics are needed
Working out the result of every possible sample of n observations entails rather a lot of computation  because there are ( B = ) n^{n} possible ways of selecting n items, with replacement, from a set of n distinct items.
By comparison, there are a mere n! ways of selecting n items without replacement. So you may wish to compare this discussion with the one for permutation tests.
In either case, if you ignore the order in which they are selected, there are rather fewer possible combinations. However even if you just work out every possible combination, for them to be of any use, you also need to work out the probability of obtaining each combination. This can pose a fairly knotty computational problem in the 'simple' situation where every different combination of observations yields a different value of Θ^{*} and when no two of your observations have the same value. Even today the entire, 'exact' distribution of bootstrapped estimates is seldom calculated for samples containing more than 10 observations  and, given the problems with small samples described above, these are largely of theoretical interest.
In practice there is little to be gained from calculating the entire bootstrap distribution. By obtaining ( B ) bootstrap estimators, Monte Carlo bootstrapping randomly samples the exact distribution. From which it follows that, the larger a 'sample' of bootstrap statistics you calculate, the better an estimate of their exact distribution you should obtain. One objection to this reasoning is that, the more times you resample your data, the more likely you are to calculate the same result more than once. To estimate this probability, you need to know the probability of achieving the most likely bootstrap statistic.
The results of such mathematics are instructive. As you might expect, if you resample a set of observations, the most likely combination of observations is identical to the sample being resampled. However, if your sample is of an untied continuous variable, if you obtain ( B = ) 2000 bootstrap samples each of 20 observations, the probability that two or more of these samples contains the same combination of observations is less than 0.954  or one in 20. For samples above that size, the probability converges quite rapidly towards zero.
In principle therefore, for all practical purposes, duplication of effort is not considered a problem. A further consequence is that the distribution of linear additive statistics (such as the mean) is often regarded as being continuous  or smooth. At the same time, because you are resampling a discrete distribution of observations, this assumption is unreasonable for small samples  nor does it imply the distribution of your bootstrap estimators is unbounded, and this limits the width of estimator's tail end distribution. Indeed, these bounds are all too easy to calculate by selecting n of the smallest, and n of the largest value in your sample  then applying your estimator to them. In addition, if your sample is heavily tied (such as binary data), or your estimating function truncates or pools values (such as where it is calculates a median or maximum), the distribution of bootstrap statistics will be a discrete, unsmooth function of your sample  which upsets this simple model.
One way around this problem is to smooth the distribution of Θ^{*} by jittering your resampled values, for example using a Gaussian smoothing function  one sideeffect of which is to make a complete enumeration of outcomes impossible, for any sample size, no matter how small.
These points aside, the number of bootstraps you need to perform depends upon whether you are trying to estimate moments or tails. In principle at least, you would expect means and standard deviations to require fewer bootstrap samples than skews or tailend proportions. For a simple twotailed 95% percentile confidence interval, a minimum of ( B = ) 1000 estimates are generally recommended  but 5000 is more common. However, to avoid the need for interpolation, many people prefer to use B = 999 times, or 9999 resamples.
 Bootstrap estimates of bias and standard error
To the extent that your sample represents its population, you would expect that bootstrap statistics ought to reflect the bias and standard error of estimates calculated from random samples of that entire population.
 Bootstrap Bias
Like jackknife statistics, bootstrap estimators are not assumed to be unbiased estimators of the population parameter. Instead it is assumed that, if the sample statistic ( ) provides a biased estimate of its parameter ( Θ ), the bootstrap statistic ( ^{*} ) provides a similarly biased estimate of the sample statistic.
If your sample statistic is unbiased, and has no inherent a priori value, the best estimate of it parameter is usually its observed value, . If its bootstrap estimators are distributed symmetrically, their mean is the best estimate  but, for skewed distributions, their median may be a better measure. For simplicity, let us say the bias ( b ) is the difference between your observed statistic and parameter estimator (b = − Θ).
Therefore:
 If the estimator has a bias of b, then = Θ + b.
 If the bootstrap estimator has a bias, , then ^{*} = + and ^{*} = Θ + b + .
 As a first approximation, assuming b = then Θ = − , and Θ = ^{*} − 2.
{Fig. 1}
Notice however, that if you are trying to correct confidence limits, this estimate of bias can only be applied directly when your statistic is distributed symmetrically. In addition, whilst the standard deviation of bootstrap estimates generally provides a rather better estimate than one calculated by jackknifing, bootstrapping is not always such a reliable estimate of bias. One reason for these problems is the fact that the random resampling process introduces its own quota of variation, because the (combined set of) observations in your B bootstrap samples are unlikely to have the same distribution as the sample you are resampling. A simple way of improving the reliability of these estimates is to perform what is known as 'balanced' bootstrapping.
 Balanced bootstrapping is performed in two steps.
 B copies of your entire sample are taken (with replacement)  and combined to form a finite model population.
 B resamples are taken from this model population (without replacement)  each of which provided you one bootstrap statistic.
However, whilst this procedure improves the reliability of some parameter estimates, it is less effective in improving confidence intervals.
 Bootstrap Standard Error
As a first approximation, the standard deviation of your bootstrap statistics ( ^{*} ) provides a valid estimate of the standard error of your sample statistic ( ). Indeed, one advantage of estimating the standard error this way is that you need comparatively few resamples for the Monte Carlo variation to be negligible.
A disadvantage of this method is that, unless your sample comprises the entire population of observations, the standard deviation of ^{*} is liable to underestimate the dispersion of .
For example we took 10000 samples, each of (n=) 10 observations, from a standard normal population (μ = 0, σ = 1). We calculated the means of these 10000 samples and, using the population variance formula, we found their standard deviation. As you might expect, the observed standard deviation ( = 0.31673) of these 10000 means was very similar to the standard error we would predict using the standard analytical formula ( ^{σ}/_{√n} = ^{1}/_{√10} = 0.31623).
To obtain a bootstrap estimate of this standard deviation we resampled each of these 10000 samples (with replacement). In other words we took 100 bootstrap samples, of 10 observations, from each sample of 10 observations  and calculated their means. Then, using the population variance formula as a plugin estimator, we calculated the standard deviation of each set of 100 bootstrap means  giving us 10000 standard deviations in all.
As expected, both the average standard deviation (0.29045) and the median (0.28620) standard deviation of these bootstrapped means was rather lower than the standard deviation of our sample means (0.31673).


For a few statistics, such as the sample mean (Σ^{Y}/_{n}), it is possible to estimate the standard error of from your sample  very often it is not. However, whilst simple bootstrap standard error provides a first order estimate of your estimate's dispersion, bootstrapping therefore tends to produce confidence limits that are slightly narrower than they ought to be. This constraint is no different for nonparametric confidence intervals  even when obtained by testinversion.
To reduce approximation error, secondstage bootstrap resampling may be used to calculate the distribution, or standard error, of t−statistics and other 'studentized' estimators.
"Bootstrap approximations are so good that if we use bootstrap estimates of erronious theoretical points, we commit noticeable errors." Hall (1988)
Theoretical comparison of bootstrap confidence intervals. The Annals of Statistics, 16.3 927963 
 Bootstrap t statistics
Given that statistics are commonly named after their testdistributions, you might reasonably assume that a 'tstatistic' is a statistic that is tdistributed.
Bootstrapping makes no assumptions as to how your observations are distributed and, allowing for the limitations noted above, bootstrap estimates are assumed to be distributed in the same way as the estimates themselves. Where your statistic has an asymptotic normal distribution, the bootstrap statistic is assumed to do the same thing  at least as a first approximation. Given which, 1.96 × the standard error (estimated as above) enables you to attach 95% confidence limits to a number of (otherwise rather awkward) statistics.
The problem with this attractively simple approach is that, because sampling looses information, you must expect bootstrapping to underestimate the standard deviation of your sample statistic ( )  yielding confidence limits with a nominal coverage error of O[n^{½}]. In other words, all else being equal, when n is large this error is of the order of ^{1}/_{√n}. As a result, even though ^{*} may be normal, resampling introduces an error equivalent to using the normal approximation for a tdistributed estimate.
For bootstrapping to exceed the accuracy of the normal approximation you have to use pivotal estimates (pivots)  in other words statistics which have the same distribution, regardless of their population parameters. For example the studentized sample mean, ' = [ − μ] / is asymptotically tdistributed where the standard error of ( ) is estimated from the standard deviation of the observations in each sample, s_{Y}. (Remember, you cannot estimate the standard error of most statistics using s_{Y} / √n  or an equivalent analytical formula!)
Assuming your statistic is unbiased, there are three problems with this approach.
 You need to estimate the standard error for each bootstrap estimate.
Studentizing your estimates using a constant standard error (such as the standard deviation of your bootstrap estimates, ) cannot mimic the way in which the estimated standard error would vary when calculated from samples of a real population. So, if was asymptotically normal, your studentized bootstrap statistics would also be normal  rather than t−distributed.
 You need to know how many degrees of freedom the t distribution should have.
Simply because you sample B estimates of a bootstrap distribution does not mean you can assume those estimates are tdistributed with B − 1 degrees of freedom.
 Although not all pivotal estimates are asymptotically normal, pivotalness is an asymptotic property.
The nominal coverage error you will usually see quoted assumes your studentized estimates are pivotal, symmetrical, homoscedastic, smooth, etcetera.
Provided you can reasonably assume your resampling process mimics the original sampling process, you can simulate the variation in standard error by resampling each of your B bootstrap resamples  and use the standard deviation of each set of secondstage bootstrap statistics () as an estimate of the standard error.
For example  using our earlier notation, where is unbiased, the ith bootstrap estimate could be studentized as _{i}^{*}' = [_{i}^{*} − ] /  where is the standard deviation of ^{**}, your bootstrap samples of the ith bootstrap sample of your original sample.
How you go about obtaining these secondstage estimates (or the degrees of freedom for t) depends upon whether you are using a nonparametric or parametric bootstrap model.
In principle, for smoothlydistributed estimators that obey central limit theorem, bootstrap t statistics enable coverage error to be reduced from O[^{1}/_{√n}] to O[^{1}/_{n}]  and using more complex corrections, still further (at least in theory). Notice also that although pivotal statistics do not have to be asymptotically normal, asymmetry and/or unsmoothness upset most methods of estimating 2sided confidence limits.
In practice, bootstrap pivotal statistics have three other disadvantages.
 They radically increase the computational load (usually by 50fold at least).
 Because each stage of resampling simplifies and distorts the cumulative distribution function, the coverage error can seldom be reduced below O[^{1}/_{n}2].
 Bootstrapt statistics can turn out to be infinitely long, or short, or simply impossible  such as confidence limits for a proportion that are < 0, or > 1.
The net result is that although there are quite a number of moreorless sophisticated bootstrap methods available, because Efron's simple percentile confidence limits are easy to compute and interpret  and, when used intelligently, relatively robust  they are surprisingly popular.
