"The property of being bootstrappable might well be added to those of efficiency, robustness and ease of computation, as a fundamentally desirable property for statistical procedures in general."
Peter Hall in Brown, B.M., Hall, P., & Young, G.A. (2001). The smoothed median and the bootstrap. Biometrika, 88(2), 519-534.
"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem."
Tukey quoted in Meier, P. (1975)
Statistics and medical experimentation.
Biometrics 31, 511-529
Bootstrapping (sampling with replacement) is an increasingly popular way of obtaining confidence intervals for otherwise intractable statistics - and is sometimes used to estimate parameters and for significance tests. Unlike jackknifing, the justification for bootstrap models is they are assumed to mimic the process by which your sample was obtained.
Bootstrap confidence intervals are useful where:
- The statistic's distribution is unknown;
- Or the formula for estimating the statistic's distribution is too approximate;
- Or the formula's predictions need to be checked.
|For many commonly-used statistics there is, quite simply, no analytical formula to estimate their standard error
- or to attach confidence limits.
Using (analytical) parametric methods it is difficult to estimate confidence intervals for growth rates, hazard functions, genetic distances, heritability indices, niche overlap indices, the Gini coefficient (of inequality of plant sizes), species richness indices (such as the Jaccard index of similarity of species composition), population estimates using MRR, dose response estimates, population density from line transects, catch per unit effort (eg in fisheries), time interval estimators (such as tumour onset following exposure or post treatment). Unfortunately, jackknifing can run into serious difficulties for non-normal estimators such as variance ratios, correlation coefficients, and X-squared error probabilities.
- For generality, let us use to represent your observed value of any one of these estimates, where Θ is the value being estimated.
- To avoid becoming tied up in terminological knots, we shall describe the reasoning in terms of a statistic calculated from a single sample of n observations. This does not imply the underlying reasoning only applies to single samples, merely that the explanation is much easier from that perspective.
Bootstrapping is particularly useful where the statistic's distribution depends upon the population being sampled, or where the statistic's bias varies with sample size - but there are a few statistics for which bootstrapping cannot estimate confidence intervals.
Although there has been a great deal of theoretical and practical research on bootstrapping, among biologists (although the method used is seldom specified) the most popular interval estimator is Efron's simple, percentile, non-parametric (backwards) bootstrap.
For a simple percentile bootstrap interval:
- Your observed estimate, , is assumed to be the best available estimate of the population parameter, Θ.
- The distribution of the n observations that was calculated from is assumed to be the best estimate of their population's distribution - and is therefore used as a model of that population.
- The sampling process is simulated by taking n observations from that sample, at random with replacement, and a bootstrap estimate ( * or, if you prefer, B ) is calculated identically to its original observed value, .
- Given a sufficient number of bootstrapped replicates, the most deviant 100α% of these bootstrap estimates provide an estimate of the corresponding theoretical interval. That 'simple percentile interval' is assumed to estimate the 100(1−α)% confidence interval about .
It may surprise you to learn that, although bootstrapping was devised by Efron in 1979, it is still somewhat controversial. One reason for this is that, because the amount of computation is comparatively large, bootstrap estimators are much more recent than the usual 'textbook' statistics (and jackknifing, for that matter). Fortunately, although some statisticians complained that bootstrapping was just a way of getting a 'free lunch', it is now generally accepted that bootstrapping is perfectly valid upon theoretical grounds - provided its assumptions are reasonable. Unfortunately, an awful lot of biologists either believe that (being nonparametric) bootstrap estimators do not make any assumptions, or simply do not bother to verify the assumptions are met.
So, before we go much further, we would be wise to understand the limitations inherent to bootstrapping in general, and simple nonparametric confidence limits in particular. To expose the assumptions, and some of their implications, let us consider four simple questions.
- Why should re-sampling samples enable me to estimate how my statistic varies?
- How many resample-statistics do I need to calculate?
- How well can bootstrapping estimate bias and standard error?
- Are bootstrap estimates assumed to be t-distributed?
- Why re-sampling samples enable me to estimate how my statistic varies
(Bootstrap estimates of )
For bootstrapping to estimate how your sample statistic varies, the first and most important assumption is that your sample of observations ( Y1 ) reflects the properties of the 'parent' population ( Y0 ) from which those observations were drawn. This assumption is frequently violated because observations were selected non-randomly. Bootstrapping cannot overcome selection bias! Therefore simple random bootstrap sampling is also inappropriate for serially-correlated, clustered, multilevel, or otherwise associated observations. More complex models have been developed for a few of these designs.
Provided you can assume that your sample provides a reasonable representation of the population it was drawn from, it is similarly reasonable to use one as a model of the other. Each bootstrap sample ( Y2 ) is selected in the same way as your original sample ( Y1 ).
In which case resampling your sample is justified as being a simulation of your original sampling process - and, assuming Y0 is infinite, Y2 is selected from Y1 with replacement.
A permutation test, in contrast, makes no assumptions regarding how your observations were obtained - and, because your inference is confined to the result of assigning those values, they are selected without replacement.
More formally, given that the observed cumulative distribution of your sample is an estimate of the cumulative distribution of its population, the cumulative distribution of bootstrap estimators calculated from that sample can be expected to reflect the (true, population, or parametric) cumulative distribution of your estimator.
Random sampling and complex models aside, the most obvious limitation of this approach is the size of your original sample.
- The larger your sample is the more reliably and more completely it can be expected to reflect the composition of its population.
- Perversely, if central limit is effective, the larger your sample the simpler its distribution - and the less important these details are.
- Small samples, on the other hand, suffer from a variety of problems.
- Very obviously, they tend to be unreliable. So, regardless of how sophisticated your analysis might be, their results will carry less weight.
- The distribution of any statistic is liable to be more complex, much harder to predict analytically, and much more directly related to the particular population you have sampled.
- Small samples can only reflect the grossest aspects of their population. In other words its location and, less reliably, its variance - and, rather less reliably its skew - and, if it was very pronounced (e.g. bimodal), you might get some indication of kurtosis.
Although the cumulative distribution of your sample may be roughly similar to its population - on average at least, this is almost never true of the frequency distribution. Even if your population has a continuous normal distribution function, you sample will always have a discrete distribution - resampling a sample therefore produces a number of irregular, randomly positioned, steps in the bootstrap statistic's cumulative distribution. This spurious 'fine structure' not only limits the detail in your model, it also is a source of annoying artefacts and 'crazy' results - all of which limits the power (and reliability) of your inference. This loss of detail, and tendency to artefacts bedevils a number of the more sophisticated bootstrapping techniques, especially where they are applied to 'residuals' - in other words to the deviations of observations from some model's predictions.
Whereas bootstrapping is most useful for moderately large samples, even in theory, there are definite limits upon what can be expected - if for no other reason than the fact that the most extreme results in any population tend to be rarest, and hence least liable to be represented in a sample. One consequence of this is the amount of variation tends to be underestimated, and extreme quantiles are both variable and biased. Correcting the standard error is relatively simple if your statistic is a mean (or behaves as if it is), but for less tractable estimators the approximation error may be reduced by calculating a studentized-bootstrap estimator (also known as a bootstrap-t statistic), for which the standard error of your statistic is estimated by a second stage of resampling - in other words, by repeatedly resampling each bootstrap sample.
The notation for bootstrap estimators varies somewhat.
A bootstrap statistic may be distinguished from a sample statistic, , in a number of ways. For example by writing * or B or B or (B) or Θ(B) or even - which is misleading, because a tilde (~) is used to signify approximately unbiased estimators.
So, to avoid confusion we shall use *. Since we need to refer to a number of bootstrap statistics (B of them), this allows us to refer to a particular statistic (the ith) as *i
Given that is a plug in estimator of Θ (in other words, they are both calculated using the same formula) these quantities could also be expressed using function notation.
In other words, where Y1 is a random sample of Y0, we could write the parameter as Θ[Y0], the sample estimator as Θ[Y1], and the bootstrap estimator as Θ[Y2].
Pursuing this line of reasoning, it follows that Θ[Y3], or **, would an estimate calculated from a second stage sample - a bootstrap sample of a bootstrap sample.
In a nonparametric bootstrap **ij would be an estimate calculated from the jth sample, of the ith sample, of your original sample.
"The assumption made in a bootstrap is basically that the sorts of errors made in inferring from a sampled sample to the sample (errors that we can see) are similar to the sorts of errors made in inferring from the sample to the population (errors that we can't see)."
An ex-algebraist who lost his ideals, his associates, and finally his identity
- How many bootstrap-statistics are needed
Working out the result of every possible sample of n observations entails rather a lot of computation - because there are ( B = ) nn possible ways of selecting n items, with replacement, from a set of n distinct items.
By comparison, there are a mere n! ways of selecting n items without replacement. So you may wish to compare this discussion with the one for permutation tests.
In either case, if you ignore the order in which they are selected, there are rather fewer possible combinations. However even if you just work out every possible combination, for them to be of any use, you also need to work out the probability of obtaining each combination. This can pose a fairly knotty computational problem in the 'simple' situation where every different combination of observations yields a different value of Θ* and when no two of your observations have the same value. Even today the entire, 'exact' distribution of bootstrapped estimates is seldom calculated for samples containing more than 10 observations - and, given the problems with small samples described above, these are largely of theoretical interest.
In practice there is little to be gained from calculating the entire bootstrap distribution. By obtaining ( B ) bootstrap estimators, Monte Carlo bootstrapping randomly samples the exact distribution. From which it follows that, the larger a 'sample' of bootstrap statistics you calculate, the better an estimate of their exact distribution you should obtain. One objection to this reasoning is that, the more times you resample your data, the more likely you are to calculate the same result more than once. To estimate this probability, you need to know the probability of achieving the most likely bootstrap statistic.
The results of such mathematics are instructive. As you might expect, if you resample a set of observations, the most likely combination of observations is identical to the sample being resampled. However, if your sample is of an untied continuous variable, if you obtain ( B = ) 2000 bootstrap samples each of 20 observations, the probability that two or more of these samples contains the same combination of observations is less than 0.954 - or one in 20. For samples above that size, the probability converges quite rapidly towards zero.
In principle therefore, for all practical purposes, duplication of effort is not considered a problem. A further consequence is that the distribution of linear additive statistics (such as the mean) is often regarded as being continuous - or smooth. At the same time, because you are re-sampling a discrete distribution of observations, this assumption is unreasonable for small samples - nor does it imply the distribution of your bootstrap estimators is unbounded, and this limits the width of estimator's tail end distribution. Indeed, these bounds are all too easy to calculate by selecting n of the smallest, and n of the largest value in your sample - then applying your estimator to them. In addition, if your sample is heavily tied (such as binary data), or your estimating function truncates or pools values (such as where it is calculates a median or maximum), the distribution of bootstrap statistics will be a discrete, unsmooth function of your sample - which upsets this simple model.
One way around this problem is to smooth the distribution of Θ* by jittering your resampled values, for example using a Gaussian smoothing function - one side-effect of which is to make a complete enumeration of outcomes impossible, for any sample size, no matter how small.
These points aside, the number of bootstraps you need to perform depends upon whether you are trying to estimate moments or tails. In principle at least, you would expect means and standard deviations to require fewer bootstrap samples than skews or tail-end proportions. For a simple two-tailed 95% percentile confidence interval, a minimum of ( B = ) 1000 estimates are generally recommended - but 5000 is more common. However, to avoid the need for interpolation, many people prefer to use B = 999 times, or 9999 resamples.
Bootstrap estimates of bias and standard error
To the extent that your sample represents its population, you would expect that bootstrap statistics ought to reflect the bias and standard error of estimates calculated from random samples of that entire population.
- Bootstrap Bias
Like jackknife statistics, bootstrap estimators are not assumed to be unbiased estimators of the population parameter. Instead it is assumed that, if the sample statistic ( ) provides a biased estimate of its parameter ( Θ ), the bootstrap statistic ( * ) provides a similarly biased estimate of the sample statistic.
If your sample statistic is unbiased, and has no inherent a priori value, the best estimate of it parameter is usually its observed value, . If its bootstrap estimators are distributed symmetrically, their mean is the best estimate - but, for skewed distributions, their median may be a better measure. For simplicity, let us say the bias ( b ) is the difference between your observed statistic and parameter estimator (b = − Θ).
- If the estimator has a bias of b, then = Θ + b.
- If the bootstrap estimator has a bias, , then * = + and * = Θ + b + .
- As a first approximation, assuming b = then Θ = − , and Θ = * − 2.
Notice however, that if you are trying to correct confidence limits, this estimate of bias can only be applied directly when your statistic is distributed symmetrically. In addition, whilst the standard deviation of bootstrap estimates generally provides a rather better estimate than one calculated by jackknifing, bootstrapping is not always such a reliable estimate of bias. One reason for these problems is the fact that the random resampling process introduces its own quota of variation, because the (combined set of) observations in your B bootstrap samples are unlikely to have the same distribution as the sample you are resampling. A simple way of improving the reliability of these estimates is to perform what is known as 'balanced' bootstrapping.
- Balanced bootstrapping is performed in two steps.
- B copies of your entire sample are taken (with replacement) - and combined to form a finite model population.
- B resamples are taken from this model population (without replacement) - each of which provided you one bootstrap statistic.
However, whilst this procedure improves the reliability of some parameter estimates, it is less effective in improving confidence intervals.
- Bootstrap Standard Error
As a first approximation, the standard deviation of your bootstrap statistics ( * ) provides a valid estimate of the standard error of your sample statistic ( ). Indeed, one advantage of estimating the standard error this way is that you need comparatively few resamples for the Monte Carlo variation to be negligible.
A disadvantage of this method is that, unless your sample comprises the entire population of observations, the standard deviation of * is liable to underestimate the dispersion of .
For example we took 10000 samples, each of (n=) 10 observations, from a standard normal population (μ = 0, σ = 1)
. We calculated the means of these 10000 samples and, using the population variance formula, we found their standard deviation. As you might expect, the observed standard deviation ( = 0.31673)
of these 10000 means was very similar to the standard error we would predict using the standard analytical formula (σ/√n
To obtain a bootstrap estimate of this standard deviation we resampled each of these 10000 samples (with replacement). In other words we took 100 bootstrap samples, of 10 observations, from each sample of 10 observations - and calculated their means. Then, using the population variance formula as a plug-in estimator, we calculated the standard deviation of each set of 100 bootstrap means - giving us 10000 standard deviations in all.
As expected, both the average standard deviation (0.29045) and the median (0.28620) standard deviation of these bootstrapped means was rather lower than the standard deviation of our sample means (0.31673).
For a few statistics, such as the sample mean (ΣY/n), it is possible to estimate the standard error of from your sample - very often it is not. However, whilst simple bootstrap standard error provides a first order estimate of your estimate's dispersion, bootstrapping therefore tends to produce confidence limits that are slightly narrower than they ought to be. This constraint is no different for nonparametric confidence intervals - even when obtained by test-inversion.
To reduce approximation error, second-stage bootstrap resampling may be used to calculate the distribution, or standard error, of t−statistics and other 'studentized' estimators.
"Bootstrap approximations are so good that if we use bootstrap estimates of erronious theoretical points, we commit noticeable errors."
Theoretical comparison of bootstrap confidence intervals. The Annals of Statistics, 16.3 927-963
Bootstrap t statistics
Given that statistics are commonly named after their test-distributions, you might reasonably assume that a 't-statistic' is a statistic that is t-distributed.
Bootstrapping makes no assumptions as to how your observations are distributed and, allowing for the limitations noted above, bootstrap estimates are assumed to be distributed in the same way as the estimates themselves. Where your statistic has an asymptotic normal distribution, the bootstrap statistic is assumed to do the same thing - at least as a first approximation. Given which, 1.96 × the standard error (estimated as above) enables you to attach 95% confidence limits to a number of (otherwise rather awkward) statistics.
The problem with this attractively simple approach is that, because sampling looses information, you must expect bootstrapping to underestimate the standard deviation of your sample statistic ( ) - yielding confidence limits with a nominal coverage error of O[n-½]. In other words, all else being equal, when n is large this error is of the order of 1/√n. As a result, even though * may be normal, resampling introduces an error equivalent to using the normal approximation for a t-distributed estimate.
For bootstrapping to exceed the accuracy of the normal approximation you have to use pivotal estimates (pivots) - in other words statistics which have the same distribution, regardless of their population parameters. For example the studentized sample mean, ' = [ − μ] / is asymptotically t-distributed where the standard error of ( ) is estimated from the standard deviation of the observations in each sample, sY. (Remember, you cannot estimate the standard error of most statistics using sY / √n - or an equivalent analytical formula!)
Assuming your statistic is unbiased, there are three problems with this approach.
- You need to estimate the standard error for each bootstrap estimate.
Studentizing your estimates using a constant standard error (such as the standard deviation of your bootstrap estimates, ) cannot mimic the way in which the estimated standard error would vary when calculated from samples of a real population. So, if was asymptotically normal, your studentized bootstrap statistics would also be normal - rather than t−distributed.
- You need to know how many degrees of freedom the t distribution should have.
Simply because you sample B estimates of a bootstrap distribution does not mean you can assume those estimates are t-distributed with B − 1 degrees of freedom.
- Although not all pivotal estimates are asymptotically normal, pivotalness is an asymptotic property.
The nominal coverage error you will usually see quoted assumes your studentized estimates are pivotal, symmetrical, homoscedastic, smooth, etcetera.
Provided you can reasonably assume your resampling process mimics the original sampling process, you can simulate the variation in standard error by resampling each of your B bootstrap resamples - and use the standard deviation of each set of second-stage bootstrap statistics () as an estimate of the standard error.
For example - using our earlier notation, where is unbiased, the ith bootstrap estimate could be studentized as i*' = [i* − ] / - where is the standard deviation of **, your bootstrap samples of the ith bootstrap sample of your original sample.
How you go about obtaining these second-stage estimates (or the degrees of freedom for t) depends upon whether you are using a nonparametric or parametric bootstrap model.
In principle, for smoothly-distributed estimators that obey central limit theorem, bootstrap t statistics enable coverage error to be reduced from O[1/√n] to O[1/n] - and using more complex corrections, still further (at least in theory). Notice also that although pivotal statistics do not have to be asymptotically normal, asymmetry and/or unsmoothness upset most methods of estimating 2-sided confidence limits.
In practice, bootstrap pivotal statistics have three other disadvantages.
- They radically increase the computational load (usually by 50-fold at least).
- Because each stage of resampling simplifies and distorts the cumulative distribution function, the coverage error can seldom be reduced below O[1/n2].
- Bootstrap-t statistics can turn out to be infinitely long, or short, or simply impossible - such as confidence limits for a proportion that are < 0, or > 1.
The net result is that although there are quite a number of more-or-less sophisticated bootstrap methods available, because Efron's simple percentile confidence limits are easy to compute and interpret - and, when used intelligently, relatively robust - they are surprisingly popular.