Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



Given the importance of this question, it is not surprising there are a number of ways of answering it - not least because statisticians have devised so many different types of estimator, and almost as many ways of classifying them.

    For example Units 1 to 5 concentrated upon 'point' estimators such as means, differences, variances, and P-values, whereas this Unit concentrates upon 'range' ('interval') estimators, such as confidence limits.

An estimator is simply a method by which you obtain an estimate, bearing in mind that there may be any number of methods available.

    For instance, the arithmetic mean, geometric mean, trimmed mean, and midrange could all be used to provide (point) estimates of a given population's median. Which of these will provide the most reliable estimate depends upon how much information you supply them, how it was obtained, and what sort of population your are estimating the median of.

Since confidence intervals are a popular way to indicate the reliability of a point estimate, and interval estimates have problems peculiar to themselves, let us consider them first.


↔   Interval estimates

Interval estimates, whilst sharing many of the problems of point estimates, tend to be assessed rather differently. To understand the reasoning and shortcomings of these methods, we must consider how these intervals, and their estimates, are defined.

In essence, a confidence interval ( Î ) estimates a range ( I ) which encloses a population statistic ( Θ ). The width of I is usually set according to what proportion ( α ) of all estimates of Θ you wish to exclude from that range (say 5%). Provided is distributed symmetrically, I is located centrally about Θ.

This arrangement has two important properties:

  • We would expect the interval, I, to enclose the most typical (1 − α) estimates of Θ.
  • Conversely, any estimate of Θ outside that range should be rejected by a comparable test (at P<α ).

From which it follows that, when = Θ and Î = I the same ought to be true - even if is distributed asymmetrically. In addition, you could assume that a good estimate of I would perform best. However, owing to the parametric normal domination of statistical models, a quite different 'frequentist' criterion was used.


  • Coverage error

Assuming your confidence intervals are good estimates of I, when these estimated intervals ( Î ) are attached to estimates of Θ, a predetermined proportion ( 1 − α, or 95%) of these intervals are expected to enclose Θ - at least on average. If, as predicted by this model, exactly 1 − α (or 95%) of confidence intervals are found to enclose Θ, this is described as being a perfect coverage.

The most popular measure of the quality of an interval estimator, known as the coverage error, is simply the difference between the observed and expected coverage. Confusingly, for reasons of mathematical convenience, the formulae for this generally assume that is distributed symmetrically, and you are calculating the (equivalent 2-tailed) interval between two equal 1-tailed confidence limits. - In other words, coverage error assumes a different definition of confidence limits from the one above.

The problem with this measure is it wholly ignores the length of confidence intervals, or what happens where is not distributed symmetrically about Θ. Interest in alternate measures of interval estimates and alternate ways of constructing confidence limits is comparatively recent.


←   Point estimates

Until quite recently, the two most common criteria for judging the reliability of a point estimator were:

  1. its accuracy, or amount of bias
  2. its precision, variation, or concentration.

a)   Measures of bias

Statisticians use two measures of bias: mean bias and median bias - of which the first is the more popular. Mean bias is simply the average deviation you would obtain if you used the same estimator upon a large number (R) of identical random samples from the same population.

For R samples the average estimate, = Σ/R would be subject to sample variation - but if there are an infinite number of estimates this formula is of no use, so the bias is expressed as an expected value. In other words, when R = ∞ the average estimate is described as its expected value, so = E[].

The mean bias of any estimator is the expected (mean) difference between the estimates and population parameter, or E[ − Θ].

If we use M to indicate the median of R estimates, the median bias is simply M[ − Θ].

For estimates that are distributed symmetrically about their mean, E[], there is no difference between mean and median bias. So E[] = M[], provided that R is infinite. For example if had a unbiased t-distribution, its expected value would be zero, and would be equally likely to be negative or positive. Or, if you prefer it, if P is the proportion of random estimates, for a median-unbiased estimator P[<Θ] = P[>Θ].

'Unbiased additive linear' estimators, such as the mean, have no mean bias but, where their estimates have a skewed distribution, these estimators are median biased. Other estimators, such as the plug-in 'population variance' formula ( Σ[Y − ]2 / n ), tend to have both mean and median bias. Notice however, that whilst this estimator has a bias σ2[n−1]/n for finite samples, when n approaches infinity, this estimator is 'asymptotically' unbiased.


For some years most statisticians assumed the best estimators had to be unbiased (when applied to finite samples). Fairly recently, however, it has been generally accepted that bias is only a serious problem when its extent is unknown. In contrast, the more variable an estimate is - in effect - the less useful information it contains.


b)   Measures of concentration

  • Efficiency

Whilst an ordinary sample mean is an unbiased estimate of its population mean, this does not imply this plug-in estimator is the best estimator of that parameter. For purposes of inference, the least variable, most efficient estimator, provides the most power to discredit a null hypothesis.

  • If your observations have a symmetric frequency distribution, because the population median and mean are the same, the sample median is also an unbiased estimator of the population mean. However, sample medians have a larger standard error than sample means, and a sample mean is the most efficient estimator of the population mean - provided the errors are normal.
  • If your observations are not normally distributed, although the mean is still an unbiased estimator of its parametric value, it is no longer the most efficient. Depending upon the error distribution a variety of statistics have been devised to provide a less variable estimate of the population mean - bearing in mind their estimates may be biased.

To parametric statisticians at least, the most obvious measure of your estimate variation is its variance. In other words, if is the average of R estimates of a parameter, Θ, then you might calculate their variance (2) as the mean squared difference between those estimates and their mean, Σ[ − ]2 / R

Because they allowed most power, the least variable estimates became known as the most efficient, and estimators were compared upon that basis.  Unfortunately, which estimator is most efficient varies according to the frequency distribution of the population being sampled. For example, the arithmetic mean is only the most powerful estimate of the population mean if errors are normal.


  • Relative efficiency

For unbiased estimators, such as the sample mean, efficiency is a perfectly adequate measure of reliability because the expected value is the same as the parametric value, and E[] = Θ. But where is biased, a combined measure of estimator bias and variation was required. Of several measures proposed, the mean squared error (MSE) is the most used. This measure is the average squared difference between your estimates and their parametric value, and is often expressed as an expected error, E{[ − Θ]2}. Arithmetically, the mean squared error is also its efficiency plus the square of its bias, 2 + {E[ − Θ]}2

Maximum likelihood estimates often have the smallest variance and are sometimes biased. For example, where the population is normal, the arithmetic mean is the same as the maximum likelihood estimate, and is both unbiased and has the smallest variance - whereas the plug-in estimator of the population variance is biased, but has the minimum variance and is equivalent to the maximum likelihood estimate. Where the population distribution is undefined the bias of the most efficient estimator is unknown, and a maximum likelihood estimate cannot be obtained.

The fact that the relative efficiency of two statistics could be defined as the ratio of their mean squared errors led to the idea of minimum variance estimators and best equivalent estimators. For samples of some types of population, particularly normal ones, it has long been known that some estimators are the most efficient possible - and these were described as sufficient estimators. If you compare a mean with a trimmed mean, such as a median (which is the most heavily trimmed mean), the underlying reason is clear. A sufficient estimator summarizes all the useful information contained within your sample - the more information there is available, the less variable are its estimates. Minimax estimators, on the other hand, minimise the greatest error in estimating the parameter - but, because this is at the expense of their power, minimax estimators are too pessimistic (underpowered) for many applications.


c)   Robustness

Where its assumptions are fully met the arithmetic mean, , ΣY/n, is the best possible estimator of its parameter by pretty much any criteria you wish to choose. Unfortunately, this almost never happens in real life. Disconcertingly it has been known for many years that, for any reasonable sample size, even quite small departures from some of these assumptions, such as slight skew or kurtosis or contamination, can make the mean anything but a reliable estimator of location.

    The Gauss-Markov theorem showed any arithmetic mean () is unbiased provided errors (ε) are not correlated, their variance is unrelated to and, in more complicated models (such as regression) the effects are additive ('linear'). Given this freedom from bias does not assume normality or strictly independent random selection. Statistics, such as regression slopes, that have these desirable properties are called 'best linear unbiased estimators' (BLUE) - and, because they have the smallest Σε2 they are also known as 'ordinary least squares' (OLS) statistics. Many of the most popular statistical methods are OLS estimators, including the mean, analysis of variance, and regression fitting formulae.

    Recall however that accuracy does not imply precision - if your sample is of a long-tailed distribution the precision of means and regression slopes can be reduced to the point of uselessness. Since a single highly-aberrant value can radically affect the value of it is said to have a zero breakdown, whereas a median (the 50% trimmed mean) has the highest breakdown (=0.5). Alternate approaches to this problem are to find which location gives the minimum sum of the trimmed squared-errors - or use the median absolute-error.

Similar concerns apply to popular measures of dispersion - such as the population variance, Σ([Y − ]2)/n. When all its assumptions are met, although biased, this plug in estimator is the maximum likelihood estimator of the population variance, σ2 - and has a 12% higher asymptotic relative efficiency than the mean absolute deviation, Σ|Y − |/n. If however you contaminate the population by replacing 0.2% of its observations with ones from an identical normal population - but with 3 times the standard deviation - the mean absolute deviation is a more efficient estimator than the plug in formula.

Since these sort of problems are exactly what you would expect from measurement errors and 'outliers', there has been a growing interest in developing more robust estimators. A robust estimator being one which performs well, both under ideal circumstances, and where its underlying assumptions are not fully met. As per usual, there are a number of conflicting measures of robustness, and quite a few ways of classifying robust estimators.

    For instance:
  • L-estimators are based upon linear combinations of the ordered sample values, for example trimmed means and percentile ranges. A scale-free example of which might be the median divided by the interquartile range, or a 5% trimmed mean divided by the 90 percentile range.
  • R-estimators are based upon ranks, in other words ordered values. The most commonly-used R-estimator is the median of all pairwise differences, known as the Hodges-Lehmann estimator.
  • M-estimators are maximum likelihood-like estimators - of which maximum likelihood estimators are a subclass. M-estimators try to minimize Σρ(x), where for example, in 'ordinary least squares' estimators ρ(x) = x2 - whereas least-absolute deviation estimators, such as the mean absolute deviation, use ρ(x) = |x|. Maximum likelihood estimators work by trying to maximize the net likelihood ΠF(x) or, equivalently minimize Σ−log[F(x)] - where Π indicates their product, and F is a function giving the probability of observing x.


d)   Other measures

Parametric statistical models make use of the fact that, as sample size increases, quite a few estimators become more regular in their habits - and approach, but not reach, known distributions. Asymptotic regularity is common where samples approach infinite size, but for some estimators 'large sample approximations' can require extremely large samples indeed. Given the quantity of study which has been invested, a variety of ways have been found to quantify how well behaved these estimators are - at least in principle.

For example, a consistent estimator is an estimator whose bias and variance both approach zero as the sample size approaches infinity. However some estimators (such as the mean), converge more rapidly than others (such as the median). Other estimators converge to their large sample behaviour only very slowly.

Estimators whose relative efficiency is unrelated to the parameter being estimated are described as being regular - and are obviously desirable. Under this definition a sample proportion is not a regular estimator. A few non-regular estimators favour particular values for their estimates, which can be irritating. In contrast superefficient estimators are unbeatable in one situation, but unreliable otherwise.


~   Distribution estimates

In principle at least, the 'distribution function' is a statistic that conveys most information about a population of observations - or about a population of summary statistics.

  • In other words, we are talking about the probability of observing a particular value of Y that is equal to x ( P[Y = x] ), for each value of x (known as the 'mass probability' function) - or, for continuous functions, it is the 'probability density' at point x - or the probability of observing within the interval x1 to x2, or ( P[x1 < < x2] ).
  • More commonly, among statisticians, the 'distribution function' refers to the cumulative distribution function of your statistic ( P[− ∞ < < x], or P[ < x] ), for each value of x.
  • But do bear in mind that formulae are only available to describe a very small set of 'theoretical', 'known' distribution functions - which is why they are usually just approximations.

If C[x] is the cumulative distribution function of population (Y0), then [x] would be an estimate of that function, based upon a sample of that population (Y1).

  • The empirical cumulative distribution function obviously provides an estimate of C[x], albeit that estimate is biased in various ways - for example it tends to underestimate the population's range, and is always discrete. Yet, in effect, this is the estimate you resample using nonparametric bootstrapping - in order to estimate the theoretical distribution of and construct a simple percentile confidence interval.

  • An alternate approach is to estimate (say) the mean and standard deviation (of Y &/or ), and use them to 'fit' a normal distribution. In practice many z-tests are done this way - although these are really just approximations to t-tests. Parametric bootstraps and semi-parametric bootstraps also make use of this principle - as do smoothed bootstraps, albeit less arbitrarily.

  • A few statistics, that behave like sums, have distributions which approach normal - when calculated from large (or exceedingly large) samples.

  • When calculated from anything other than large samples, many commonly-used estimators have surprisingly complex distribution functions - even when the observations represent a 'known' population distribution (e.g. normal).

  • When calculated from real data, the exact distribution of any statistic is highly complex and impossible to cope with analytically.
    "The traditional machinery of statistical processes is wholly unsuited to the needs of practical research. Not only does it take a cannon to shoot a sparrow, but it misses the sparrow.

    The elaborate mechanism built on the theory of infinitely large samples is not accurate enough for simple laboratory data. Only by systematically tackling small sample problems on their merits does it seem possible to apply accurate tests to practical data."

    Sir Ronald A. Fisher (1925)
    Preface to Statistical Methods for Research Workers. 1st ed Edinburgh, Oliver and Boyd.

    For example, likelihood statistics are often assumed to be asymptotically normal, so the distribution of their ratios (such as the G-statistics described in Unit 9) are tested against a Chi-square distribution. These limit distributions, valid where n approaches infinity, are frequently used as 'parametric' approximations for quite moderate samples - despite the fact that statistical functions approach their asymptotic behaviour at widely differing rates.

    Only knowing their limit distribution can also make it rather difficult to select an optimal maximum likelihood estimator!


    ≅   Approximation error

    "When sophistication loses content then the only way of keeping in touch with reality is to be crude and superficial.
    This is what I intend to be.
    Feyerabend, P. 1975
    Using some arbitrary but convenient theoretical frequency distribution as an approximation of the actual distribution of your estimates introduces what is known as an approximation error. Of course, in terms of its effect upon the location of confidence limits and testing point estimators, it enables you to quantify a bias.
      For instance approximation errors arise when a statistic is tested assuming it is normally-distributed, when it is actually t-distributed.

    We noted above that, although the small-sample distributions of many estimators are complex, they often converge asymptotically to 'known' distributions - particularly the normal one. Rescaling-transformations and studentizing can reduce, but do not eliminate, approximation error.


    • Taylor's expansion

    Of the various ways approximation error can be expressed, the absolute difference is perhaps the most common. In order to reduce that error what was needed was a better approximation of the function being approximated. The methods subsequently developed, although highly mathematical, stem from three facts.

    • All of the 'known' distributions are mathematically quite closely related.
    • The more population moments you have good estimates of, the more closely you can approximate one function in terms of another.
    • In general, the higher the moment, the less is its effect upon large-sample estimates. The mean has most effect, followed by variance, then skew, then kurtosis...
    An early application of this is known as Taylor's series - which is the sum of a, theoretically infinite, progression of terms - known as a polynomial.

    A polynomial is a sum of ( m ) terms, each of which is the product of a constant ( ci ) and the power of a variable ( x ).
      The general form, for a single variable, is Y = Σ(cixi) summed from i = 0 to i = m
    This may be expanded to:

    y = c0x0 + c1x1 + c2x2 ... + cmxm  

    In this application, because the left-most terms in the series are generally the largest and most easily estimated, all the smaller 'trailing' terms are usually combined into a 'remainder' term, Ri - where i is the number of (left-hand) terms it excludes - usually between 2 and 4. This remainder term is useful because it indicates how well the leading terms, by themselves, approximate the expression as a whole - assuming the series is convergent (in other words, higher terms are progressively smaller).

    Taylor's expansion is quite useful for estimating moments, or the probability of observing a value ( x ) for well-behaved discrete functions (such as the binomial), provided n is extremely large. Unfortunately, it is not very accurate. One reason for this is that, unless special measures are taken to avoid it, the terms within a polynomial series are not independent. For instance, unless c4 and c2 are extremely different, you would expect c4x4 will be related to c2x2. As a result, errors in a Taylor's expansion tend to be a function of x.


    • Edgeworth's expansion

    Another way to quantify approximation error is to use what are known as orthogonal polynomials. An orthogonal polynomial is one whose terms are modified so they are as different as possible from one another, and therefore behave as if they were independent. The inner workings of these expansions are therefore rather complex, and not something you would wish to meet on a dark night.

      The resulting formulae are effectively two polynomials, where the second polynomial is used to calculate the constant (ci) for each term of the first polynomial. These two series of terms are arranged in powers of n½ - where n, as per usual, is the number of observations each estimate is calculated from.

    Nevertheless, if we ignore the mathematical detail, most of the implications and assumptions are understandable enough.

    • In principle at least, the error is unrelated to x and n (the sample size).
    • For studentized statistics (and z-statistics) the first two moments (the mean and variance) and therefore the first two terms of this expansion, tend to simplify or disappear.
    • When the various bits of the formula are collected according to the effects of sample size, portions of the higher moments (the skew, kurtosis, and so-forth) end up in more than one term of the resulting series.
    • For large samples, the effect of each term of the expansion is less than the one to its left - so the polynomial terms are a convergent series, and the remainder term (R) is finite.

    For example, ignoring how each constant is calculated, the formula below assumes that the studentized estimator behaves as sum (or mean) and has a continuous distribution - and that the moments of that distribution are finite.

    | Fn[x] − (Φ[x] + c1n¹/2 + c2n−1 + c3n³/2 + c4n−2 ) | = R4

    • Fn[x] is the cumulative distribution function (CDF) of our estimator - in other words, the proportion of estimates below say, x, ( P[<x] ).
    • Φ[x] is the standard cumulative normal distribution function of x.
    • c1 to ci are constants, but c0[1/n]0 is absent because the estimates are assumed to be unbiased.
    • n is the number of observations each estimate is calculated from, and
    • Ri provides a measure of how good the approximation is liable to be.

    Since its reasoning derives from the behaviour of large (or extremely large) samples, a popular way to express the magnitude of this remainder (the estimated approximation error) is to use an asymptotic order term such as the 'big O' order term introduced in used in Unit 3.


    Although this approach assumes you know how your statistical function is distributed, it does help theoreticians to improve large-sample approximations for moderate samples - and some transformations now make use of it although, to avoid 'over correction', they seldom use more than the first three terms. More immediately, order terms are increasingly being used to indicate both the degree and type of approximation error.

      For example, the error in using a normal approximation for a t-distributed statistic is O[n-½] or O[1/√n].

      Notice however that, because of the way in which the moments get 'smeared out', the first term of the expansion contains most of the effects of skew, whereas the second term allows for most of the effects of kurtosis and the secondary effects of skewness.