Given the importance of this question, it is not surprising there are a number of ways of answering it  not least because statisticians have devised so many different types of estimator, and almost as many ways of classifying them.
For example Units 1 to 5 concentrated upon 'point' estimators such as means, differences, variances, and Pvalues, whereas this Unit concentrates upon 'range' ('interval') estimators, such as confidence limits.
An estimator is simply a method by which you obtain an estimate, bearing in mind that there may be any number of methods available.
For instance, the arithmetic mean, geometric mean, trimmed mean, and midrange could all be used to provide (point) estimates of a given population's median. Which of these will provide the most reliable estimate depends upon how much information you supply them, how it was obtained, and what sort of population your are estimating the median of.
Since confidence intervals are a popular way to indicate the reliability of a point estimate, and interval estimates have problems peculiar to themselves, let us consider them first.
↔ Interval estimates
Interval estimates, whilst sharing many of the problems of point estimates, tend to be assessed rather differently. To understand the reasoning and shortcomings of these methods, we must consider how these intervals, and their estimates, are defined.
In essence, a confidence interval ( Î ) estimates a range ( I ) which encloses a population statistic ( Θ ). The width of I is usually set according to what proportion ( α ) of all estimates of Θ you wish to exclude from that range (say 5%). Provided is distributed symmetrically, I is located centrally about Θ.
This arrangement has two important properties:
 We would expect the interval, I, to enclose the most typical (1 − α) estimates of Θ.
 Conversely, any estimate of Θ outside that range should be rejected by a comparable test (at P<α ).
From which it follows that, when = Θ and Î = I the same ought to be true  even if is distributed asymmetrically. In addition, you could assume that a good estimate of I would perform best. However, owing to the parametric normal domination of statistical models, a quite different 'frequentist' criterion was used.
Assuming your confidence intervals are good estimates of I, when these estimated intervals ( Î ) are attached to estimates of Θ, a predetermined proportion ( 1 − α, or 95%) of these intervals are expected to enclose Θ  at least on average. If, as predicted by this model, exactly 1 − α (or 95%) of confidence intervals are found to enclose Θ, this is described as being a perfect coverage.
The most popular measure of the quality of an interval estimator, known as the coverage error, is simply the difference between the observed and expected coverage. Confusingly, for reasons of mathematical convenience, the formulae for this generally assume that is distributed symmetrically, and you are calculating the (equivalent 2tailed) interval between two equal 1tailed confidence limits.  In other words, coverage error assumes a different definition of confidence limits from the one above.
The problem with this measure is it wholly ignores the length of confidence intervals, or what happens where is not distributed symmetrically about Θ. Interest in alternate measures of interval estimates and alternate ways of constructing confidence limits is comparatively recent.
← Point estimates
Until quite recently, the two most common criteria for judging the reliability of a point estimator were:
 its accuracy, or amount of bias
 its precision, variation, or concentration.
a) Measures of bias
Statisticians use two measures of bias: mean bias and median bias  of which the first is the more popular. Mean bias is simply the average deviation you would obtain if you used the same estimator upon a large number (R) of identical random samples from the same population.
For R samples the average estimate, = Σ/R would be subject to sample variation  but if there are an infinite number of estimates this formula is of no use, so the bias is expressed as an expected value. In other words, when R = ∞ the average estimate is described as its expected value, so = E[].
For estimates that are distributed symmetrically about their mean, E[], there is no difference between mean and median bias. So E[] = M[], provided that R is infinite. For example if had a unbiased tdistribution, its expected value would be zero, and would be equally likely to be negative or positive. Or, if you prefer it, if P is the proportion of random estimates, for a medianunbiased estimator P[<Θ] = P[>Θ].
'Unbiased additive linear' estimators, such as the mean, have no mean bias but, where their estimates have a skewed distribution, these estimators are median biased. Other estimators, such as the plugin 'population variance' formula ( Σ[Y − ]^{2} / n ), tend to have both mean and median bias. Notice however, that whilst this estimator has a bias σ^{2}[n−1]/n for finite samples, when n approaches infinity, this estimator is 'asymptotically' unbiased.
For some years most statisticians assumed the best estimators had to be unbiased (when applied to finite samples). Fairly recently, however, it has been generally accepted that bias is only a serious problem when its extent is unknown. In contrast, the more variable an estimate is  in effect  the less useful information it contains.
b) Measures of concentration
Whilst an ordinary sample mean is an unbiased estimate of its population mean, this does not imply this plugin estimator is the best estimator of that parameter. For purposes of inference, the least variable, most efficient estimator, provides the most power to discredit a null hypothesis.
 If your observations have a symmetric frequency distribution, because the population median and mean are the same, the sample median is also an unbiased estimator of the population mean. However, sample medians have a larger standard error than sample means, and a sample mean is the most efficient estimator of the population mean  provided the errors are normal.
 If your observations are not normally distributed, although the mean is still an unbiased estimator of its parametric value, it is no longer the most efficient. Depending upon the error distribution a variety of statistics have been devised to provide a less variable estimate of the population mean  bearing in mind their estimates may be biased.
To parametric statisticians at least, the most obvious measure of your estimate variation is its variance. In other words, if is the average of R estimates of a parameter, Θ, then you might calculate their variance (^{2}) as the mean squared difference between those estimates and their mean, Σ[ − ]^{2} / R
Because they allowed most power, the least variable estimates became known as the most efficient, and estimators were compared upon that basis. Unfortunately, which estimator is most efficient varies according to the frequency distribution of the population being sampled. For example, the arithmetic mean is only the most powerful estimate of the population mean if errors are normal.
For unbiased estimators, such as the sample mean, efficiency is a perfectly adequate measure of reliability because the expected value is the same as the parametric value, and E[] = Θ. But where is biased, a combined measure of estimator bias and variation was required. Of several measures proposed, the mean squared error (MSE) is the most used. This measure is the average squared difference between your estimates and their parametric value, and is often expressed as an expected error, E{[ − Θ]^{2}}. Arithmetically, the mean squared error is also its efficiency plus the square of its bias, ^{2} + {E[ − Θ]}^{2}
Maximum likelihood estimates often have the smallest variance and are sometimes biased. For example, where the population is normal, the arithmetic mean is the same as the maximum likelihood estimate, and is both unbiased and has the smallest variance  whereas the plugin estimator of the population variance is biased, but has the minimum variance and is equivalent to the maximum likelihood estimate. Where the population distribution is undefined the bias of the most efficient estimator is unknown, and a maximum likelihood estimate cannot be obtained.


The fact that the relative efficiency of two statistics could be defined as the ratio of their mean squared errors led to the idea of minimum variance estimators and best equivalent estimators. For samples of some types of population, particularly normal ones, it has long been known that some estimators are the most efficient possible  and these were described as sufficient estimators. If you compare a mean with a trimmed mean, such as a median (which is the most heavily trimmed mean), the underlying reason is clear. A sufficient estimator summarizes all the useful information contained within your sample  the more information there is available, the less variable are its estimates. Minimax estimators, on the other hand, minimise the greatest error in estimating the parameter  but, because this is at the expense of their power, minimax estimators are too pessimistic (underpowered) for many applications.
c) Robustness
Where its assumptions are fully met the arithmetic mean, , ΣY/n, is the best possible estimator of its parameter by pretty much any criteria you wish to choose. Unfortunately, this almost never happens in real life. Disconcertingly it has been known for many years that, for any reasonable sample size, even quite small departures from some of these assumptions, such as slight skew or kurtosis or contamination, can make the mean anything but a reliable estimator of location.
Similar concerns apply to popular measures of dispersion  such as the population variance, Σ([Y − ]^{2})/n. When all its assumptions are met, although biased, this plug in estimator is the maximum likelihood estimator of the population variance, σ^{2}  and has a 12% higher asymptotic relative efficiency than the mean absolute deviation, ΣY − /n. If however you contaminate the population by replacing 0.2% of its observations with ones from an identical normal population  but with 3 times the standard deviation  the mean absolute deviation is a more efficient estimator than the plug in formula.


Since these sort of problems are exactly what you would expect from measurement errors and 'outliers', there has been a growing interest in developing more robust estimators. A robust estimator being one which performs well, both under ideal circumstances, and where its underlying assumptions are not fully met. As per usual, there are a number of conflicting measures of robustness, and quite a few ways of classifying robust estimators.
For instance:
 Lestimators are based upon linear combinations of the ordered sample values, for example trimmed means and percentile ranges. A scalefree example of which might be the median divided by the interquartile range, or a 5% trimmed mean divided by the 90 percentile range.
 Restimators are based upon ranks, in other words ordered values. The most commonlyused Restimator is the median of all pairwise differences, known as the HodgesLehmann estimator.
 Mestimators are maximum likelihoodlike estimators  of which maximum likelihood estimators are a subclass. Mestimators try to minimize Σρ(x), where for example, in 'ordinary least squares' estimators ρ(x) = x^{2}  whereas leastabsolute deviation estimators, such as the mean absolute deviation, use ρ(x) = x. Maximum likelihood estimators work by trying to maximize the net likelihood ΠF(x) or, equivalently minimize Σ−log[F(x)]  where Π indicates their product, and F is a function giving the probability of observing x.
d) Other measures
Parametric statistical models make use of the fact that, as sample size increases, quite a few estimators become more regular in their habits  and approach, but not reach, known distributions. Asymptotic regularity is common where samples approach infinite size, but for some estimators 'large sample approximations' can require extremely large samples indeed. Given the quantity of study which has been invested, a variety of ways have been found to quantify how well behaved these estimators are  at least in principle.
For example, a consistent estimator is an estimator whose bias and variance both approach zero as the sample size approaches infinity. However some estimators (such as the mean), converge more rapidly than others (such as the median). Other estimators converge to their large sample behaviour only very slowly.


Estimators whose relative efficiency is unrelated to the parameter being estimated are described as being regular  and are obviously desirable. Under this definition a sample proportion is not a regular estimator. A few nonregular estimators favour particular values for their estimates, which can be irritating. In contrast superefficient estimators are unbeatable in one situation, but unreliable otherwise.
~ Distribution estimates
In principle at least, the 'distribution function' is a statistic that conveys most information about a population of observations  or about a population of summary statistics.
 In other words, we are talking about the probability of observing a particular value of Y that is equal to x ( P[Y = x] ), for each value of x (known as the 'mass probability' function)  or, for continuous functions, it is the 'probability density' at point x  or the probability of observing within the interval x_{1} to x_{2}, or ( P[x_{1} < < x_{2}] ).
 More commonly, among statisticians, the 'distribution function' refers to the cumulative distribution function of your statistic ( P[− ∞ < < x], or P[ < x] ), for each value of x.
 But do bear in mind that formulae are only available to describe a very small set of 'theoretical', 'known' distribution functions  which is why they are usually just approximations.
If C[x] is the cumulative distribution function of population (Y_{0}), then [x] would be an estimate of that function, based upon a sample of that population (Y_{1}).
A few statistics, that behave like sums, have distributions which approach normal  when calculated from large (or exceedingly large) samples.
When calculated from anything other than large samples, many commonlyused estimators have surprisingly complex distribution functions  even when the observations represent a 'known' population distribution (e.g. normal).
When calculated from real data, the exact distribution of any statistic is highly complex and impossible to cope with analytically.

 
"The traditional machinery of statistical processes is wholly unsuited to the needs of practical research. Not only does it take a cannon to shoot a sparrow, but it misses the sparrow.
The elaborate mechanism built on the theory of infinitely large samples is not accurate enough for simple laboratory data. Only by systematically tackling small sample problems on their merits does it seem possible to apply accurate tests to practical data."
Sir Ronald A. Fisher (1925)
Preface to Statistical Methods for Research Workers. 1st ed Edinburgh, Oliver and Boyd. 

For example, likelihood statistics are often assumed to be asymptotically normal, so the distribution of their ratios (such as the Gstatistics described in Unit 9) are tested against a Chisquare distribution.
These limit distributions, valid where n approaches infinity, are frequently used as 'parametric' approximations for quite moderate samples  despite the fact that statistical functions approach their asymptotic behaviour at widely differing rates.
Only knowing their limit distribution can also make it rather difficult to select an optimal maximum likelihood estimator!
≅ Approximation error
Of the various ways approximation error can be expressed, the absolute difference is perhaps the most common. In order to reduce that error what was needed was a better approximation of the function being approximated. The methods subsequently developed, although highly mathematical, stem from three facts.
 All of the 'known' distributions are mathematically quite closely related.
 The more population moments you have good estimates of, the more closely you can approximate one function in terms of another.
 In general, the higher the moment, the less is its effect upon largesample estimates. The mean has most effect, followed by variance, then skew, then kurtosis...
An early application of this is known as Taylor's
series  which is the sum of a, theoretically infinite, progression of terms  known as a polynomial.
A polynomial is a sum of ( m ) terms, each of which is the product of a constant ( c_{i} ) and the power of a variable ( x ).
The general form, for a single variable, is Y = Σ(c_{i}x^{i}) summed from i = 0 to i = m
This may be expanded to:
y = c_{0}x^{0} + c_{1}x^{1} + c_{2}x^{2} ... + c_{m}x^{m}

In this application, because the leftmost terms in the series are generally the largest and most easily estimated, all the smaller 'trailing' terms are usually combined into a 'remainder' term, R_{i}  where i is the number of (lefthand) terms it excludes  usually between 2 and 4. This remainder term is useful because it indicates how well the leading terms, by themselves, approximate the expression as a whole  assuming the series is convergent (in other words, higher terms are progressively smaller).
Taylor's expansion is quite useful for estimating moments, or the probability of observing a value ( x ) for wellbehaved discrete functions (such as the binomial), provided n is extremely large. Unfortunately, it is not very accurate. One reason for this is that, unless special measures are taken to avoid it, the terms within a polynomial series are not independent. For instance, unless c_{4} and c_{2} are extremely different, you would expect c_{4}x^{4} will be related to c_{2}x^{2}. As a result, errors in a Taylor's expansion tend to be a function of x.
Another way to quantify approximation error is to use what are known as orthogonal polynomials. An orthogonal polynomial is one whose terms are modified so they are as different as possible from one another, and therefore behave as if they were independent. The inner workings of these expansions are therefore rather complex, and not something you would wish to meet on a dark night.
The resulting formulae are effectively two polynomials, where the second polynomial is used to calculate the constant (c_{i}) for each term of the first polynomial. These two series of terms are arranged in powers of n^{−½}  where n, as per usual, is the number of observations each estimate is calculated from.
Nevertheless, if we ignore the mathematical detail, most of the implications and assumptions are understandable enough.
 In principle at least, the error is unrelated to x and n (the sample size).
 For studentized statistics (and zstatistics) the first two moments (the mean and variance) and therefore the first two terms of this expansion, tend to simplify or disappear.
 When the various bits of the formula are collected according to the effects of sample size, portions of the higher moments (the skew, kurtosis, and soforth) end up in more than one term of the resulting series.
 For large samples, the effect of each term of the expansion is less than the one to its left  so the polynomial terms are a convergent series, and the remainder term (R) is finite.
For example, ignoring how each constant is calculated, the formula below assumes that the studentized estimator behaves as sum (or mean) and has a continuous distribution  and that the moments of that distribution are finite.
 F_{n}[x] − (Φ[x] + c_{1}n^{−¹/2} +
c_{2}n^{−1} +
c_{3}n^{−³/2} +
c_{4}n^{−2} )  = R_{4}
Here,
 F_{n}[x] is the cumulative distribution function (CDF) of our estimator  in other words, the proportion of estimates below say, x, ( P[<x] ).
 Φ[x] is the standard cumulative normal distribution function of x.
 c_{1} to c_{i} are constants, but c_{0}[^{1}/_{n}]^{0} is absent because the estimates are assumed to be unbiased.
 n is the number of observations each estimate is calculated from, and
 R_{i} provides a measure of how good the approximation is liable to be.

Since its reasoning derives from the behaviour of large (or extremely large) samples, a popular way to express the magnitude of this remainder (the estimated approximation error) is to use an asymptotic order term such as the 'big O' order term introduced in used in Unit 3.
Although this approach assumes you know how your statistical function is distributed, it does help theoreticians to improve largesample approximations for moderate samples  and some transformations now make use of it although, to avoid 'over correction', they seldom use more than the first three terms. More immediately, order terms are increasingly being used to indicate both the degree and type of approximation error.
For example, the error in using a normal approximation for a tdistributed statistic is O[n^{½}] or O[^{1}/_{√n}].
Notice however that, because of the way in which the moments get 'smeared out', the first term of the expansion contains most of the effects of skew, whereas the second term allows for most of the effects of kurtosis and the secondary effects of skewness.