Maximum likelihood estimators (MLEs)
If we ignore the maths, the basic reasoning behind maximum likelihood techniques is quite straightforward - provided we go through it in orderly fashion, and concentrate upon principles rather than calculations.
A likelihood may be defined as the probability a defined population will give rise to a single observation, or a group of specific independent observations.|
To work out the likelihood of a combination of n independent events occurring, you do not add their probabilities, instead you multiply them (or calculate the log likelihood by summing their logs).
For example, if the probability of a child being born male is 0.5146, the probability of 3 randomly selected babies being male is not 0.5146 × 3 but 0.51463 or 0.1362
That is not, however, the probability of the same woman bearing 3 sons in succession - because those events are, most emphatically, not independent.
The greater your sample size is therefore, the smaller the net likelihood of n specified, independent, events. In many situations the likelihood of obtaining a specific value is pretty small to start off with, so the likelihood of observing say, all ( n = ) 30 specified values is very tiny indeed. Indeed, for a continuous variable, because the number of possible values is theoretically infinite, the likelihood of finding any specific value is effectively zero - and is described using a probability density, rather than a probability mass.
In practice, the likelihood of making any given combination of observations depends upon how their population is distributed (for example, binomially or log-normally), and the accuracy of your measurements.
For example, we could use the normal probability density function to estimate the likelihood of an observation falling into, say, the 105.5 to 106.5 milligram class interval - assuming your measurements were made to the nearest milligram. Where the intervals approach zero, and errors are normal, the probability density (Z or φ) are used instead of the probability.
To cope with these vanishingly tiny numbers, likelihoods are routinely calculated as log-likelihoods, which are not only equivalent statistics but are also additive. The other way of coping with vanishingly small likelihoods is as ratios - or the difference between log likelihoods. This is very useful when comparing two models, or two hypotheses - such as the likelihood ratio test described in Unit 9.
Maximum likelihood statistics are an important and heavily used class of estimators - an 'estimator' being a procedure (or formula) that enables you to estimate of the value of a population's parameter given a sample of that population. In everyday speech a maximum likelihood statistic also refers to the value which results from that procedure - though it is more properly described as a maximum likelihood estimate.
Put another way, calculating a maximum likelihood statistic involves three things:
A sample of observations - and a clear idea of how that sample was obtained from its population (the most simple of which is independent random selection).
A statistic which estimates a population parameter.
Some way to predict how that statistic is distributed given a the value of that parameter is known - such as a population distribution function, or a simulation model.
Notice that, in order for a to relate to c, you must know what sort of population your observations represent - or have a plausible reason for assuming so.
Therefore, how you calculate a maximum likelihood statistic depends upon a given sample of values and given type of population distribution - this latter point may not always be stated explicitly, but nonetheless it is always assumed.
There are 2 ways of getting a maximum likelihood estimate (MLE):
- Get the (pre-cooked) formula out of a text book, or from a computer package, and apply it to your data. This is usually, but not always a simple one-of calculation, but some have to be performed iteratively - such as the negative binomial shape parameter MLE.
- 'Fit' the formula to your data, by finding which of a range of values for your parameter yields the highest probability of producing your observed result. Again this may be done by applying a simple formula (such as when fitting an ordinary linear regression), or mathematically (which assumes the maximum likelihood is where the likelihood slope equals zero, and there is only one such point), or graphically - as shown in Fig.7.
In any practical situation we very seldom know everything we need to know about the population being sampled, which is usually why we are sampling it, or its parameters. Given a choice of possible populations, assuming your sample reflects the composition of the population it was drawn from, the best choice is whichever population had the maximum likelihood of yielding your observed result.
To illustrate how a maximum likelihood statistic might be fitted let us employ an everyday statistic whose properties you should be familiar with.
For example, suppose you have a sample of 7 nvCJD patients. Of these, three patients show a clear positive response to a new but controversial type of chemotherapy, the remaining four do not. Ignoring the problems of estimators, assuming our observed proportion of successes is merely an estimate of the population as a whole, let us compare the likelihood of observing these results for a range of this parameter.
Assuming each of our observations represents its population, you might reasonably assume you had a 3 out of seven chance ( P ) of each result being a success, and a 4 in seven chance ( 1-P or Q ) it would be a failure. Since we have no idea of what the true proportion of successes really is, let us make a series of guesses as to what P might be - assuming P could be anywhere from zero to one.
Logically, because the only information we have on its population is our sample, that combination of results has the maximum likelihood of arising when p=P - for other statistics and estimators this relation is not so straightforward! The graph below shows the probability of p=3/7 arising given 50 possible values of P and, as you might expect, the proportion yielding the highest likelihood was very close to P= f/n = 3/7 - suggesting the maximum likelihood estimate of P, given a binomial model, is not (for instance) f/(n-1) or (f-0.5)/(n-0.5)
Notice that, although this looks like a binomial distribution, the graph above is a likelihood function. Nevertheless, by assuming that these observations are independent and random, we are implicitly assuming they would be binomially distributed. If however, there is appreciable measurement error, or these results are in some way associated, this model may be highly unrealistic.
In order to calculate a maximum likelihood, you must assume you know how your population is distributed.
A 'maximum likelihood' statistic assumes your data represents a specific defined population - if that is not the case NO statistic can be described thus.
Some common maximum likelihood statistics are:
- The arithmetic mean, it provides the maximum likelihood estimate of the population mean when observations are randomly selected from a normal population.
- The sample proportion, p, gives the MLE of the population proportion (P) when p is distributed binomially.
- The line predicted by an ordinary linear regression when errors about that line represent a normal population whose mean is zero, and they are uniformly distributed along that line.
Note: When calculated from sufficiently large samples quite a few maximum likelihood statistics have an approximately normal distribution, and their bias converges to zero - although some converge to this asymptotic behaviour quite slowly.
- For example, when calculated from normally distributed samples, the population variance formula provides the maximum likelihood estimate with a bias of n/(n-1).
As a result, maximum likelihood estimators are commonly fitted and tested assuming a normal distribution - even for very small sample sizes.