Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Normal distributionOn this page: Inverse probability function Probability density function Graphical methods Cumulative distributions by class Histograms Probability-probability, PP, & Quantile-quantile, QQ, plots Interpreting PP & QQ plots
Normal probability functions
A function is simply a formula which calculates a number from one or more other numbers. In theory at least, there are three ways of relating the value of a normal observation to the probability of obtaining it. Of these, the last is the simplest to calculate and the most familiar, but is the least used. Since, in the case of a normal population, it is also the most awkward to explain, let us deal with the simpler - more useful - methods first.
This is defined as the proportion, P, of the population X which is less than or equal to a given value, x - where x is any possible value of X between plus and minus infinity. It may also be referred to as the normal distribution function
Notice that this function does not describe the probability of observing value x, but the probability of observing any value less than or equal to x. As a result, the cumulative normal distribution function is sometimes described as a normal integral function.
Today, most software packages use a cumulative (or integrated) normal function formula, which returns (more or less) the exact probability for any standard normal deviate. Unfortunately, there is no exact formula for converting standard normal deviation to cumulative probability. But there are approximations, and a good one is given
This rather horrific formula takes quite a while to do on a calculator. If you decide to make it into a computer program, use 'double precision' variables to avoid rounding errors.
This does the opposite to the cumulative function, and estimates the value of x at or below which a certain proportion P of the population lies.
Like the cumulative normal function, there is no exact formula for converting cumulative probability to standard normal deviates. Below is an approximation from the same source as the one above.
This is the function whose formula you will find in most statistics textbooks.
Where X is normal, because the population is infinite and every value is unique, the probability of obtaining an observation precisely equal to x is vanishingly small. Nevertheless it is possible to calculate the relative probability of obtaining x. Mathematically this probability density, Z, is equivalent to the slope of the cumulative distribution - in other words the rate of increase of P for a given x.
The normal probability density function is often confused with the normal distribution function, or is assumed to provide the probability of observing some value, x. In fact this function only approximates the probability of observing a value within a vanishingly small range about x. As a result, when considered over a finite interval, that value can be very much greater than 1 - which a probability, by definition, cannot exceed.
Graphical methods of testing for normality
Fitting a normal distribution
Many packages 'fit' frequency curves to graphs of observations, based upon their mean and standard deviation, and using a mathematical 'normal' function. However, the exact details of how this works depends upon how you express their distribution. Let us deal with the simplest first.
We take the example of the 1881 PCV values that we have used previously. Relative rank of each observation is plotted against its observed value. The smallest relative rank is therefore 1/1880, and the largest is 1881/1881, or 1. Each blue point indicates the proportion of our observations equal to, or less than, X. The green line is the fitted cumulative normal distribution of X, for a population with the same mean and standard deviation as our sample, namely 26.12 and 4.381. The green line is continuous because the computer works out the result of this formula at all points along the X-axis (or a sufficient number to produce a realistic-looking result).
In this particular case, because we assume PCV is a continuous variable, that has been rounded to the nearest 1%, using the sequential rank distorts your comparison of these data with the fitted line. In this particular case therefore, the mean rank provides a better measure of the cumulative distribution.
Note that if you plot your observations as rank, rather than relative rank, the computer would have to multiply the expected proportion by the total number of observations (in this case, that would be p × 1881). If you wished to plot the rank as percentiles, it should calculate the probability as a percentage (p × 100).
Fitting a normal distribution to data presented as cumulative class-intervals is done the same way as for a scatterplot (above). In other words, the cumulative normal probability ( P ), of your observed mean and standard deviation, is multiplied by the number of observations in your sample ( N ), or by 100 (for percentages).
If you want to predict the proportion of observations in each class-interval (shown in green on the second graph of the set above), you use the cumulative normal distribution to work out the proportion at the class-interval's upper and lower boundaries - and find the difference between them. For example, if its upper boundary was the mean, and the lower was minus infinity, you would expect to find 0.5 - 0 of the observations in that interval.
Of course, you could also re-plot the predicted proportion of observations in each class as an ordinary histogram. However, fitting a normal distribution to a histogram is a little more complicated.
To fit a normal distribution curve to a histogram of n observations, you need to convert the probability density function to a frequency. To do this you multiply it by n. In which case the area under the curve is equal to n, rather than 1. The peak value should be n/σ√2π - because, at the peak
Aside from the information required above, the computer also has to re-scale its predictions to allow for the width of class intervals used. This is because, the narrower the class-interval, the fewer observations it is likely to end up with. For example, see the graph set below.
The only awkward bit in all of this is that for a histogram, the normal probability 'density' function does not yield a probability as we explained above. Instead,
it yields the change in probability at each value of X. This does not cause any major problems providing your class intervals are less than the standard deviation. But if your class intervals are more than about 2.5 times the standard deviation, your relative frequency can exceed 100%, as shown in the second figure above. This causes some confusion because, by definition, the relative frequency of observations cannot
You may also have problems if you try to fit a normal distribution to a histogram of a discrete variable. Because discrete variables cannot have fractional values, many packages tend to produce very odd results.
Probability-probability (PP) and Quantile-quantile (QQ) plots
Unlike the methods above, which 'fit' a normal distribution to an existing (curved) graph, PP and QQ plots are used to assess normality by plotting 'like against like' - and seeing if the result is a straight line. Although these plots can be calculated for any mathematically 'known' population distribution, they are most commonly used to assess departures from normality.
The relationship between relative rank, quantiles and
Note that we have always plotted observed against expected values. But sometimes expected values are plotted against observed
We will first look at how PP and QQ plots are done for a small
The table below shows the value of each of these (n=) 10 observations arranged in order of rank (x(r)), their corrected relative rank (crr), their theoretical quantile, and expected P-value. The theoretical quantile was estimated from the inverse normal probability function using R. The expected p-value was estimated from the cumulative normal distribution function, again using R.
For the PP plot, the observed probability (crr) is plotted against the probability expected for a normal distribution
Some software packages offer a range of methods for correcting the relative rank of observations within your sample to provide better estimates of their population equivalents. If you are uncomfortable with these correction, and do not have access to the appropriate statistical tables, remember that it is possible to obtain rankits by simulation for any distribution for which you have the inverse probability function.
Interpreting PP & QQ plots
Even when your sample is randomly selected from a normal population, random error can be expected to ensure there is some deviation from the 'perfect fit' line. Indeed various tests compare the observed deviation from that line with the amount you would expect, if your sample represented a normal distribution. To enable you to see what to expect from QQ, PP plots of non-normal data, we have excluded that source of variation in the diagrams below by using rankits.
For the sake of comparison, we used 4 variables, y1, y2, y3, and y4 - these being normal, skewed, platykurtic (flattened), and leptokurtic (peaked). All of them comprised the most typical locations of 90 values from populations having the same mean and standard deviation (μ=10, σ=2). The four histograms, below, show how these values were distributed.
For notes and comments click on each image.