What is a quantile, and does it matter?
Many elementary stats books treat quantiles as of trivial or academic importance, and they are largely ignored by biologists. Perplexingly, while quantiles underlie much statistical exploration and analysis, and are remarkably useful and informative, they can be surprisingly controversial - at any rate ignoring them causes an awful lot of needless problems! To understand quantiles it is best to begin with a very simple, albeit imperfect, definition.
A quantile is a location within a set of ranked numbers, below which a certain proportion, p, of that set lie.
For example, consider these 4 numbers, arranged in ascending order:
|egg weight (mg):||0.1||1.5||3.7||9.4
|p = 0.5||⇑
the pth quantile
Under the definition above, if p is 0.5, then half of the observations lie below a value - the 0.5th quantile - commonly known as the median. In this particular set the median lies somewhere between the second and third ranked values (1.5 and 3.7).
Notice that, provided we are dealing with just 4 values, the 0.5th quantile would lie between the second and third ranked values - irrespective of what their values happen to be! In other words, quantiles describe the relative rank of a value within its set.
There are two, very different ways of working with quantiles.
- You can calculate the value corresponding to the rank of a, predefined, pth quantile.
- Conversely, you can calculate p given the rank of a given value within a set of n values.
Although p can be any value from zero to one, elementary textbooks often define quantiles as locations which divide a set of numbers into equal (often predefined) portions.
The pth quantile
The pth quantile is that value which demarcates a given proportion of a set of values.
- The median is a single value, the 0.5th quantile, which divides an ordered set into 2 equal groups.
- A quartiles is one of three values which divide an ordered set into 4 equal sized groups.
- A decile is one of nine values which divide an ordered set into 10 equal groups.
- A percentile is one of 99 values which demarcate an ordered set into 100 equal groups.
Once again, consider this set of 4 values:
|egg weight (mg):||0.1||1.5||3.7||9.4
Under our simple definition, above,
- The lower quartile is the value below which p = 1/4 of a set lie - in this case that value is somewhere between 0.1 and 1.5.
- The middle quartile (or median) is the value below which p = 2/4 of a set lie - in this case that value is between 1.5 and 3.7.
- The upper quartile is the value below which p = 3/4 of them lie - in this case that value lies somewhere between 3.7 and 9.4.
Notice that, where predefined quantiles fall between the values of a set, you have to estimate their exact value by interpolation or by rounding. This is because quantiles are defined in terms of the rank of each value in its set.
The pth member of a set (Y) is known as the pth order statistic, or y(p) - or just yp.
For example, consider this set of three values of variable Y: 101, 122, 303. The middle ranking value, 122, is the median, or p = 0.5th quantile of this set - and y0.5 is the median of Y.
The rth quantile
The rth quantile is the rth value of a set of values.
Whilst, in principle, you could divide an ordered set into as many equal groups as you might wish, in practice the maximum number is usually limited to the number of values to be divided up. In other words, if you have n values, you could divide that set into n quantiles.
Even where you have to estimate the value of a quantile by interpolation, a quantile is primarily defined in terms of its rank (r) within an ordered set of n values. To make this measure independent of the set's size, the relative rank r/n is often used instead. Since an 'ordinal number' defines the rank (r) of a value within an ordered set, a quantile (including a predefined one) is also known as an order statistic.
The rth member of a set (Y) is described as the rth order statistic, or y(r)
For example, y(1), the minimum, is the 0th order statistic - below which (p = 0) of its set lie. The median is the middle-ranking, or mean ranking, or most typical, or least deviant, or least extreme value of its set - the maximum and minimum are the most extreme quantiles, or outlying values.
Each number in an ordered set corresponds to a quantile of that set - for which a value of p may be calculated from the value's rank (or relative rank), or vice versa.
Converting a value of p (or a P-value) to a rank (or a relative rank) is very simple if the only quantile you are interested in is the median. Unfortunately, in order to reliably interconvert rank and p-value, we need to refine our definition somewhat. At least, if you are to select the most appropriate formula, you need to understand what they assume and approximate.
Quantiles as a relative rank
If you are dealing with a large set of n, non-identical values, the relative rank of each value (r/n) is approximately the same as its p-value - and vice versa.
In other words, p ≅ r/n
As a result, a quantile is often defined as the value which has the pth relative rank within a ''population'' - where the population is assumed to be an infinitely large set of different values. Implicitly this defines p as the proportion of a set whose ranks are less than or equal to the pth quantile.
This definition is important because, very often, you are dealing with a sample of some population. Moreover, if your set of values is a random selection from a very much larger set of values, you would expect the relative rank of a value in your subset would be similar to its relative rank within its superset - or ''population''. Given which, sample quantiles are much used as estimates of their equivalent population quantiles.
Of course, in real data the number of values in a set is finite, and often quite small. In addition, each observation can only have its own, unique, p-value if every observation in the set is different (termed unique, or un-tied).
In practice therefore, this approximation is biased - especially for small sets, and very especially for extreme quantiles. As you might imagine, this has some important consequences. Even if we ignore the difficulties of estimating extreme quantiles of a superset, from a representative subset, one problem introduced by an imperfect definition is easy enough to demonstrate.
For example, if you want to find the 0.5th quantile (the median) of a set of n = 3
Under our first definition, only p = 1/3 can be ranked below the middle-ranking value, and 1/3 above it - in which case there cannot be a 0.5th quantile.
If you assume p is the relative rank, p = 2/3 are less than or equal to the median - again there is no 0.5th quantile.
Therefore unless you ignore the middle-ranking value, or chop it in half (so p = [r − 1/2]/n), there is no median.
Similarly, only p = 2/3 of the set are less than the maximum.
Conversely, p = 1/3 of the set have ranks equal to or below the minimum value.
Chopping the offending value in half does not solve the problem, unless you are prepared to accept the minimum is the 1/6th quantile, and the maximum is the 5/6th quantile.
Provided n is large, the errors in using p = [r-1/2]/n may be small enough to be ignored (and reduces the bias in estimating extreme quantiles). For small sets, one way to avoid the difficulties listed above is to ignore the offending rank, and correct your calculations to allow for the fact.
The pth quantile is the location within a set of n (sequentially) ranked numbers, below which pn of ranks lie - and above which n − pn of ranks lie.
To make this definition work properly we have to, in effect, reduce the number of items in the set by one. In other words, instead of defining p = r / n, this definition assumes that p = [r − 1] / [n − 1].
Nevertheless, even for purely descriptive work, this formula not heavily used - and most quantiles are calculated from their relative rank, or some variant thereof.
How to calculate quantiles
Given the fact that a quantile can be defined in several ways, you may not be surprised to learn there is more than one way of estimating quantiles. Rather than merely listing these ad nauseam we use one method here - and try to bring out some general principles. Of these, the most important is that these calculations are based upon an observation's rank - rather than its value.
The rank and quantile of any observation can be simply and uniquely inter-converted using:
(1) p = [r − 1]/[n − 1]
(2) r = 1 + p[n − 1]
To see how these two formulae work for a variety of quantiles, let us first assume you are only interested in a finite number of, n, values. Again, entirely for sake of convenience, let us describe this collection of values as "observations of variable Y".
Provided that no two observations are the same, in other words Y is un-tied, each value of Y will have a unique non-arbitrary rank.
For example, we can calculate a p-value for each value within this ordered set of (n=) 6
|y(r) = || 0.1 || 1.2 || 9.3 || 14 || 55 || 116
|r = ||1||2||3||4||5||6
|p = ||[r − 1] / [6 − 1]
Provided each value of Y was different, we would obtain identical p-values for any 6 observations.
Using the same 6 observations, we can estimate some quantiles with preassigned p-values: the minimum, lower quartile , median, upper quartile, and maximum.
|p = ||0.0||0.25||0.5||0.75||1.0
|r = ||1 + p[6 − 1]
|y(r) = ||0.1||from|
1.2 to 9.3
9.3 to 14
14 to 55
Since in this case, three of our quantiles have fractional ranks, we cannot exactly 'detransform' these ranks to values of y(r).
Although there are a variety of ways of rounding to a whole number of ranks, in general, the best way to estimate y(r) is to interpolate - either graphically, or using this formula:
Algebraically speaking -
y (r) = [1 − f] y(i) + [f] y (i+1)
- i is the integer part of r
(in other words the nearest whole number < r so if r = 2.25 then i = 2)
- f is the fractional part of r (in other words equal to r − i)
- So, if f = 0 then y(r) = y(i)
For example, the lower quartile of these 6 observations has a rank of 2.25, or about a quarter of the way between y(2) = 1.2 and y(3) = 9.3. So we estimate its location as [0.75 × 1.2] + [0.25 × 9.3], or 3.225. Whereas their median has a rank of 3.5, midway between y(3) = 9.3 & y(4) = 14, so our best estimate of the median is [0.5 × 9.3] + [0.5 × 14], or 11.65.
Below you can see the graphical equivalent of this. Horizontal lines show the predetermined p-values, and their corresponding ranks, vertical lines show the corresponding quantiles. Dashed lines indicate quantiles obtained by linear interpolation.
When one or more values in a set are identical we have a different situation in that, whilst each rank can be associated with a different value of p, more than one of these have the same value. So, although you can obtain quantiles in the same way as above, some of them may have identical values.
For example, the scatterplot below shows how extensive ties effect the minimum, lower quartile, median, upper quartile, and maximum. In this case, while none of these predefined quantiles have fractional nominal ranks, the minimum, lower quartile and median have identical values.
If, for any reason, you are particularly interested in working out a p-value for a particular item of a set, you must bear in mind the fact that its sequential rank amidst identical fellows is inherently arbitrary. So is hard to defend calculating a p-value using the highest possible rank that item could have.
One solution to this is to use, not the highest sequential rank, but the mean rank. For instance, given that 3 of these five values are identical: 0.1 110.2 110.2 110.2 171.5 instead of arbitrarily allocating the tied values different ranks, (1 2 3 4 5) we can give them all the same mean ranking (1 3 3 3 5).
P-values calculated on the basis of mean rank are known as 'mid-p-values'. 'Conventional p-values' therefore tend to be rather higher than mid-p-values. Thus, in the set above, the conventional p-value (using [r − 0.5]/n) for the value 110.2 would be 3.5/5 - whereas, using the same conversion, its mid-p-value is 2.5/5 or 0.5
We explore some practical consequences of this in Units 5 and 6.
As estimates of supersets
All too often the reason you are interested in the quantiles, or p-values, of a sample is in order to estimate the corresponding quantiles, or p-values, of a population. In other words, your sample is assumed to be a (sub)set which represents a defined superset of values - from which it was randomly selected. Without going into too many details, the following points are worth noting:
Although the median of a sample should provide an unbiased estimate of its population median, this is only true on average - assuming that your sample really does represent that superset. However, the median is a rather less reliable (more variable) estimate of its true value than the mean is of its true value - particularly where samples are small, and their observations are heavily tied.
Extreme quantiles, such as the minimum and maximum, are biased estimates of their population values. They are also highly unreliable (variable) estimates. Nevertheless, the error in highly skewed populations is similarly asymmetric - for example the largest possible number of breeding females in a vole colony might well be very much more than was observed, whereas the minimum can only be slightly below 1.
For extremely large samples r / n approaches p = [r − 1] / [n − 1], in other words the relative rank (used by cumulative distribution functions) can be used to estimate p-values. Nevertheless - unless n is infinitely big - the relative rank provides a biased estimate of p, and of the median, minimum and maximum. Using p ≅ [r − 0.5] / n, instead of r / n, removes this bias for the median - but only partly corrects the more extreme p-values.
Uses of quantiles
- To provide summary statistics of location and spread.
Measures of location
If one is dealing with a variable measured on the ordinal rather than the measurement scale, the median provides a more appropriate measure of location of a distribution than the mean. It is also more appropriate for a measurement variable if the distribution of that variable is skewed. We deal with the median in more detail in the More Information page on Measures of location.
Measures of spread
A popular measure of spread, derived from quantiles, is the interquartile range. This comprises the middle, most typical, 50% of observations - enclosed by first and third quartiles. The interquartile range is the best descriptive measure of variability if the distribution of the variable is not symmetrical. Unlike the standard deviation (see Unit 3), which assumes a symmetrical distribution, the interquartile range accurately reflects differing amounts of variability above and below the median. A common way of summarizing a frequency distribution is to give the five-quantile summary (also known as the five number summary), namely the minimum, the lower quartile, the median, the upper quartile, and the maximum.
An alternate and very useful measure of spread is the reference interval - which comprises the 95% of observations, enclosed by the quantiles that cut off 2.5% at each end of the distribution. In other words it comprises the 95% most typical observations. The interval must initially be defined using a large number of observations. Reference intervals are used extensively in clinical chemistry to define the 'normal' range of concentrations of substances found in body fluids. For example, the reference interval of total serum bilirubin concentration (a useful marker of liver and blood disorders) in healthy adults is 0.2 to 1.0 mg/dl. Reference intervals have also been recommended as descriptive statistics for cost data, on the basis that they are more useful than the full range, or 100% range, (minimum to maximum) - which is totally dependent on the two most extreme observations. Sometimes other ranges are used such as the 90% range which comprises the 90% most typical observations.
- To investigate and compare the shape of frequency distributions
To divide data into equal size groups for further analysis
- Box and whisker plots
Quantiles give some information about the shape of a distribution - in particular whether a distribution is skewed or not. For example if the upper quartile is further from the median than the lower quartile, we can conclude that the distribution is skewed to the right, and vice versa. However, quantiles can only be used to provide such information if the distribution is unimodal.
Information about quantiles is best displayed using a box-and-whisker plot. The quantiles most commonly used in a box-and-whisker plot are:
- The median - usually represented as a small square or a wide horizontal line.
- The inter-quartile range - indicated by a box around the median. The upper and lower sides of the box indicate the upper and lower quartiles.
- The range - either the minimum and maximum; or the reference interval; or the 90% range; or the furthest observations that lie within 1.5 times the interquartile range. In the latter instances, observations outside of that range are shown individually as outliers. A line (or whisker) is drawn from each side of the box to the range values.
Sometimes only the median and interquartile range are displayed without the range. This may be done using either a box (producing a boxplot) or the whiskers (producing a range plot).
We can use quantiles in a different way to display and compare cumulative frequency distributions. For a single distribution, the rank of each observation is converted to a relative rank, or a percentile, and then plotted against the value of its observation. Such plots are known as rank scatterplots or empirical cumulative distribution functions (ECDFs), and are described in the More Information page on Frequency distributions. They have several advantages over histograms - the data do not need to be arbitrarily grouped into classes, and values which demarcate a given proportion of the observations can be readily determined.
For comparing distributions, a quantile-quantile plot (also known as a QQ plot) is very useful (when plotted from 2 samples it is sometimes specified as an empirical QQ plot). Quantiles corresponding to the same ranks from the two distributions are plotted against each other, along with a 45-degree (x=y) reference line. If distributions are similar in all respects, the points will fall along this line. If distributions are of similar shape, but one is shifted to the right of the other, the points will fall along a line parallel to the reference line. If there is multiplicative difference between the distributions, the points will fall along a straight line at an angle to the reference line. If distributions differ in shape, the points will follow an irregular pattern in relation to the reference line. If sample sizes are the same, this is straight forward; if not, the values of the appropriate quantiles must be interpolated as detailed below.
Note: two-sample QQ plots are less popular than single-sample QQ plots, in which observed values are plotted against their typical or expected or 'theoretical' values - as described in Unit 3.
Quartiles and quintiles are commonly used to divide up observations when collapsing a variable from a measurement variable to an ordinal variable. This may be done to explanatory variables for some types of analysis (for example logistic regression).
Another reason for dividing observations into groups by quantiles is to define cohorts that can then be followed over time. For example, in a study of eating habits and obesity a group of young children might be divided by quartiles on the basis of their consumption of fresh vegetables. Consumption of vegetables would then be measured in the same groups over a number of years to see if the groups maintained their differences - known as tracking - or instead converged to a common value.
To identify outliers
An outlier may be loosely defined as an extreme or outlying observation. A genuine outlier is an observation that has arisen due to measurement error, or is not as member of the population in question - for example if a male is inadvertently included in a sample of females. Observations outside a particular range of quantiles are sometimes regarded as outliers, which can be deleted from the data set. A common criterion used for an outlier is whether it lies outside 1.5 × the interquartile range. We look into this in more depth in Unit 2.
To compare with population quantiles
Quantiles calculated from a sample are often used to estimate their equivalent population quantiles - as where a the median of a sample (or 'sample median') is used to estimate the median of the 'population' of values that sample (hopefully) represents. Unit 3 shows how quantile-quantile plots can be used to compare the distribution of a sample with that of a theoretical population.
In more advanced studies, simulation models are used to calculate the same statistic from large numbers of random samples of the same population - this set of sample statistics is treated as if it were a sample of a much larger population of similarly-derived values. Given which the quantiles of that sample of statistics may be used to estimate their population quantiles.
Statistical tests also employ quantiles to compare the value of an observed statistic with a theoretical test population of similarly-derived statistics. Unit 5 describes how, if an observed statistic falls outside the 95% most typical values, it can be classed as significantly different from its fellows.
P-values of tests