What is a quantile, and does it matter?
Many elementary stats books treat quantiles as of trivial or academic importance, and they are largely ignored by biologists. Perplexingly, while quantiles underlie much statistical exploration and analysis, and are remarkably useful and informative, they can be surprisingly controversial  at any rate ignoring them causes an awful lot of needless problems! To understand quantiles it is best to begin with a very simple, albeit imperfect, definition.
A quantile is a location within a set of ranked numbers, below which a certain proportion, p, of that set lie.

For example, consider these 4 numbers, arranged in ascending order:
egg weight (mg):  0.1  1.5  3.7  9.4 
p = 0.5  ⇑ the pth quantile 
 
Under the definition above, if p is 0.5, then half of the observations lie below a value  the 0.5th quantile  commonly known as the median. In this particular set the median lies somewhere between the second and third ranked values (1.5 and 3.7).
Notice that, provided we are dealing with just 4 values, the 0.5th quantile would lie between the second and third ranked values  irrespective of what their values happen to be! In other words, quantiles describe the relative rank of a value within its set.
There are two, very different ways of working with quantiles.
 You can calculate the value corresponding to the rank of a, predefined, pth quantile.
 Conversely, you can calculate p given the rank of a given value within a set of n values.

Although p can be any value from zero to one, elementary textbooks often define quantiles as locations which divide a set of numbers into equal (often predefined) portions.
The pth quantile
The pth quantile is that value which demarcates a given proportion of a set of values.
 The median is a single value, the 0.5th quantile, which divides an ordered set into 2 equal groups.
 A quartiles is one of three values which divide an ordered set into 4 equal sized groups.
 A decile is one of nine values which divide an ordered set into 10 equal groups.
 A percentile is one of 99 values which demarcate an ordered set into 100 equal groups.
Once again, consider this set of 4 values:
egg weight (mg):  0.1  1.5  3.7  9.4 
rank (r):  1  2  3  4 
 
Under our simple definition, above,
 The lower quartile is the value below which p = ^{1}/_{4} of a set lie  in this case that value is somewhere between 0.1 and 1.5.
 The middle quartile (or median) is the value below which p = ^{2}/_{4} of a set lie  in this case that value is between 1.5 and 3.7.
 The upper quartile is the value below which p = ^{3}/_{4} of them lie  in this case that value lies somewhere between 3.7 and 9.4.
Notice that, where predefined quantiles fall between the values of a set, you have to estimate their exact value by interpolation or by rounding. This is because quantiles are defined in terms of the rank of each value in its set.
The pth member of a set (Y) is known as the pth order statistic, or y_{(p)}  or just y_{p}.

For example, consider this set of three values of variable Y: 101, 122, 303. The middle ranking value, 122, is the median, or p = 0.5th quantile of this set  and y_{0.5} is the median of Y.
The rth quantile
The rth quantile is the rth value of a set of values.
Whilst, in principle, you could divide an ordered set into as many equal groups as you might wish, in practice the maximum number is usually limited to the number of values to be divided up. In other words, if you have n values, you could divide that set into n quantiles.
Even where you have to estimate the value of a quantile by interpolation, a quantile is primarily defined in terms of its rank (r) within an ordered set of n values. To make this measure independent of the set's size, the relative rank r/n is often used instead. Since an 'ordinal number' defines the rank (r) of a value within an ordered set, a quantile (including a predefined one) is also known as an order statistic.
The rth member of a set (Y) is described as the rth order statistic, or y_{(r)}

For example, y_{(1)}, the minimum, is the 0th order statistic  below which (p = 0) of its set lie. The median is the middleranking, or mean ranking, or most typical, or least deviant, or least extreme value of its set  the maximum and minimum are the most extreme quantiles, or outlying values.
Each number in an ordered set corresponds to a quantile of that set  for which a value of p may be calculated from the value's rank (or relative rank), or vice versa.

Converting a value of p (or a Pvalue) to a rank (or a relative rank) is very simple if the only quantile you are interested in is the median. Unfortunately, in order to reliably interconvert rank and pvalue, we need to refine our definition somewhat. At least, if you are to select the most appropriate formula, you need to understand what they assume and approximate.
Quantiles as a relative rank
If you are dealing with a large set of n, nonidentical values, the relative rank of each value (r/n) is approximately the same as its pvalue  and vice versa.
In other words, p ≅ ^{r}/_{n}

As a result, a quantile is often defined as the value which has the pth relative rank within a ''population''  where the population is assumed to be an infinitely large set of different values. Implicitly this defines p as the proportion of a set whose ranks are less than or equal to the pth quantile.
This definition is important because, very often, you are dealing with a sample of some population. Moreover, if your set of values is a random selection from a very much larger set of values, you would expect the relative rank of a value in your subset would be similar to its relative rank within its superset  or ''population''. Given which, sample quantiles are much used as estimates of their equivalent population quantiles.
Of course, in real data the number of values in a set is finite, and often quite small. In addition, each observation can only have its own, unique, pvalue if every observation in the set is different (termed unique, or untied).
In practice therefore, this approximation is biased  especially for small sets, and very especially for extreme quantiles. As you might imagine, this has some important consequences. Even if we ignore the difficulties of estimating extreme quantiles of a superset, from a representative subset, one problem introduced by an imperfect definition is easy enough to demonstrate.
For example, if you want to find the 0.5th quantile (the median) of a set of
n = 3 observations:
Under our first definition, only p = 1/3 can be ranked below the middleranking value, and 1/3 above it  in which case there cannot be a 0.5th quantile.
If you assume p is the relative rank, p = 2/3 are less than or equal to the median  again there is no 0.5th quantile.
Therefore unless you ignore the middleranking value, or chop it in half (so p = [r − 1/2]/n), there is no median.
Similarly, only p = 2/3 of the set are less than the maximum.
Conversely, p = 1/3 of the set have ranks equal to or below the minimum value.
Chopping the offending value in half does not solve the problem, unless you are prepared to accept the minimum is the 1/6th quantile, and the maximum is the 5/6th quantile.
Provided n is large, the errors in using p = [r1/2]/n may be small enough to be ignored (and reduces the bias in estimating extreme quantiles). For small sets, one way to avoid the difficulties listed above is to ignore the offending rank, and correct your calculations to allow for the fact.
The pth quantile is the location within a set of n (sequentially) ranked numbers, below which pn of ranks lie  and above which n − pn of ranks lie.

To make this definition work properly we have to, in effect, reduce the number of items in the set by one. In other words, instead of defining p = r / n, this definition assumes that p = [r − 1] / [n − 1].
Nevertheless, even for purely descriptive work, this formula not heavily used  and most quantiles are calculated from their relative rank, or some variant thereof.
How to calculate quantiles
Given the fact that a quantile can be defined in several ways, you may not be surprised to learn there is more than one way of estimating quantiles. Rather than merely listing these ad nauseam we use one method here  and try to bring out some general principles. Of these, the most important is that these calculations are based upon an observation's rank  rather than its value.
The rank and quantile of any observation can be simply and uniquely interconverted using:
(1) p = [r − 1]/[n − 1]
(2) r = 1 + p[n − 1]
To see how these two formulae work for a variety of quantiles, let us first assume you are only interested in a finite number of, n, values. Again, entirely for sake of convenience, let us describe this collection of values as "observations of variable Y".
Untied sets
Provided that no two observations are the same, in other words Y is untied, each value of Y will have a unique nonarbitrary rank.
For example, we can calculate a pvalue for each value within this ordered set of
(n=) 6 observations:
y_{(r)} =  0.1  1.2  9.3  14  55  116 
r =  1  2  3  4  5  6 
p =  [r − 1] / [6 − 1] 
0.0  0.2  0.4  0.6  0.8  1.0 
 
Provided each value of Y was different, we would obtain identical pvalues for any 6 observations.
Using the same 6 observations, we can estimate some quantiles with preassigned pvalues: the minimum, lower quartile , median, upper quartile, and maximum.
p =  0.0  0.25  0.5  0.75  1.0 
r =  1 + p[6 − 1] 
1.00  2.25  3.5  4.75  6.00 
y_{(r)} =  0.1  from 1.2 to 9.3  from 9.3 to 14  from 14 to 55  116 
 
Since in this case, three of our quantiles have fractional ranks, we cannot exactly 'detransform' these ranks to values of y_{(r)}.
Although there are a variety of ways of rounding to a whole number of ranks, in general, the best way to estimate y_{(r)} is to interpolate  either graphically, or using this formula:
Algebraically speaking 
y_{ (r)} = [1 − f] y_{(i)} + [f] y _{(i+1)}
Where
 i is the integer part of r
(in other words the nearest whole number < r so if r = 2.25 then i = 2)
 f is the fractional part of r (in other words equal to r − i)
 So, if f = 0 then y_{(r)} = y_{(i)}

For example, the lower quartile of these 6 observations has a rank of 2.25, or about a quarter of the way between y_{(2)} = 1.2 and y_{(3)} = 9.3. So we estimate its location as [0.75 × 1.2] + [0.25 × 9.3], or 3.225. Whereas their median has a rank of 3.5, midway between y_{(3)} = 9.3 & y_{(4)} = 14, so our best estimate of the median is [0.5 × 9.3] + [0.5 × 14], or 11.65.
Below you can see the graphical equivalent of this. Horizontal lines show the predetermined pvalues, and their corresponding ranks, vertical lines show the corresponding quantiles. Dashed lines indicate quantiles obtained by linear interpolation.
{Fig. 1}
Tied sets
When one or more values in a set are identical we have a different situation in that, whilst each rank can be associated with a different value of p, more than one of these have the same value. So, although you can obtain quantiles in the same way as above, some of them may have identical values.
For example, the scatterplot below shows how extensive ties effect the minimum, lower quartile, median, upper quartile, and maximum. In this case, while none of these predefined quantiles have fractional nominal ranks, the minimum, lower quartile and median have identical values.
{Fig. 2}
If, for any reason, you are particularly interested in working out a pvalue for a particular item of a set, you must bear in mind the fact that its sequential rank amidst identical fellows is inherently arbitrary. So is hard to defend calculating a pvalue using the highest possible rank that item could have.
One solution to this is to use, not the highest sequential rank, but the mean rank. For instance, given that 3 of these five values are identical: 0.1 110.2 110.2 110.2 171.5 instead of arbitrarily allocating the tied values different ranks, (1 2 3 4 5) we can give them all the same mean ranking (1 3 3 3 5).
Pvalues calculated on the basis of mean rank are known as 'midpvalues'. 'Conventional pvalues' therefore tend to be rather higher than midpvalues. Thus, in the set above, the conventional pvalue (using [r − 0.5]/n) for the value 110.2 would be 3.5/5  whereas, using the same conversion, its midpvalue is 2.5/5 or 0.5
We explore some practical consequences of this in Units 5 and 6.
As estimates of supersets
All too often the reason you are interested in the quantiles, or pvalues, of a sample is in order to estimate the corresponding quantiles, or pvalues, of a population. In other words, your sample is assumed to be a (sub)set which represents a defined superset of values  from which it was randomly selected. Without going into too many details, the following points are worth noting:
Although the median of a sample should provide an unbiased estimate of its population median, this is only true on average  assuming that your sample really does represent that superset. However, the median is a rather less reliable (more variable) estimate of its true value than the mean is of its true value  particularly where samples are small, and their observations are heavily tied.
Extreme quantiles, such as the minimum and maximum, are biased estimates of their population values. They are also highly unreliable (variable) estimates. Nevertheless, the error in highly skewed populations is similarly asymmetric  for example the largest possible number of breeding females in a vole colony might well be very much more than was observed, whereas the minimum can only be slightly below 1.
For extremely large samples r / n approaches p = [r − 1] / [n − 1], in other words the relative rank (used by cumulative distribution functions) can be used to estimate pvalues. Nevertheless  unless n is infinitely big  the relative rank provides a biased estimate of p, and of the median, minimum and maximum. Using p ≅ [r − 0.5] / n, instead of r / n, removes this bias for the median  but only partly corrects the more extreme pvalues.
Uses of quantiles
To provide summary statistics of location and spread.
Measures of location
If one is dealing with a variable measured on the ordinal rather than the measurement scale, the median provides a more appropriate measure of location of a distribution than the mean. It is also more appropriate for a measurement variable if the distribution of that variable is skewed. We deal with the median in more detail in the More Information page on Measures of location.
Measures of spread
A popular measure of spread, derived from quantiles, is the interquartile range. This comprises the middle, most typical, 50% of observations  enclosed by first and third quartiles. The interquartile range is the best descriptive measure of variability if the distribution of the variable is not symmetrical. Unlike the standard deviation (see Unit 3), which assumes a symmetrical distribution, the interquartile range accurately reflects differing amounts of variability above and below the median. A common way of summarizing a frequency distribution is to give the fivequantile summary (also known as the five number summary), namely the minimum, the lower quartile, the median, the upper quartile, and the maximum.
An alternate and very useful measure of spread is the reference interval  which comprises the 95% of observations, enclosed by the quantiles that cut off 2.5% at each end of the distribution. In other words it comprises the 95% most typical observations. The interval must initially be defined using a large number of observations. Reference intervals are used extensively in clinical chemistry to define the 'normal' range of concentrations of substances found in body fluids. For example, the reference interval of total serum bilirubin concentration (a useful marker of liver and blood disorders) in healthy adults is 0.2 to 1.0 mg/dl. Reference intervals have also been recommended as descriptive statistics for cost data, on the basis that they are more useful than the full range, or 100% range, (minimum to maximum)  which is totally dependent on the two most extreme observations. Sometimes other ranges are used such as the 90% range which comprises the 90% most typical observations.
To investigate and compare the shape of frequency distributions
 Box and whisker plots
Quantiles give some information about the shape of a distribution  in particular whether a distribution is skewed or not. For example if the upper quartile is further from the median than the lower quartile, we can conclude that the distribution is skewed to the right, and vice versa. However, quantiles can only be used to provide such information if the distribution is unimodal.
Information about quantiles is best displayed using a boxandwhisker plot. The quantiles most commonly used in a boxandwhisker plot are:
 The median  usually represented as a small square or a wide horizontal line.
 The interquartile range  indicated by a box around the median. The upper and lower sides of the box indicate the upper and lower quartiles.
 The range  either the minimum and maximum; or the reference interval; or the 90% range; or the furthest observations that lie within 1.5 times the interquartile range. In the latter instances, observations outside of that range are shown individually as outliers. A line (or whisker) is drawn from each side of the box to the range values.
Sometimes only the median and interquartile range are displayed without the range. This may be done using either a box (producing a boxplot) or the whiskers (producing a range plot).
Quantile scatterplots
We can use quantiles in a different way to display and compare cumulative frequency distributions. For a single distribution, the rank of each observation is converted to a relative rank, or a percentile, and then plotted against the value of its observation. Such plots are known as rank scatterplots or empirical cumulative distribution functions (ECDFs), and are described in the More Information page on Frequency distributions. They have several advantages over histograms  the data do not need to be arbitrarily grouped into classes, and values which demarcate a given proportion of the observations can be readily determined.
Quantilequantile plots
For comparing distributions, a quantilequantile plot (also known as a QQ plot) is very useful (when plotted from 2 samples it is sometimes specified as an empirical QQ plot). Quantiles corresponding to the same ranks from the two distributions are plotted against each other, along with a 45degree (x=y) reference line. If distributions are similar in all respects, the points will fall along this line. If distributions are of similar shape, but one is shifted to the right of the other, the points will fall along a line parallel to the reference line. If there is multiplicative difference between the distributions, the points will fall along a straight line at an angle to the reference line. If distributions differ in shape, the points will follow an irregular pattern in relation to the reference line. If sample sizes are the same, this is straight forward; if not, the values of the appropriate quantiles must be interpolated as detailed below.
Note: twosample QQ plots are less popular than singlesample QQ plots, in which observed values are plotted against their typical or expected or 'theoretical' values  as described in Unit 3.
To divide data into equal size groups for further analysis
Quartiles and quintiles are commonly used to divide up observations when collapsing a variable from a measurement variable to an ordinal variable. This may be done to explanatory variables for some types of analysis (for example logistic regression).
Another reason for dividing observations into groups by quantiles is to define cohorts that can then be followed over time. For example, in a study of eating habits and obesity a group of young children might be divided by quartiles on the basis of their consumption of fresh vegetables. Consumption of vegetables would then be measured in the same groups over a number of years to see if the groups maintained their differences  known as tracking  or instead converged to a common value.
To identify outliers
An outlier may be loosely defined as an extreme or outlying observation. A genuine outlier is an observation that has arisen due to measurement error, or is not as member of the population in question  for example if a male is inadvertently included in a sample of females. Observations outside a particular range of quantiles are sometimes regarded as outliers, which can be deleted from the data set. A common criterion used for an outlier is whether it lies outside 1.5 × the interquartile range. We look into this in more depth in Unit 2.
To compare with population quantiles
Quantiles calculated from a sample are often used to estimate their equivalent population quantiles  as where a the median of a sample (or 'sample median') is used to estimate the median of the 'population' of values that sample (hopefully) represents. Unit 3 shows how quantilequantile plots can be used to compare the distribution of a sample with that of a theoretical population.
In more advanced studies, simulation models are used to calculate the same statistic from large numbers of random samples of the same population  this set of sample statistics is treated as if it were a sample of a much larger population of similarlyderived values. Given which the quantiles of that sample of statistics may be used to estimate their population quantiles.
Statistical tests also employ quantiles to compare the value of an observed statistic with a theoretical test population of similarlyderived statistics. Unit 5 describes how, if an observed statistic falls outside the 95% most typical values, it can be classed as significantly different from its fellows.
Related
topics :
Pvalues of tests