Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Quantiles and their displayOn this page: What is a quantile & does it matter? Proportional quantiles Rank quantiles Quantiles as Relative rank Un-tied sets Tied sets As estimates Uses of quantiles
What is a quantile, and does it matter?
Many elementary stats books treat quantiles as of trivial or academic importance, and they are largely ignored by biologists. Perplexingly, while quantiles underlie much statistical exploration and analysis, and are remarkably useful and informative, they can be surprisingly controversial - at any rate ignoring them causes an awful lot of needless problems! To understand quantiles it is best to begin with a very simple, albeit imperfect, definition.
For example, consider these 4 numbers, arranged in ascending order:
Under the definition above, if p is 0.5, then half of the observations lie below a value - the 0.5th quantile - commonly known as the median. In this particular set the median lies somewhere between the second and third ranked values (1.5 and 3.7).
Notice that, provided we are dealing with just 4 values, the 0.5th quantile would lie between the second and third ranked values - irrespective of what their values happen to be! In other words, quantiles describe the relative rank of a value within its set.
Although p can be any value from zero to one, elementary textbooks often define quantiles as locations which divide a set of numbers into equal (often predefined) portions.
The pth quantile
The pth quantile is that value which demarcates a given proportion of a set of values.
Once again, consider this set of 4 values:
Notice that, where predefined quantiles fall between the values of a set, you have to estimate their exact value by interpolation or by rounding. This is because quantiles are defined in terms of the rank of each value in its set.
For example, consider this set of three values of variable Y: 101, 122, 303. The middle ranking value, 122, is the median, or
The rth quantile
The rth quantile is the rth value of a set of values.
Whilst, in principle, you could divide an ordered set into as many equal groups as you might wish, in practice the maximum number is usually limited to the number of values to be divided up. In other words, if you have n values, you could divide that set into n quantiles.
For example, y(1), the minimum, is the 0th order statistic - below which
Converting a value of p (or a
Quantiles as a relative rank
As a result, a quantile is often defined as the value which has the pth relative rank within a ''population'' - where the population is assumed to be an infinitely large set of different values. Implicitly this defines p as the proportion of a set whose ranks are less than or equal to the pth quantile.
Of course, in real data the number of values in a set is finite, and often quite small. In addition, each observation can only have its own, unique, p-value if every observation in the set is different (termed unique, or un-tied).
In practice therefore, this approximation is biased - especially for small sets, and very especially for extreme quantiles. As you might imagine, this has some important consequences. Even if we ignore the difficulties of estimating extreme quantiles of a superset, from a representative subset, one problem introduced by an imperfect definition is easy enough to demonstrate.
For example, if you want to find the 0.5th quantile (the median) of a set of
Provided n is large, the errors in using
To make this definition work properly we have to, in effect, reduce the number of items in the set by one. In other words, instead of defining
Nevertheless, even for purely descriptive work, this formula not heavily used - and most quantiles are calculated from their relative rank, or some variant thereof.
Given the fact that a quantile can be defined in several
The rank and quantile of any observation can be simply and uniquely inter-converted using:
To see how these two formulae work for a variety of quantiles, let us first assume you are only interested in a finite number of, n, values. Again, entirely for sake of convenience, let us describe this collection of values as "observations of variable Y".
Provided that no two observations are the same, in other words Y is un-tied, each value of Y will have a unique non-arbitrary rank.
For example, we can calculate a p-value for each value within this ordered set of
Provided each value of Y was different, we would obtain identical p-values for any 6 observations.
Using the same 6 observations, we can estimate some quantiles with preassigned p-values: the minimum, lower quartile , median, upper quartile, and maximum.
Since in this case, three of our quantiles have fractional ranks, we cannot exactly 'detransform' these ranks to values
Although there are a variety of ways of rounding to a whole number of ranks, in general, the best way to estimate y(r) is to interpolate - either graphically, or using this formula:
For example, the lower quartile of these 6 observations has a rank of 2.25, or about a quarter of the way between
Below you can see the graphical equivalent of this. Horizontal lines show the predetermined p-values, and their corresponding ranks, vertical lines show the corresponding quantiles. Dashed lines indicate quantiles obtained by linear interpolation.
When one or more values in a set are identical we have a different situation in that, whilst each rank can be associated with a different value of p, more than one of these have the same value. So, although you can obtain quantiles in the same way as above, some of them may have identical values.
For example, the scatterplot below shows how extensive ties effect the minimum, lower quartile, median, upper quartile, and maximum. In this case, while none of these predefined quantiles have fractional nominal ranks, the minimum, lower quartile and median have identical values.
If, for any reason, you are particularly interested in working out a p-value for a particular item of a set, you must bear in mind the fact that its sequential rank amidst identical fellows is inherently arbitrary. So is hard to defend calculating a p-value using the highest possible rank that item could have.
One solution to this is to use, not the highest sequential rank, but the mean rank. For instance, given that 3 of these five values are identical: 0.1 110.2 110.2 110.2 171.5 instead of arbitrarily allocating the tied values different ranks, (1 2 3 4 5) we can give them all the same mean ranking (1 3 3 3 5).
P-values calculated on the basis of mean rank are known as 'mid-p-values'. 'Conventional p-values' therefore tend to be rather higher than mid-p-values. Thus, in the set above, the conventional p-value (using [r − 0.5]/n) for the value 110.2 would be 3.5/5 - whereas, using the same conversion, its mid-p-value is 2.5/5 or 0.5
As estimates of supersets
All too often the reason you are interested in the quantiles, or
Uses of quantiles
Measures of location
If one is dealing with a variable measured on the ordinal rather than the measurement scale, the median provides a more appropriate measure of location of a distribution than the mean. It is also more appropriate for a measurement variable if the distribution of that variable is skewed. We deal with the median in more detail in the More Information page on Measures of
A popular measure of spread, derived from quantiles, is the interquartile range. This comprises the middle, most typical, 50% of observations - enclosed by first and third quartiles. The interquartile range is the best descriptive measure of variability if the distribution of the variable is not symmetrical. Unlike the standard deviation (see
An alternate and very useful measure of spread is the reference interval - which comprises the 95% of observations, enclosed by the quantiles that cut off 2.5% at each end of the distribution. In other words it comprises the 95% most typical observations. The interval must initially be defined using a large number of observations. Reference intervals are used extensively in clinical chemistry to define the 'normal' range of concentrations of substances found in body fluids. For example, the reference interval of total serum bilirubin concentration (a useful marker of liver and blood disorders) in healthy adults is 0.2 to 1.0 mg/dl. Reference intervals have also been recommended as descriptive statistics for cost data, on the basis that they are more useful than the full range, or 100% range, (minimum to maximum) - which is totally dependent on the two most extreme observations. Sometimes other ranges are used such as the 90% range which comprises the 90% most typical observations.
Quartiles and quintiles are commonly used to divide up observations when collapsing a variable from a measurement variable to an ordinal variable. This may be done to explanatory variables for some types of analysis (for example logistic regression).
Another reason for dividing observations into groups by quantiles is to define cohorts that can then be followed over time. For example, in a study of eating habits and obesity a group of young children might be divided by quartiles on the basis of their consumption of fresh vegetables. Consumption of vegetables would then be measured in the same groups over a number of years to see if the groups maintained their differences - known as tracking - or instead converged to a common value.
An outlier may be loosely defined as an extreme or outlying observation. A genuine outlier is an observation that has arisen due to measurement error, or is not as member of the population in question - for example if a male is inadvertently included in a sample of females. Observations outside a particular range of quantiles are sometimes regarded as outliers, which can be deleted from the data set. A common criterion used for an outlier is whether it lies outside 1.5 × the interquartile range. We look into this in more depth in
Quantiles calculated from a sample are often used to estimate their equivalent population quantiles - as where a the median of a sample (or 'sample median') is used to estimate the median of the 'population' of values that sample (hopefully) represents.
In more advanced studies, simulation models are used to calculate the same statistic from large numbers of random samples of the same population - this set of sample statistics is treated as if it were a sample of a much larger population of similarly-derived values. Given which the quantiles of that sample of statistics may be used to estimate their population quantiles.
Statistical tests also employ quantiles to compare the value of an observed statistic with a theoretical test population of similarly-derived statistics.