![]() Biology, images, analysis, design... |
|
"It has long been an axiom of mine that the little things are infinitely the most important" |
|
Using R to display distributions(Introducing some spanners)On this page: Histograms / bargraphs Univariate scatterplots Conventional dotplots Jittered dotplots Gaussian-smoothed & jittered Simple rank scatterplots Cumulative rank scatterplots P-value plots Frequency of ties Q-Q plots Normal quantile plots Rankits and deviates thereof![]() If you find this page useful, and want more of the same, try
Histograms / bargraphsUnless you are trying to show data do not 'significantly' differ from 'normal' (e.g. using Lilliefors test) most people find the best way to explore data is some sort of graph. Yet, whilst there are many ways to graph frequency distributions, very few are in common use. Journalists (for reasons of their own) usually prefer pie-graphs, whereas scientists and high-school students conventionally use histograms, (or bar-graphs). Curiously, while statisticians condemn pie-graphs as misleading if not wholly inappropriate, they seldom criticise histograms - at any rate histograms appear in virtually every introductory statistics text, and many advanced ones.
If you assume R's default settings are liable to be the most reasonable in most circumstances, plotting a histogram is almost childishly simple. But, when inspecting a histogram, do remember that genuinely normal values are smoothly distributed. The following code instructs R to randomly select a large sample of (n=1000000) values from a standard normal population and put ('assign') those values in a variable called 'y', then plot a histogram thereof.
Histograms perform tolerably well when 'sensibly' applied to very large samples of 'normal' data, but very poorly when obtained from small samples and/or particularly non-normal data.
Of course, if a sample distribution's fine-structure is solely due to simple random variation, smoothing this out can give a more realistic picture of the population that sample represents. But that is only true if the smoothing function is appropriate! Which is one reason why histograms can be astonishingly misleading when their breakpoints are poorly (or unluckily) chosen. Let us therefore consider some other ways of graphically displaying how values are distributed which do not require class-intervals. Surprisingly, the rank-based nonparametric viewpoint has much to offer in exploring distributions - even if you merely want to see whether re-scaling (transforming) your data has made its errors roughly normal.
Some of the most useful procedures were devised for sampling distributions, or as extensions of confidence intervals, but are seldom applied to the actual data. Univariate scatterplotsFor a variety of reasons, univariate scatterplots (rugplots) are the simplest way to compare how sets of values are distributed - yet they are surprisingly rarely used, even in elementary stats texts. By convention rugplots are plotted along one or more of a graph's axes, often the x-axis, hence its name (imagine a rug viewed edge-on) - but this need not be so. Therefore, whilst R's function allows you to add a rugplot to an existing plot, the following code takes a sample of n observations from a defined population (Y), and plots them as a simple (vertical) rugplot.
The following code requests R to take three random samples of different 'theoretical' populations, then to plot them as 'rugplots' up the middle of a graph. This time we have used dashes to minimize point overlap.
Conventional dotplotsDotplots, traditionally drawn with graphpaper and pen, used to be a popular way to display distributions of small, heavily tied, sets of values. The R code below assigns some values to a variable (y), then plots a conventional dotplot, with duplicate values arranged evenly above and below.
The conventional way to go about this task was (to instruct some unfortunate technician) to plot values initially as a rugplot, adding tied-values alternately above and below their fellows. Provided each value has an odd number of tied values the graph should be symmetrical about the x-axis, otherwise the result was arbitrarily assymetric - and for large sets of values, a tedious, untidy, and unsatisfactory affair. Conventional dotplots display tied values one above the other. They are also known as univariate scatterplots, dot histograms or histogram-type dotplots, or (along with jittered dotplots) as density plots. Another form of dot histogram displays tied values by plotting the frequency of ties of each value of y, on the value of y, as described below. Jittered dotplotsOne advantage of those very simple (univariate) plots is that even an untrained eye can readily interpret the differing densities of values - until, that is, the points overlap. Therefore, whilst rugplots have an attractive simplicity for very small sets of values, they do not cope well with high densities of similar values, or with 'tied', 'discrete' data. Years before computers were available, a popular way around those constraints was to plot the values as dotplots using ordinary graphpaper and, if there were duplicate or very-similar values, to add them (more-or-less evenly) either side of those already plotted. Given which, the wider the dotplot, the denser the values were around that value. Statisticians have criticised this method, partly because the rules for adding points were not standard, but especially because it has a habit of introducing unwanted, arbitrary, and sometimes misleading patterning. Now computers are ubiquitous, and we have good pseudo-random-number algorithms, an increasingly popular way to separate similar values, whilst not introducing bias, is to add uniformly-distributed random variation orthogonally (at-right angles, or 90 degrees) to the observed values. Since adding random variation is known as 'jittering', these are commonly known as jittered dotplots, or jittered scatterplots. The R code for displaying a single sample as a jittered dotplot is gloriously simple. The following code displays the sample obtained above
Smoothing distributionsApplying repeated random variation to the observed values themselves, then averaging the result, smooths their distribution. But, if the result is not to be misleading, this smoothing requires you select suitably-distributed error-distribution, e.g. normal. The following code instructs R to apply Gaussian (normal) smoothing to the values in variable y3, and plot their mean probability density. We have added the original values as a rugplot.
The computational advantage of using a theoretical and mathematically-defined function is you do not have to repeatedly jitter all your sample values, then examine how the jittered vales are distributed. The advantage to users is smoothed plots resemble theoretical plots and histograms in text-books. Since the form and degree of smoothing is unavoidably arbitrary, every smoothing function risks introducing bias and artefacts - even if that smoothing function is a simple running mean, or class-intervals (as used in histograms). Jittered scatterplots do not introduce bias on average, but when jittering is applied to an individual sample the human eye smooths the distribution to a random hence uneven degree. The following code instructs R to produce jittered scatterplots of the 3 samples above
One important limitation of rugplots, jittered dotplots and their ilk, is they tend to obscure any fine structure within a sample distribution, such as tied values, or patterns within very similar values. Ironically, whilst many nonparametric statistics collapse data to ranks, rank-based methods avoid the problems inherent to class-intervals, and can retain all the fine structure for examination. Simple rank scatterplotsArguably the simplest rank-based graphical technique is a scatterplot of rank on value. For instance, the following code instructs R to randomly select (n=) 30 values from a defined population distribution, and show the result as a scatterplot of rank on value.
Remember, with all the plots on this page, you are unlikely to get precisely (or sometimes even approximately) the same result as us, because the values are selected at random! You can get the same result using these instructions:
Notice that, because R's function assumes you want the mean (average, or expected) rank of tied values, the following code would loose some of that valuable information - unless the data lacks ties (so every value is unique) - which often happens in small samples, even of highly-discrete populations.
Notice this plot tends to obscure how ties within the data are distributed. Cumulative rank scatterplotsIf you wish to compare several samples containing unequal numbers of values it helps to standardize the ranks - most simply by converting to relative rank - as in this example:
These plots are also known as empirical distributions functions (ECDF), and to emphasize the fact they are unavoidably discrete, they are often plotted as stepplots. Plotting them as lineplots smooths the distribution to the eye, and makes them easier to compare, but implicitly assumes intermediate values could realistically be observed. If you want to use R's ECDF function, you can plot the results using Theoretical statisticians might also point out that an ECDF provides a maximum-likelihood estimate (MLE) of the population's cumulative distribution function (CDF) - and note that many MLE's are biased. In more everyday terms, these plots are cumulative distributions. Unfortunately, owing to the way statistics are taught in schools, the histogram holds powerful sway, and most people find cumulative distributions comparatively hard to interpret. P-value plotsOne reason cumulative distributions are unpopular is because people find it hard judge their location, dispersion, or skew. A simple way to address these issues is to use convert values of p above 0.5 to 1 minus p - in other words to reflect the upper tail downwards. However, since 1/n >= r/n >= 1 is inherently assymetric, p = {r-0.5}/n is a less biased measure of relative rank. The following code takes 3 samples in the same way as immediately above, then presents them as p-value lineplots - to aid comparison, a vertical line shows each sample median.
One disadvantage of p-value plots is, since they are seldom used, they confuse the uninitiated - including otherwise-sensible statisticians.
Frequency of tiesOne advantage of rank scatterplots is that, being cumulative, they are less affected by fine structure than rank-frequency plots - the larger your sample size, the less variable is its cumulative distribution. Hence the peak of each p-value plot (the median is where p=0.5) is a more reliable measure of location than a histogram's mode. The following code instructs R to plot the relative frequency of each value of y1, calculated from its rank. Bars indicate the frequency each value is tied + 1.
Of course unless they are subject to rounding, because a normal population contains an infinite number of different values, the probability of selecting two identical normal values by chance approaches zero. In which case the frequency distribution (for example see above) is polymodal, therefore every mode has the same height (f = 1), and the result is equivalent to a univariate plot or rugplot. Rounded large samples produce histogram-like results - but, if the rounding is uneven, such plots are misleading. Remember, given sufficiently many class-intervals, a histogram will also have up to n modes, unless values are tied - in which case the result is equivalent to a bargraph. At the opposite extreme, most people assume straight lines must be relatively easy to appraise. Which may explain why quantile-quantile plots (QQ plots) are a relatively popular way to compare two distributions. Q-Q plotsThe following instructions are a simple and transparent way to compare two samples of equal size:
A few moments thought reveals that if both samples are of the same population we would expect, on average, a QQ plot's points will be identical. Whereas, if the values are selected at random, this will seldom occur - unless the population is extremely small indeed. If you compare two large samples of infinite normal populations, you commonly find values are very similar around the plot's center, but differ at its extremes. If your samples are of unequal size, R's function can use interpolated values from the larger sample. So if y1 has 3000 values and y2 has 3 values, qqplot only produces 3 points. Normal quantile plotsBecause two-sample QQ plots are comparatively rare, most people assume QQ plots are only used to see if a set of values deviates from their expected ('theoretical') normal values. This type of plot is more correctly termed a normal quantile plot, for example as follows:
Two further alternativesIf you prefer to use R's normal quantile function, it is called The following code applies R's normal quantile function to the expected values of 5 normal observations, which we estimate from (R=) 50000 random samples (of n=15 values) from a normal population (otherwise known as ranked normal deviates, or rankits).
Lest you assume theoretical quantiles estimated via simulation (such as rankits) have no advantage over theoretical quantiles obtained from an inverse probability function, let us compare them a little more carefully. The following code asks R to plot the difference between the (estimated) expected values on their theoretical quantiles (in this case obtained R's normal quantile plot function). Plotting the deviations from expected against their observed values is much more sensitive than a simple QQ plot - so can reveal systematic differences in two otherwise similar distributions.
See alsoMore examples of R code for displaying frequency distrbutions: Drawing a histogram, a frequency polygon, a stem and leaf plot, jittered dot plot, rank scatterplots, frequency of each value, empirical cumulative distribution function (ECDF), P-value plot, multiple P-value plots, smoothed distribution function. Elementary Statistics (Pages of information about basic statistics - for struggling Students - and their teachers): Displaying frequency distributions + Means, medians, modes + Types of variables + Variance and Standard Deviation + Standard error of the mean + The Normal distribution + Relationships between variables + Quantiles (median, range, interquartile and 90% range) + Statistics for beginners (using R) + Extras (where values are not normal)
|