InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Stat.Book Beginners Stats & R
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

 

 

 
If you find this page useful, and want more of the same, try our hyperbook 
    Since 'throwing a spanner into the works' has bad connotations, let us begin with the most popular, normal, conventional (if blunt) tool.

Histograms / bargraphs

Unless you are trying to show data do not 'significantly' differ from 'normal', most people find the best way to explore data is some sort of graph. Yet whilst many graphical techniques exist, very few are in common use. Journalists (for reasons of their own) usually prefer pie-graphs, whereas scientists and high-school children conventionally use histograms (or bar-graphs). Curiously, while statisticians condemn pie-graphs as misleading if not wholly inappropriate, they seldom criticise histograms - at any rate histograms appear in virtually every introductory statistics text, and many advanced ones.

    When asked to examine a distribution most people assume they are merely being asked to look at a histogram, which seldom stirs much enthusiasm - before or after performing a statistical analysis. One justification (noted elsewhere ) is publishers are reluctant to 'waste' page space upon qualitative and basic exploration. A more practical reason is that histograms work well when applied to very large sets of normal values, but are not a good way to examine small sets of values, or especially non-normal data. This is partly because, whilst grouping values into class-intervals smooths their distribution to some extent, that smoothing is wholly arbitrary. When applied to values which are highly skewed, highly polymodal, or highly discrete the outcome wholly depends upon your choice of breakpoints (even if you are unaware of making that choice).

If you assume R's default settings are liable to be the most reasonable in most circumstances, plotting a histogram is almost childishly simple. But, when inspecting a histogram, do remember that genuinely normal values are smoothly distributed.

The following code instructs R to randomly select a large sample of (n=1000000) values from a standard normal population and put ('assign') those values in a variable called 'y', then plot a histogram thereof.

    Note:

  • because our intention is not to provide a software library, but to illustrate principles and promote thought, we only provide the most minimal R-code here.

  • In the interests of clarity, we annotated our graphs using a simple image editor (MS PCpaint).

  • For those new to R, text to the right of a hashmark is for your information, not R's.

  • R purists may be horrified that we often assign values to variables using rather than

 

Histograms perform tolerably well when 'sensibly' applied to very large samples of 'normal' data, but very poorly when obtained from small samples and/or particularly non-normal data.

    There are two obvious reasons for that:

  1. The choice of class intervals is almost always arbitrary, hence prone to artefacts and bias.

  2. Collapsing data to class-intervals, equally arbitrarily, discards fine-structure information.

Of course, if a sample distribution's fine-structure is solely due to simple random variation, smoothing this out can give a more realistic picture of the population that sample represents. But that is only true if the smoothing function is appropriate! Which is one reason why histograms can be astonishingly misleading when their breakpoints are poorly (or unluckily) chosen.

Let us therefore consider some other ways of graphically displaying how values are distributed which do not require class-intervals.

Surprisingly, the rank-based nonparametric viewpoint has much to offer in exploring distributions - even if you merely want to see whether re-scaling (transforming) your data has made its errors roughly normal.

Some of the most useful procedures were devised for sampling distributions, or as extensions of confidence intervals, but are seldom applied to the actual data.

 

Univariate scatterplots

For a variety of reasons, univariate scatterplots (rugplots) are the simplest way to compare how sets of values are distributed - yet they are surprisingly rarely used, even in elementary stats texts. By convention rugplots are plotted along one or more of a graph's axes, often the x-axis, hence its name (imagine a rug viewed edge-on) - but this need not be so.

Therefore, whilst R's function allows you to add a rugplot to an existing plot, the following code takes a sample of n observations from a defined population (Y), and plots them as a simple (vertical) rugplot.

  • Note that this graph represents (n=) 100 values - yet only 10 are visible on the plot. This is because y can only take the value of one of the ten discrete values given above. Tied values are not distinguished.

 

The following code requests R to take three random samples of different 'theoretical' populations, then to plot them as 'rugplots' up the middle of a graph. This time we have used dashes to minimize point overlap.

  • Since we believe computers should be machines which save us work, we often disregard convention and use rather than

  • Similarly, given R provides 'random' theoretical functions, we thought might be clearer than , and have used rather than

 

Conventional dotplots

Dotplots, traditionally drawn with graphpaper and pen, used to be a popular way to display distributions of small, heavily tied, sets of values.

The R code below assigns some values to a variable (y), then plots a conventional dotplot, with duplicate values arranged evenly above and below.

The conventional way to go about this task was (to instruct some unfortunate technician) to plot values initially as a rugplot, adding tied-values alternately above and below their fellows. Provided each value has an odd number of tied values the graph should be symmetrical about the x-axis, otherwise the result was arbitrarily assymetric - and for large sets of values, a tedious, untidy, and unsatisfactory affair.

Conventional dotplots display tied values one above the other. They are also known as univariate scatterplots, dot histograms or histogram-type dotplots, or (along with jittered dotplots) as density plots. Another form of dot histogram displays tied values by plotting the frequency of ties of each value of y, on the value of y, as described below. 

 

Jittered dotplots

One advantage of those very simple (univariate) plots is that even an untrained eye can readily interpret the differing densities of values - until, that is, the points overlap. Therefore, whilst rugplots have an attractive simplicity for very small sets of values, they do not cope well with high densities of similar values, or with 'tied', 'discrete' data.

Years before computers were available, a popular way around those constraints was to plot the values as dotplots using ordinary graphpaper and, if there were duplicate or very-similar values, to add them (more-or-less evenly) either side of those already plotted. Given which, the wider the dotplot, the denser the values were around that value. Statisticians have criticised this method, partly because the rules for adding points were not standard, but especially because it has a habit of introducing unwanted, arbitrary, and sometimes misleading patterning.

Now computers are ubiquitous, and we have good pseudo-random-number algorithms, an increasingly popular way to separate similar values, whilst not introducing bias, is to add uniformly-distributed random variation orthogonally (at-right angles, or 90 degrees) to the observed values. Since adding random variation is known as 'jittering', these are commonly known as jittered dotplots, or jittered scatterplots.

The R code for displaying a single sample as a jittered dotplot is gloriously simple. The following code displays the sample obtained above 

  • Note the jittering variation (provided by the function) is uniformly, not normally distributed - in other words the jittering value is equally likely to have any value from zero to one (so uniform populations are usually parameterized by a minimum and maximum, rather than a mean and standard deviation).

  • Notice also that, entirely for our own convenience, we plotted the y-variable horizontally rather than vertically.

 

Smoothing distributions

Applying repeated random variation to the observed values themselves, then averaging the result, smooths their distribution. But, if the result is not to be misleading, this smoothing requires you select suitably-distributed error-distribution, e.g. normal.

The following code instructs R to apply Gaussian (normal) smoothing to the values in variable y3, and plot their mean probability density. We have added the original values as a rugplot.

The computational advantage of using a theoretical and mathematically-defined function is you do not have to repeatedly jitter all your sample values, then examine how the jittered vales are distributed.

The advantage to users is smoothed plots resemble theoretical plots and histograms in text-books.

Since the form and degree of smoothing is unavoidably arbitrary, every smoothing function risks introducing bias and artefacts - even if that smoothing function is a simple running mean, or class-intervals (as used in histograms). Jittered scatterplots do not introduce bias on average, but when jittering is applied to an individual sample the human eye smooths the distribution to a random hence uneven degree.

The following code instructs R to produce jittered scatterplots of the 3 samples above 

One important limitation of rugplots, jittered dotplots and their ilk, is they tend to obscure any fine structure within a sample distribution, such as tied values, or patterns within very similar values. Ironically, whilst many nonparametric statistics collapse data to ranks, rank-based methods avoid the problems inherent to class-intervals, and can retain all the fine structure for examination.

 

Simple rank scatterplots

Arguably the simplest rank-based graphical technique is a scatterplot of rank on value.

For instance, the following code instructs R to randomly select (n=) 30 values from a defined population distribution, and show the result as a scatterplot of rank on value.

Remember, with all the plots on this page, you are unlikely to get precisely (or sometimes even approximately) the same result as us, because the values are selected at random!

You can get the same result using these instructions:

  • In the interest of readability, we decided not to reduce to

Notice that, because R's function assumes you want the mean (average, or expected) rank of tied values, the following code would loose some of that valuable information - unless the data lacks ties (so every value is unique) - which often happens in small samples, even of highly-discrete populations.

Notice this plot tends to obscure how ties within the data are distributed.

 

Cumulative rank scatterplots

If you wish to compare several samples containing unequal numbers of values it helps to standardize the ranks - most simply by converting to relative rank - as in this example:

  •  When relative rank is calculated in that way (p = r/n), for any given value, p is the proportion of values in y whose ranks are less than or equal to that value - hence ranking is a cumulative function (re-mapping).

These plots are also known as empirical distributions functions (ECDF), and to emphasize the fact they are unavoidably discrete, they are often plotted as stepplots. Plotting them as lineplots smooths the distribution to the eye, and makes them easier to compare, but implicitly assumes intermediate values could realistically be observed.

If you want to use R's ECDF function, you can plot the results using

Theoretical statisticians might also point out that an ECDF provides a maximum-likelihood estimate (MLE) of the population's cumulative distribution function (CDF) - and note that many MLE's are biased. In more everyday terms, these plots are cumulative distributions. Unfortunately, owing to the way statistics are taught in schools, the histogram holds powerful sway, and most people find cumulative distributions comparatively hard to interpret.

 

P-value plots

One reason cumulative distributions are unpopular is because people find it hard judge their location, dispersion, or skew. A simple way to address these issues is to use convert values of p above 0.5 to 1 minus p - in other words to reflect the upper tail downwards. However, since 1/n >= r/n >= 1 is inherently assymetric, p = {r-0.5}/n is a less biased measure of relative rank. 

The following code takes 3 samples in the same way as immediately above, then presents them as p-value lineplots - to aid comparison, a vertical line shows each sample median.

One disadvantage of p-value plots is, since they are seldom used, they confuse the uninitiated - including otherwise-sensible statisticians.

  • The p-values in these examples employ sample quantiles (not theoretical quantiles) so must not be confused with P-values of test statistics, or P-value plots of nested confidence intervals.

  • Since no-one refers to 'empirical QQ plots', talking about 'empirical p-value plots' seems unlikely to improve matters.

 

Frequency of ties

One advantage of rank scatterplots is that, being cumulative, they are less affected by fine structure than rank-frequency plots - the larger your sample size, the less variable is its cumulative distribution. Hence the peak of each p-value plot (the median is where p=0.5) is a more reliable measure of location than a histogram's mode.

The following code instructs R to plot the relative frequency of each value of y1, calculated from its rank. Bars indicate the frequency each value is tied + 1.

Of course unless they are subject to rounding, because a normal population contains an infinite number of different values, the probability of selecting two identical normal values by chance approaches zero. In which case the frequency distribution (for example see above) is polymodal, therefore every mode has the same height (f = 1), and the result is equivalent to a univariate plot or rugplot. Rounded large samples produce histogram-like results - but, if the rounding is uneven, such plots are misleading. Remember, given sufficiently many class-intervals, a histogram will also have up to n modes, unless values are tied - in which case the result is equivalent to a bargraph.

At the opposite extreme, most people assume straight lines must be relatively easy to appraise. Which may explain why quantile-quantile plots (QQ plots) are a relatively popular way to compare two distributions.

 

Q-Q plots

The following instructions are a simple and transparent way to compare two samples of equal size:

A few moments thought reveals that if both samples are of the same population we would expect, on average, a QQ plot's points will be identical. Whereas, if the values are selected at random, this will seldom occur - unless the population is extremely small indeed. If you compare two large samples of infinite normal populations, you commonly find values are very similar around the plot's center, but differ at its extremes.

If your samples are of unequal size, R's function can use interpolated values from the larger sample. So if y1 has 3000 values and y2 has 3 values, qqplot only produces 3 points.

 

Normal quantile plots

Because two-sample QQ plots are comparatively rare, most people assume QQ plots are only used to see if a set of values deviates from their expected ('theoretical') normal values. This type of plot is more correctly termed a normal quantile plot, for example as follows:

  • Notice that in real data the normal population's mean and standard deviation are seldom known, unless they are standardized (e.g. using ), these quantiles will be linearly related, but unequal.

  • Again, because the theoretical values are normal population quantiles, a relative rank of P=r/n would bias those theoretical values. So, to reduce that bias, we use (r-0.5)/n 
      For instance in a sample of n values the highest possible rank (rmax) equals n, therefore rmax/n = 1. However, if y is randomly selected from a normal distribution, we are unlikely to observe the highest value of y is infinite - unless n is also infinite.
    Other corrections are available to reduce the bias further, nevertheless the extreme points of small samples of normal populations should be expected to slightly deviate from a linear relationship.

 

Two further alternatives

If you prefer to use R's normal quantile function, it is called

The following code applies R's normal quantile function to the expected values of 5 normal observations, which we estimate from (R=) 50000 random samples (of n=15 values) from a normal population (otherwise known as ranked normal deviates, or rankits).

Lest you assume theoretical quantiles estimated via simulation (such as rankits) have no advantage over theoretical quantiles obtained from an inverse probability function, let us compare them a little more carefully.

The following code asks R to plot the difference between the (estimated) expected values on their theoretical quantiles (in this case obtained R's normal quantile plot function). Plotting the deviations from expected against their observed values is much more sensitive than a simple QQ plot - so can reveal systematic differences in two otherwise similar distributions.

  • Notice that the median value is unbiased.

  • In this example Y contained standard rankits, because their values are similar to theoretical standard normal quantiles.

  • For a real sample, because the population parameters are unknown, you would only expect no difference if the data were standardized prior to plotting, for example using: