InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

 

 

Displaying Freq. Dists. using R

See also

Drawing a histogram

R can draw a histogram of a variable, y, using these instructions:

Gave us:

    Note:
  • You can set the number of breakpoints using the hist function's breaks argument (or some shortened form thereof, such as br). For example hist(y, breaks=1) would yield just 2 (equal width) intervals, whereas setting br=100 yields 101 intervals, and so on. Setting intervals of unequal widths is also possible, albeit generally unwise.
  • Since histograms group values into arbitrary class-intervals, you may find it useful to indicate the actual values of y as 'tick-marks' along one axis - this is known as a 'rug-plot'. The following instruction adds a rugplot to an existing plot.

  • by default, a rugplot is added to the (lower) x-axis.
  • Rugplots can usefully be added to some other types of plot.

 

 

A frequency polygon

R can draw a frequency polygon of one variable, y, using these instructions:

Gave us:

    Note:
  • Frequency polygons can be useful in comparing several distributions (say of the values in variables y & z) by plotting them within the same (single set) of graph axes. Of course, if the distributions of y & z differ much you need to set the plot limits so they all fit on it - and, if the number of values are very different, it is best to use relative frequencies, as we do below:

 

 

A stem and leaf plot

R can produce a stem and leaf plot of a variable, y, using these instructions:

Gave us:

 The decimal point is 1 digit(s) to the right of the |

  42 | 000
  44 | 50
  46 | 005
  48 | 05055
  50 | 050
  52 | 00000555
  54 | 0555
  56 | 00

    Note:
  • Unlike R's graphics functions, stem only produces an output to the console. The command x=stem(y) would merely assign a NULL value to x.
  • The parameter 'scale' is used to set the scale of the plot. We set scale=2 to scale it the same way as the other plots. If scale were not specified, R would use the default scale=1 which would pool adjacent classes.
  • If you used stem(y/10000) the result would be identical - aside from the message at the top being The decimal point is 4 digit(s) to the left of the |.

 

 

A jittered dot plot

R can produce a jittered dot plot of a variable, y, using these instructions:

Gave us:

    Note:
  • Variable jy contains the jittered values for plotting against variable y. By default the range of jy is 0 to 1.
  • runif(length(y)) produces the same number of random values as there are values in variable y. These random values (assigned to variable jy) are uniformly distributed, in that it is equally likely you will obtain any value between zero and one - it does not mean those values will be evenly spaced between zero and one.
  • To plot the same data with a logarithmically re-scaled y-axis, include log = 'y' in the plot command - for example, like this:

  • If, however you use plot(log(y), runif(length(y))), then R will plot the natural log of your y-variable - instead of plotting the rescaled but original values.
  • R has a function called dotchart which plots values along the x-axis, with the y-value set by the order in which data are given. Identical values are plotted one above the other. Since the order is set by the user rather than randomly, we cannot recommend use of this function in place of the jittered dot plot. However, if data are first sorted by rank, it does provide an alternative method of obtaining a rank scatterplot (see below ).

 

 

Rank scatterplots

There are many ways of plotting rank (or relative rank) on value - in other words producing a rank scatterplot. For example:

    Note:
  • You can obtain the same sort of thing by sorting y into ascending order, and plotting the result against that order:

  • Notice this code assumes there are NO non-available (NA) values, because the sort function automatically removes them.
  • If y does have some NA values, the following code would work - although it is somewhat inefficient because y must be sorted twice.

  • A better way is to sort y, and create a new variable (r) - assuming it is OK to reorder y and create r.

  • If you prefer to use relative rank, and y has no NA values, these instructions would work:

  • But these instructions may be better:

  • Or you could use these instructions:

Frequency of each value

R can plot the frequency of each value in variable y, without using class intervals, using these instructions:

Gave us:

  • If you prefer to plot the distribution as a histogram-type line diagram, use these instructions instead:

  • Or you could sort y into ascending order, then find how many of each value there are (using the run-length-encoding function) then plot frequency against value:

  • The rle (run length encoding) function produces two sets of values:
    1. Run lengths - that is the frequency of (neighbouring) identical elements.
    2. Run values - that is the value of each group of identical elements.

  • The result of the rle function are assigned to a (list-type) variable called tmp - as two (hidden) vectors, called 'lengths' and 'values'. The next line instructs R to plot the frequency of each value in variable y against its value. Notice that, because the length and value of each run is held by vectors within tmp, we need to address them as tmp$values and tmp$lengths.

  • If y is a continuous variable whose values are neither rounded nor truncated, the result will nearly always be equivalent to a rugplot. This is often true of small samples of discrete data, such as the distribution of eggs per gram of faeces - as shown in the 'line-plot' and a barplot below.

     

  • Last but not least, if you prefer to plot the distribution as a (traditional) non-jittered dotplot you could use these instructions:

 

 

Empirical cumulative distribution function (ECDF)

An ECDF is simply a plot of (sequential) rank on value, shown as a step plot - but often with the values overlaid as points. Since step plots must be plotted in ascending order, the simplest way is to sort the data. For example:

Gave us:

 

    Note:
  • If you prefer to use R's ecdf function, you could simply enter plot(ecdf(y)) - but it does not produce the verticals (nor will it accept a colour), so we prefer our version.

P-value plot

P-value plots are a useful way to examine and compare frequency distributions. At their simplest they are simply a (cumulative) lineplot of (p=) relative rank on (y=) value. But since it is harder to assess symmetry given an shaped plot, than to assess it given a /\ shaped plot, or an X shaped plot, p is calculated as corrected relative rank and values above the median are plotted against 1-p.

The code below gives a p-value plot for 1 variable (y).

Gave us:

    Note:
  • The median lies within the graph's apex, but you could make this explicit using abline(v=median(y1))
  • If y does not contain any nonavailable (NA) values, and you do not wish to sort y, this code would work equally well:

  • Or you could simply plot p, then 1-p, but only show the lower half of the graph.

Multiple P-value plots

P-value plots are a useful way to compare distributions because they are easier to interpret than quantile-quantile plots - and p-value plots enable you to compare more than 2 distributions at once.

For example, we did p-value plots for the two sets of cattle weight data using these instructions:

Gave us:

 

    Note:
  • You could also plot these as scatterplots, and limit their upper range to p-0.5, but lineplots are often easier to inspect.
  • We highlighted their medians using dotted vertical lines. This makes it easier to judge how symmetrical each distribution is, albeit at the expense of a more cluttered graph.
  • See how readily you can identify their quartiles, 10% & 90% quantiles, or any outlying values.

 

 

Smoothed distribution function

Since the frequency distribution of a sample is unavoidably discrete, it is sometimes useful to smooth it before plotting - provided this can be done without introducing too many arbitrary assumptions. We consider the reasoning behind this in Unit 3. R provides this facility via its density function. For example:

Gave us:

 

    Note:
  • For small samples it is often useful to add a rug plot.
  • Formally-speaking, by default this density function uses normal (or Gaussian) smoothing.
  • By overplotting, several smoothed distributions can be compared. For example: