InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site

 

 

Normal distribution functions using R

Cumulative normal distribution function

R's pnorm function calculates what proportion of a normally-distributed population (pN) is less than a given value (y).

Gave:

[1] 0.8913985

    Note:
  • Most statistical tables yield the proportion > y. If that is what you require, use 1-pnorm(y) or use pnorm(-y)

  • It is just as easy to find p <= y for more than one value of y.
    For example as follows: y=c(1.234, 3.21, 0.999) pnorm(y)
  • By default y is assumed to be a quantile of the standard normal distribution.

  • Officially pnorm yields the proportion of a normal population less than or equal to any given quantile. In reality, because a normal population is infinite and every value is different, the proportion which equals any one value is 1/infinity, which is virtually zero. So p <= y is identical to p < y

  • For quantiles of normal populations with a non-zero mean, or whose standard deviation are not 1, you need to supply the relevant parameters.
    For example as follows: y=c(1.234, 5.432, 9.999) pnorm(y, mean(y), sd(y))
      In this case, the R function knows which value is which because all 3 were supplied. If this is not so, you must make your arguments explicit.
      For example as follows: pnorm(1.234, sd=0.12)

 

 

Inverse normal distribution function

R's qnorm function calculates which value in a normal population (y) has a given proportion (pN) of values below it. In other words it does the inverse of the cumulative normal function.

Gave:

[1] -1.160120

    Note:
  • Traditional statistical tables only deal with the upper tail of distributions, where pN > 0.5
      If you want to obtain the same results as you would get from a statistical table use qnorm(1-pN) or use -qnorm(pN)

  • If you specify their parameters, you can obtain quantiles of normal distributions where μ≠0 or σ≠1. For example, this would give 3 quantiles, each of a different normal distribution:
    pN=c(0.23, 0.34, 0.95) m=c(1, 3.1, -0.01) s=c(1.2, 1.3, 1.4) qnorm(pN, m, s)

  • When pN=0, or pN=1, then qnorm(pN) gives -Inf or Inf. If the proportion is below 0 or greater than 1, qnorm(pN) cannot be evaluated - so the result is not available (NA).

 

 

Normal density function

R's dnorm function calculates how steep the cumulative normal distribution is at a given value (y).

For example:

Gave:

[1] 0.186315

    Note:
  • This function does not give the probability of obtaining a given value. That probability is effectively zero. Instead, rather loosely, it is a measure of how likely you are to find values similar to the one you supplied - in other words their density.

  • Like statistical tables, it's default is a standard normal distribution.

  • Unlike statistical tables this function can deal with positive or negative values, and allows you to specify the distribution parameters.

 

 

Random normal values

the rnorm function enables you to obtain (n) randomly-selected values (y) from a normal distribution.

For example:

Gave:

[1] -1.5648171 -0.6778267

    Note:

  • These 2 observations were selected at random from a normal population with a mean of 1.2 and a standard deviation of 2.3

  • If the mean and standard deviation are not supplied values are selected from a standard normal population.

  • The instruction rnorm(2, 1.2, 2.3) could, of course, be divided into these three instructions:
    n=2 Pn=runif(n) qnorm(Pn, mean=1.2, sd=2.3)
    Following their execution, the variable Pn is equally likely to have any value between zero and one. If n is made sufficiently large, the resulting values of y can be expected to closely approach normal.

  • The functions pnorm, qnorm, dnorm, and r norm will happily accept and produce variables containing one or more values. If these variables are of unequal length, R will recycle their contents as need be. For example, y=rnorm(1000) yields a vector, y, containing 1000 values randomly selected from a standard normal distribution. Whereas qnorm(0.1, 1:2, 1:5) would yield quantiles from 5 different normal distributions:
    [1] -0.2815516 -0.5631031 -2.8446547 -3.1262063 -5.4077578

  • For help on these 4 functions use the function name preceded by a question mark, for example ?rnorm

 

 

Fitting to a quantile scatterplot

R can fit a normal distribution to a quantile scatterplot using these instructions:

Giving:

 

    Note:
  • y=scan(file='rnorm90.txt') loads a set of 90 random normal values (from the file rnorm90.txt in the default folder for your current session of R) and assigns those values to variable y.

  • plot(y, (rank(y)-0.5)/length(y), ylab='crr', pch=20, col='red', main='Quantile scatterplot of y') instructs R to do a scatterplot of corrected relative (mean) rank (crr = [rank-0.5/n]) on y - assuming y does not contain any missing values. For sake of clarity, it also instructs R to label the y-axis as crr, and to use small red bullet symbols for the points.

    If you need sequential rather than mean rank, use rank(y, ties.method='first'), instead of rank(y)

  • The lines function asks R to overplot the quantile scatterplot with a lineplot approximating the fitted cumulative normal distribution. The sort function ensures the points are plotted in ascending order - unless, of course, you prefer your graph to resemble an insane bird's nest.

    Notice that, rather than assign the sorted values to variable z beforehand, as a separate instruction, we have done this within the plot function. To accomplish this feat of brevity, we must use the more general <- instead of the much more restricted = function.

  • If you want to compare a number of distributions, without having to overplot them, it is possible to split the plot window into a grid of sequentially-used plots. For example par(mfrow=c(2,2)) splits the plotting output into 2 rows and 2 columns - which are used from left to right, top to bottom.

    However, since the par function affects subsequent plots, before changing it's parameters, it is a good idea to save the default settings using something like this: defaults=par(). Then, when you want to restore them, you can use par(defaults).

 

 

Fitting to a cumulative histogram

You can plot a cumulative histogram and add a fitted normal distribution as follows:

Giving:

 

    Note:
  • The lines function overplots an existing plot - so you must already have plotted your cumulative histogram.
      lines(y) would do the same thing as points(y, type='l')

    The <- function enables us to assign the sorted values to variable z within the plot function - rather than using separate instruction beforehand.

    Because this code is plotting frequencies, rather than relative frequencies, we have rescaled our p-values using the number of values in variable y.

  • This way of producing a fitted function will not look very good if y contains very few values.

    A simple way around this problem is to use a sequence of, say 100, values instead of y. In other words by using:

 

 

Fitting to a histogram

Getting R to plot a histogram with a fitted normal distribution is not quite as easy as you might expect - but these instructions work reasonably well:

Giving:

 

    Note:
  • n=length(y) and info=hist(y, plot=FALSE) obtains, n, the number of items in variable y - and, info, the information for plotting a histogram of y.

    The plot=FALSE modifier tells R not to plot this histogram. If you omit this modifier, you may find the top of your fitted curve falls outside the plottable region - and is therefore cropped.

  • x=seq(min(y), max(y), length=100) asks y to create a variable called x, containing a sequence of 100 numbers between the smallest value and the largest value in variable y.

    If variable y contains a sufficient number of values you could use x=sort(y) instead of this instruction - although you may find the line fit is not so good at its extremities.

  • int=info$breaks[2]-info$breaks[1] calculates the width of the histogram's class intervals as the difference between the first two of its breakpoints. In other words, it assumes all the breakpoints are identical - which is the default for the hist function.

  • plot(x, dnorm(x, mean(y), sd(y))*n*int, col='red', type='l', xlab='y', ylab='Frequency', main='Histogram of y') instructs R to plot the fitted line before the histogram - thereby ensuring all the fitted curve between min(y) and max(y) is plotted. If this is not a concern you could plot the histogram first, using info=hist(y), then overplot it using points(x, dnorm(x, mean(y), sd(y))*n*int, col='red', type='l').

    Notice that this plot command is used to specify the headings normally supplied by the hist function. There are two reasons for this.

  • It prevents the plot function from using the variable names as labels.
  • When the hist function is used to overplot it does not supply its usual labels.

  • The add=TRUE modifier instructs the hist function to overplot (add to) a previous plot.

    Sadly, not all graphics functions can be used to overplot like this. For example, using plot(y,x, add=TRUE) merely produces a batch of error messages.

  • For more information on the hist function use ?hist.

    Notice that the hist function allows you to specify a range of parameters, including the number and location of class intervals, and whether to plot frequencies or relative frequencies. But, in order to fit a normal distribution, we must assume all the class intervals are the same.

 

 

Doing a 1-sample QQ plot

R can produce a normal QQ plot using this pair of instructions.

Giving:

 

    Note:
  • qqplot(y) produces the scatterplot, assuming the population is normal - but the expected quantiles represent a standard normal population.

    To ensure they are plotting 'like against like', some people prefer to plot the standardised values - for example using qqplot((y-mean(y))/sd(y))

  • qqline(y) adds the 'perfect fit' 1:1 straight line.

    If you plotted standardised values of y, you could use abline(0,1) instead.

  • If you have a lot of observations, qqplot(y, pch='.', col='red') and qqline(y, col='blue') may give a clearer graph.

Doing a PP plot

To obtain a normal PP plot you could use these commands:

Giving:

 

    Note:
  • rank(y) gives the mean rank of each observation, and length(y) the number of observations. So pobs = (rank(y) - 0.5) / length(y) estimates the proportion of the population that would be less than or equal to each value of y - in other words the observed p-value of each observation.

  • pexp = pnorm(y, mean(y), sd(y)) uses the probability (cumulative) normal distribution function (pnorm) to calculate what proportion of a normal population, with our sample mean and standard deviation, we would expect to be less than or equal to each value of y.

  • plot(pexp, pobs) plots a scatterplot of the observed on the expected p-values.

  • abline(0,1) adds the 'perfect fit' line.

  • If you have a lot of observations, plot(pexp, pobs, pch='.',col='red') and abline(0,1,col='blue') may give a clearer graph.

Plotting observed values on rankits

To obtain a normal rankit plot you could use these commands:

Giving:

 

    Note:
  • n=length(y) obtains the number of values in y.

  • x=matrix(rnorm(n*5000,mean(y),sd(y)), ncol=5000) randomly selects 5000 sets of n values from a normal population with the sample mean and standard deviation and puts them into a matrix of 5000 columns and n rows.

    Given that rnorm(i,mean(y),sd(y)) = (rnorm(i)*sd(y))+mean(y), if you have already standardised y, you could simplify this line to x=matrix(rnorm(n*5000), ncol=5000)

  • x=apply(x,2,sort) sorts each column of n values (n being our sample size).

    The 2 in this instruction tells R to apply the sort function to each column.

  • x=apply(x,1,mean) obtains the mean of each row of 5000 replicates - as specified by the 1 in this instruction.

    If you are calculating rankits for a distribution which is symmetrical about zero, such as the standard normal population, the precision of these rankits can be improved by adding an instruction to combine the values for each tail - such as: for(i in 1:n){ x[i]=sign(x[i]) *0.5* (abs(x[i])+abs(x[n+1-i])) }

    If however, you are calculating rankits for a skewed population (such as a lognormal one) you should not combine the tails - and should use a median, rather than a mean, to find their location of your rankits - for example using: x=apply(x,1,median) Also, given the median is less precise than the mean, you may wish to increase the number of replicates from 5000 to, say, 10000.

  • Now we have our (normal) rankits we can plot them against the values in variable y. But, since the rankits are in ascending order, y=sort(y) ensures variable y is in the same order.

  • plot(x,y,xlab='rankits', ylab='observed values') plots observed on expected values, and labels the x & y axes.

  • abline(0,1) adds the 'perfect fit' straight line.

  • For a normal distribution, rankits are the (mean or median) expected location for each of n observations. With a sample as large as the one above this plot is virtually identical to a QQ plot. As we note in the core text of this Unit, rankits are a better way of assessing small samples.