InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Stat.Book Beginners Stats & R
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

 

 

Why should anyone use a tool as complex as simulation-modelling to understand simple statistics?

The idea is not as deranged as it might first appear:

  • Monte Carlo simulation models enable you to avoid much of the algebra normally required to explore ordinary statistical models.
  • Constructing simulation models tends to reveal the reasoning and assumptions underlying a statistical analysis.
  • Simulation models enable you to explore the behaviour of statistics where their assumptions are met.
  • They enable you to predict their behaviour when some of those assumptions are not met.
  • They are the only way to predict how some statistics behave - even approximately.

"There is no substitute for understanding what you are doing."

Loren P. Meissner

If you cannot immediately see why these properties are useful to students, ask yourself which of these definitions is more useful.

  1. The standard error of a sample, of sample size n, is the sample's standard deviation divided by the square root of n.
    (E.g. see WolframMathWorld  the web's most extensive mathematics' resource.)
  2. The standard error is the standard deviation of a statistic's sampling distribution - or an estimate thereof.

If you feel the first definition is more appropriate at an 'elementary' or 'practical' level, please bear in mind that, with suitable off-the-shelf software and minimal training personal computers enable virtually anyone to calculate remarkably complex statistics. In which case, the only point in 'understanding statistics' is to interpret the results thereof.

  • Note: When we talk about 'elementary' statistics, we are not assuming students are 5-year-olds, we are simply referring to calculations which statisticians treat as trivial, that involve principles too obvious or too complex to explain at an 'introductory' level.

In practice any useful 'understanding' must enable students to answer questions such as:

  • How reliable or useful is this statistic?
  • What is its underlying reasoning?
  • What is being assumed when using it?
  • How should I react when those assumptions are untrue?

Then ask yourself how enables anyone to answer those questions, even at the most elementary level. The second definition may not enable you to answer them immediately, but it does at least provide a starting point in doing so.

 

Let us consider how Monte Carlo, using can provide such understanding.

"Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise."

John W. Tukey

To illustrate the power of simulation modelling in understanding statistics it helps if we begin with a familiar statistic (such as the simple arithmetic mean) then build up our simulations from scratch using elementary R code, and see what emerges in the process.

    In doing so, we are not trying to teach elementary statistics to anyone who already understands them, we are considering what can be revealed to those who do not. In other words, what follows is not a neatly-ordered academic discourse, nor any kind of 'lesson-plan', but a practical exploration of what is possible. It does not assume the reader understands maths, or knows anything much about R, but it does assume one is prepared to 'think outside the box'.

 

An unavoidable first step in calculating a mean is to have some numbers to calculate it from. The following code assigns an arbitrary set of values to a variable called then reveals its contents, then selects all of those values, and calculates their mean.

    These are the results R gave us:
    > Y # reveal the contents of Y
    [1]  1.1  2.2  3.3  4.4  5.5  6.6  8.0 10.0 99.9
    > y=Y # select all of Y, in their original order
    > m=mean(y) # find the mean of y
    > m # show the value of that mean
    [1] 15.66667
    
  • Note: comments to the right of a hash mark are for your benefit, not R's.
  • Yes, these instructions to R are not as simple or neat as they could be, but this is an exercise in statistics, not programming.

Assuming nothing has gone wrong with our computer, it does not matter how many times we recalculate that mean, it will not vary. In other words, whilst we cannot assume our computer is 100% accurate, we do expect it to be 100% precise.

To check that assumption is correct, the following code asks R to calculate that mean ten thousand times from the same data, and save each result in a variable called , then summarize the results.

    These are the results R gave us
    > summary(m) # give a 5 point summary
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      15.67   15.67   15.67   15.67   15.67   15.67
    > sd(m) # show the standard deviation of those means
    [1] 0
    
  • Note: Those instructions to R assume we ran the earlier set of instructions.
  • The function gives the 'sample' standard deviation. So would give the standard deviation of Y.

Before going any further, please invest a few moments ensuring you are absolutely certain what those instructions to R were doing.

"There's no sense in being precise when you don't even know what you're talking about."

John von Neumann

If it is of help, there are several ways  to graphically display how these sample means are distributed. But, to avoid becoming bogged down, we leave their interpretation to you.

Bearing in mind our definitions of the standard error of the mean, let us use one of our samples to compare the standard deviation of the 10000 means (obtained above ) with their standard deviation estimated using

    > sd(y)/sqrt(n) # estimate their standard error
    [1] 10.57061
    > sd(m) # give their standard deviation
    [1] 0
    
  • Note: estimates the standard deviation of the sampling distribution of a simple arithmetic mean calculated from n observations - provided certain assumptions are met.
  • Put more simplistically, estimates how means vary, given certain assumptions.

Given all of which it should be blindingly obvious that the standard error formula must be assuming something which is missing from the simulation model above 

 

Understanding sampling distributions is useful

At this point it might be a good idea to consider a highly-important term which is largely ignored in conventional elementary statistics courses: the sampling distribution - that being the distribution of a given statistic. The concept of sampling distributions is central to statistical tests and confidence intervals.

In this case the statistic is the mean of a sample, otherwise known as a sample mean. Clearly the standard error formula must be assuming how sample means can vary - and must assume that variation is predictable.

The most popular model by which variation is predicted requires you assume that variation is entirely random - in other words the variation of those means arises from 'sampling variation'. In which case the obvious recourse is to randomly select the values from which those means are calculated.

  • Note: by 'random' we do not mean that you can just put in any results that you happen to feel like, or that you can select whatever results are most convenient! A random outcome means that one result should not be predictable from the previous ones (no sequences) and, in most cases that, of the available outcomes, each is equally likely (which is what you assume when you toss a coin, or roll a dice).
  • Random selection and allocation form a central, absolutely crucial, part of statistical inference.

Since random selection is a common sort of requirement, R provides a function which does exactly that. Given an appropriate variable from which to select, the function returns a random selection of values thereof.

For instance:

    > y = sample(Y) # randomly sample the values in Y
    > y # show the values selected
    [1]  2.2  3.3  8.0  4.4  6.6  1.1  5.5 10.0 99.9
    > Y # show the original values
    [1]  1.1  2.2  3.3  4.4  5.5  6.6  8.0 10.0 99.9
    
  • Notice that, by default, all the available values are selected without replacement, so this merely provides the contents of Y in random order.

Since the order of values from which a mean is calculated ought not to affect the result, sampling every possible value, where each can only appear once, cannot produce any variation in those means.

  • If that statement was not self-evident, try entering the following code!

      > sd(m) # this is what we obtained
      [1] 0
      > sd(y)/sqrt(n) # estimate their standard error
      [1] 10.57061
      

 

Since random variation assumes you cannot predict successive outcomes, you might argue that selecting without replacement is not really 100% random. So let us see what happens if each value is replaced immediately after it is selected, in other words, we sample with replacement.

  • Notice that sampling a finite set (of 9) values with replacement produces the same result as sampling an infinitely-large set of those same values (with or without replacement).

Now let us estimate how such variation would cause these sample means to vary.

> sd(m) # this is what we obtained
[1] 9.861535
> sd(y)/sqrt(n) # the conventional estimate
[1] 14.11489
  • Notice these sample means show some variation, but not apparently as much as predicted by the conventional estimate. However, there is one (or more) very good reasons why the results are not the same.
  • Since these selection are done randomly, both of these results are liable to vary each time this code is run.
  • If R was very small the variation in would be considerable but, given enough replicates, we would expect a consistent result.
  • However, when the contents of each random sample varies, using only gives the standard error estimated from the very last sample we took. This is unfortunate because we have no idea how 'typical' that particular sample was.

To obtain a fairer comparison, let us do two things:

  1. Record the standard error estimated for each sample, of n (=9) values, using then use the most typical result (their median value) for comparison.
  2. Estimate the standard error using where s is the standard deviation of the entire set of (=90000) values - in other words a large random sample.

    > sd(m) # this is what we obtained
    [1] 9.918693
    > median(se) # the most typical sample se
    [1] 10.53685
    > sd(all_y)/sqrt(n) # the combined sample se
    [1] 9.945413
    

If you re-enter that code several times, it should be obvious that the standard deviation of sample means is much the same as the standard error estimated using when s is the standard deviation of a very large sample (90000 values). However, when the standard error is estimated from a small sample (just 9 values) tends to be rather too large - in other words it is a biased small-sample estimator. We return to this issue below.

Remember, the textbook formula is supposed to be estimating how random selection causes sample means to vary - not the other way round!

  • Below are the result of 10 runs of that model, expressed as 3 rugplots, plus the estimated sampling distribution of these means.

  • Notice the distribution of these sample means is neither smooth nor symmetrical about their median.
  • This lack of symmetry makes the standard deviation of these means a rather poor measure of their variation.
  • The standard deviation and mean are parameters of a 'normal distribution'.
  • Specifying a mean +/- standard error is rather misleading when those means have a very non-normal sampling distribution - especially if that distribution is skewed...

Clearly the textbook standard error formula is not only assuming the values are sampled at random.

 

How realistic are these models?

"What men really want is not knowledge but certainty."

Bertrand Russell

At this point it might be as well to pause, and consider what on earth these models are supposed to be simulating. Trying to understand the properties of sample means and standard deviations is all very well, but it is hard to imagine these simulation models (or statistical models) tell us anything useful about the real world. Mind you, thus far the same can be said of the values we calculated our statistics from - garbage in, garbage out, remember?

A moment's thought reveals that we are dealing with two very different issues:

  1. How we selected the values.
  2. What other values could have been selected.
An excellent reason for randomly selecting values is it enables us to avoid bias.
  • If we ignore any other possible outcome, our conclusions can only apply to the particular results we obtained - which would make surveys and experiments somewhat pointless.
  • If a statistical analysis ignores how the results were obtained, the results can be 100% misleading.

These conclusions are true of any statistical analysis, no matter how 'elementary'. Calculating statistics such as the mean and standard deviation may describe the results at hand, but are otherwise meaningless. Statistics such as standard errors do not merely describe a set of observed results, they enable you to infer something about how they might behave - assuming the samples could be repeated.

  • The problem with our simulation models thus far (and ) is they ignore how the values in variable were obtained, or where from.

So where does the textbook standard error formula assume values are selected from, and how does it assume they are selected?

Given the importance of the normal distribution in conventional elementary statistics courses, you might assume that observations should be normally-distributed. Or more correctly, the values are 'randomly selected from a normally-distributed population'. (If you are new to statistics a population is any defined, fixed set of values, from which samples may be drawn).

Using R, it is easy to simulate such a sample, for instance as follows:

    > qnorm(runif(n))
    [1]  0.5428250 -0.4163659
    
  • The function yields randomly selected values which, by default, are from zero to one, in other words these are random uniform quantiles. These values are rescaled to normal quantiles by the function. (To obtain random normal quantiles directly, use the function.)

Notice that these simple instructions make some crucial, if non-obvious, assumptions:

  1. The sample size is set by us - in other word it is fixed, so cannot vary at random. In real surveys or experiments that assumption is not always true, but most analyses ignore the fact.
  2. The mean (=0) and standard deviation (=1) of that normal population is also fixed, if only by default.
  3. Because normal populations contain an infinite number of slightly different values, provided each value is measured accurately, the chance of selecting the same value twice is effectively zero. So it makes no difference whether each value is replaced immediately after selection.

 

Small sample behaviour

Now for an acid test. Let us:

  • Select a sample of values, then record their mean in variable then record their 'sample standard error' in variable calculated as .
  • Repeat that a lot of times (say ).
  • Compare the standard deviation of those 10000 means with the standard errors we have estimated.

    > summary(se) # summarize the estimates
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
    9.466e-06 2.238e-01 4.770e-01 5.639e-01 8.124e-01 3.035e+00
    > sd(m) # give the observed se
    [1] 0.7118297
    
  • A normal-quantile plot reveals that (unlike our previous sample means) these means are effectively normal (thus smoothly and symmetrically distributed) which should not surprise you as the values they were calculated from were also normal.
 

  • Provided our sample means are distributed normally, since the standard deviation specifies how any normal distribution is dispersed, the standard error is a useful way to describe sampling distribution of those means.
  • But does provide an unbiased estimate thereof?
  • Since a standard normal distribution has a standard deviation = 1 the standard error of those means should be

 

Now we can compare the standard error of these means, estimated from each sample, with the standard deviation of those means obtained from samples. In doing so, remember the standard errors estimated from each sample and the standard deviation of their sample means are both estimates of the true standard error (where is infinite). So you may want to rerun that model instructions several times.

Applying the standard error formula to our population parameters tells us we should expect a standard error of 0.7071068, if we took the standard deviation of an infinite number of sample means.

  • Notice that seems to give a closer estimate than . This is scarcely surprising - each sample standard error estimate is based upon just n (=2) observations, whereas the standard deviation of sample means used 20000 observations.
  • More importantly, notice that the textbook formula is giving a biased (low) estimate of the standard error, even though these random samples are of a normally distributed values.

 

Large sample behaviour

The following instructions are identical to those above  except we have fixed the sample size at n=1000 instead of n=2.

    > summary(se) # summarize the estimates
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    0.02887 0.03115 0.03162 0.03162 0.03209 0.03429
    > sd(m) # give the observed se
    [1] 0.03144545
    > 1/sqrt(n) # give the expected se
    [1] 0.03162278
    
  • Notice these sample standard errors are much closer to what they ought to be.
  • We may conclude that the textbook standard error formula assumes we are dealing with extremely large (in fact infinite) samples.

 

Samples of non-normal data

"Half of wisdom is learning what to unlearn."

Larry Niven
The Ringworld Throne (1996)

But does the textbook standard error of the mean formula assume samples are of a normal population? The following instructions are identical to those above  except they sample a standard uniformly distributed population. (Every value we select is equally likely to lie anywhere between zero and one.)

If we do not know the formula for calculating the standard deviation of a uniform population, one simple solution is to take a very large sample thereof, and calculate its standard deviation. We then divide it by root n to get the standard error.

        Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
    0.008647 0.009039 0.009128 0.009128 0.009217 0.009611
    > sd(m) # give the observed se
    [1] 0.009156626
    > # give the expected se
    [1] 0.009131692
    
  • The function yields randomly selected values which, by default, are from zero to one.

Clearly the value obtained by the textbook standard error-of-the-mean formula is just as applicable for (large) samples of a uniform population as it is for a large samples of a normal population.

 

Very non normal data

Uniformly and normally distributed data have some important properties in common:

  • The distribution of values being sampled is a smoothly and symmetrically distributed.
  • It is extremely unlikely a sample will contain two identical values.

Let us consider what happens if we take large samples of data which are noticeably skewed, and where each sample will contain many identical (tied) values. You may recall  that sampling a finite set of values with replacement produces the same result as sampling an infinitely-large set of those values. You may also recall that, when small samples were used, the textbook standard error formula gave a biased estimate of the standard deviation of the sampling distribution of means, and that sampling distribution was anything but normal.

    > summary(se) # summarize the estimates
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     0.7798  0.9196  0.9452  0.9446  0.9700  1.0650
    > sd(m) # give the observed se
    [1] 0.9501475
    > # give the expected se
    > sd(sample(Y, size=1000000, replace=TRUE))/sqrt(n)
    [1] 0.9454088
    

The similarity of these estimates suggests that the textbook standard error formula is applicable for means of samples drawn from any given population, provided those samples are sufficiently large and obtained entirely randomly. In which case provides a quick and easy estimate.

However, as should now be increasingly apparent, there are some important practical difficulties with that nice simple answer. For instance:

  • Real samples are often modest to small.
  • Some populations may cause this simple formula to approach its large-sample behaviour more slowly than others.
  • Standard errors of other statistics may behave very differently from those of simple means.
  • Many real studies do not obtain their data by simple random sampling of a single fixed population of potential results.

Nevertheless, even with the tools we have provided on this page, most students should be able to envisage ways of investigating these issues - with very practical applications.

"Progress is impossible without change, and those who cannot change their minds cannot change anything."

George Bernard Shaw