![]() Biology, images, analysis, design... |
|
"It has long been an axiom of mine that the little things are infinitely the most important" |
|
Using Monte Carlo to learn about statisticsA new & powerful approach to teaching & learning(Not so much a screwdriver as a whole new toolkit)On this page: Why should anyone use a tool as complex as simulation-modelling to understand simple statistics? Let us consider how Monte Carlo, using R can provide such understanding Understanding sampling distributions is useful How realistic are these models? Small sample behaviour Large sample behaviour Samples of non-normal data Very non normal dataWhy should anyone use a tool as complex as simulation-modelling to understand simple statistics?The idea is not as deranged as it might first appear:
If you cannot immediately see why these properties are useful to students, ask yourself which of these definitions is more useful.
If you feel the first definition is more appropriate at an 'elementary' or 'practical' level, please bear in mind that, with suitable off-the-shelf software and minimal training personal computers enable virtually anyone to calculate remarkably complex statistics. In which case, the only point in 'understanding statistics' is to interpret the results thereof.
In practice any useful 'understanding' must enable students to answer questions such as:
Then ask yourself how Let us consider how Monte Carlo, using R can provide such understanding.
To illustrate the power of simulation modelling in understanding statistics it helps if we begin with a familiar statistic (such as the simple arithmetic mean) then build up our simulations from scratch using elementary R code, and see what emerges in the process.
An unavoidable first step in calculating a mean is to have some numbers to calculate it from. The following code assigns an arbitrary set of values to a variable called then reveals its contents, then selects all of those values, and calculates their mean.
These are the results R gave us:
> Y # reveal the contents of Y
Assuming nothing has gone wrong with our computer, it does not matter how many times we recalculate that mean, it will not vary. In other words, whilst we cannot assume our computer is 100% accurate, we do expect it to be 100% precise. To check that assumption is correct, the following code asks R to calculate that mean ten thousand times from the same data, and save each result in a variable called , then summarize the results.
These are the results R gave us
> summary(m) # give a 5 point summary
Before going any further, please invest a few moments ensuring you are absolutely certain what those instructions to R were doing.
If it is of help, there are several ways to graphically display how these sample means are distributed. But, to avoid becoming bogged down, we leave their interpretation to you.
Bearing in mind our definitions of the standard error of the mean, let us use one of our samples to compare the standard deviation of the 10000 means (obtained above) with their standard deviation estimated using
> sd(y)/sqrt(n) # estimate their standard error
Given all of which it should be blindingly obvious that the standard error formula must be assuming something which is missing from the simulation model above Understanding sampling distributions is usefulAt this point it might be a good idea to consider a highly-important term which is largely ignored in conventional elementary statistics courses: the sampling distribution - that being the distribution of a given statistic. The concept of sampling distributions is central to statistical tests and confidence intervals. In this case the statistic is the mean of a sample, otherwise known as a sample mean. Clearly the standard error formula The most popular model by which variation is predicted requires you assume that variation is entirely random - in other words the variation of those means arises from 'sampling variation'. In which case the obvious recourse is to randomly select the values from which those means are calculated.
Since random selection is a common sort of requirement, R provides a function which does exactly that. Given an appropriate variable from which to select, the function returns a random selection of values thereof. For instance:
> y = sample(Y) # randomly sample the values in Y
Since the order of values from which a mean is calculated ought not to affect the result, sampling every possible value, where each can only appear once, cannot produce any variation in those means.
> sd(m) # this is what we obtained
Since random variation assumes you cannot predict successive outcomes, you might argue that selecting without replacement is not really 100% random. So let us see what happens if each value is replaced immediately after it is selected, in other words, we sample with replacement.
Now let us estimate how such variation would cause these sample means to vary.
> sd(m) # this is what we obtained
To obtain a fairer comparison, let us do two things:
> sd(m) # this is what we obtained If you re-enter that code several times, it should be obvious that the standard deviation of sample means is much the same as the standard error estimated using Remember, the textbook formula
Clearly the textbook standard error formula is not only assuming the values are sampled at random. How realistic are these models?
At this point it might be as well to pause, and consider what on earth these models are supposed to be simulating. Trying to understand the properties of sample means and standard deviations is all very well, but it is hard to imagine these simulation models (or statistical models) tell us anything useful about the real world. Mind you, thus far the same can be said of the values we calculated our statistics from - garbage in, garbage out, remember? A moment's thought reveals that we are dealing with two very different issues:
These conclusions are true of any statistical analysis, no matter how 'elementary'. Calculating statistics such as the mean and standard deviation may describe the results at hand, but are otherwise meaningless. Statistics such as standard errors do not merely describe a set of observed results, they enable you to infer something about how they might behave - assuming the samples could be repeated.
So where does the textbook standard error formula assume values are selected from, and how does it assume they are selected? Given the importance of the normal distribution in conventional elementary statistics courses, you might assume that observations should be normally-distributed. Or more correctly, the values are 'randomly selected from a normally-distributed population'. (If you are new to statistics a population is any defined, fixed set of values, from which samples may be drawn). Using R, it is easy to simulate such a sample, for instance as follows:
> qnorm(runif(n))
Notice that these simple instructions make some crucial, if non-obvious, assumptions:
Small sample behaviourNow for an acid test. Let us:
> summary(se) # summarize the estimates
Now we can compare the standard error of these means, estimated from each sample, with the standard deviation of those means obtained from samples. In doing so, remember the standard errors estimated from each sample and the standard deviation of their sample means are both estimates of the true standard error (where is infinite). So you may want to rerun that model instructions several times. Applying the standard error formula to our population parameters tells us we should expect a standard error of 0.7071068, if we took the standard deviation of an infinite number of sample means.
Large sample behaviourThe following instructions are identical to those above except we have fixed the sample size at n=1000 instead of n=2.
> summary(se) # summarize the estimates
Samples of non-normal data
But does the textbook standard error of the mean formula assume samples are of a normal population? The following instructions are identical to those above except they sample a standard uniformly distributed population. (Every value we select is equally likely to lie anywhere between zero and one.) If we do not know the formula for calculating the standard deviation of a uniform population, one simple solution is to take a very large sample thereof, and calculate its standard deviation. We then divide it by root n to get the standard error.
Min. 1st Qu. Median Mean 3rd Qu. Max.
Clearly the value obtained by the textbook standard error-of-the-mean formula is just as applicable for (large) samples of a uniform population as it is for a large samples of a normal population.
Very non normal dataUniformly and normally distributed data have some important properties in common:
Let us consider what happens if we take large samples of data which are noticeably skewed, and where each sample will contain many identical (tied) values. You may recall that sampling a finite set of values with replacement produces the same result as sampling an infinitely-large set of those values. You may also recall that, when small samples were used, the textbook standard error formula
> summary(se) # summarize the estimates
The similarity of these estimates suggests that the textbook standard error formula is applicable for means of samples drawn from any given population, provided those samples are sufficiently large and obtained entirely randomly. In which case However, as should now be increasingly apparent, there are some important practical difficulties with that nice simple answer. For instance:
Nevertheless, even with the tools we have provided on this page, most students should be able to envisage ways of investigating these issues - with very practical applications.
|