InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Stat.Book Beginners Stats & R
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site

 

 

Beginners statistics:

Introduction

On this page: Example,  Definition and Use,  Tips and Notes,  Test yourself,  References  Download R  R is Free, very powerful, and does the boring calculations & graphs for scientists.

Example, with R

Imagine you have been collecting data - for example the colour - on a number of items.

You could just list the colour of each one like this.

greenishyellow   yellowishgreen   blue   puregreen   blue   bluegreen

Alternatively you could summarize the data in various ways.

For instance:

  1. That list has 6 items.
  2. There are 5 different colours.
  3. The most common single colour is blue.
  4. Most of that list are shades of green - or greenish.
  5. If you select an item at random there is a 3/6 chance it will be a shade of green.
  6. The first colour listed is greenishyellow, the last is bluegreen - assuming that order is important. It may be arbitrary, or random.
  7. You could also say those colours range from blue to greenishyellow - but that assumes you rank them in brightness, or in spectral order, rather than in terms of their 'purity', or emotional appeal.

With so little data is easy to do such summaries. But if you were summarizing very many results, say a million, you may prefer a bit of help. Some of those summaries can be obtained with R:



Definition and Use

The term 'statistic' can refer to several rather different things.

  1. A statistic summarizes or represents a set of information - most commonly as a single number. The term statistic is used both for the value and for the mathematical 'function' (usually an equation) used to obtain that value.

Many functions are available to summarize information. For example, a salesman could equally truthfully provide the most typical cost as 'on average' or give the maximum ('up to...') or the minimum ('from...') just $ 300. The 'average', 'maximum' and 'minimum' are all statistics.

Note, summary statistics of a sample are often used as estimates for the population at large - for instance when you are told 'the average man has 1.8 children' that result was found in a sample of men - it is usually impossible to check every man.

  1. Statistics are also used to describe how sets of results varied, or to infer how they are liable to vary, or to infer how their summary statistic might be expected to vary. Inferential statistics are variously used to indicate how reliable an outcome is, or the probability it occurred by simple chance - given a simple (hopefully plausible) set of assumptions.

  2. When used in the plural, 'statistics'also describes 'the study of the collection, organization, analysis, interpretation and presentation of data'.

Humans, of course, use non-numerical summaries all of the time. For example when you say 'cats are smaller than dogs' you are probably describing the average  situation - however some people assume you mean every cat is smaller than every dog.

Humans also use non-numerical estimates of probability, using a simple scale, ranging from impossible to certain. Research shows most people divide that scale into surprisingly few levels - seldom more than 7 - and have problems in dealing with very small probabilities.

Beware:
Every statistic, and every summary measure, involves some assumptions. If you do not know what is being assumed, whether those assumptions are met, or how deviations from those assumptions can affect the summary, expect to be misled!


Tips and Notes

Whilst simple numerical measures are a useful way to summarize data:

  • A single statistic, such as an average, is often a simplistic and misleading way to represent the information.
  • Since a picture can provide more information than a thousand words, and because it is can be much easier to assess images than numbers, graphs are often a much more powerful way to present and explore data - assuming you and your audience can interpret them.

Since there are innumerable ways to summarize any set of information, and assuming no mistakes are made in making that summary, you should always ask yourself:

  1. Which is the most appropriate way to summarize the information at hand - and who decides what is appropriate, and how impartial is that decision?
  2. What information is being summarized, and how was it obtained - is the information detailed, consistent, plentiful, and does it comprise all the items of interest, or is it assumed to represent a larger set?

Governments and corporations have particular ideas of what summaries are appropriate, and may select their information and summary measures so as to achieve particular outcomes. Hence 'statistics' are commonly seen as 'lies using numbers'.

Nevertheless, since statistics are used for all sorts of important things, and because we all use statistics (consciously or otherwise) it is wise to understand something of their properties - and we do not mean you merely need to know how to calculate them, or to memorize the results of those calculations!

It is easy to get a computer to calculate a statistic, the hard part is knowing whether the result means anything - and how it may be misleading.


Test yourself

Consider this set of children's test-scores:
11, 35, 36, 37, 38, 99, 104, 105, 417
  • there are 9 scores
  • every score is equally common
  • the average (arithmetic mean) score is 98
  • they range from 11 to 417
  • they fall into 2 groups, plus two extreme values

    Notice that:

  1. Which, if any, of those summaries is appropriate depends upon how the summary is to be used.
  2. How those few values were obtained is unstated - are they trustworthy, or supposed to be representative?
 

Consider this summary of salaries of a small company:

  • 1 member of staff received $ 32708400 that year
  • 1104 members of staff received $ 400 that year

The company said 'since their average salary was $ 30000 our staff receive far above the industry average of $ 15000 per year'.

Do you think this is an appropriate way to summarize their data?

Would the mid-ranking salary be a better measure of what their average member of staff gets?

It is easy to work out the numbers with R:

Note, on most pages we provide the R-code, and a few comments/notes, and expect you to ask appropriate questions and reach your own conclusions. Their object is to promote thought, not to simply impart information.


Useful references

Wikipedia: Statistic. Full text 


Wikipedia: Statistics. Full text 

See Also