Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site



Species diversity is an important property of biological communities, and various measures have been devised to summarize it. Most studies summarize the diversity of species within a single well-documented phylum - such as the birds, or mammals, or the butterflies or moths.

Species diversity measures assume each member of a given species is identical, that each species is equally genetically different, and that primary producers, top predators, and parasites are of equal importance. More crucially, for conservationists, the standard diversity indices do not allow for the worldwide abundance of each species. This is bad news for endangered species, albeit useful to some large corporations and governments.


Therefore, however mathematically complex the indices listed below may appear, try to bear in mind that in reality they are very crude estimates of diversity - and tell you little or nothing about the conservation value of a community.

"A great deal of time and expertise has been expended on the compilation of faunal lists for particular habitats,

but the consequent increase in our understanding ... is still meagre."
Southwood & Henderson (2000)
Ecological methods, Blackwell Science


Species richness and rarefaction

The oldest and simplest measures of diversity is species richness - the total number of species (S) within a given community. One problem with this measure is it is rarely possible to observe every species within a community - in other words to conduct a complete census. This can only be done if you examine a very small area.

In practice, the longer you search, or the bigger the area you examine, the more species you can expect to observe. You are likely to observe the most common species fairly quickly, but in time the number of species discovered per unit effort invested will level off - although you can never be quite certain you have found them all.

One, very practical, implication of this arises when we try to compare survey results - for instance where one survey finds 231 individuals and 20 species, and another survey finds 157 individuals and 12 species. A rarefaction model may be used to estimate how many species the first survey would be expected to find if 157 individuals were randomly selected from the first collection.

An important weakness of the rarefaction model is it assumes that, once you have allowed for their relative abundances, every individual of each species is equally likely to be detected - and that your effort in finding them remains constant. Aside from the human failings inherent to such surveys, species that are aggregated, seasonal, man-shy, trap-shy, nocturnal or camouflaged will unavoidably bias this measure - and change the variability of your estimate from that predicted by the rarefaction model.


These points aside, an additional constraint in estimating the number of species in a community arises because virtually no community is wholly free from immigrants and visitors. This renders attempts to enumerate every species rather futile. The alternative to spending the rest of your life attempting to identify every species is to perform a standardized but thorough survey - and proceed from there.


Species heterogeneity measures

Instead of merely attempting to estimate how many species a community contains, various measures have been devised which allow for both the number of species and the relative abundance of each species. For large animals the most common measure of abundance is the number of individuals observed for each species but, for plants, ground cover may be a better measure - or biomass.

A surprising number of species heterogeneity indices have been developed, of which the simplest is the Berger Parker dominance index - the proportion of all the organisms recorded which belong to the most common species. However, for good or ill, this index is insensitive to the number of less abundant species in a community. In other words the Berger Parker dominance index is wasteful of information - and it is seldom used.

Strangely enough, the ratio between the total number of individuals in the community (N) and the number of individuals in the rarest species (Nmin) is often discussed (as J = N / Nmin) - even though, unless you perform a census, it is fiendishly difficult to determine.


The commonly used indices of species heterogeneity can be divided into two classes, 'parametric' indices which assume species obey some model resulting in a predefined pattern of abundance, and 'nonparametric' indices, which do not. Since the nonparametric indices are easier to describe let us begin there.


♦   Nonparametric indices of species diversity

    Of the nonparametric indices, two are most commonly referred to.

  1. Simpson's index, (1-D): works on the basis that the less diverse a community is the more likely it is that two, randomly selected, individuals will belong to different species.

    Assuming a sample of n individuals represents a very much larger population, and contains ni individuals of species i, then the approximate probability of observing one individual from species i is ni/n - and the probability of obtaining two individuals from that same species is [ni/n]2. Summing the probabilities for all the species you observe, D = Σ[(ni/n)2]. At one extreme, if all your sample represent the same species, diversity is minimal and D = 1, and at the other extreme D = 1/s.

    In reality such samples are anything but random or unbiased, so D or (1-D) or 1/D is often calculated for the individuals at hand, rather than a population they supposedly represent - and simulation, rather than formulae, is used to estimate how D can be expected to vary.


  2. Shannon-Weiner information function, H': quantifies the amount of uncertainty represented by the proportion of your sample that belongs to each species, and is calculated as H' = Σ[pi log2(pi)]. Because it describes the amount of information in your sample, H' is measured in bits per individual, pi is the proportion belonging to the ith species - and, if it helps, log2(x) is 3.321928 times log10(x).

    In theory H' can be very large, but for biological communities seldom exceeds 5. For a single-species sample the smallest H' is zero, but if there are S species in a sample of n individuals H' cannot be less than log[n/(n-S)].

    Once again this index assumes the survey is a random and unbiased sample of the community. In the real world this is not the case, which biases H' (usually downwards) and you have to estimate its variability using simulation. One way around this is to confine your estimates to the individuals you have sampled, using Brillion's index, H = [1/n]log2[n!/Π(ni!)]. H is expressed as bits per individual, ni is the number of individuals in the ith species, n is their sum, and Π is the product for all i species.

    Where n is large, for field data H and H' yield similar results - although H cannot be calculated where ni is the area covered, or the biomass, of species i.


♦   Parametric indices of species diversity

Underlying the parametric indices three models of species abundance have received particular attention.

  • MacArthur's broken stick model: assumes each species is equally abundant, or potentially so, and any differences you are observe are due to chance. Available evidence suggests this model is most applicable to communities of a few taxonomically very similar species, in an homogenous environment, between which a single overriding survival requirement is more or less equally divided.

  • Geometric model: which assumes the most abundant species has p% of available area, or biomass, the next most has k% of what is left, and so on. But random variation results in the log series model of species abundance (described below). The log series model applies to small numbers of species in succession communities, or in very harsh environments. However, although empirically the log series approximates samples of more complex communities, the lognormal model may be more appropriate.

  • Lognormal model: is mathematically midway between the broken stick and log series. It assumes the relative abundance of species is lognormally distributed, although sampling errors render this distribution Poisson lognormal. Biologically the lognormal applies to stable many-species communities, or sets of rapidly-reproducing opportunist species - and appears to be the most commonly-applicable model.

Until quite recently the log series model was most popular, and since it is closely related to the negative binomial let us now consider it in more detail.


The log series coefficient, α

Many surveys of natural biological communities observe most individuals belong to just a few species, and most species are represented by very few individuals. In other words, the distribution of species' abundance appears to be strongly skewed.

For example, the graph below shows the relationship between the number of carabid species and the number of individuals caught in each species. In this study a total of 22112 individuals, representing 49 species, were recorded. Of these individuals 58% were of the most common species, Pterostichus melanarius - compared to nine species represented by single individuals.

{Fig. 1}

Fisher et al. (1943) noted that species abundance data such as these could be approximated by a negative binomial with a k-value approaching zero, from which the zero class was omitted. The zero class corresponded to all the species in the community that had not been recorded. He justified this model on two grounds:

  • Observations of any given species are Poisson distributed.
  • The relative abundance of each species obey a predefined mathematical relationship, defined by the negative binomial model.


Under the negative binomial model, m is the mean number of individuals observed among all species of that community - large values of k indicate similar numbers of each species, and a homogenous abundance of species in that community. Where k approaches zero, because we are not interested in species absent from our sample, the negative binomial can be reduced to its mathematical equivalent, a log-series distribution. - Which also has just two parameters, x and α.

The parameters of this model can be expressed in various ways.

    For instance:
  • In terms of the negative binomial, x = m/[k+m], and α = Γ(k).
  • In terms of the number of individuals (n), and the number of species (s) found by a survey, x = n/[n + α] and α = s/−loge(1 - x).
  • Rearranging which, it turns out that s/n = loge(1 - x)[1 - x]/x, but x has to be obtained iteratively.

Provided the log series model is reasonable, x only depends upon the overall number of individuals per species - not the number of species in the community. Whereas α is a function of both the number of species (n) and the number of individuals per species (s/n) - and is therefore used as an index of species diversity. For most surveys x is extremely close to one. So for species abundance data, such as the carabids, the distribution can be fitted by looking up n/s in tables, then using it as an initial estimate of (1 - x) for the iterative equation above. Or α can be estimated directly from this, maximum likelihood, relationship: s = αloge(1 + [n/α])


Given the form of the log series distribution, it always assumes most species are represented by single individuals. In reality, since sexually reproducing species seldom survive as a single individual, log series-type distributions may simply result from sampling a minute portion of the biological population. Moreover, even if we could estimate the total number of organisms in a community (N), it seems unlikely this model would provide a useful estimate of how many species it contains (S).


The discrete lognormal model

A number of studies have shown that, where there are sufficient data, surveys can have a skewed, but two tailed species abundance distribution - although the tail comprising the least abundant species is often truncated because they are too few to penetrate the 'veil' of our inefficient sampling.

This had led to suggestions that the 'natural' distribution of species abundance is lognormal. The simplest statistical model of this assumes that samples of any one species are Poisson distributed, but these species means are lognormally distributed. Since the number of individuals is discrete the Poisson lognormal model is sometimes known as the discrete lognormal - although measures of abundance such as biomass, or area covered, are continuous.

Like the logseries model, the lognormal has two parameters - its location and dispersion. Under the lognormal model, plotting the number of species for which i organisms were observed against log[i] yields an approximately normal distribution - although, because you cannot observe an infinite number of organisms or less than one organism, it is truncated at both ends. The location of this distribution's mode depends upon what proportion of the community was sampled. Where the population is very sparsely sampled the mode lies below the point where species are abundant enough for your sample to observe one individual - in which case the distribution is hard to distinguish from a log series.

Provided the distribution's mode is sufficiently distinct, and assuming the distribution is symmetrical, the missing (zero truncated) tail can be estimated - enabling you to calculate the total number of species in that community. But this is probably not be a very reliable estimate unless your sample contains more than 1000 individuals, and has observed at least 80% of the community's species. Even where these assumptions are met, and a clear mode is visible, fitting a Poisson lognormal to a truncated distribution is not easy, so the ordinary lognormal is often used as an approximation.

A number of studies show that sampling a larger proportion of a community alters the lognormal's distribution's location, but not its dispersal. However, because many studies have a very similar dispersal, this parameter does not provides a useful index of species diversity. For these 'canonical' distributions, the geometric mean number of individuals per species is used as a measure of species heterogeneity.


Like the logseries model, this discrete lognormal assumes every individual of a given species is equally likely to be observed when the community is sampled, that observing each individual is an independent event, and the population remains unchanged during sampling. For strongly territorial or aggregated species, or sporadically available ones, or rare species that are destructively sampled, or those that learn to avoid our sampling procedures, we must expect these assumptions to be compromised. In other words, given the grossly unequal and non-random efficiency with which many species are sampled, a number of authors find it hard to accept the assumptions required by parametric models.


Choosing the most appropriate index

Which index is best depends upon the use to which it is to be put. For example, theoretical ecologists tend to want the least variable and most robust measures, whereas applied ecologists may be more interested in the underlying model. Thus, nonparametric methods have been criticized for yielding results that are imprecise and model-dependent. Whereas, although parametric indices assume a predefined model, or models, it can be impossible to infer from the observed distribution which is the most appropriate model for a particular set of field data.

For the conservationist, whilst some of these indices are more affected by the least abundant species in a community, none of them provide a quantitative measure of their conservation status. In other words, simply because a species is rare in a study area does not mean that it is rare anywhere else. As a result indices of species heterogeneity, by themselves, provide a misleading measure of a community's conservation value - or the impact of an intervention upon it.


At first sight you might assume that a good index of a community's conservation importance might be the abundance of species that are rare worldwide. Unfortunately, by itself, global rarity is not a reliable measure of endangerment because a number of species are uncommon but very widespread - whereas other species are common but locally rare.

The classical view of endangered species were they were being specifically hunted out - for food, fur, ivory, or as 'vermin'. However there are many endangered species which are not especially persecuted, and many species that are persecuted are not endangered. Experience shows that slow-breeding top predators (such as leopards and eagles) are most vulnerable, and fast breeding opportunists (such as rats and crows) are least.

These points aside, the most endangered species share two important properties:

  1. They depend upon a habitat that is under threat. Such as rainforest, coral reefs, wetlands, or shingle.
  2. They are restricted to isolated unique habitats, such as islands and mountain tops.

Indices which attempt to quantify these issues include the rarity index (an additive score of red-listed species), the endemism index (an additive score of highly localized species), and an assortment of combined, weighted indices of species diversity. In reality, of course, none of these measures are of the slightest use where there is no political will to conserve endangered species or the habitats upon which they depend.