Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site



Longitudinal surveys (monitoring)

  1. Sampling methodology

    The first step is to define the target population and the response variable - whether disease incidence, pest density or population size. If using passive surveillance, ensure that you have recorded all the cases correctly. For active surveillance and population monitoring, use probability sampling if at all possible. For different sampling methods see the More Information page on Sampling methods.

    Sometimes randomly located samples are taken on each occasion. Such an approach may be used if you are monitoring the development of a pest population on a crop. Alternatively, repeated samples may be taken from a number of fixed locations - such as trap positions, or line transects. By restricting sampling to the same positions on each sampling occasion, the catch or number observed will not be differentially affected over time by any 'site effect' - providing any site-dependent bias remains constant over time, which it may not! If the aim is to study the dynamics of a population, the range of habitats sampled should cover the full range of habitats used over the year. Otherwise apparent changes in population may merely reflect changes in its distribution.

    As regards the pattern of sampling over time, the most informative way of analysing monitoring data (time-series analysis) assumes that the sampling interval (the period between sampling occasions) remains constant. Hence as far as possible, sampling should be done daily, weekly, or after a set number of days. The length of the sampling interval required depends upon the rate of change of the characteristic being studied - a rapidly changing population should be sampled more frequently than one which has a slow rate of change.

    It is important to take sufficient samples so that you have adequate power to detect population changes. This necessitates (again) use of probability sampling methods so that meaningful confidence intervals can be attached.


  2. Display of monitoring data

    One needs to think carefully about what will be reported in disease monitoring data. The first data available is number of suspected cases by date of reporting. However, this can be very misleading if there are diagnostic problems. Alternatively one can report the number of confirmed cases by week of confirmation. This is more reliable as regards getting real cases, but can be misleading as it is affected by delays at the laboratories where confirmation is taking place. The best way is to report data confirmed cases according to the week in which they were reported. If the period of time between onset and reporting is known or can be estimated then confirmed cases can be plotted by date of clinical onset of the disease to give what is known as an epidemic curve.

    If your observations are made at equally-spaced intervals, for example every 15 days, it is straightforward to plot the variable being measured against its day number. Monthly readings will not be precisely equally spaced, as the numbers of days in a month varies. If observations are not equally spaced, ensure that points are plotted on the correct day number, rather than just as an observation number or month. Month can then be superimposed on the day number axis as we have done below.

    {Fig. 1}

    Readings are sometimes missed, so there may be several months between readings instead of the usual one. If you do have missing data, for example the two values for May, you should join them with a dotted line, as we have done in the second figure above, to indicate that you have much less confidence on what was happening during that period. Of course, you may be missing fluctuations between the regular sampling occasions, especially if the sampling interval is excessively long for the organism you are studying. Time trend plots are also sometimes done using bar diagrams rather than line plots but this is not recommended.

    Need a mention here on running means and medians referring back to unit 1 - MImean


    When we come to consider multiple time series plots, we may find some quite complex relationships between them. Sometimes they may vary quite independently of each other - other times they may vary together. But sometimes there is a time lag or time delay between change in one variable and change in the other variable. In the example below there appears to be 1-2 week lag between when host density changes and when percentage parasitism changes.

    {Fig. 3}

    It can be informative to plot this type of data as a trajectory plot, shown in the second figure above. Here percentage parasitism is plotted against host density, and the points are joined in sequence. If percentage parasitism is related to host density some time period previous, we will get an anticlockwise spiral in the trajectory plot.

    There is another way that lags in systems can be identified - that is by plotting one variable against the other variable, at some specified earlier time period. The first figure below shows percentage parasitism plotted against host density with no time lag - there is no obvious relationship between the two variables.

    {Fig. 4}

    The second and third figures show the effect of adding a progressively greater time lag. There is a very clear relationship between percentage parasitism and host density two weeks earlier. It is frequently necessary to test a number of different time lags when looking at relationships between (say) pest numbers and meteorological factors such as temperature or rainfall.


  3. Analysis of monitoring data

    Probably the most common model for time series data is to fit a simpler linear regression line to a trend over time. If the trend over time is curvilinear, the response variable can be log transformed before fitting the regression line. We looked briefly at regression in the More Information page on Relationships between variables, but we cover it in more depth in Unit 12 As we shall see, there are problems with using regression in this way - unless the fitted line is regarded as purely descriptive. This is because observations made in a time series (or spatial series) are not independent. Two better reasons for not using simple linear regression for time series data are (1) that it often provides a very poor fit to the data, and (2) it tends to obscure, rather than illuminate, what is going on. Smoothing makes far fewer assumptions, and is a better initial approach to time series data for example using a median trace or running means.

    There are more rigorous methods of analysis for serially correlated observations - namely time series analysis. One version of this is ARIMA modelling. The acronym stands for Auto-Regressive, Integrated, Moving Average. The first step is often to transform the response variable to ensure homogeneity of variances. ARIMA models are then fitted to both the explanatory and response variables to remove the effects of serial correlation. The two series are then cross-correlated to determine whether an association exists. Note however that this approach can only be used with very long runs of data - if, for example, you wish to identify seasonal trends, you will need 10-20 years of data.

    Extrapolation of a multivariate regression relationship is another approach, which is widely used for forecasting pest damage levels to crops. Any forecast based upon only a few years data is, of course, likely to less reliable than one based on many years data. But however-many years' data are used, it is still possible that relationships, which have held in the past, may not hold in future. This will lead to the breakdown of any forecasting system.

    Often a more valid use of past data is to use the level of past variability to identify unusually large increases in the number of cases. If we know the number of cases of a disease have only varied historically between 10 and 100 per year, then the occurrence of 400 cases in a year would certainly be a cause for concern.



Distribution surveys (mapping)

  1. Sampling methodology

    Where passive surveillance often get 'presence only' data - in other words you know the locations where an individual occurs, but you cannot say it is absent in other locations. Where active sampling, and desire is to map distribution, then random sampling is not necessarily the best approach. A systematic grid sample is often preferable in order o provide good coverage

    For mapping surveys a great deal of additional spatial data are required, including topographical features (contours, rivers, roads etc) and vegetation types. Such data may be obtained from a variety of sources. Map data can be obtained from existing maps, ground survey or remote sensing. Often remote sensing data must be ground-truthed to be useful. For example remote sensing will provide multicoloured maps of the different vegetation types in an area - but the different types identified must then be ground-truthed to link up the particular vegetation types to the particular colours on the maps. Attribute data comprise any georeferenced data describing features of the area.

    Usually several data layers are input to the GIS multilayer data base - for example vegetation, rivers, roads, villages and disease prevalence. The base map data are stored in one of two formats - either raster-based (also called grid-based, or pixel-based), or vector-based. For a raster-based system a map is scanned, whilst for a vector-based system a map's features are traced over using a digitizing tablet. Some GIS systems can analyse both raster and vector data. All other data are then rescaled to fit the base map - most commonly they are remapped to a Universal Transverse Mercator (UTM) grid.


  2. Display of spatial data


    Data are output as different types of maps. Such maps may incorporate contours (such as altitude contours commonly found on maps), shading, and a variety of symbols of different sizes and shapes. The base map can be displayed in two or three dimensions.

      We will take our examples of mapping from work carried out by us, at Nguruman in the Rift Valley of Kenya, on the distribution and abundance of tsetse flies - vectors of trypanosomosis in cattle. Two geographic information systems were used, ARC/INFO, which is a vector-based system, and CRIES, which is a raster-based system. Data could be converted between the two systems, so each system could be used for the tasks best suited to it. Data sources included Kenya Ordinance Survey maps, aerial photographs, satellite imagery, and extensive ground-based field data. Long term data were available on tsetse fly numbers from 22 traps that were placed within the tsetse habitat.

      The first map below shows a three-dimensional representation of the topography of the area. The escarpment edge bordering the Rift Valley lies to the west and is topped by the Losuate Hills. The trap positions marked in pink lie at the base of the escarpment.

      {Fig. 5}

      The second map (above) is rotated round a little and shows the patch of denser vegetation within which the traps are set. That denser vegetation results from a series of streams which descend from the hills to the plains. The third map (above) is further rotated and shows the full range of vegetation types present in the area. This map was obtained from satellite imagery, which was then ground-truthed to match up vegetation types to the different colour signatures. Where there is no source of groundwater, the vegetation is dry grassland and scrub.


    We then have to consider how to display the main response variable in the map - for example disease prevalence or number of insects caught. One option is to use a proportional circle map. The area of the circle is made proportional to the variable of interest. This method was widely used before the advent of GIS, but is also available on GIS systems. Another method is to use a bar diagram maps, this time with the height of the bar proportional to the variable of interest.

      {Fig. 6}

      In our example the variable of interest is the mean number of female tsetse flies caught per trap per day in the 22 traps distributed through the woodland. In the first map we have collapsed catches to an ordinal variable using class intervals, in this case on a logarithmic scale. This figure probably does not do the method justice since the way catches have been scaled tends to minimize the differences between catches. The second map shows the same information using proportional bar diagram maps. Again a logarithmic scale has been used, but the actual mean catch has been displayed, rather than using class intervals. The result is more informative than the use of proportional circles, although there is a problem where the bars of nearby traps overlap.

      Catches are seen to be highest in the main woodland area (in black) and in the woodland 'corridors' to the north and west. The high catches in the latter areas reflected the main 'invasion' pressures since tsetse numbers were very high to the north (where there was a larger patch of woodland) and west (on top of the escarpment). Catches were much lower in the woodland corridor to the south, and in the open woodland to the east.


    So far the maps used have only indicated the value of the variable of interest at specific locations. But usually the real interest is in how the variable is distributed over the whole area - not just at particular sites. For this one has to interpolate numbers or proportions in areas that fall between actual observations (in other words, where data are absent). This can be done using simple arithmetic or geometric means - or values can be weighted, depending upon values in the other data layers

    One way to do this would be to draw a chloroplethic map. The whole area would be divided into discrete units, and either shading or colour used to represent the mean catch within each unit. The disadvantage of chloroplethic maps is that the boundaries are artificial and do not represent the true boundaries between (in this case) high and low densities of tsetse flies.

    A better way to visualise the distribution of tsetse would be to use isoplethic maps. An isoplethic map aims to represent the true boundaries between different values by joining all points of equal value. This produces a set of 'contour lines'. The areas between contour lines can be shaded or coloured to represent different ranges of values.

    {Fig. 7}

      To draw an isoplethic map we need to interpolate catches in areas we have not sampled from the sites for which we do have data. For the tsetse data this was done using the mean of the ten nearest trap points after weighting by the inverse of their distance from the point to be interpolated. In addition, hypothetical 'zero-catch' traps were added around the edge of the study area, in the grassland and scrub habitats that offered insufficient cover for tsetse flies. This effectively pinned the distribution around the edges.

      The animated map, given here, shows the change in catches from April 1987 to February 1988 - during the first year of a tsetse control trial, using traps in the southern part of the area. Numbers decline sharply from April till October, but there is some reinvasion from the north and west in November during the short rains. Numbers then continue to decline till February.


  3. Analysis of spatial data

    It is when we come to spatial analysis that GIS really comes into its own. This can itself be divided into several categories:
    1. Visualization
      Visualization of a situation can be an important step to better understanding - especially when multiple factors are operating. By overlaying maps one can, for example, assess disease risk in an area based on multiple attributes. If one knows that altitude, presence of breeding sites and vegetation type are all important risk factors, one can select areas meeting all three criteria. Different weights can also be attached to the criteria based on prior research. This approach of weighting and overlaying maps is known as suitability analysis. Another type of visualization is to use animation, either to draw attention to specific features, or to display change over time.
    2. Exploratory data analysis
      The aim here is to identify unusual space-time clusters (or hotspots) of disease, or pest abundance. Various statistical techniques are then combined with the GIS, in an attempt to assess whether particular clusters are significantly associated with possible risk factors. This approach was used, for example, in an attempt to determine if spatial clusters of childhood leukaemia were located near nuclear facilities in Britain. As we point out below, investigating associations using spatial (or temporal) data is fraught with statistical problems, not least being how we define the area of area of influence of the risk factor. Nevertheless, such exploratory data analysis can open up possible lines for future research.
    3. Modelling
      Modelling can be used predict the change in animal and plant populations, and their diseases, both spatially and temporally. A good example of this is work carried out to predict the impact of global warming on the distribution of insect pests - whether agricultural pests, or vectors of disease. Whatever modelling approach is used, it is important to test the robustness of a model. This is commonly done by dividing all occurrence points randomly into training (for model building) and testing (for model evaluation) datasets. Another approach is to test conclusions reached in one geographical area against data from another quite different area.