4. The epidemiological approach to investigating disease problems

4.1 Introduction
4.2 Types of epidemiological study
4.3 Sampling techniques in epidemiological studies
4.4 Sample sizes
4.5 Methods for obtaining data in epidemiological studies
4.6 Basic considerations in the design of epidemiological investigations
4.7 The use of existing data
4.8 Monitoring and surveillance

4.1 Introduction

In Chapters I and 2 we described the need for an epidemiological approach to the investigation of disease problems. We also implied that such investigations usually have the basic objective of describing and quantifying disease problems and of examining associations between determinants and disease. With these objectives in mind, epidemiological investigations are normally conducted in a series of stages, which can be broadly classified as follows:

1. A diagnostic phase, in which the presence of the disease is confirmed.
2. A descriptive phase, which describes the populations at risk and the distribution of the disease, both in time and space, within these populations. This may then allow a series of hypotheses to be formed about the likely determinants of the disease and the effects of these on the frequency with which the disease occurs in the populations at risk.
3. An investigative phase, which normally involves the implementation of a series of field studies designed to test these hypotheses.
4. An experimental phase, in which experiments are performed under controlled conditions to test these hypotheses in more detail, should the results of phase 3 prove promising.
5. An analytical phase, in which the results produced by the above investigations are analysed. This is often combined with attempts to model the epidemiology of the disease using the information generated. Such a process often enables the epidemiologist to determine whether any vital bits of information about the disease process are missing.
6. An intervention phase, in which appropriate methods for the control of the disease are examined either under experimental conditions or in the field. Interventions in the disease process are effected by manipulating existing determinants or introducing new ones.
7. A decision-making phase, in which a knowledge of the epidemiology of the disease is used to explore the various options available for its control. This often involves the modelling of the effects that these different options are likely to have on the incidence of the disease. These models can be combined with other models that examine the costs of the various control measures and compare them with the benefits, in terms of increased productivity, that these measures are likely to produce. The optimum control strategy can then be selected as a result of the expected decrease in disease incidence in the populations of livestock at risk.
8. A monitoring phase, which takes place during the implementation of the control measures to ensure that these measures are being properly applied, are having the desired effect on reducing disease incidence, and that developments that are likely to jeopardise the success of the control programme are quickly detected.

The following two sections are concerned with describing ways in which epidemiological investigations can be designed and implemented, and the data produced analysed.

4.2 Types of epidemiological study

4.2.1 Prospective studies
4.2.2 Retrospective studies
4.2.3 Cross-sectional-studies

There are three main types of epidemiological study:

· Prospective studies, which look forward over a period of time and normally attempt to examine associations between determinants and the frequency of occurrence of a disease by comparing attack rates or incidences of disease in groups of individuals in which the determinant is either present or absent, or its frequency of occurrence varies.

· Retrospective studies, which look backward over a period of time and normally attempt to compare the frequency of occurrence of a determinant in groups of diseased and non- diseased individuals.

· Cross-sectional studies, which attempt to examine and compare estimates of disease prevalence between various populations and subsets of populations at a particular point in time.

Frequently, however, these approaches may be combined in a general study of a disease problem. In such studies, other morbidity and mortality rates may be compared as well as other variables such as weight gain, milk yield etc. depending on the objectives of the particular study.

4.2.1 Prospective studies

There are, essentially, two approaches to a prospective study. The first, which is similar to that used in controlled experiments, can be used when the investigator has control over the distribution of the determinant that is to be studied. The individual animals selected for the study are assigned to groups or cohorts. (For this reason, prospective studies are often called cohort studies). The determinant to be studied is then introduced into one cohort and the other cohort is kept free of the determinant as a control. The two cohorts are observed over a period of time and the frequencies with which disease occurs in them are noted and compared.

Often, however, the investigator has no control over the distribution of the determinant being studied. In such a case he will select the individuals that have been or are exposed to the determinant concerned, while another group of individuals that do not have, or have not been exposed to, that determinant is used as a control. The frequency of occurrence of the disease in the different groups is then observed over a period of time and compared.

In prospective studies, the cohorts being compared should consist, ideally, of animals of the same age, breed and sex and should be drawn from within the same herds or flocks, since there may be many differences in the way that different herds or flocks are kept and managed, which may be expected to have an effect on the frequency of occurrence of the disease being investigated. If such cohorts can be selected, prospective studies can demonstrate accurately the association between determinants and disease, since the cohorts will differ from each other merely in the presence or absence of the particular determinant being studied. This will only be possible if the investigator has control over the distribution of the determinant being selected. Even then, such conditions are often very difficult to fulfil in the field, where the investigator is dependent on the cooperation of livestock owners who may be unwilling to alter their management systems to fit in with the study design. If the investigator has no control over the distribution of the determinant being studied, the study design becomes more complicated and the investigation may have to be repeated to take into account the variations in the many different factors involved.

Prospective studies have the disadvantage that if the incidence of the disease is low, or the difference one wishes to demonstrate between groups is small, the size of the study groups has to be large. (Methods for analysing the results of prospective studies and for estimating the size of cohorts needed are described in Chapter 5). The problem of low disease incidence can sometimes be overcome by artificially challenging the different cohort groups with the disease in question. However, this may not be acceptable under field conditions, since livestock owners take grave exception to having their animals artificially infected! For these reasons, prospective studies are normally performed on diseases of high incidence and where the expected difference in disease frequencies between the groups studied is likely to be large.

4.2.2 Retrospective studies

Retrospective studies are often referred to as case-control studies. In such studies, the normal procedure is to look back through records of cases of a particular disease in a population and note the presence or the absence of the determinant being studied. The case group can then be compared with a group of disease-free individuals in which the frequency of occurrence of the determinant has been determined. Note that in a case-control study one is, in effect, comparing the frequency of occurrence of the determinant in two groups, one diseased (cases) and one not (controls).

Retrospective studies have various advantages and disadvantages when compared with prospective studies. The principal advantage of retrospective studies is that they make use of data that have already been collected and can, therefore, be performed quickly and cheaply. In addition, because diseased individuals have already been identified, retrospective studies are particularly useful in investigating diseases of low incidence.

The main disadvantage is that the investigator has no control over how the original data were collected, unless he or she collected them. If the data are old, it may not be possible to contact the individuals who had collected them, and thus there is often no way of knowing whether the data are biased or incomplete (see also Section 4.7 on some other disadvantages in using already generated data in epidemiological work).

The second major disadvantage is that although one knows the frequency of occurrence of the determinant in the case group, one does not know its frequency of occurrence in non-diseased individuals from the same population. The latter is normally determined by sampling from a population of non-diseased individuals at the time that the study is being carried out. There is no way of knowing the extent of the similarity between the two different populations from which the case and control groups are taken. Consequently, there is no way of ascertaining the distribution within these populations of undetermined factors which could affect the frequency of the disease. Great caution has to be exercised, therefore, in making inferences about associations between determinants and disease frequencies from retrospective studies.

A third disadvantage is that historical data on cases of disease that are sufficiently accurate to merit further study, are hard to come by in veterinary medicine. The opportunities for doing case-control studies are thus rather limited. They are much more common in human medical studies.

In spite of the fact that classic case-control studies are rarely performed in veterinary epidemiology, retrospective data are often used in livestock disease studies. The advantages and disadvantages of using such data are discussed later on in this chapter.

Methods for analysing case-control study data and for calculating the sizes of case and control study groups are described in the following chapter.

4.2.3 Cross-sectional-studies

Cross-sectional studies are, in fact, surveys. They take place over a limited time period and, in epidemiological studies, are normally concerned with detecting disease, estimating its prevalence in different populations or in different groups within populations, and with investigating the effect of the presence of different determinants on disease prevalence. They can, of course, be used to provide data on a large number of other variables present in livestock populations. Two types of cross-sectional study are commonly performed.

Censuses

A census in effect means sampling every unit in the population in which one has an interest. If the population is small, this is the most accurate and effective way of conducting a survey. Unfortunately, in most instances the populations studied are large and censuses become difficult and expensive to undertake. A further drawback with censuses in large populations is that, because of the practical constraints of staff and facilities, each individual unit within a population can be allocated only a limited amount of time and effort. Consequently, the amount of data that can be obtained from each unit sampled is limited.

Sample surveys

Sample surveys have the advantage of being cheaper and easier to perform than censuses. Because the population is being sampled, the actual number of units being measured is relatively small, and as a result more time and effort can be devoted to each unit. This enables a considerable amount of data to be collected on each sample unit.

The question is, how closely do the results of the survey correspond to the real situation in the population being sampled? If undertaken properly, sample surveys can generate reliable information at a reasonable cost; if they are performed improperly, the results may be very misleading. This is also true of censuses.

4.3 Sampling techniques in epidemiological studies

4.3.1 Random sampling
4.3.2 Multi-stage sampling
4.3.3 Systematic sampling
4.3.4 Purposive selection
4.3.5 Stratification
4.3.6 Paired samples
4.3.7 Sampling with and without replacement

Epidemiological studies usually involve sampling from livestock populations in some way in order to make inferences about a disease or diseases present in these populations. The units sampled are referred to as sample units. Sample units may be individual animals or they may be the units that contain the. animals to be investigated, such as herd, ranch, farm, or village.

The sample fraction is the number of units actually sampled, divided by the total number of units in the population being sampled.

Various methods can be used to sample a population. The more common techniques used in epidemiological studies are described in the following sections.

4.3.1 Random sampling

The rationale behind random sampling is that units are selected independently of each other and, theoretically, every unit in the population being sampled has exactly the same probability of being selected for the sample. It is, in fact, akin to the process of drawing lots. Random sampling removes bias in the selection of the sample and thereby removes one of the main sources of error in epidemiological studies.

The first step in random sampling is to construct a list of all the individual sample units in the population being sampled. This is known as the sample frame Each unit in the sample frame can then be assigned an identification number which is normally the numerical order in which they appear in the sample frame. A computer program can be used to generate random numbers or a table of the out put from such a program. (A random number table is given in Appendix 1). As each number is produced, the unit to be sampled can be identified from the sample frame. Random numbers are selected from a random number table by starting anywhere in the table and then reading either horizontally across the rows or vertically down the columns.

Example: Suppose we are interested in detecting the presence of brucellosis in a dairy herd of 349 cows. We decide that, for our purposes, we wish to be 90% sure of detecting the disease and we estimate, although we do not know, that the prevalence of brucellosis in the herd is not likely to be less than 8% (see Section 4.4 on estimating sample sizes). From Table 10 we see that in order to be 90% sure of detecting the disease at this level of prevalence in a herd of 349 cows, we need a random sample of 27 animals. The animals in the herd are not tagged, but the herdsman is able to identify each animal by name. We can, therefore, construct a sample frame of the animals in the herd by listing their names. If, for any reason, two or more animals had the same name, we could further identity them by a number (e.g. Daisy 1, Daisy 2 etc). A similar procedure can sometimes be used to establish the identify of certain unnamed animals in a herd by identifying them as the first calf of Emma, the second calf of Flora etc.

To select the animals to be sampled we could simply write the name of each animal in the herd on a piece of paper, place the name cards in a hat and then draw out 27 cards. Alternatively, we could use a random number generator or table to produce a set of three-digit numbers. Rejecting all numbers greater than 349, we continue until we have 27 three-digit numbers. A series of such numbers might for instance read 001, 088, 045, 008, 016, 344 etc. We would then select the first, the eighty-eighth, the forty-fifth, the sixteenth, the three-hundred-and-fourty-fourth etc animal from the sample frame. Since we now know the names of the animals to be sampled, we can identify them in the herd and include them in the sample. As a simple alternative, we could run the herd through a chute and select the animals as they come through, taking the first, eighth, sixteenth, forty-fifth etc animal for the sample.

Note that if the population to be sampled was between 10 and 99, we would use two-digit numbers to select the sample; if it was between 100 and 999, three-digit numbers would be used; for populations between 1000 and 9999, and between 10 000 and 99 999, four-digit and five-digit numbers, respectively, would be selected. Any number in these categories greater than the size of the population being sampled is rejected. If during the sampling procedure the same unit is selected a second time, the number that led to that selection is also rejected.

If we were selecting animals from the same herd for the purposes of a prospective study, we could use random numbers to identify them in the sample frame and then assign each animal in turn to the appropriate group. Thus, in the above example, if we wanted to select three groups from the herd, the first cow on the list would be assigned to group I, the eighty-eighth cow on the list to group 2, the forty-fifth cow on the list to group 3, the eighth cow to group I, the sixteenth cow to group 2, the three-hundred-and-forty-fourth cow to group 3 and so on. There are many ways of selecting random samples, but the principles are substantially the same as those outlined above.

Apart from removing bias in the selection of the sample, random sampling has other advantages, the main being that we can easily calculate an estimate of the error for the values of a population parameter estimated by a random sample. This is done by the use of a statistic known as the standard error (see Section 4.4). Having calculated the error, we can adjust the size of the sample according to how precise we require our sample estimate to be. It is possible to calculate estimates of errors in other forms of sampling, but the calculations involved are more complex. For this reason, random sampling is normally the method of choice when circumstances permit.

The main disadvantage of random sampling is that it cannot be attempted if the size of the population is not known. In most instances, a sample frame must be constructed before sampling can begin. This sample frame must contain all the sample units in the population, and the sample units must be identifiable by some means or other in the population which is being sampled. Sample frames are notoriously difficult to construct, certain sample units may occur in the frame more than once, thus increasing their chance of selection, or certain sectors of the population to be sampled may be omitted. Moreover in Africa, where records of individually identifiable animals are seldom available, sample frames of individual animal units can rarely be constructed. For this reason, simple random sampling based on individual animals as sample units is rarely attempted in Africa.

Furthermore, random sampling is impossible where the type of unit being sampled does not permit the population size to be determined beforehand. If, for instance, events such as births or deaths are being sampled, there is simply no way of knowing with absolute precision how many births or deaths there will be in a population over the study period.

4.3.2 Multi-stage sampling

A way round the problem of constructing sample frames of individual animal units is to use a technique known as multi-stage sampling. As the name implies, this involves sampling a population in different stages, with the sample unit being different at each stage. If it is not possible to construct a sample frame of individual animals, then herds, farms or villages in which livestock are kept can be used as units. Lists, particularly of farms or villages, are frequently compiled for administrative purposes by governments, and it is relatively easy to construct a sample frame from such lists. This would be the first stage of the process. The sample units are then selected at random from the sample frame. Once the farm or village units have been selected, it may prove possible to construct a sample frame of the animals within the units and sample these in turn.

Alternatively, all the animals within a village, farm or herd can be sampled. This technique is known as cluster sampling. The herd, farm or village is the sample unit and the animals contained within the sample unit are the cluster. Since one of the main expenses of sampling is often for travel, the advantages of sampling all the animals in the herd, village or farm during one visit are obvious. For this reason, cluster sampling is often the method of choice in epidemiological studies in Africa.

An alternative method of cluster sampling is to define the target population as all the livestock of a particular type within a region demarcated by well defined geographical boundaries. An areal sampling method is then used whereby the region is divided into small units, with all the animals in each unit being defined as a single cluster. The advantage of this procedure is that the investigator knows how many areal units there are in total, since he has defined them, and this in turn enables him to construct easily a sample frame. The disadvantage is that it may be difficult to find all the animals in a given small area, or even to be sure to which areal unit a particular animal belongs.

Cluster sampling has some advantages and disadvantages when compared with simple random sampling. These are discussed in detail in the next chapter but it may be useful to include a brief summary here.

The first advantage of cluster sampling is one of a saving in travel costs. Much less travelling is involved in sampling animals on a cluster basis than if animals are selected at random from a target population. Provided that the complete collection of animals in each cluster is included in the sample, it is not too difficult to calculate an estimate of the variable being investigated and the corresponding standard error. (It is not very difficult even if only a subset is used).

However, since the variation in disease prevalence is likely to be greater between clusters than within clusters, examining animals within clusters will give less information than examining animals from different clusters. This is particularly so in the case of infectious diseases. The more infectious the disease, the more likely it is that in any particular cluster of animals either none or most of the animals will be infected. Because of this, cluster sampling will almost always increase the standard error - sometimes very considerably - and hence the uncertainty involved in the estimation of the particular variable being considered.

One implication of this is that the minimum number of cases required for a reliable estimate of disease prevalence or incidence in the target population as a whole will be several times larger than that required in simple random sampling The sample size in a cluster sample has to be correspondingly larger, therefore, to produce an estimate of the same reliability. If, as a result, the procedures for measuring a particular variable become time consuming and/or costly, the time and money spent may outweigh the benefits of reduced travel costs and increased administrative convenience gained by cluster sampling.

4.3.3 Systematic sampling

Systematic sampling involves sampling a population systematically i.e. if a 1/n sample is required, every nth unit in that population is sampled. For example, if a 10%(l/10) sample is required, every 10th unit in the population is sampled. If a 5% (1/20) sample is required, every 20th unit in the population is sampled.

The main advantage of systematic sampling is that it is easier to do than random sampling, particularly if the sample frame is large. It also enables sampling a population whose exact size is not known. This is impossible in random sampling. Thus systematic sampling is used to sample such events as births or deaths, whose total number cannot be known before the study begins, or livestock populations at abattoirs or dips where, again, the population size may not be determinable at the outset.

The main disadvantage of systematic sampling is that if the sample units are distributed in the sample frame or in the population periodically, and this periodicity coincides with the sampling interval, the sample estimate may be very misleading. Estimating the standard error is thus more difficult and depends on making the assumption that there is no periodicity in the data.

4.3.4 Purposive selection

Purposive selection involves the deliberate selection of certain sample units for some reason or other. The reason may often be that they are regarded as being "typical" of the population being sampled. For example, a herd or series of herds may be selected because they are representative of a certain production system. Purposive selection is also used to select particular sample units for a particular purpose e.g. high-risk sentinel herds along a national or geographic boundary or along a stock route.

The main advantage of purposive selection is the relative ease with which sample units can be selected. Its main disadvantage is that sample units are frequently selected not because they are representative of a particular situation but because they are the most convenient to sample. Even if the sample units are selected as being representative of a general population or situation, they often tend to reflect the opinions of the individual selecting them as to what he or she considers to be representative, rather than the actual case. In addition, if the samples are selected on the basis of being typical of the average situation. they only represent those units close to the population mean and tell one little about the variation in the population as a whole.

In spite of these drawbacks, purposive selection may in certain instances be the only method available. If there are difficulties communication, sample units may have to be selected purposively on the basis of their accessibility. Alternatively, if the measurement procedures are long or complicated, involve some form of damage to an animal or upset local beliefs or prejudices, e.g. when taking blood or biopsies, a sample may have to be purposively selected on the basis of the livestock owner's willingness to cooperate.

4.3.5 Stratification

This involves treating the population to be sampled as a series of defined sub-populations or strata. Suppose, for example, that we wished to sample a population of 4000 goat flocks in order to estimate the prevalence of a particular disease in an area, and that this population consisted of

200 large-sized flocks containing 51 animals or more;

800 medium-sized flocks containing between 20 and 50 animals; and

3000 small-sized flocks containing 19 animals or less.

If we took a 1% random sample of all flocks, we might find that this would give us a sample consisting of, say, 1 large flock, 9 medium-sized flocks and 30 small flocks. Suppose, however, that one of the determinants we were interested in was the influence of flock size on the prevalence of the disease. We would obviously want to know more about the larger flocks than our present system of sampling would tell us. We could, therefore, divide the population to be sampled into strata according to flock size, and sample each stratum in turn.

We could also take larger samples from those strata that we are particularly interested in and smaller from those that we are not. For example, we might decide to take a 5% random sample from the large-flock stratum, a 2% sample from the medium-flock stratum and a 0.5% sample from the small-flock stratum. This might give us 10 large flocks, 16 medium flocks and 15 small flocks. Note that the actual sample size has increased from 40 to 41 only, although if we were cluster sampling more animals would be involved. This technique is known as stratification with a variable sampling fraction, and its usefulness lies in that it allows us to concentrate the facilities at our disposal on those sections of the population that are of particular interest to us.

Many different systems of stratification are possible, depending on the purpose of the study being undertaken. Common variables for stratification include area, production system, herd size, age, breed and sex.

4.3.6 Paired samples

Variations in the sample groups due to host and management characteristics can sometimes be overcome by pairing individuals in the different sample groups according to common characteristics (age, breed, sex, system of management, numbers of parturitions, stage of lactation etc) and then analysing the paired samples (see Chapter 5). This technique is useful in that it often greatly increases the precision of the study.

4.3.7 Sampling with and without replacement

There are essentially two different options for selecting clusters. We may select them in such a way that each cluster has an equal probability of being selected, or that some clusters have a higher probability of being selected than others.

If the first option is chosen, the natural method of selection is simple random sampling. If, however, the clusters have different probabilities of being selected, it then becomes rather difficult to devise a sampling method which allows the clusters to be chosen with the intended probability. In addition, the correct method to calculate unbiased estimates of the standard errors of any estimates which include "between-cluster" variability is rather complicated and requires a powerful computer with a special program. If such resources are not available, it will be advisable to select clusters with replacement i.e. choose from the complete set of clusters without discarding any previously selected. This will mean that sometimes the same cluster will appear more than once in the sample, though this will happen rarely if the total number of clusters is large compared to the sample being selected. (The interested reader should consult Chapters 9 and 10 in Cochran (1977) for further details).

There are many variations and combinations of sampling possible even within one particular study. Detailed descriptions of all the possible permutations involved are beyond the scope of this manual, and the ensuing discussions in this and the next chapter will focus on simple random and cluster sampling.

4.4 Sample sizes

4.4.1 Sample sizes for estimating disease prevalence in large populations
4.4.2 Sample sizes needed to detect the presence of a disease in a population

This section is concerned with estimating sample sizes for cross-sectional studies. The approach used will depend on whether we are measuring a categorical or a numerical variable. Categorical (discrete) variables are probably more frequent in epidemiology, particularly dichotomies, and we shall illustrate the problem of estimating sample size for such variables in the following subsections. Techniques available for estimating sample sizes in cross-sectional studies involving numerical (continuous) variables, and in cohort and case-control studies, are described in Chapter 5.

4.4.1 Sample sizes for estimating disease prevalence in large populations

Suppose that we-wish to-carry out a survey to investigate the distribution of disease in a large animal population. How big a sample should we aim for? Since the cost of finding and examining each animal (i.e. the unit sampling cost) is likely to be quite high, the total sampling cost, and hence the sample size, will be an important determinant of the total cost of the survey. So how do we decide how many animals we need to examine? The answer to this question largely depends on four subsidiary questions:

- To what degree of accuracy do we require the results?
- What sampling method have we used?
- What is the size of the smallest subgroup in the population for which we require accurate answers?
- What is the actual variability in the population surveyed of the variable we wish to measure?

Clearly the last of these questions will cause the greatest problem, since if we knew the exact answer to this we would have no need to carry out the survey in the first place! Let us now consider these questions one by one.

Suppose that a disease is distributed in a population with a prevalence of P. and that we have decided to estimate P by means of a survey using a particular sampling method. We carry out the survey and obtain an estimated prevalence p. If we repeated the whole survey a second time using the same sampling method and the same sample size, we would get a different estimate p of the prevalence P. If it were possible to go on repeating the survey many times with the same sample size, we would get a whole series of estimates from which we could draw a histogram. This would resemble Figure 7 if n, the sample size, was large.

Figure 7. Distribution of different estimates of disease prevalence in a large-sized sample.

It can be shown that the average of all the estimates p1, p2 etc will be almost exactly the true prevalence P. and that 68% of the estimates will differ from the true value by less than the quantity called the standard error of the estimated prevalence (SE), where:

P = true prevalence (%),
Q = 100- P, and
n = size of the sample.

Similarly, 95% of the estimates would differ from the true value by less than twice the standard error, and 99% of the estimates would be within three standard errors of the true value.

This suggests a method for stating how precise we would like the results to be. We might, for example, say that we would like to be 95% sure of being within 1% of the correct, true prevalence P(%). This implies that we want twice the standard error to be no greater than 1%, or that the standard error should not be greater than 0.5%. This means that it is always possible to fix a given accuracy level by choosing the sample size so that the standard error of the estimate is controlled.

Requirements for precision can be stated in terms of absolute or relative accuracy. If we talk in terms of absolute accuracy we might say that "we want the error in the prevalence estimate to be no more than 1%" i.e. p = P ± 1%. For example, if the true prevalence is 3%, we will be requiring an estimate that lies in the range of 2 to 4%. If the true prevalence is 20%, we require the estimated value to fall between I 9 and 21 %.

If we want to state our requirements in terms of relative accuracy, the estimated value must lie within 10% of the true value. For example, if the true prevalence is 20%, this would mean obtaining an estimate in the range of 18 to 22%, since 2 is 10% of 20. If the true value was 5%, we would be demanding an estimate between 4.5 and 5.5%, since 0.5 is 10% of 5. In principle, there is nothing wrong in stating accuracy requirements in this way, but high relative accuracy will not be possible when true prevalence is low (see Table 9).

Table 8 shows the sample sizes required for estimating prevalences at different levels of absolute accuracy from large populations. Note that no sample size is given unless the standard error is smaller than the true prevalence. The entries have been calculated using the formula:

n = P(100-P)/SE²

If the sample size is a large proportion of the population, say greater than 10%, then it is better to use the more exact formula:

where N is the total size of the population.

Table 8. Sample size (n) for controlling the standard error (SE) of estimated prevalence for different values of the true prevalence (P) in large populations.

P (%)

SE (%)

0.1

0.5

1.0

1.5

2.0

2.5

0.5

4975

-

1.0

9900

396

-

-

-

-

1.5

13275

591

148

-

-

-

2.0

19600

784

196

87

-

-

2.5

24375

975

244

108

61

-

3.0

29100

1164

291

129

73

47

3.5

33775

1351

338

150

84

54

4.0

38400

1536

384

171

96

61

4.5

42975

1719

430

191

107

69

5.0

47500

1900

475

211

119

76

6.0

56400

2256

564

251

141

90

7.0

65100

2604

651

289

162

104

8.0

73600

2944

736

327

184

118

9.0

81900

3276

819

364

205

131

10.0

90000

3600

900

400

225

144

20.0

160000

6400

1600

711

400

256

30.0

210000

8400

2100

933

525

336

40.0

240000

9600

2400

1067

600

384

50.0

250000

10000

2500

1111

625

400

Example 1: Suppose we wish to be 95% sure that a survey will give an estimated prevalence within 1% of the true value in absolute terms. Two standard errors will then be less than 1% i.e 2 SE =<1% or SE = < 0.5%. Table 8 gives the sample sizes required for different prevalence rates and standard errors. However, since the sample size we are looking for will depend on true prevalence, whose value we do not know, that being the reason for the survey, this does not seem to help much. It will be rare, however, to have absolutely no idea what value of the true prevalence to expect. We will usually be able to make an estimate and say, for example, that "we believe the prevalence is not greater than 8%". If we then choose the sample size, it might turn out to be much too big, since the correct sample size to measure a prevalence of, say, around 2% to the desired accuracy is 784, while the sample size corresponding to a prevalence of around 8% is 2944. However, there is nothing much we can do about this. Lack of prior knowledge will always result in a need for liberal (i.e. overlarge) sample sizes and hence higher costs.

If we do not have the slightest idea what prevalence to expect, we can use the sample size corresponding to the least favourable case (P = 50%) given in Table 8, though if we are demanding a high degree of accuracy the indicated sample size (10 000) may be unrealistically large.

Example 2: We might suspect that the true prevalence is of the order of 20% and would like to be 99% sure that the estimated prevalence is within 2% of the true value. We can be 99% certain that the true value lies within three standard errors of the estimate. Hence, to fulfill the required conditions we must choose the sample size in such a way that 3 SE = <2% or SE = <2/3 = 0.7% approximately. From Table 8 we see that for SE = 0.5% and P = 20%, we need a sample of 6400. For SE = 0.7%, it seems, we will need around 4000. (In fact the exact sample size as calculated from the formula n = P(100-P)/SE² is only 3265).

Table 9 gives sample sizes required to estimate prevalence in a large population when the desired precision is stated in terms of relative accuracy. In this case the sample sizes are such as to ensure that the standard error will not be greater than the stated percentage of the true prevalence. The entries in the table have been calculated using the formula:

If the sample size required represents a very high proportion of, or is greater than, the sampled population itself, the more accurate formula

should be used to calculate the sample size. (N is the size of the population being sampled).

Table 9. Sample size (n) to control the standard error (SE) of estimated prevalence relative to the true value of the prevalence.

P (%)

SE as a percentage of P

1.0

5.0

10.0

0.5

1 990 000

79 600

19 900

1.0

990 000

39 600

9 900

1.5

656 667

26 267

6 567

2.0

490 000

19 600

4 900

2.5

390 000

15 600

3 900

3.0

323 333

12 933

3 233

3.5

275 714

11029

2 757

4.0

240 000

9 600

2 400

4.5

212 222

8 489

2 122

5.0

190000

7 600

1 900

6.0

156 667

6 267

1 567

7.0

132857

5314

1 329

8.0

115000

4600

1 150

9.0

101111

4044

1 011

10.0

900 000

3 600

900

20.0

40000

1 600

400

30.0

23 333

933

233

40.0

15 000

600

150

50.0

10 000

400

100

The sample sizes calculated in the two different exercises were obtained assuming that the sample was to be chosen by simple random sampling i.e. that animals were sampled individually. If we use a different sampling method, these sample sizes will no longer be appropriate. For example in cluster sampling, which increases the variability of any estimates made, we should assume that, to be on the safe side, we will need to examine four times as many animals as for a simple random sample.

If we require an accurate estimate of prevalence not only for the complete population but also within well defined subgroups, as in a stratified survey, we need to choose the sample size sufficiently large within each subgroup. Suppose, for instance, that the population is distributed in six regions. Then, in our first example, if we require to estimate a true prevalence of 2% with an SE of 0.5% for each region, we would need a sample size of 784 in each region, assuming that we take simple random samples within the regions.

4.4.2 Sample sizes needed to detect the presence of a disease in a population

It may sometimes be important to discover whether a disease is at all present in a population. This population may be a single herd or a much larger group in, say, a well defined geographical region. Here the problem is no longer one of having a sample large enough to give a good estimate of true prevalence, but rather of knowing the minimum sample size required to find at least one animal with the disease. This will clearly need a much smaller sample than would be required for an accurate estimation of prevalence. Again the answer will depend on the true, but unknown, value of the prevalence of the disease in the target population. For small populations, e.g. individual herds, the answer will depend on the size of the population (Table 10). For populations of over 10 000, the sample sizes in the last column of the table will be approximately correct.

The values in Table 10 were calculated from the formula:

Probability of detection =
1-(N-M)/Nx(N-M-1)/(N-1)x.. (N-M-n+1)/(N-n+ 1) where:

N = size of population,
M = total number of infected animals, and
n = sample size.

Where the indicated prevalence did not correspond to a whole number of animals, the value was rounded up to the next whole number (e.g. 3% of 75 = 2.25 animals; this was rounded up to 3). The sample sizes indicated in Table 10 are appropriate only for simple random sampling and would be much larger if cluster sampling was used. The determination of sample sizes required to estimate continuous variables is discussed in Section 5.3.2.

4.5 Methods for obtaining data in epidemiological studies

4.5.1 Interviews and questionnaires
4.5.2 Procedures involving measurements
4.5.3 Errors due to observations and measurements

In epidemiological studies we can obtain data on a particular variable in two main ways. We can actually measure the variable or we can ask individuals concerned with livestock to give an estimate of the variable in the livestock populations with which they are concerned. As in estimating sample size, the approach adopted will largely depend on the purposes of the study. If the objective of the study is to obtain broad estimates of the relative importance of various diseases within a livestock population, the degree of precision need not be great. Consequently, the sample size may be small and the quality of the data generated does not need to be high. If, on the other hand, we are interested in studying the epidemiology of a particular disease in detail, accurate estimates of prevalence or incidence may be needed, the sample size will have to be large, and the data generated must be of high quality.

4.5.1 Interviews and questionnaires

Interviews and questionnaires are frequently used in epidemiological studies and can be a valuable means of generating data. In countries with good postal services, data can be collected cheaply and quickly by circulating questionnaires. Because of literacy and communications difficulties, this approach is of little use when one is soliciting information from traditional livestock owners, but it can be helpful in obtaining information from extension officers, veterinarians and other individuals concerned with traditional livestock production. It should be noted, however, that questionnaires involving a considerable effort in filling in are likely to have a high non-return rate, and the sample size may have to be adjusted accordingly. Furthermore, high non-return rates can introduce substantial bias in the estimates calculated from the returns.

Epidemiological studies often involve visiting the sample units and collecting the relevant data by questioning the owners and/or carrying out the appropriate measurement procedure on the animals concerned. Designing questionnaire formats and interview protocols can be a long and difficult process, particularly where traditional livestock producers are concerned. Remember that questioning a traditional livestock producer about the numbers or performance of his animals is akin to questioning other individuals about their bank accounts! Considerable time and patience are needed to obtain the trust and cooperation of such individuals. Wherever possible, a trusted intermediary should be employed. Nevertheless, as most traditional livestock producers live in close proximity to their animals and normally come from sections of the population with a vast experience of keeping livestock under African conditions, they are obviously an extremely useful and valuable source of information.

Table 10. Sample size as a function of population size, prevalence and minimum probability of detection.

P (%)

Population size

50

75

100

300

500

1000

5000

10 000

a) 90% probability of detection

0.5

50

75

100

271

342

369

439

449

1

45

68

91

161

184

205

224

227

2

45

51

69

95

102

108

113

114

3

34

40

54

67

71

73

76

76

4

34

40

44

52

54

55

57

57

5

27

33

37

42

43

44

45

45

6

27

27

32

35

36

37

38

38

7

22

24

28

31

31

32

32

32

8

22

24

25

27

27

28

28

28

9

18

21

20

22

22

22

22

22

10

18

18

20

22

22

22

22

22

b) 95% probability of detection

0.5

50

72

100

286

388

450

564

581

1

48

72

96

189

225

258

290

294

2

48

58

78

117

129

138

147

148

3

39

47

63

84

90

94

98

98

4

39

47

52

66

69

71

73

74

5

31

39

45

54

56

57

69

59

6

31

33

39

45

47

48

49

49

7

26

29

34

39

40

41

42

42

8

26

29

31

34

35

36

36

36

9

22

26

28

31

31

32

32

32

10

22

23

25

28

28

29

29

29

c) 99% probability of detection

0.5

50

75

100

297

450

601

840

878

1

50

75

99

235

300

368

438

448

2

49

68

90

160

183

204

223

226

48

59

78

119

131

141

149

151

45

59

68

94

101

107

112

113

5

39

51

59

78

83

86

89

90

6

39

44

53

66

70

72

74

75

7

34

39

47

58

60

62

64

64

8

34

39

43

51

53

54

55

56

9

29

35

39

45

47

48

49

49

10

29

32

36

41

42

43

44

44

The success or failure of this type of epidemiological study depends as much on the design of recording forms as it does on the overall survey, the actual field work and the analysis. The latter will be impossible unless the material recorded is intelligible. Much thought should therefore be given to the design of forms and their efficiency should be tested in pilot trials. The forms should be orderly, with related items grouped together (calf number, date of birth, place of birth), convenient to use (the form should fit on a clip board), and technical words not likely to be understood by field staff avoided, as should any ambiguities in the terms used. The form should have a title and provisions for the identification of both the officer completing the form and the data source. It should also have a reference number which relates to the survey design (e.g. 06/04/93 might indicate the sixth visit to farm 93 in stratum 4). Completed forms should be checked for errors as soon as possible, so that appropriate corrections can be made while the memory of the interviewer is still fresh and the sample unit accessible.

Some additional points to bear in mind in the design of interviews and questionnaires include:

i) Explain the purposes of the interview to the interviewee. People are generally much more cooperative when they know why they are being questioned.

ii) Being normally very polite, livestock owners tend to answer questions with the answer that they think the interviewer wishes to hear, rather than giving the correct answer. The use of leading questions which give the interviewee a clue as to the answer expected or desired, should therefore be avoided.

iii) Human memories are short, and there is a tendency to concentrate events into a more limited time period than was actually the case. So if livestock owners are asked about events that occurred in their animals over the last year, they tend to report events that happened over the last 2 or 3 years. This obviously exaggerates data on disease frequencies.

iv) Do not make interviews or questionnaires too long, or else the interviewee will get bored and the quality of his answers will suffer. To avoid this, the most important questions should be asked at the beginning.

v) Questions requiring subjective answers generate data that are extremely difficult to analyse. They should be avoided whenever possible, even though they may give valuable insights.

vi) Long, complicated questions tend to lead to misunderstanding and wrong answers.

4.5.2 Procedures involving measurements

If a high degree of precision is required in the study, the variable being investigated will normally have to be measured in some way. This may involve taking a biological specimen from an animal for a diagnostic test, weighing the animal, measuring milk yield, or measuring climatic variables such as rainfall, temperature etc.

Before measuring begins, it is important to understand exactly what is being measured and what are the advantages and disadvantages of the method used. This applies particularly to diagnostic tests. If the procedure is complicated or involves complex equipment, the person using it must master all its aspects before the survey begins, to ensure that an acceptable level of consistency in the measurements is being obtained. The equipment used during a field investigation should be calibrated and checked for accuracy before the start of each series of measurements and should be regularly maintained.

4.5.3 Errors due to observations and measurements

Earlier in this chapter we discussed statistical techniques available to calculate the size of a sample that would give a population estimate with the precision required if:

· The study is performed exactly as it was originally designed; and
· All the statistical assumptions are fulfilled.

However, this does not take into account errors due to variations between observers and those inherent in the measurement procedures used. These errors may, in fact, be more important than the errors generated by faulty sampling procedures.

Errors due to variations between observers

Many epidemiological studies are conducted with the help of enumerators, usually field services staff, who visit the sample units and carry out the procedures required. If interviews are being conducted by such staff, answers may be received which could be subject to different interpretations by different individuals. To keep errors to a minimum, strict control should be maintained over the interview protocols and the interviewees monitored from time to time.

Variations between different observers may occur when some degree of subjective judgement is involved, as may be the case in the diagnosis of a disease. Criteria need to be established by which a diagnosis is arrived at and adhered to by all those engaged in the study. Such considerations are of particular importance in retrospective studies.

An additional problem frequently encountered is that of bias on the part of the observer. If an individual wishes to prove a particular point he may, quite unintentionally, be biased in recording his observations. This problem can be avoided by the use of a "blind" technique whereby the observer is kept ignorant of the distribution of the determinant in the groups being studied, merely being required to record a set of observations about those groups.

Errors due to measurements

Errors inherent in the procedures by which a variable is being measured are common in epidemiological studies. For example, if two weighing scales are being used in a study, one scale may consistently give a higher reading than the other. Obviously, careful checking and monitoring of such apparatus before and during the study will reduce errors of this kind.

Further errors may occur when diagnostic tests are being used to determine the presence or absence of an infectious agent. The terms used to describe the reliability of diagnostic procedures are:

Repeatability, which is the ability of a diagnostic test to give consistent results.

Accuracy, which is the ability of a test to give a true measure of the variable being tested. Accuracy is normally measured by two criteria:

- Sensitivity, which is the capability of that test to identify an individual as being infected with a disease agent when that individual is truly infected with the disease agent in question. In other words, it gives the proportion of infected individuals in the sample that produce a positive test result.

- Specificity, which is the capability of that test to identify an individual as being uninfected with a disease agent when that individual is truly not infected with the disease agent in question. In other words, it gives the proportion of uninfected individuals in the sample that produce a negative test result.

These two terms are illustrated in Table 11.

Table 11. Estimated and true prevalences of a disease agent illustrating the terms specifcity and sensitivity.

Number of individuals infected

Number of individuals not infected

Total

Positive test result

a

b

a+b

Negative test result

c

d

c+d

Total

a+c

b+d

N

Notes: The estimated prevalence is (a+b)/N; the true prevalence is (a+c)/N.

The sensitivity of the test is a/(a+c) and its specificity is d/(b+d)

Example 1: Suppose that we tested a sample of 1000 animals for the presence of a disease agent using a test of 90% sensitivity and 90% specificity. The results of the testing procedure are shown in Table 12.

Table 12 is somewhat artificial in that it gives the column totals, which we are trying to estimate. However, if the disease was distributed through the population in this way and we used a test that was 90% sensitive and 90% specific to estimate the extent of this distribution, we would arrive at an estimated prevalence of 180/1000, which would be an overestimate of the true prevalence of 100/1000. Of the 180 animals that the test identified as positive, 90 were, in fact, not infected with the disease, while of the 820 animals that the test identified as negative, 10 were, in fact, infected with the disease.

Table 12. Results of using a diagnostic test of 90% sensitivity and 90% specificity in a sample of 1000 animals in which the true prevalence of infection is 10%.

Number of individuals infected

Number of individuals not infected

Total

Positive test result

90

90

180

Negative test result

10

810

820

Total

100

900

1000

Example 2: Suppose we used the same diagnostic test on a similar sample of animals but the true prevalence of the infection in the sample was 1%. The results of this test are given in Table 13.

Table 13. Results of using a diagnostic test of 90% sensitivity and 90% specificity in a sample of 1000 animals in which the true prevalence of infection is l %

Number of individuals infected

Number of individuals not infected

Total

Positive test result

9

99

108

Negative test result

1

891

892

Total

10

990

1000

The true prevalence of the infection in this case is 10/1000 = 1%, while the estimated prevalence of infection is 108/1000 = 10.8%. Of the 108 animals that the test diagnosed as positive, 92% (i.e. 99/108) were, in fact, not infected with the disease agent in question. This leads us to another useful statistic, the diagnosibility of a test, which is the proportion of test-positive individuals that are truly infected with the disease agent.

In our first example the diagnosibility was 90/180 = 50% while in the second it was 9/108 = 8.3%. Note that the diagnosibility of a diagnostic test declines as the prevalence of a disease decreases. This means that sensitivity and specificity errors in diagnostic tests produce relatively much greater errors in prevalence estimates of diseases with low true prevalence than would be the case in diseases of high prevalence.

It is obviously desirable to use a test that is as sensitive and specific as possible, so that the numbers of false positives and false negatives in the sample are reduced. The sensitivity and specificity of a test can be determined by administering the test to a number of animals and then comparing its results with the results obtained from a series of detailed diagnostic investigations on the animals concerned. In order for the results to be valid, however, the animals selected for the evaluation must be representative of the population to which the test is to be applied.

Once the sensitivity and specificity of a test are known, a correction factor can be applied to the prevalence estimate to take into account the sensitivity and specificity of the test:

where all values are expressed as decimals.
For our example 2 (Table 13):
True prevalence
= (0.108 + 0.90- 1)/(0.90 + 0.90- 1)
= 0.008/0.80 = 0.01 or 1%.

Note that although we can now correct the prevalence estimate, we still have no idea which of the individual animals are truly negative, falsely negative, truly positive and falsely positive. This problem can occur when diagnostic tests are being used in a test-and-slaughter policy for controlling a particular disease. Such policies are normally only implemented after a vaccination campaign has reduced the disease to a low prevalence, when the diagnosibility of a test is likely to be low. In addition, vaccination it tests are being used in a test-and-slaughter policy for controlling a particular disease. Such policies are normally only implemented after a vaccination campaign has reduced the disease to a low prevalence, when the diagnosibility of a test is likely to be low. In addition, vaccination itself often has an adverse effect on test sensitivity and specificity. We can see from our second example that if we slaughtered all the test positives, 92% of the animals being slaughtered would not be actually infected with the disease agent.

While it is relatively easy to make a test more sensitive, often by lowering the criteria by which a test result is deemed positive, this normally results in the test becoming less specific. Tests which are highly specific are often complicated, time consuming and, consequently, expensive. As such they can rarely be employed on a large scale.

A way round this problem is to apply two separate and independent testing procedures. Initially, a screening test of high sensitivity is needed to ensure that as many infected animals as possible are detected. Once the initial screening test has been performed, all positive reactors can be reexamined by a second test of high specificity. Since only the positive reactors have to be examined and not the entire sample, this cuts down the cost of using a highly specific test.

Example: Suppose we were attempting to eradicate a disease of 1% prevalence from a population of 10 000 animals by a process of test and slaughter. If we first use a test of high sensitivity (95%) but low specificity (85%), our initial results would be as illustrated in Table 14.

Table 14. Results of a diagnostic test of 95% sensitivity and 85% specificity used to examine a population of 10 000 animals for the presence of a disease with true prevalence of 1%.

Number of individuals infected

Number of individuals not infected

Total

Positive test result

95

1 485

1 580

Negative test result

5

8 415

8 420

Total

100

9 900

10 000

We then subject the 1580 test-positive animals to a further test of the same sensitivity but a higher specificity (Table 15).

Table 15. Results of a diagnostic test of 95% sensitivity and 98% specificity applied to the 1580 test-positive animals identified in Table 14.

Number of individuals infected

Number of individuals not infected

Total

Positive test result

90

30

120

Negative test result

5

1 455

1 460

Total

95

1 485

1 580

This test indicates that we would need to slaughter 120 as opposed to 1580 animals. Admittedly, a few false negatives might have slipped through the testing procedure, but it is hoped that these would be picked up on subsequent testing.

4.6 Basic considerations in the design of epidemiological investigations

4.6.1 Objectives and hypotheses

In this chapter we have illustrated some of the many problems that can be encountered in the design and implementation of epidemiological studies, and it may be useful at this point to summarise the basic considerations.

4.6.1 Objectives and hypotheses

A good way to approach the planning of a field study is to take the view that we are, in effect, buying information. We must make sure, therefore, that the study produces the information required at the lowest possible cost. We should also ask ourselves if that information can be obtained from other, cheaper sources. The processes involved in such considerations could be schematised as follows:

Figure

The first step is to write out clearly the objectives of the study and the data that will need to be generated in order to attain them. Throughout the entire planning process, constant reference should be made to these objectives in order to ensure that the procedures being planned are of relevance. If it is found that the resources available may not permit the achievement of the original objectives, the objectives may have to be redefined or additional resources found.

Objectives can often be defined by constructing a hypothesis. An epidemiological hypothesis should:

Specify the population to which it refers i. e. the population about which one wishes to make inferences and therefore sample from. This is referred to as the target population. Sometimes, for practical reasons, the population actually sampled may be smaller than the target population. In such cases the findings of the study will relate to the sampled population, and care must be exercised in extrapolating inferences from the sampled population to the target population.

Frequently, inferences may be required about different groups within the target population. For example, one may want to estimate not only the overall prevalence of a specific disease, but also the prevalences or incidences of the disease in various groups or subsets of the population. To obtain estimates with the precision required, the samples taken from these groups must be large enough, and this will obviously affect the design of the study.

A further problem may occur when defining the actual units to be sampled within a population. If, for example, the sample unit was a calf, at what age exactly does a calf cease being a calf? Alternatively, suppose the sample unit is a herd. What exactly is meant by the term "herd"? If a livestock owner has only one animal, does that constitute a herd? Obviously, the sample unit must be precisely defined and appropriate procedures designed to take care of borderline cases.

Specify the determinant or determinants being considered Can such disease determinants as "stress", "climate" and management" be defined accurately? How are these determinants to be quantified and what measurements would be used in their quantification? What are the advantages and disadvantages of these methods of measurement? How accurate are they?

Specify the disease or diseases being considered. The criteria by which an animal is regarded as suffering from a particular disease must be carefully defined. Will the disease be diagnosed on clinical symptoms alone? If so, what clinical symptoms? Are there likely to be problems with differential diagnoses? Will laboratory confirmation be needed? If so, are there adequate laboratory facilities available? Will they be able to process all the samples submitted? Will diagnostic tests be used? How accurate are these tests? Remember that studies based solely on diagnostic tests may provide data about the rates of infection present in the population being sampled, but they may not indicate whether the infected animals are showing signs of disease or not. Additional data on mortalities and morbidities may have to be generated.

What rates are to be calculated? Remember that incidence and attack rates cannot normally be obtained by a cross-sectional study. If estimates on economic losses due to particular diseases are required, various production parameters may have to be recorded. How are these to be measured? How good and how accurate will these measurements be?

Specify the expected response induced by a determinant on the frequency of occurrence of a disease. In other words, what effect would an increase or decrease in the frequency of occurrence of the determinant have on the frequency of occurrence of the disease? Remember that the determinant must occur prior to the disease. This may be difficult to demonstrate in a retrospective study.

Make biological sense. In epidemiological studies we are interested in exploring relationships between the frequency of occurrence of determinants and the frequency of occurrence of disease. We are particularly interested in determining whether the relationship is a causal one i.e. whether the frequency of occurrence of the particular variable being studied determines the frequency of occurrence of the disease. We analyse such relationships by the use of statistical tests which tell us the probability of occurring by chance of the relative distributions of the determinant and the disease in the studied populations. If there is a good probability that the distributions occur by chance, the result is not significant and the distributions of the variable and the disease are independently related. If there is a strong probability that the distributions did not occur by chance, the result is significant and the distributions of the variable and the disease are related in some way. Note that a statistically significant result does not necessarily imply a causal relationship.

Example: Suppose that the frequency of occurrence of variable A is determined by the frequency of occurrence of variable B. which also determines the frequency of occurrence of disease D. What is the relationship between variable A and disease D?

Figure

Note that although this arrangement would produce a statistically significant relationship between variable A and the disease D, the relationship is not a causal one, since altering the frequency of occurrence of variable A would have no effect on the frequency of occurrence of the disease, which is determined by variable B.

Variables that behave in this way are known as confounding variables and can cause serious problems in the analysis of epidemiological data. For this reason, any hypothesis that is made about the possible association of a determinant and a disease should offer a rational biological explanation as to why this association should be.

Finally, remember that common events occur commonly and that often the simplest explanation for a disease phenomenon is the right one. Complicated hypotheses should not be tested until the simplest ones have been ruled out. For example, the presence of ticks on supposedly dipped animals is more likely to be due to a failure to dip the animals or to improper dipping procedure, rather than to the appearance of a new strain of acaricide-resistant ticks.

These considerations emphasise the need for careful and detailed planning of an epidemiological study. They also illustrate the need to obtain as comprehensive and detailed knowledge as possible about the subject being investigated and the techniques used in the investigation. The time spent reading relevant literature is therefore usually well spent. Extensive literature searches can often be performed quickly and easily by using modern information-processing techniques.

Do not be afraid to ask advice from experts. Such advice is essential when one is conducting investigations or employing techniques outside one's particular area of expertise. Remember that the time to ask for advice is before the study has begun. Whenever possible, consult a statistician on the statistical design of the study in order to ensure that the data generated will be sufficient and can be analysed in the appropriate way to fulfil the objectives of the study.

4.7 The use of existing data

4.7.1 Advantages and disadvantages
4.7.2 Sources of data

Collecting specific epidemiological data involves a considerable amount of time and effort in both the planning and implementation stages. Because of this, the possibility of using existing data should be explored before generating new ones.

4.7.1 Advantages and disadvantages

The main advantages of using existing data are:

· Data collection is expensive; using existing data is cheaper although not cost free.
· Time is often essential; analysis of existing data sources gives answers more quickly.
· By using data from various sources, it may become possible to monitor the progress of a disease through different populations and to establish linkages between disease events, so that the sources of disease outbreaks can be traced and populations likely to be at risk of the disease identified.
· The use of existing data sources will help strengthen them or induce the need for change.
· Since the original data collection was performed in ignorance of the ongoing study, there may be a reduced chance of bias in favour or against any hypothesis being tested.
The main disadvantages encountered in the use of existing data include:
· Data sets are often incomplete. For example, national reports based on compilations of regional reports are almost invariably incomplete and frequently very late in appearing, as some regions are late in reporting. Parts of data sets may have been lent out and not returned.
· The data may have been collected for other purposes than those of the present study. For example, data collected initially for administrative or accounting purposes are unlikely to help identify the associations between a disease and its determinants.
· Existing data may be inconsistent or of unknown consistency. Observers change and so do recording systems. Changes in administrative procedures or policy may alter the type and method of data collection and complicate analysis. Random errors of counting or in reading instruments may cancel each other out in the long term, but errors are often not random. Scales may be consistently misread due to confusion over units and graduations. Different observers may consistently under- or overestimate livestock numbers, weights and ages and differ in their diagnosis of the same disease condition. Calculations of epidemiological rates are often prejudiced by ignorance of the size of the population at risk and of the time over which events were observed.
· The data may not be relevant. Records for Friesians will not be useful in estimating production losses in zebus. Although data may be readily available from commercial producers, they will not relate to the majority of rural enterprises. Since livestock production is dependent on weather, among other factors, data from a series of years need to be examined to obtain representative estimates of means and scatter. Even if such data are available from apparently similar farming systems, checking is necessary to indentify any changes that might have occurred in the provision of services, health control, markets and in prices, before taking historical data as being a good estimate of animal health and production at present.
· The method used to collate and analyse the data may not be adequate for epidemiological purposes. If this is the case, the data may have to be obtained in the original form, if still available, and reanalyzed. This may be a time-consuming process. Moreover, it may not be possible to subject the original data to the appropriate analysis.

There are nearly always some serious limitations in the value of existing data for epidemiological purposes. This does not mean that the data may not be useful; if the limitations are understood, the probability of their misinterpretation will be reduced.

4.7.2 Sources of data

In Africa, epidemiological data can be obtained from the following potential sources:

Livestock producers. Little or no recorded data are generated directly by traditional livestock producers. Where livestock development projects, government, parastatal, or commercial farming are operating, records may be kept. Such records can often furnish data on production parameters, births, deaths, purchases and sales, husbandry practices, the frequency of occurrence of specific diseases, particularly those that produce distinct and easily recognizable symptoms, and disease control inputs such as vaccinations, dipping, treatments, diagnostic tests etc.

The quality of such data fluctuates widely. Staff may change, and individual animal records may be lost or destroyed on removal of the animals. Historic records may give no indication of the population at risk. If record cards of different groups of animals (e.g. infertile and milking cows) are kept separately, care should be taken that all available records are, in fact, examined. If data on disease are being collected, it is necessary to know the diagnostic criteria used and who made the diagnosis, so that the likely problem of differential diagnoses can be assessed. When disease recording is attempted by farm staff, there is often a tendency not to record common conditions, such as mastitis, neonatal mortalities and lameness, whereas the incidence of dramatic diseases or sudden death is given undue prominence. Cross-checking with records on veterinary inputs may help to reveal serious discrepancies.

The main disadvantage of the data generated by livestock producers is that the data often relate to specific populations of livestock which may be atypical in terms of breed, husbandry practices and disease control inputs, to the general livestock populations of the country.

Veterinary offices, treatment and extension centres. The data produced from such sources are likely to be in the form of case books, treatment records, vaccination and drug returns, outbreak reports etc. The main problem with such data lies in relating them to a source population. They are frequently incomplete and may contain significant omissions, particularly with regard to those diseases that are either treated by livestock owners themselves or for which treatment is unavailable. Veterinarians may vary considerably in their diagnostic ability and preferences. As a result, increases or decreases in the occurrence of specific diseases which may be reflected in the records may not, in fact, be due to actual increases or decreases in disease incidence but rather to the replacement of one veterinarian by another, or to a greater efficiency in overcoming operational constraints, or to the provision of additional drugs, equipment and facilities. An increased awareness on the part of livestock owners to a particular disease problem or more selective diagnosis and treatment may also lead to an apparent increase in recorded incidence.

Probably the most useful data from such sources are those related to notifiable disease outbreaks, on which detailed reports have to be compiled. If the report forms have been properly designed and the investigative procedures specified, such data may allow the appropriate rates to be calculated. However, owners may be reluctant to report such diseases in their livestock, especially if they know that restrictions are likely to be imposed.

Diagnostic laboratories. The data generated by diagnostic laboratories often provide precise diagnoses of disease conditions but can be highly selective. The relative frequencies with which specific diagnoses are reported often reflect the standard and range of laboratory facilities, and the interests or expertise of the field staff and laboratory workers, rather than the actual situation in the field. Unless the laboratory has a field survey capacity, incidence and prevalence rates cannot be established, since the data on diagnoses obtained cannot be related to a source population. Nevertheless, such data are often useful in highlighting disease problems which are of particular concern to the individuals submitting the specimens. The minimum knowledge that disease x was confirmed in location y at time z provides some basis on which to build.

Research laboratories, institutions and universities. Most of the data generated by these institutions are likely to come from experiments and may be difficult to relate to the situation in the field. Nevertheless, if research is being conducted into a particular disease, the data generated are likely to provide valuable insights into the epidemiology of the disease in question. Such institutions are also good sources of reference and advice.

Slaughter houses and slaughter slabs. The data generated from these sources are normally in the form of findings at meat inspection, and may be recorded in a limited and highly administrative format. Major variations in the sensitivity and specificity of diagnoses may occur between different inspectors. The data only pertain to certain sections of livestock populations, being highly biased since mostly healthy young adults are examined. Significant omissions are common, and relatively rare pathological conditions are not usually differentiated, but the data may provide information on congenital abnormalities and chronic disease conditions which produce distinctive lesions. Slaughter houses and slaughter slabs are frequently used as a starting point for epidemiological investigations since they have facilities for conducting examinations and taking specimens that are not available elsewhere.

Marketing organizations. Data from marketing organizations provide information on sales and off take and sometimes also on livestock movements. Information on the latter might be used to trace back disease outbreaks to their sources. Unfortunately, this is rarely the ease in Africa, since animals are seldom individually identified and therefore their movements cannot be accurately recorded.

Control posts and quarantine stations. Records from these facilities can provide information about livestock movements and outbreaks of notifiable diseases.

Artificial insemination services. Records from AI services may be of assistance in providing some information about fertility. The data are normally collected in the form of non-return rates i.e. the proportions of first, second, third inseminations etc for which no further insemination is requested.

Such rates often give an overestimate of the true reproductive performances in the populations concerned. Many AI services often include a facility for the investigation of infertility problems. Data from such a facility can be of interest but are difficult to a source population.

Insurance companies. Since these companies now offer insurance cover for high-value animals, and may offer limited cover for animals of lower value, they need to calculate and monitor risks, which reflects the interest of the epidemiologist. As such their records may be useful but only limited data may be available.

The time required to identify and analyse existing records should not be underestimated, while their value needs to be carefully weighed against the cost. A quick but comprehensive survey of such material should indicate whether it will provide the required answers.

4.8 Monitoring and surveillance

4.8.1 Epidemiological surveillance
4.8.2 Epidemiological monitoring

One of the most important activities in veterinary epidemiology is the continuous observation of the behaviour of disease in livestock populations. This is commonly known as monitoring or surveillance. The term surveillance refers to the continuous observation of disease in general in a number of different livestock populations, while monitoring normally refers to the continuous observation of a specific disease in a particular livestock population.

4.8.1 Epidemiological surveillance

Surveillance activities involve the systematic collection of data from a number of different sources. These may include already existing data sources as well as new ones that have been created for specific surveillance purposes. The data are then analysed in order to:

· Provide a means of detecting significant developments in existing disease situations, with particular reference to the introduction of new diseases, changes in the prevalence or incidence of existing diseases, and the detection of causes likely to jeopardise existing disease control activities, such as the introduction of new strains of disease agents, chancres in systems of livestock management, changes in the extent and pattern of livestock movements, the importation of livestock and their products, and the introduction of new drugs, treatment regimes etc.
· Trace the course of disease outbreaks with the objective of identifying their sources and the populations of livestock likely to be at risk.
· Provide a comprehensive and readily accessible data base on disease in livestock populations for research and planning purposes.

The prime objective of such activities is, however, to provide up-to-date information to disease control authorities to assist them in formulating policy decisions and in the planning and implementation of disease control programmes. Although a detailed discussion on the design and implementation of surveillance systems is beyond the scope of this manual, it may be useful to review briefly some of the considerations involved.

The success of any surveillance or monitoring system depends largely on the speed and efficiency with which the data gathered can be collated and analysed, so that up-to-date information can be rapidly disseminated to interested parties. As a result of recent advances in data processing techniques, particularly in the field of computing, the development of comprehensive and efficient surveillance and monitoring systems at a reasonable cost is now within the reach of most veterinary services.

The capacity of epidemiological units to employ these modern techniques means that such units may be able to offer data-processing services to institutions and organisations in return for the use of their data. This has removed one of the main constraints on the development of such systems in the past, which was the reluctance of various data-generating sources to make their data available to those responsible for surveillance. Such cooperation depends on a clear identification of the information needs of reporting organisations and fulfilling these rapidly and efficiently.

Modern computerised data processing allows complicated analytical procedures to be carried out on large volumes of data quickly and easily. However, they must be used with a great deal of caution and only on data which justify them. If used on incomplete or inaccurate data whose limitations are not understood, they may produce results which are at best confusing or misleading. For this reason, the analysis of surveillance or monitoring data should be kept simple and the limitations of information produced should be clearly stated.!

A further consideration is that of confidentiality. Any surveillance or monitoring system will contain a certain amount of confidential data. If such data get into the wrong hands and are used indiscriminately without due regard to their probable limitations, serious problems may result. Appropriate safeguards need to be designed, therefore, to ensure that information is distributed to interested parties on a confidential and need-to-know basis.

4.8.2 Epidemiological monitoring

Epidemiological monitoring may include the use of existing routine data sources as well as of specific epidemiological field studies. Monitoring of a specific disease in a population is, in effect, a specialised form of a longitudinal study. The design of any individual monitoring programme will depend largely on the disease or control programme being monitored e.g. monitoring a vaccination programme would require different types of data than monitoring a tick control programme by dipping. The following objectives should be borne in mind in the design of monitoring systems:

· If control measures are being employed, the monitoring programme should provide a means to ascertain whether these measures are being carried out promptly and efficiently as specified in the programme design, and if not, why not.
· The monitoring programme should provide a means to ascertain whether the control measures being applied are having the desired and predicted effect on disease incidence. This normally implies a prompt and comprehensive disease-reporting system. The system should not be passive, but should include a component that is actively concerned with searching out disease outbreaks.
· The monitoring programme should provide a means for a rapid detection of developments which might jeopardise the control programme, or, in instances where no control measures are being implemented, which might warrant the introduction of control activities.

P (%)	SE (%)
P (%)	0.1	0.5	1.0	1.5	2.0	2.5
0.5	4975	-
1.0	9900	396	-	-	-	-
1.5	13275	591	148	-	-	-
2.0	19600	784	196	87	-	-
2.5	24375	975	244	108	61	-
3.0	29100	1164	291	129	73	47
3.5	33775	1351	338	150	84	54
4.0	38400	1536	384	171	96	61
4.5	42975	1719	430	191	107	69
5.0	47500	1900	475	211	119	76
6.0	56400	2256	564	251	141	90
7.0	65100	2604	651	289	162	104
8.0	73600	2944	736	327	184	118
9.0	81900	3276	819	364	205	131
10.0	90000	3600	900	400	225	144
20.0	160000	6400	1600	711	400	256
30.0	210000	8400	2100	933	525	336
40.0	240000	9600	2400	1067	600	384
50.0	250000	10000	2500	1111	625	400

P (%)	SE as a percentage of P
P (%)	1.0	5.0	10.0
0.5	1 990 000	79 600	19 900
1.0	990 000	39 600	9 900
1.5	656 667	26 267	6 567
2.0	490 000	19 600	4 900
2.5	390 000	15 600	3 900
3.0	323 333	12 933	3 233
3.5	275 714	11029	2 757
4.0	240 000	9 600	2 400
4.5	212 222	8 489	2 122
5.0	190000	7 600	1 900
6.0	156 667	6 267	1 567
7.0	132857	5314	1 329
8.0	115000	4600	1 150
9.0	101111	4044	1 011
10.0	900 000	3 600	900
20.0	40000	1 600	400
30.0	23 333	933	233
40.0	15 000	600	150
50.0	10 000	400	100

P (%)	Population size
P (%)	50	75	100	300	500	1000	5000	10 000
a) 90% probability of detection
0.5	50	75	100	271	342	369	439	449
1	45	68	91	161	184	205	224	227
2	45	51	69	95	102	108	113	114
3	34	40	54	67	71	73	76	76
4	34	40	44	52	54	55	57	57
5	27	33	37	42	43	44	45	45
6	27	27	32	35	36	37	38	38
7	22	24	28	31	31	32	32	32
8	22	24	25	27	27	28	28	28
9	18	21	20	22	22	22	22	22
10	18	18	20	22	22	22	22	22
b) 95% probability of detection
0.5	50	72	100	286	388	450	564	581
1	48	72	96	189	225	258	290	294
2	48	58	78	117	129	138	147	148
3	39	47	63	84	90	94	98	98
4	39	47	52	66	69	71	73	74
5	31	39	45	54	56	57	69	59
6	31	33	39	45	47	48	49	49
7	26	29	34	39	40	41	42	42
8	26	29	31	34	35	36	36	36
9	22	26	28	31	31	32	32	32
10	22	23	25	28	28	29	29	29
c) 99% probability of detection
0.5	50	75	100	297	450	601	840	878
1	50	75	99	235	300	368	438	448
2	49	68	90	160	183	204	223	226
	48	59	78	119	131	141	149	151
	45	59	68	94	101	107	112	113
5	39	51	59	78	83	86	89	90
6	39	44	53	66	70	72	74	75
7	34	39	47	58	60	62	64	64
8	34	39	43	51	53	54	55	56
9	29	35	39	45	47	48	49	49
10	29	32	36	41	42	43	44	44

	Number of individuals infected	Number of individuals not infected	Total
Positive test result	a	b	a+b
Negative test result	c	d	c+d
Total	a+c	b+d	N