Sage Journals: Discover world-class research

Abstract

The likelihood that a study will yield statistically significant results depends on the chosen sample size. Surveillance and diagnostic situations that require sample size calculations include certification of disease freedom, estimation of diagnostic accuracy, comparison of diagnostic accuracy, and determining equivalency of test accuracy. Reasons for inadequately sized studies that do not achieve statistical significance include failure to perform sample size calculations, selecting sample size based on convenience, insufficient funding for the study, and inefficient utilization of available funding. Sample sizes are directly dependent on the assumptions used for their calculation. Investigators must first specify the likely values of the parameters that they wish to estimate as their best guess prior to study initiation. They further need to define the desired precision of the estimate and allowable error levels. Type I (alpha) and type II (beta) errors are the errors associated with rejection of the null hypothesis when it is true and the nonrejection of the null hypothesis when it is false (a specific alternative hypothesis is true), respectively. Calculated sample sizes should be increased by the number of animals that are expected to be lost over the course of the study. Free software routines are available to calculate the necessary sample sizes for many surveillance and diagnostic situations. The objectives of the present article are to briefly discuss the statistical theory behind sample size calculations and provide practical tools and instruction for their calculation.

Keywords

Diagnostic testing epidemiology sample size study design surveillance

Introduction

Calculation of sample size is important for the design of epidemiologic studies, 18,62 and specifically for surveillance 9 and diagnostic test evaluations. 6,22,32 The probability that a completed study will yield statistically significant results depends on the choice of sample size assumptions and the statistical model used to make calculations. The statistical methodology of employed sample size calculations should parallel the proposed data analysis to the extent possible. 18 The most frequently chosen sample size routines are based on frequentist statistics, and these have been reviewed previously for other fields. 1,11,20,33,35,36,50,54,61,10 Issues specifically related to diagnostic test validation also have been discussed. 2,28,42,48 Sample size routines related to issues of surveillance, as well as diagnostic test validation using Bayesian methodology, also have been developed. 7,56

Surveillance and diagnostic situations that require sample size calculations include the detection of disease in a population to certify disease freedom, estimation of diagnostic accuracy, comparison of diagnostic accuracy among competing assays, and equivalency testing of assays. The appropriate sample size depends on the study purpose, and no calculations can be made until study objectives have been defined clearly. Sample size calculations are important because they require investigators to clearly define the expected outcome of investigations, encourage development of recruitment goals and a budget, and discourage the implementation of small, inconclusive studies. Common sample size mistakes include not performing any calculations, making unrealistic assumptions, failing to account for potential losses during the study, and failing to investigate sample sizes over a range of assumptions. Reasons for inadequately sized studies that do not achieve statistical significance include failing to perform sample size calculations, selecting sample size based on convenience, failing to secure sufficient funding for the project, and not using available funding efficiently.

There is no single correct sample size “answer” for any given epidemiologic study objective or biologic question. Calculated sizes depend on assumptions made during their calculation, and such assumptions cannot be known with certainty. If assumptions were known to be true with certainty, then the study that is being designed will likely not add to the scientific understanding of the problem. There are concepts that are important to consider when performing sample size calculations, despite the inability to classify certain results as correct or incorrect. A few simple formulas are generally sufficient for most sample size situations that would be encountered in the design of studies to determine disease freedom and evaluate diagnostic tests. The objectives of the present article are to briefly discuss the statistical theory behind sample size calculations and provide practical tools and instruction for their calculation. This review will only discuss issues related to frequentist approaches to sample size calculation and will emphasize conservative methods that result in larger sample sizes.

Table 1.

Definition of type I and type II errors.*

		True state of nature
		H_O is true H_A is true (H_O false)
Statistical Reject H_O result	Type I error Agreement with (alpha) truth
Fail to reject H_O	Agreement with Type II error truth (beta)

H_O = null hypothesis; H_A = alternative hypothesis.

Epidemiologic errors

The current presentation of statistical results in the medical literature tends to be a blending of significance testing attributed to the work of Fisher, 21 subsequently discussed by others, 29,55 and hypothesis testing as attributed to Neyman and Pearson. 46,47 The P value in the Fisher significance testing approach is considered a quantitative value documenting the level of evidence for or against the null hypothesis. The P value is formally defined as the probability of observing the current data or more extreme when the null hypothesis is true. The hypothesis testing approach as introduced by Neyman and Pearson was based on rejection or acceptance of null hypotheses using specified P value cutoffs. The hypothesis testing interpretation of statistical results allows for the definition of type I and type II errors as the errors associated with rejection of the null hypothesis when it is indeed true and the acceptance of the null hypothesis when it is false (and a particular alternative hypothesis is true), respectively. 47 The probabilities of making these errors are frequently referred to as alpha (α) and beta (β) for type I and type II errors, respectively. 36,54 Current sample size procedures are derived from the hypothesis testing approach as put forth by Neyman and Pearson; however, current convention is to use the terminology of “failure to reject” rather than acceptance of a null hypothesis.

Figure 1.

Sampling distributions presented under the null (H_O, black line) and alternative (H_A, gray line) hypotheses. Black shaded area corresponds to alpha (type I error), and gray shaded area corresponds to beta (type II error). Sample size calculations solve for the number so that the critical value (cv) corresponds to the location, where Pr Z ≤ z = 1 – α/2 and Pr Z ≤ z = β (or Pr Z ≤ z = 1 – α for a 1-sided test).

A requirement for sample size calculation is the specification of alpha and beta when considering the testing of a statistical hypothesis (Table 1). Precision-based sample size methods must specify alpha, but beta is not included in the equations and, based on the typical large-sample approximation methods, is consequently assumed to be 50% for the alternative hypothesis that the true value falls outside the limits of the calculated confidence interval. 17 The P value obtained after statistical analysis will equal the prespecified alpha if the assumptions of the sample size calculations are observed exactly in the collected data due to their similar probabilistic definition. However, the meaning of beta is often misunderstood as simply “the probability of accepting the null hypothesis when a true difference exists” based on presentations in tables and figures. 11,33,36,54 The issue is that there are an infinite number of specific alternative hypotheses that could be true if the null hypothesis is false, and many will be less probable than the null itself. Beta can be calculated only after an explicit alternative hypothesis has been specified. The hypothesis that is chosen during sample size calculation is the expected difference between the population values. Alpha and beta correspond to areas under sampling distributions for population means (including proportions) under the null and alternative hypotheses, respectively (Fig. 1). The statistical power of a test is defined as 1 – β or the probability of rejecting the null hypothesis when the alternative hypothesis is true.

Sample size adjustment factors

Sample size calculations are often based on large-sample approximation methods. 24,30,51 The quality of the approximate results depends on the specific sample size situation, and adjustment factors have been developed to improve their approximation to exact distributions. Some of the typical adjustment factors include the finite population, continuity correction, and variance inflation factors.

The finite population correction factor 4,19 is typically considered when the study objective is to estimate a population proportion. Typically, sampling without replacement is performed, and if the sample size is relatively large compared with the total population, then this correction factor should be considered. A typical recommendation is to employ this factor when the sample includes 10% or more of the population. 19 The need for this correction factor is derived from the fact that sampling is hypergeometric (sampling without replacement, as from a deck of cards), and sample size formulas are based on binomial (sampling with replacement) theory. The formula 19 for the correction is

n = n \times \frac{N}{N + n}, ​

where n is the corrected sample size, and n and N are the uncorrected and population sizes, respectively. Applying this correction factor causes the sample size to be smaller than the uncorrected. Confidence interval algorithms have been developed based on hypergeometric sampling, 52 but the author is not aware of their availability in statistical packages. Usual confidence interval calculation methods are based on either normal approximation or exact binomial methodology. Application of the finite population correction factor is only recommended by the author when the analysis also incorporates adjustment for hypergeometric sampling.

The continuity correction factor 51 is employed when the study objective is to compare 2 population proportions (including diagnostic sensitivity or specificity). The difference in proportions is approximated by a normal distribution in typical sample size formulas, even though binomial distributions are discrete and normal distributions are continuous. The normal approximation might not always be adequate, and continuity correction should be applied to better approximate the exact distribution (Fig. 2). The formula 25 for continuity correction is where n and n are the corrected and uncorrected sample sizes, respectively, and P₁ and P₂ are the hypothesized proportions. Applying the continuity correction increases the sample size over the uncor-rected and should typically be applied. Frequently employed methods for the comparison of proportions use continuity correction when calculating chi-square test statistics. 8,58

n = \frac{n}{4} {(1 + \sqrt{1 + 4 / (n | P_{1} - P_{2} |)})}^{2}, ​

Figure 2.

Cumulative probability function for a binomial distribution (n = 12, p = 0.5) (gray shading) overlaid with the corresponding cumulative normal distribution (μ = 6, σ = 1.732) denoting the uncorrected (A; probability = 0.500) and continuity-corrected (B; probability = 0.614) probability for observing 6 successes. Continuity correction improves the approximation of the true binomial cumulative probability for observing 6 successes, which is 0.613.

Sample size calculations for estimating proportions typically involve making the assumption of independence among sampling units. Lack of independence that is introduced when a clustered sampling design is employed can be adjusted by inflating the variance estimate. The design effect (DE), 10,38 or variance inflation factor, is defined as the variance of the sampling design compared with simple random sampling. The formula 10,45,59 for its calculation is

D E = 1 + ρ (m - 1), ​

where ρ is the intraclass correlation, and m is the sample size within each cluster. When clustered sampling is employed, then the sample size estimated by the usual methods assuming independence is multiplied by the DE to account for the expected dependence.

The intraclass correlation is a relative measure of the homogeneity of sampling units within each cluster compared with a randomly selected sampling unit. This correlation is formally defined as the proportion of total variation within sampling units that can be accounted by variation among clusters. 38,45 A high correlation indicates more dependence within the data, resulting in a larger DE. The intraclass correlation is generally estimated from pilot data or based on estimates available from the literature. If the number of clusters is fixed by design and the cluster sample size is unknown, then it is not possible to simply use the previously mentioned formula for the DE. The sample size per cluster (m) must first be estimated, and it is based on the effective sample size (ESS), which is the sample size estimated assuming independence. It is also necessary to know the number of clusters (k) and the intraclass correlation (ρ). The formula 27 for calculation of the cluster sample size is

m = \frac{E S S - ρ (E S S)}{k - ρ (E S S)} . ​

Sample size situations

Surveillance or detection of disease

The detection of disease in a population is important for herd certification programs and for documenting freedom from disease after an outbreak. It has implications in regional and international trade of animals and animal products. The first step is to determine the prevalence of disease that is important to detect. A prevalence of disease at this level or greater is considered biologically important. Documenting a zero prevalence of disease is not typically possible because it would require testing the entire population with a perfect assay. The next step is to define the level of confidence for which it is desired to find the disease should it be present in the population at the hypothesized prevalence or higher. Again, 100% confidence is not feasible because it would require sampling all animals and testing with a perfect assay. Alpha is calculated as 1 – confidence. The final step is to determine the statistical model to use for calculations. In small populations, sample size calculations should be based on a hypergeometric distribution (sampling without replacement). In larger populations, it is often assumed that the true hypergeometric distribution can be well approximated by the binomial (sampling with replacement). The sample size formula assuming a binomial model is based on the following relationship: (1 – p)ⁿ = (1 – confidence). The formula 19 after solving for the sample size is

n = \log α / [\log (1 - p)], ​

where α is 1 – confidence, and p is the prevalence worth detecting. The corresponding formula 19 based on hypergeometric sampling is

n = (1 - {(α)}^{1 / D}) (N - \frac{D - 1}{2}), ​

where α is 1 – confidence, N is the population size, and D is the expected number of diseased animals in the population.

The necessary sample size for various combinations of prevalence and confidence can be tabulated (Table 2), and software is available that will perform the necessary calculations. Survey Toolbox a can perform these calculations and is available free for download. The software performs calculations based on both binomial and hypergeometric sampling and can also adjust for imperfect sensitivity and specificity of employed tests.

An example of this type of sample size problem is illustrated by the regulatory agency in Texas when it decided to perform active surveillance for bovine tuberculosis (Table 3). There are approximately 7,650 registered pure-bred beef seed stock producers in Texas, and it was decided that a herd-level prevalence of 0.001 (1 in 1,000 herds infected) or greater was important to detect with 95% confidence. Survey Toolbox can be used to solve this sample size problem. From the menu, choose Freedom From Disease →Sample Size. Click on the Sample Size tab and input 100 for the sensitivity and specificity of the test. Change the population size to 7,650 and set the minimum expected prevalence to %0.1. Click on the Options tab and be sure that the type I error is set at 0.05. Click to have the program calculate the sample size based on the simple binomial model. No other changes are necessary. Go back to the Sample Size tab and click on the Calculate button. The sample size based on the hypergeometric model can be calculated by changing to the Modified Hypergeometric Exact on the Options tab before clicking on the Calculate button.

A binomial model suggested that the necessary sample size would be 2,994 of the 7,650 beef operations (39%). The interpretation is that assuming that the true prevalence is at least 0.001, then a sample consisting only of noninfected herds would occur 5% of the time or less when the sample size is 2,994 (assuming a perfect test at the herd level). The hypergeometric model might be more appropriate, because sampling would be from a finite population without replacement; using the hypergeometric formula, the sample size is 2,388 herds (31%) of the 7,650 total.

Table 2.

Number needed for study to be confident that the disease will be detected if present at or above a specified prevalence based on hypergeometric sampling and assuming a perfect test.

	Expected prevalence of disease in population (95%/99% confidence)
Population size	0.10	0.05	0.02	0.01	0.005
20	15/18	19/20	20/20	20/20	20/20
50	22/29	34/41	48/50	50/50	50/50
100	25/35	44/59	77/90	95/99	100/100
150	26/38	48/67	94/117	129/143	150/150
200	27/39	51/72	105/136	155/180	190/198
250	27/40	52/75	112/149	174/210	238/244
300	27/41	53/77	117/159	189/235	233/286
350	27/41	54/79	121/167	201/255	272/324
400	27/41	54/80	124/174	210/272	310/360
450	28/42	55/81	126/179	218/287	349/391
500	28/42	55/82	128/183	224/300	388/420
600	28/42	56/83	131/189	235/320	378/470
700	28/42	56/84	134/194	243/336	369/511
800	28/43	56/85	135/198	249/349	421/546
1,000	28/43	57/86	138/204	258/367	450/601
1,200	28/43	57/86	139/208	264/381	471/642
1,400	28/43	57/87	141/210	268/391	486/673
1,600	28/43	57/87	142/212	272/398	500/699
1,800	28/43	57/88	142/214	275/404	509/719
2,000	28/43	58/88	143/215	278/409	517/736
3,000	28/43	58/88	145/219	284/425	542/791
4,000	28/44	58/89	146/222	287/433	555/821
5,000	28/44	58/89	146/223	289/438	563/839
6,000	28/44	58/89	146/224	291/441	569/852
10,000	28/44	58/89	147/225	294/443	580/888
100,000	28/44	58/90	148/228	298/457	596/915
.100,000*	28/44	58/90	148/228	298/458	598/919

Based on binomial model.

Estimation of a population proportion

Calculating the sample size necessary to estimate a population proportion is important when an estimate of disease prevalence or diagnostic test validation is desired. The sensitivity and specificity of an assay should be considered population estimates in the same manner as other proportions. The sample size formulas employed for these calculations are typically considered to be precision based because they involve finding confidence intervals of a specified width rather than testing hypotheses. The typical sample size formula 37,58 based on the normal approximation to the binomial is

n = \frac{P \cdot (1 - P) \cdot {(Z_{1 - α / 2})}^{2}}{e^{2}}, ​

where P is the expected proportion (e.g., diagnostic sensitivity), e is one half the desired width of the confidence interval, and Z_1–α/2 is the standard normal Z value corresponding to a cumulative probability of 1 – α/2. The investigator must specify a best guess for the proportion that is expected to be found after performing the study. The investigator also needs to specify the desired width of the interval around this proportion and the level of confidence. In essence, this procedure will find the sample size that, upon statistical analysis, would result in a confidence interval with the specified probability and limits if the assumed proportion were in fact observed by the study (Fig. 3). The resulting sample size could be adjusted using the finite population correction factor, and if this is performed then the statistical analysis should be similarly adjusted at the end of the study. Sample sizes calculated using formulas should always be rounded up to the nearest whole number.

Table 3.

Sample size situation for the detection of bovine tuberculosis (TB) in beef cattle herds.

Biologic question	Is TB present in the 7,650 registered beef herds in Texas?
Statistical model	A hypergeometric model is assumed because sampling of herds will be performed without replacement.
Prevalence of TB	If TB is present at or above 0.1% (1 in 1,000 herds), then it is biologically important to detect.
Confidence level	If TB is present at this prevalence, then it will be detected with 95% confidence.
Results	2,388 of the 7,650 total herds should be tested.
Interpretation	When the true prevalence is at least 0.1%, then a sample consisting only of TB-uninfected herds would occur 5% of the time or less when the sample size is 2,388 (assuming a perfect test at the herd level).

Figure 3.

The sample size is determined so that the sampling distribution of the hypothesized proportion () has an area under the curve between the specified upper (P_U) and lower (P_L) bounds of the confidence interval equal to the specified probability (grade shaded area); Pr(P_L ≤ ≤ P_U) = confidence level.

The sample size methods based on the normal approximation to the binomial might not be adequate when the expected proportion is close to the boundary values of 0 or 1. Exact binomial methods are preferred when the proportion is expected to fall outside the range of 0.2–0.8. 26 The binomial probability function is the basis of exact sample size methods, and it is

(\begin{array}{l} n \\ x \end{array}) P^{x} {(1 - P)}^{n - x}, ​

where P is the hypothesized proportion, n is the sample size, and x is the number of observed “successes.”

Derivation of a sample size algorithm based on the binomial probability function has been described previously. 26 It is based on the mid-P adjustment 5,41 for the Clopper-Pearson method of exact confidence interval estimation. 14 The investigator specifies P_U and P_L as the desired limits of the confidence interval around the hypothesized proportion (P) and the desired level of confidence. The calculated sample size could be adjusted using the finite population correction factor if deemed appropriate.

Software is available to calculate the necessary sample size for estimating population proportions. Epi Info b includes software that can perform these calculations 33 and is available free for download. The software performs calculations based on normal approximation methods and will apply the finite population correction factor if the population size is specified. Software to perform calculations based on binomial exact methods (Mid-P Sample Size routine) can be obtained by contacting the author.

An example of this type of sample size problem is the design of a study to estimate the diagnostic specificity of a new assay to screen healthy cattle for Foot-and-mouth disease virus (FMDV; Table 4). The number of cattle necessary to sample could be calculated for an expected specificity of 0.99 and the desire to estimate this specificity ±0.01 with 99% confidence. For this example, it will be assumed that sampling is from a large population, and a simple random sampling design will be employed. Epi Info 6 can be used to make the calculation based on normal approximation methods (newer versions of Epi Info have not retained presented sample size routines). From the menu, choose Programs → EPITABLE. From the menus in EPITABLE, choose Sample → Sample size → Single proportion. The size of the population does not need to be changed unless the application of the finite population correction is desired. The design effect should be 1.0 unless the variance is to be adjusted for clustered sampling techniques. Enter 1% for the desired precision, 99% for the expected prevalence, and check 1% for alpha. Alternatively, the modified exact sample size routine could be used. Open the Mid-P Sample Size routine and enter 0.99 for the proportion, 0.01 for the error limit, and 0.99 for the confidence level. The normal approximation method suggested that 657 cattle would need to be sampled, whereas the method based on the exact binomial distribution suggested that 974 should be sampled. Neither of these numbers incorporates finite population correction. The sample size based on the exact distribution is preferred, and it is substantially larger than the sample size based on the normal approximation because the expected proportion is very close to 1.

Table 4.

Sample size situation for estimating the specificity of a test to screen cattle for Foot-and-mouth disease virus (FMDV).

Biologic question	What is the specificity of a new screening test for FMDV in cattle?
Statistical model	An exact binomial model is assumed because the specificity is expected to be close to 1.
Best guess for specificity	The new assay is expected to be 99% specific.
Precision	Specificity is desired to be estimated ±1%.
Confidence level	Specificity is desired to be estimated to be within these limits with 99% confidence.
Results	974 cattle should be sampled.
Interpretation	If 974 cattle are sampled and the true specificity of the test is 99%, then a 99% confidence interval should have the specified width (2%).

Typically, sample size calculations for studies that will perform clustered sampling first calculate the necessary sample size assuming independence or lack of clustering. Calculated sample sizes are then multiplied by the DE to account for the lack of independence. Expert opinion can be used to account for expected correlation of sampling units when prior information concerning the intraclass correlation is not available. A sample size routine incorporating a method to estimate the DE based on expert opinion for a fixed number of clusters has been developed 27 and is available from the author.

Comparison of 2 proportions

Independent proportions. Calculating the sample size necessary to compare 2 population proportions is important when a comparison of the accuracy of diagnostic tests is desired. Sensitivity and specificity are population estimates, and comparison between 2 assays should be based on this sample size situation. The usual sample size formula 13,25,53 based on the normal approximation to the binomial with equal group sizes is

n = \frac{{(Z_{1 - α / 2} \sqrt{2 \bar{P} (1 - \bar{P})} - Z_{β} \sqrt{P_{1} (1 - P_{1}) + P_{2} (1 - P_{2})})}^{2}}{{(P_{2} - P_{1})}^{2}}, ​

where P₁ and P₂ are the expected proportions in each group, and is the simple average of the expected proportions. Variables Z_1–α/2 and Z_β are the standard normal Z values corresponding to the selected alpha (2-sided test) and beta, respectively. Typical presentation of the formula 11,12 above uses Z_α/2 instead and an addition of the 2 components within the numerator. Solving these 2 formulations gives the same sample size because the numerator is squared. The specific formulation has been included here because alternative hypotheses have been presented in figures as being on the positive side of the null hypothesis, and therefore Z_β should be negative. This is also consistent with the algebraic manipulation to solve for Z_β, as presented in the section related to power calculation. The resulting sample size should be adjusted using the continuity correction factor, and all sample sizes should be rounded up to the nearest whole number. The magnitude of the difference between the 2 proportions has a greater effect on calculated sample sizes than typical values for alpha and beta (Fig. 4). The absolute magnitude of the proportions affects the calculations, with proportions closer to 0.5 resulting in larger sample sizes 11 because the variance of a proportion is greatest at this value. The formula for the standardized difference (SDiff) in proportions 3,36,61 is

S D i f f = \frac{| P_{1} - P_{2} |}{\sqrt{P_{1} (1 - P_{1}) + P_{2} (1 - P_{2})}} . ​

Figure 4.

Sample size estimates are affected by the standardized difference and the specified alpha (type I error) and beta (type II error).

Software is available to calculate the necessary sample size to compare 2 independent population proportions. Epi Info b can be used to perform these calculations. The calculations are based on normal approximation methods and will apply a continuity correction factor. An example of this type of sample size problem is the design of a study to compare the diagnostic sensitivity of magnetic resonance imaging (MRI) for detection of intervertebral disk disease between chondrodystrophoid and nonchondrodystrophoid breeds of dogs (Table 5). The number of dogs necessary to sample could be calculated for expected sensitivities of 90% and 80% in chondrodystrophoid and nonchondrodystrophoid dogs, respectively. The statistical test could be desired to have an alpha of 5% and beta of 20% to detect this difference in proportions. The ratio of chondrodystrophoid to nonchondrodystrophoid dogs also needs to be specified, and the assumption could be made to have equal group sizes. Epi Info 6 can be used to make the calculation. From the menu, choose Programs → EPITABLE. From the menus in EPITABLE, choose Sample → Sample size → Two proportions. The ratio of group 1 to group 2 should be 1, the percentage in group 1 would be 90%, the percentage in group 2 would be 80%, alpha should be 5%, and the power should be set at 80%. Calculations suggest that 219 dogs are necessary in each group (chondrodystrophoid and nonchondrodystrophoid) for a total of 438. The reported sample size includes continuity correction.

Table 5.

Sample size situation for comparing the sensitivity of magnetic resonance imaging (MRI) for detection of intervertebral disk disease (IVDD) between chondrodystrophoid and nonchondrodystrophoid breeds of dogs.

Biologic question	Is the sensitivity of MRI for detecting IVDD different for chondrodystrophoid and nonchondrodystrophoid dogs?
Statistical model	A binomial model is assumed for sensitivity, and a normal approximation is used for the comparison of sensitivities.
Best guess for sensitivity 1	MRI is expected to be 90% sensitive for detecting IVDD in chondrodystrophoid dogs.
Best guess for sensitivity 2	MRI is expected to be 80% sensitive for detecting IVDD in nonchondrodystrophoid dogs.
Type I error	A statistical test with 5% type I error (alpha) is desired.
Type II error	A statistical test with 20% type II error (beta) is desired.
Other assumptions	Assume an equal number of chondrodystrophoid and nonchondrodystrophoid dogs will be evaluated.
Results	219 are necessary in each group, for a total of 438 dogs.
Interpretation	If 219 dogs are available from each breed group for sensitivity estimation and the true sensitivities are 80% and 90%, then a statistical test interpreted at the 5% level of significance (alpha) will have a power of 80% for detecting the difference in sensitivities as statistically significant.

Sample size calculations for the comparison of proportions when the group sizes are not equal are a simple modification of the presented formula. 60 The formula also can be modified to allow for the estimation of odds ratios and risk ratios. 20,53 All presented formulas correspond to the necessary sample sizes for 2-sided statistical tests. Variable Z_1–α/2 is replaced with Z_1–α to modify the formula for a 1-sided test.

Dependent proportions. When multiple tests are performed based on specimens collected from the same animal, then the proportions (i.e., sensitivity and specificity) should be considered dependent. There are multiple conditional and unconditional approaches to solving this sample size problem, 15,16,23,39,40,43,44,49,57 and a formula is not presented in this section due to increased complexity and lack of consensus among competing methods. An example of this type of sample size problem is the design of a study to compare the diagnostic specificity of 2 tests for FMDV screening in healthy cattle (Table 6). Serum samples from each selected animal for study will have both tests performed in parallel. The number of cattle necessary to sample could be calculated based on expected specificities of 99% and 95% in test 1 and test 2, respectively. The statistical test could be desired to have alpha be 1 % and beta 10% to detect this difference in proportions. Software is available to calculate the necessary sample size to compare 2 dependent population proportions. WinPepi c includes software that can perform these calculations and is available free for download. From the main menu, the program PAIRSetc should be selected. Sample size should be chosen from the top menu of PAIRSetc. The correct type of sample size procedure corresponds to the McNemar test, and S1 should therefore be selected. The significance level should be set as 1 % and the power as 90%. The expected percentage of “Yes” in the first set of observations should be set as 99%, and the other percentage of “Yes” should be set as 95%. Numbers without “%” should be entered, and the remainder of the input boxes should be left blank. Calculations suggest that 544 pairs of observations are required (544 cattle total). This sample size is smaller than the corresponding sample size if the proportions were considered to be independent.

Table 6.

Sample size situation for comparing the specificity of 2 tests for Foot-and-mouth disease virus (FMDV) screening in healthy cattle.

Biologic question	Is the specificity of a new FMDV screening test in cattle different than the more established test?
Statistical model	A binomial model is assumed for specificity, a multinomial model is assumed for the cross-classified proportions, and a normal approximation is used for the difference in specificities.
Best guess for specificity 1	The new test is expected to be 99% specific.
Best guess for specificity 2	The established test is considered 95% specific.
Type I error	A statistical test with 1 % type I error (alpha) is desired.
Type II error	A statistical test with 10% type II error (beta) is desired.
Other assumptions	Data from the 2 tests will be dependent because they will be performed on the same cattle.
Results	544 cattle are required, and both tests performed on all.
Interpretation	If 544 cattle are available for specificity estimation and the true specificities are 99% and 95%, then a statistical test interpreted at the 1 % level of significance (alpha) will have a power of 90% for detecting the difference in specificities as statistically significant.

Epi Info 6 could be used to make the calculation if the paired design was ignored. From the menu, choose Programs → EPITABLE. From the menus in EPITABLE, choose Sample → Sample size → Two proportions. The ratio of group 1 to group 2 should be 1, the percentage in group 1 would be 99%, the percentage in group 2 would be 95%, alpha should be 1%, and the power should be set at 90%. Calculations suggest that 588 cattle are necessary for each group, and this would be the total number of necessary cattle because of the paired design. This sample size is not much different (8% greater) than the calculation based on the paired design, and since it is larger, it would not be necessarily incorrect to use the usual unmatched method for sample size determination.

Equivalency testing. A study that aims to determine whether or not a certain test has the equivalent (or noninferior) accuracy 44 of another, typically well-established test is based on separately comparing sensitivity and specificity between tests. The first step is to consider the sensitivity and specificity of the well-established test and then quantify the level of difference in the accuracy that would be allowable while still considering the 2 tests equivalent or the new test not inferior. It is not possible to calculate a sample size to determine zero difference for the similar reason that it is not possible to calculate a sample size to be 100% sure that a given population has no disease (zero prevalence). An example would be to determine equivalency of a new test to a well-established test that has been reported to be 90% sensitive and 95% specific. Further assumptions could be that as long as the new test is at least 85% sensitive and 90% specific, then it would be considered equivalent. The allowable alpha and beta values could be assumed to be 5% (2-sided) and 20%, respectively. However, power values greater than 80% and larger alpha values are sometimes assumed for equivalency studies. 54 Epi Info could be used to calculate the necessary sample size as described previously for 2 independent proportions. If equal group sizes are assumed (for each test), then the necessary sample size is 726 infected animals within each group tested by the 2 tests for sensitivity comparison and 474 uninfected animals within each group for specificity comparison. If a paired design were planned, then these numbers would be a reasonably good estimate for the total number of animals necessary for the evaluation. Often for noninferiority testing a 1-sided statistical test will be employed, and therefore the sample size calculation should be adjusted accordingly. Equivalency testing in general requires large sample sizes, and the discussed example is a simplified situation. Literature related to these studies documents several methods of calculation and varies based on the determination of regions associated with rejection of the null hypothesis of no difference between tests. The simplified example has been presented to give a general idea of how studies should be designed, and interested readers should review the paper by Lu et al. 44

Calculation of power when sample size is fixed

When the sample size is fixed by design, then it is good planning to determine the power of a statistical test to identify a biologically important difference. Estimating the power to compare 2 population proportions is important when it is desired to compare the accuracy of diagnostic tests. The usual formula for calculating the power for this comparison is an algebraic manipulation of the previously presented sample size formula and assuming equal group sizes is

Z_{β} = \frac{Z_{1 - α / 2} \sqrt{2 \bar{P} (1 - \bar{P})} - | P_{1} - P_{2} | \sqrt{n}}{\sqrt{P_{1} (1 - P_{1}) + P_{2} (1 - P_{2})}} . ​

A modification of the above formula, 24 including continuity correction, is

Z_{β} = \frac{Z_{1 - α / 2} \sqrt{2 \bar{P} (1 - \bar{P})} - | P_{1} - P_{2} | \sqrt{n - \frac{2}{| P_{2} - P_{1} |}}}{\sqrt{P_{1} (1 - P_{1}) + P_{2} (1 - P_{2})}} . ​

where n is the sample size, P₁ and P₂ are the expected proportions in each group, and is the simple average of the expected proportions. Variables Z_1–α/2 and Z_β are standard normal Z values. Power is determined as 1 – cumulative probability associated with Z_β as calculated from the formula (Table 7). Typical presentations of these formulas 24 incorporate Z_α/2 and addition of the numerator components.

An example would be to compare diagnostic sensitivity between 2 tests when both tests were independently performed on 100 infected animals. Assume that the tests are believed to have sensitivities of 85% and 90%, and a test with an alpha of 5% is desired. Epi Info 6 can be used to calculate the power of the test to compare these 2 proportions. From the menu, choose Programs → EPITABLE. From the menus in EPITABLE, choose Sample → Power calculation → Cohort study. The number of exposed should be set to 100, and the ratio of exposed to exposed as 1 (exposed and nonexposed is simply a way to distinguish the 2 groups). The relative risk worth detecting should be set to 1.06 (90%/85%; larger proportion over the smaller), the attack rate in the unexposed should be set as 85% as the lower of the 2 proportions, and alpha should be 5%. The power calculation includes continuity correction and is reported as 13.3% by Epi Info. Using the presented formulas, the powers are calculated as 18.5% and 12.8% for the uncorrected and continuity-corrected formulas, respectively.

Table 7.

Common standard normal Z scores for use in sample size formulas and power estimation.*

Standard normal Z score	Cumulative probability	1 – cumulative probability
–2.576	0.005	0.995
–2.326	0.010	0.990
–1.960	0.025	0.975
–1.645	0.050	0.950
–1.282	0.100	0.900
–1.036	0.150	0.850
–0.842	0.200	0.800
–0.524	0.300	0.700
–0.253	0.400	0.600
0.000	0.500	0.500
0.253	0.600	0.400
0.524	0.700	0.300
0.842	0.800	0.200
1.282	0.900	0.100
1.645	0.950	0.050
1.960	0.975	0.025
2.326	0.990	0.010
2.576	0.995	0.005

Power is found as 1 minus the cumulative probability associated with the Z score calculated from the power formula.

The calculation of power is dependent on the specification of an alternative hypothesis. The sampling distribution of the proportion under the null hypothesis is determined, and the critical value (Pr Z ≤ z = 1 – α/2) is located on this distribution. The alternative hypothesis is set as the expected difference in the 2 population proportions, and the sampling distribution of this difference is plotted with the critical value under the null hypothesis. The area under the sampling distribution of the alternative hypothesis to the right of the critical value is the power of the statistical test (Fig. 5). The shapes of these curves depend on the hypothesized proportions and the sample size. There is only a single power value related to each possible alpha and alternative hypothesis (expected difference in proportions).

Conclusions

The calculation of the sample size is very important during the design stage of all epidemiologic studies and should match the proposed statistical analysis to the extent possible. It is important to recognize that there is no single correct sample size, and all calculations are only as good as the employed assumptions. The sample size ensures statistical significance if the subsequent data collection is perfectly consistent with the assumptions made for the sample size calculation (assuming power was set as 50% or greater). If the null hypothesis is false and the assumed alternate hypothesis is true, then the probability of observing statistical significance will be equal to the assumed power of the test. The choice of assumptions for calculations is very important because their validity determines the likelihood of observing statistical significance. The traditional choices of 5% alpha and 20% beta can simply be used unless the investigator has specific reasons for other values. The choices of the best guesses or hypothesized values for the proportions that will be estimated by the study are more difficult. Values for these assumptions should be based on available literature or expert opinion. When there is doubt concerning their value, then proportions could be assumed to be close to 0.5. A proportion of 0.5 has the maximum variance, and therefore would result in the largest sample size.

Figure 5.

The sampling distribution under the null (black line) and alternative (gray line) hypotheses for the situation when P₁ = 0.2 and P₂ = 0.4 with equal group sizes. H_O is the null hypothesis that the true proportion is 0.3 (simple average of P₁ and P₂), and H_A is the alternative hypothesis that P₁ = 0.2 and P₂ = 0.4 and is centered at P₂. Alternatively, H_A could have been centered at P₁. The gray shaded area corresponds to the power for the statistical test with alpha of 5% when the sample size per group is 20 (A), 40 (B), 80 (C), and 160 (D). The power is 50% in panel B because the observed P value of the comparison is equal to the specified alpha (5%).

Sample size calculations correspond to the number of animals that are required to complete the study and be available for statistical analysis. They are the minimum sample sizes required to achieve the desired statistical properties. Sample size calculations should be increased by the number of animals that are anticipated to be lost during the study. The study design influences the number of animals expected to be lost during implementation. Cross-sectional studies should have minimal losses, but there is always the possibility of mislabeled samples, lost records, and laboratory errors. Sample sizes for cross-sectional studies should be increased 1–5% to account for these potential losses. Prospective studies that cover long time periods could have substantial losses, but these types of study designs are unusual for diagnostic investigations.

Some published recommendations include the post-hoc calculation of power when study results fail to achieve statistical significance. 34 However, there is no statistical basis for this calculation. 31 The power of a 2-sided test with an alpha set to be equal to the observed P value is typically 50%, 34 as presented in Figure 5. Therefore, post-hoc power calculations will typically be less than 50% for observed nonsignificant results. This fact, in conjunction with the one-to-one relationship between P value and power, suggests that little information can be garnered from their calculation. Post-hoc calculations of power could be useful if performed for magnitudes of differences other than what was observed by the study. In general, however, the post-hoc calculation of power is akin to determining the probability that an event will be observed after the event has already occurred (or not).

A primary purpose of sample size calculations is to ensure that the proposed study will be of an appropriate size to find an important difference statistically significant. Therefore, calculations should be performed prior to the determination of the study size. In practice, however, sample sizes are sometimes performed after the number of animals for study has been set, for reasons that might include cost or availability. Often, the assumptions are simply modified based on trial and error until calculations lead to the predetermined sample size, and these calculations are presented in grant applications or other proposed research plans. Also, studies are sometimes performed without performing any sample size calculations. Many journals require discussion of sample size calculations, and therefore such calculations are sometimes performed after the fact, with assumptions modified until the appropriate size is found. These are obviously not appropriate uses of sample size calculations. A better approach often would be the calculation of power based on the sample size expected to be used for the study. Though such post-hoc determinations are inappropriate or misleading, many epidemiologists and statisticians likely have been asked to perform these calculations. Unfortunately, the realities of research do not always coexist peacefully within the service of science itself. It is hoped that the material presented in the present article will demystify sample size calculations and encourage their use during the initial design phase of surveillance and diagnostic evaluations.

Acknowledgements

This manuscript was prepared in part through financial support by the U.S. Department of Agriculture, Cooperative State Research, Education, and Extension Service, National Research Initiative Award 2005–35204–16087. The author would like to thank the anonymous reviewers for helpful suggestions, which resulted in a better overall paper.

Footnotes

a.

Survey Toolbox version 1.04. by Angus Cameron et al. Available at .

b.

Epi Info™ version 6.04d for Windows, Centers for Disease Control and Prevention, Atlanta, GA. Available at .

c.

WINPEPI (PEPI-for-Windows) version 6.8 by J. H. Abramson. Available at .

References

Aitken

: 1999, Sampling—how big a sample? J Forensic Sci 44: 750–760

Alonzo

Pepe

Moskowitz

: 2002, Sample size calculations for comparative studies of medical tests for detecting presence of disease Stat Med 21: 835–852

Altman

: 1980, Statistics and ethics in medical research: III How large a sample? Br Med J 281: 1336–1338

Berry

Lindgren

: 1996. Statistics: theory and methods, 2nd ed. Duxbury Press, Belmont, CA

Berry

Armitage

: 1995, Mid-P confidence intervals: A brief review Statistician 44: 417–423

Bochmann

Johnson

Azuara-Blanco

: 2007, Sample size in studies on diagnostic accuracy in ophthalmology: A literature survey Br J Ophthalmol 91: 898–900

Branscum

Johnson

Gardner

: 2006, Sample size calculations for disease freedom and prevalence estimation surveys Stat Med 25: 2658–2674

Breslow

Day

, eds.: 1980, Statistical methods in cancer research In: The analysis of case-control studies, vol.1. IARC Scientific Publications no. 32, pp. 131–133. International Agency for Research on Cancer, Lyon, France

Cameron

Baldock

: 1998, A new probability formula for surveys to substantiate freedom from disease Prev Vet Med 34: 1–17

10.

Campbell

Mollison

Grimshaw

: 2001, Cluster trials in implementation research: Estimation of intracluster correlation coefficients and sample size Stat Med 20: 391–399

11.

Carlin

Doyle

: 2002, Sample size J Paediatr Child Health 38: 300–304

12.

Carpenter

: 2001, Use of sample size for estimating efficacy of a vaccine against an infectious disease Am J Vet Res 62: 1582–1584

13.

Casagrande

Pike

: 1978, An improved approximate formula for calculating sample sizes for comparing two binomial distributions Biometrics 34: 483–486

14.

Clopper

Pearson

: 1934, The use of confidence or fiducial limits illustrated in the case of the binomial Biometrika 26: 404–413

15.

Connett

Smith

McHugh

: 1987, Sample size and power for pair-matched case-control studies Stat Med 6: 53–59

16.

Connor

: 1987, Sample size for testing differences in proportions for the paired-sample design Biometrics 43: 207–211

17.

Daly

: 1991, Confidence intervals and sample sizes: Don't throw out all your old sample size tables BMJ 302: 333–336

18.

Delucchi

: 2004, Sample size estimation in research with dependent measures and dichotomous outcomes Am J Public Health 94: 372–377

19.

Dohoo

Martin

Stryhn

: 2003, Veterinary epidemiologic research. AVC Inc., Charlottetown, PE, Canada

20.

Donner

: 1984, Approaches to sample size estimation in the design of clinical trials—a review Stat Med 3: 199–214

21.

Fisher

: 1929, Tests of significance in harmonic analysis Proc R Soc Lond A Math Phys Sci 125: 54–59

22.

Flahault

Cadilhac

Thomas

: 2005, Sample size calculation should be performed for design accuracy in diagnostic test studies J Clin Epidemiol 58: 859–862

23.

Fleiss

Levin

: 1988, Sample size determination in studies with matched pairs J Clin Epidemiol 41: 727–730

24.

Fleiss

Levin

Paik

: 2003, Statistical methods for rates and proportions, 3rd ed. Wiley, Hoboken, NJ

25.

Fleiss

Tytun

Ury

: 1980, A simple approximation for calculating sample sizes for comparing independent proportions Biometrics 36: 343–346

26.

Fosgate

: 2005, Modified exact sample size for a binomial proportion with special emphasis on diagnostic test parameter estimation Stat Med 24: 2857–2866

27.

Fosgate

: 2007, A cluster-adjusted sample size algorithm for proportions was developed using a beta-binomial model J Clin Epidemiol 60: 250–255

28.

Georgiadis

Johnson

Gardner

: 2005, Sample size determination for estimation of the accuracy of two conditionally independent tests in the absence of a gold standard Prev Vet Med 71: 1–10

29.

Goodman

: 1993, p values, hypothesis tests, and likelihood: Implications for epidemiology of a neglected historical debate Am J Epidemiol 137: 485–496

30.

Greenland

: 1985, Power, sample size and smallest detectable effect determination for multivariate studies Stat Med 4: 117–127

31.

Greenland

: 1988, On sample-size and power calculations for studies using confidence intervals Am J Epidemiol 128: 231–237

32.

Greiner

Gardner

: 2000, Epidemiologic issues in the validation of veterinary diagnostic tests Prev Vet Med 45: 3–22

33.

Grimes

Schulz

: 1996, Determining sample size and power in clinical trials: The forgotten essential Semin Reprod Endocrinol 14: 125–131

34.

Hoenig

Heisey

: 2001, The abuse of power: The pervasive fallacy of power calculations for data analysis Am Stat 55: 19

35.

Houle

Penzien

Houle

: 2005, Statistical power and sample size estimation for headache research: An overview and power calculation tools Headache 45: 414–418

36.

Jones

Carley

Harrison

: 2003, An introduction to power and sample size estimation Emerg Med J 20: 453–458

37.

Kelsey

: 1996, Methods in observational epidemiology, 2nd ed. Oxford University Press, New York, NY

38.

Killip

Mahfoud

Pearce

: 2004, What is an intracluster correlation coefficient? Crucial concepts for primary care researchers Ann Fam Med 2: 204–208

39.

Lachenbruch

: 1992, On the sample size for studies based upon McNemar's test Stat Med 11: 1521–1525

40.

Lachin

: 1992, Power and sample size evaluation for the McNemar test with application to matched case-control studies Stat Med 11: 1239–1251

41.

Lancaster

: 1961, Significance tests in discrete distributions J Am Stat Assoc 56: 223–234

42.

Fine

: 2004, On sample size for sensitivity and specificity in prospective diagnostic accuracy studies Stat Med 23: 2537–2550

43.

Bean

: 1995, On the sample size for one-sided equivalence of sensitivities based upon McNemar's test Stat Med 14: 1831–1839

44.

Jin

Genant

: 2003, On the non-inferiority of a diagnostic test based on paired observations Stat Med 22: 3029–3044

45.

McDermott

Schukken

: 1994, A review of methods used to adjust for cluster effects in explanatory epidemiological studies of animal populations Prev Vet Med 18: 155–173

46.

Neyman

Pearson

: 1928, On the use and interpretation of certain test criteria for purposes of statistical inference: Part I Biometrika, 20A:175–240

47.

Neyman

Pearson

: 1933, On the problem of the most efficient tests of statistical hypotheses Proc R Soc Lond A Math Phys 231: 289–337

48.

Obuchowski

: 1998, Sample size calculations in studies of test accuracy Stat Methods Med Res 7: 371–392

49.

Parker

Bregman

: 1986, Sample size for individually matched case-control studies Biometrics 42: 919–926

50.

Rigby

Vail

: 1998, Statistical methods in epidemiology. II: A commonsense approach to sample size estimation Disabil Rehabil 20: 405–410

51.

Rothman

Greenland

: 1998, Modern epidemiology, 2nd ed. Lippincott-Raven, Philadelphia, PA

52.

Sahai

Khurshid

: 1995, A note on confidence intervals for the hypergeometric parameter in analyzing biomedical data Comput Biol Med 25: 35–38

53.

Schlesselman

: 1974, Sample size requirements in cohort and case-control studies of disease Am J Epidemiol 99: 381–384

54.

Sheps

: 1993, Sample size and power J Invest Surg 6: 469–475

55.

Sterne

Smith

Davey G

: 2001, Sifting the evidence—what's wrong with significance tests? BMJ 322: 226–231

56.

Suess

Gardner

Johnson

: 2002, Hierarchical Bayesian model for prevalence inferences and determination of a country's status for an animal pathogen Prev Vet Med 55: 155–171

57.

Suissa

Shuster

: 1991, The 2 × 2 matched-pairs trial: Exact unconditional design and analysis Biometrics 47: 361–372

58.

Thrusfield

: 2005, Veterinary epidemiology, 3rd ed. Blackwell Science, Ames, IA

59.

Ukoumunne

: 2002, A comparison of confidence interval methods for the intraclass correlation coefficient in cluster randomized trials Stat Med 21: 3757–3774

60.

Ury

Fleiss

: 1980, On approximate sample sizes for comparing two independent proportions with the use of Yates' correction Biometrics 36: 347–351

61.

Whitley

Ball

: 2002, Statistics review 4: Sample size calculations Crit Care 6: 335–341

62.

Wickramaratne

: 1995, Sample size determination in epidemiologic studies Stat Methods Med Res 4: 311–337

Practical Sample Size Calculations for Surveillance and Diagnostic Investigations

Abstract

Keywords

Introduction

Epidemiologic errors

Sample size adjustment factors

Sample size situations

Surveillance or detection of disease

Estimation of a population proportion

Comparison of 2 proportions

Calculation of power when sample size is fixed

Conclusions

Acknowledgements

Footnotes

a.

b.

c.

References