Abstract
It is highly desirable to use experimental methodologies in toxicological pathology that combine statistical power, practicality, and objective reviewability to detect small differences. The different ways of gathering data at the microscope can result in clear differences in power to discriminate small, but real, differences between treated and control rodent groups with nonneoplastic lesions. Six alternative methods of gathering and analysing results are compared. They are referred to as the Measuring, Ordering, Scoring (or Grading), Pair-contrast, Outside-control, and Affected methods. Measuring and Ordering methods are uniformly more powerful than other more common and highly esteemed methods, such as Scoring/Grading. From the practical perspective, Measuring and Ordering can be applied objectively, reviewed objectively, and interpreted to standards that are widely accepted as valid throughout experimental science e.g., using confidence limits and intervals. They also are intuitively natural extensions of routine toxicological histopathological examinations. Establishing a small difference between control and treated groups is commonly a problem when reporting no-observed-effect levels. Ordering is the recommended method for assessing if a small difference between treated and control groups is within chance variation or is the result of a true treatment effect, when measurement is impractical.
Introduction
This article compares the relative ability of a variety of methods of gathering data at the microscope to discriminate treatment-related change from spontaneous, background (chance) variation. How the data are gathered at the microscope determines what analysis can be achieved. This article follows a prior published cluster survey of all the UK-based toxicological pathologists (Holland, 2001) that established what their common practice and perceptions were.
The power of a test is the probability of the test detecting the difference specified at a given confidence level, given that it is truly there. The study of statistical “power” generates much confusion, not least because of confounded terminology—particularly with the more common term “sensitivity.” Unfortunately, sensitivity has 2 easily confused but very distinct meanings. The usual meaning is “analytical sensitivity” (or level of detection)—which is a measure of the least detectable amount and will have units such as mg/kg. The second distinct usage is as “diagnostic sensitivity”—which measures the probability of detecting a disease condition if present in an individual patient. It is a unitless probability measure (being a real number between 0 and 1) and is very close in meaning to statistical power (also a real number between 0 and 1). In this article we will be looking at the probability of obtaining true results from whole experiments, so the well-defined statisticians’ term “power” is appropriate. Readers may prefer to think “diagnostic sensitivity” if they dislike my preferred word, “power.” A good nonmathematical overview of the subject is given in the first 2 chapters of Statistical Power Analysis (Murphy, 1998).
In simple terms, in this article the various ways of gathering data at the microscope and then analysing it, which were identified in the initial survey (Holland, 2001), are examined to see which of them can detect a small but real difference (in statistical argot, this is a power analysis). This power analysis has been performed as a numerical simulation, by creating thousands of randomly generated sets of numerical data for a treated and a control group from defined distributions, where the underlying true difference between the 2 groups is known. If morphological change could realistically be measured, then this is the sort of numeric data that we would use.
However, histological reality is too complex for simple measurement, so we use less direct methods (such scoring lesions) to distinguish toxic effects from background variation. These less direct methods can be applied to numerical information in just the same way that they can be applied to lesions. Hence, how frequently each of these less direct methods can then detect a known difference between groups gives a measure of its power for that difference. Whether the data are lesions or sets of random numbers, powerful methods can detect very small differences between groups reliably and less powerful methods require the differences between groups to be greater before they can reliably detect them. Powerful techniques are clearly advantageous in experimental methodology, but there are occasions when the costs of increasing power (e.g., by using huge samples sizes) or absence of robustness may make the choice of less powerful methods desirable. The practical implications to toxicological pathology methodology and the further effects that these sorts of choices can have on issues of peer-review and experimental interpretation are then discussed.
Methods
Terminology
Six methods are compared. Each involves a method of gathering data at the microscope coupled with an appropriate method of statistical analysis of these data. There are no widely accepted terms for these methods of data gathering. Those terms used in the original survey (Holland, 2001) are generally followed, with modifications noted where cogent criticism has resulted in improvements.
Measuring method—any method involving measuring or counting some aspect of a lesion (called a Parametric method originally in Holland (2001));
Ordering method—any method based on ordering the animals from the most-to-least affected or the reverse (called a Ranking method in Holland (2001));
Score method—any method in which the range of variation is divided into ordinal classes (commonly termed absent, minimal, mild, moderate, marked, and severe), each animal being assigned to 1 of these classes. This is also commonly described as a Grading method (a synonym to me);
Pair-contrast method—involves the contrasting of individual-treated and control animals in pairs and uses the number of pairs in which the treated animal is the more extreme as its test statistic (called the Pairwise method in Holland (2001), such experimental results are tested by the Sign test);
Outside-control method—uses the number of treated animals that exceeds the most extreme control as its test statistic;
Affected method—any method in which a range is defined as normal or unaffected. Any individuals outside this range are then counted as affected. The unaffected range can be defined by historic material, expert opinion, or personal preference (but if the concurrent controls are used, then it becomes the Outside-control method).
Note that each method requires the data to be collected from the slides under the microscope in a different way. If initially you do a Pair-contrast method at the microscope, you will have to go back and completely reevaluate the slides if you want to use a test based on Ordering later. However, a Measured data set can clearly be converted to an Ordered data set without reexamination of the slides.
For ease, the 2 groups under comparison will be referred to as “control” and “treated” and the treated group assumed to show more of a hypothetical feature than the control group. By symmetry, the reverse situation gives identical results.
Simulations
The power of each test has been measured by Monte Carlo simulation (MC) of typical preclinical rodent studies. For several methods that are used by pathologists, theoretical analysis to derive their power is not possible. However, it is very easy to simulate repeatedly experimental data from defined distributions in which the difference between the control and treated populations is known and then test which of the methods detect that known difference. This is an MC method, which are used in a wide range of intractable analytical problems in mathematics and science, such as unanalysable definite integrals and reaction kinetics in physical chemistry. A very readable account intended for biologists is the short monograph by Moon (1997). His prescriptions have been followed in this work.
In simulating the populations to which the tests can be applied, the following assumptions have been made:
The variable is normally distributed and treatment has no effect on variance, only location. The variable has been standardised, so results are in standard deviation units.
The confidence level is that commonly applied in toxicology—the 1-tailed 95% limit (so the null hypothesis is simple and not composite).
The group sizes are 10—so the attempted simulations are applicable to studies with groups of this size, such as rodent noncarcinogenicity studies. It is difficult to predict how these results apply to studies with larger groups with present/absent responses (tumour-bearers in rodent carcinogenicity studies), or studies with very small groups sizes such as some dog or primate studies.
Given the types of methods used by pathologists as identified by survey (Holland, 2001), the simulations required exact conditions to be specified. Where possible, I chose the conditions applied to be representative of the methods that I have seen commonly applied (by myself and colleagues in their work) or are standard statistical techniques. Where this was not possible (e.g., the Score method), I have favoured exact methods. For each individual method the following conditions have been applied:
Measuring method—the t-test was used. This is the uniformly most powerful test possible for this sort of data (Arnold, 1990).
Ordering method—The Wilcoxon-Mann–Whitney test was applied as prescribed by Siegel and Castellan (1988).
Score method—while the Wilcoxon rank sum test can be adapted to this type of data, the power of the test is sub optimal (even with correction for mid ranks) (Sprent, 1993). An Exact test was applied to the most common form of this method, a three-part grading (e.g., mild-moderate-marked) in its most powerful form (i.e., with marginal totals for the grades of 7,6,7), at the 95% confidence limit. Clearly as the number of grades increases to 20, then the Score method will tend to the power of the Ordering method.
Pair-contrast method—the Sign test was applied as prescribed by Siegel and Castellan (1988). It must be noted that the test statistic can only hold integer values (9 and 10 in this case). To exceed the conventional 95% confidence limit, this test was applied at the 98.9% confidence limit. This is a more stringent level of probability against a false positive than was applied to the other tests.
Outside-control method—there are no widely published methods for this method. The test statistic was calculated as >3 by taking the probability density function of the most extreme control (from order statistics theory in Arnold, 1990) and convoluting it against the most extreme treated animal, then the 2nd most extreme treated, and so on until the 95% confidence limit was exceeded (for the 4th this is the 95.7% confidence limit). The numerical integrations were done in Mathcad (v7). (This is not the easiest way of calculating this figure, but was the method that I used.)
Affected method—The data needs testing for association, and Fisher’s exact test is appropriate. The test will be most powerful with all marginal totals of 10.
I have deliberately ignored the possible role of transforming the data between gathering it at the microscope and analysing. Transformations are only possible on the Measured data, which has been optimised under the assumptions of these simulations.
Minitab (v10) was used to produce and analyse all the simulations, and where possible, Minitab procedures were used. All the individual data points used in the simulations were generated (N~(0,1)) from a specified start location in the random number generator (as a single array). They were then assigned to control or treated groups in 9,999 batches of 10 each group for testing (for information, the 9,999 was imposed by a 4-digit limitation in Minitab’s programming language; it is larger than the 1,000 simulations that is common in the literature for MC methods). The difference in location of the means that was being tested was then added to each value in the batches of 10 values simulating a treated. All 6 methods were then tested on each batch of control and treated animals and the results recorded. The whole process was then repeated with a different difference between the groups’ means (minimum value 0, maximum value 2.9, step size 0.1, all in standard deviation units). Each of the 9,999 batches of 10 controls or 10 treated was tested once only for each difference of means, but it was tested exhaustively for all the differences in means used. The power of a method was estimated by the frequency of successfully identifying a given difference divided by the number of times it was applied (9,999 in this case).
Checks
The whole data set was tested on generation for normality and variance. The theoretical power of the most powerful method (Measuring) was then obtained from nQuery and checked against the 95% confidence interval of the simulated value (using the normal approximation of the binomial expansion). The theoretical power of the least powerful method (Pair-contrasting) was obtained as follows:
The difference between a random pair of animals is distributed:
Hence
As the critical values are 9 and 10:
The power of the Ordering, Score, Outside-control, and Affected methods could not be calculated from first principles.
Results
The results of the simulation are given in Figure 1, a plot of the actual values with linear interpolations between the simulated points. The Measuring method was the most powerful under all conditions tested. The Ordering method was almost as powerful as the Measuring method. The methods resulting in tabulated/contingency table data (Score, Affected, and Outside-control methods) were variably the next most powerful. The Pair-contrast method was a very poor, least-powerful method. However, if the difference between the populations is large (2 or more standard deviations), then all tests will detect the difference reliably.
The differences between the treated and control groups required to achieve the conventional 80% power (obtained by linear interpolation of the adjacent points to the 80% power value) are given in Table 1. It can be seen that a pathologist who uses a Pair-contrast method will need to have a 70% larger increase in an effect to achieve the same level of detection as a colleague using a Measuring or Ordering method. Establishing no-observed-effect levels (NOELs) routinely requires the distinction of a very small treatment effect from chance background variation. The methodology a Pathologist chooses may be more important to determining the resulting NOEL than the material he/she observes.
The perception of power obtained by survey (Holland, 2001) is at considerable odds with these mathematical results (Table 1). The Score method is seen as a powerful method by the large majority of pathologists, who use it commonly more than twice as frequently as the next most popular method (the Affected method, which itself has poor power). The reality is that Measuring and Ordering methods are better than the most popular methods actually used at the bench.
In the checks, the Normal distribution plot showed no evidence of significant difference from the standard normal distribution. The power analyses of the Measuring and Pair-comparison methods over the whole range showed them to be within or very close to the 95% confidence intervals of the estimated values. This shows that it is vanishingly improbable that any of these simulations have been affected by an unusual and unfortunate distribution within the treated and control groups, as all methods were tested on the same sets of data.
Discussion
This discussion is intended to raise and inform on issues for debate within the discipline. It is not intended to provide definitive conclusions, as it is accepted that this is a simple mathematical model of the complex processes used by pathologists to reach their results.
The discussion initially covers the limitations of these particular data, how statistical theory addresses these issues, and why these data are widely applicable. Then, we cover the important issues of validity, interpretation, and peer-review.
These data and analyses are limited in scope. It is perfectly reasonable to ask: what would the effect of a concurrent change in location and variance be or what would the result of using the Kolmogorov–Smirnov test instead of the Wilcoxon–Mann–Whitney test have been? These questions can be addressed, and are being worked on currently as part of a larger fellowship thesis. One issue that this initial simple model raises is not dependent at all on these specific results themselves. It is more philosophical one; how should toxicological pathology be practised? Is it a clinical discipline in which individual professionals practise as they individually see fit; or alternatively, is it a scientific discipline in which debated and agreed methodology and interpretation standards are a major component of the discipline?
The limitation in scope of these initial simple analyses does bring certain advantages. We know from statistical theory that with the data as drawn under the simple conditions of these simulations, then the most powerful statistical test possible is the t-test (Arnold, 1990) and this is used in the Measuring method in this report. So under these assumptions we have a “gold standard” method—Measuring—to which we can compare others. If the data had been drawn with both the variance and the location changed by treatment, then there is no known uniformly most powerful test (the result of the Fisher–Berhens problem). To illustrate the utility of this simplification, take as an example centrilobular cellular hypertrophy in the rodent liver. No method for measuring this change exists. Should we invest a large amount of effort in developing and using measurement techniques, which will probably involve days/weeks/months at the microscope counting intersects? It seems hardly worth the effort because a simple ordering of the animals from the most-to-least affected will achieve only a slightly less powerful method than a perfected Measuring method, even under ideal conditions.
A second major advantage of these simple data is the ability to gain insight into why the methods show such differences in power. The following argument is based on Fisher’s theoretical concept of statistical “sufficiency.” The t-test is the most powerful test possible for the simple null hypothesis because it is based on every data point making a contribution of its individual magnitude to the test statistic (and also because it belongs to the maximum likelihood class of tests). In a test based on ranks, the order of the values is used, but the actual magnitudes are not used. So, a rank-based test uses less information and is less powerful than a t-test. A method producing scores, grades, or ordinal classes (such as affected–unaffected) is a form of Ordering in which extensive amalgamation of adjacent ranks occurs. So, it is based on even less information than a test based on ranks. In the Pair-contrast method, each animal is only compared to a single animal in the opposing group, so the group information is lost and power reduced again (a further reason is that the test statistic can only hold integer values so this test is applied near the 99% confidence limit, which further reduces its power). Tests based on a reduced amount of information cannot be expected to have increased power.
Simply put, if you have the information from a Measuring method, you can throw away some of your information and do an Ordering method (but you will lose a small amount of power). Then you can throw away some of the Order information and amalgamate individual ranks into a small number of classes and get a Score, Outside-control, or Affected method (but you will lose some more power). Finally, you can disregard almost all the information that comes from there being groups at all and do a Pair-contrast method—and achieve very poor power. Power is only one important aspect of method selection; as we will see there are other considerations—the most important of which is, is the test a valid one?
Issues of power have been addrssed in the statistical literature. The most widely used measure is that of Pitman (1948, widely quoted lecture notes never published, see Sprent, 1993) and called “Pitman efficiency” or asymptotic relative efficiency (ARE). The basic idea is to compare 2 tests to answer the question: what is the sample sizes each requires to achieve the same power. For mathematical simplicity this is done as the sample sizes tend to infinity (the choice of confidence limit and power then have no effect on the result). While ARE is mathematically elegant and simplifying, in practical terms it is not a very useful measure to experimenters, because our sample sizes of 10 and less are a poor approximation for infinity. What experimenters need is a method of converting relative infinite sample sizes into the more useful measurement of “how big does the difference need to be so that it is detectable by this method as compared to that method.” No way of converting ARE to this useful information exists, hence these simulations. In the literature the ARE compared to the t-test for the Ordering method (Wilcoxon–Mann–Whitney test) is given as 1.5–0.955 (Connover, 1999), and for the Pair-contrast method (Sign test) is 0.638–0.333 (Connover, 1999) under a wide range of conditions.
The circumstances in which an Ordering method can be more powerful than a t-test (ARE of 1.5) are those in which the variance of the samples are different. This situation contravenes an underlying assumption of the uniformly most powerful property of the t-test, and it loses this valuable property. For many situations in pathology we know with certainty that changes cannot be simply normally distributed. For example, any inflammatory change involves several components (e.g., edema, cellular infiltration, vasodilatation) and at very minimum must be multidimensional. Technically the t-test is then invalid. Nonparametric tests make very sparse (even no) assumptions about the populations from which they are drawn. This gives them very great “robustness” in statistical argot. In this respect, they are ideally valid, as we usually have no clear idea of the underlying distribution of lesions’ magnitude through our populations.
Pathologists must work within 2 sets of constraints—maximising the information gathered from the slides so that sensitive analysis can be made and also using methods that are robust, because we have little/no knowledge of the underlying distributions of the lesions that we observe in populations. On this evidence, I strongly recommend using data based on ranks to achieve the best power practicable with good robustness. Failing that, gather score data with as many ordinal classes as possible. At very worst, the loss of power from the theoretic maximum possible power using ranks will be of the order of 5%, and if the populations are nonnormal in the feature you are assessing, you may achieve an ARE of greater than 1. Choosing a poor way of examining your slides (e.g., Pair-comparison method instead of an Ordering method) may increase the degree of change required for detection over the theoretical minimum more than 10-fold. When the group sizes are much reduced from 10 (to 5 or even 3 animals in a group, as in primate studies) it is clear that this effect becomes much more marked. With 2 groups of 3 animals, the Measuring methods can exceed the 95% confidence limit, the Ordering method can equal the 95% confidence limit and the Score, Pair-contrast, Outside-control, and Affected methods simply can not achieve a conventionally significant result on any data set. With experiments with groups of this size, then data from intermediate groups, historic experimental data, and/or expert opinion may need to be used to refute any null hypothesis.
More important than these issues of power and robustness are those of interpretability, reviewability, and reporting. The almost universal current practice of reporting toxicological lesions is to record scored and graded information and present it as a contingency table (grade prevalence against dose for each sex). This has 2 substantial drawbacks.
The first is that it is impossible to convey where the boundaries between the ordinal classes occur; what is the objective difference between a mild and a moderate? Reviewing my own work after a period is difficult, peer reviewing others’ tables is impossible to do objectively. I have to construct a table with my own grade boundaries to see if it agrees generally with the original pathologist’s opinion. Worse still, as the boundaries between the classes are subjective and ineffable, it is impossible not to be influenced by the slides I have seen most recently as to the grade of the slide I am looking at now (are you truly unaffected by diagnostic drift?). Each table is a unique event in the passage of time that cannot be prescribed, described, or rescored. These problems can be sidestepped by doing a blind-to-treatment ranking of a lesion from most-to-least affected. Blinding of this sort also reduces the problem of observer bias and seeing the slides in order of severity removes diagnostic drift. It makes the work involved in peer-review much easier as well. A simple pass through the slides of the most extreme to the (say) 3rd most extreme, then the 2nd to the 4th and so on through all the slides is easy, quick, and exhaustive.
Second, the contingency tables that result from grading are extraordinarily difficult to interpret or analyse. Most scientific disciplines accept exceeding the 95% confidence limit under the null hypothesis as evidence of a real result. In over 20 years I cannot recall ever seeing that applied to histological score data in a regulatory study. This is partly due to the inherent difficulty of analysing contingency tables with sparse data (although programmes such as StatExact do exist now to give an exact significance level). It is more usual to see just an expert opinion of the table in the text. Expert opinions have frequently proved to be all too fallible. Objective ranked data is analysable by a variety of proven methods with a pencil and paper in a matter of minutes, to standards widely regarded as valid (e.g., the 95% confidence limit) throughout science. Moreover, it is a more powerful method as well. With this paper I would like to stimulate a widespread discussion of methodology in toxicological pathology. This should by now be a mature experimental discipline with an accepted and proven methodology to widely accepted scientific standards. To focus discussion, I put forward these assertions, which I hold to be true:
Toxicological pathology should move forward in its methodology of regular preclinical rodent studies:
To ordering, counting, or measuring substantive lesions.
And hence to giving probability estimates or confidence intervals under the null hypothesis to all substantive findings.
Scoring lesions should be confined to the initial screening of study material for possible effects.
Footnotes
Acknowledgments
ACKNOWLEDGMENTS
This work has been sponsored by my employer, AstraZeneca, in pursuit of a Fellowship of the Royal College of Veterinary Surgeons. I am very grateful to my supervisors Mr. Peter Lee and Dr. John Glaister and also my colleague Dr. Stig Johan Wiklund—especially for their trenchant criticism of the first draft of this article (which reduced its length by some 40%, but spawned another). Finally, I record my thanks to my wife Helen, who proofread this article after putting up with it for many months.
