Abstract
Although interpretation and description of clinical pathology test results for any preclinical safety assessment study should employ a consistent standard approach, companies differ regarding that approach and the appearance of the end product. Some rely heavily on statistical analysis, others do not. Some believe reference intervals are important, most do not. Some prefer severity of effects be described by percentage differences from, or multiples of, baseline or control, others prefer only word modifiers. Some expect a definitive decision for every potential effect, others accept uncertainty. This commentary addresses these differences and underscores the need for flexibility in a “consistent standard approach” because the conditions of every study are unique. This article constitutes an overview of material originally presented at Session 2 of the 2016 Society of Toxicologic Pathology Annual Symposium.
Keywords
The study is over. All samples have been analyzed, and test results have been tabulated. Summary and individual animal data tables arrive in your mailbox, and your report is due in 2 days. It is tempting to open the tables and start scanning the data but don’t do it. Begin at the beginning. Review the protocol and amendments to be completely familiar with the study objective and design. Review the study conduct and progression to be aware of all unusual or unexpected circumstances such as dosing holidays, altered dose levels, replacement animals, excessive sample collection for other tests, or errors in sample collection (e.g., collection in group order). Review what is already known about the test article, including prior studies if available. This prework affects how you will view the data and the quality of your interpretation and description. It is necessary to understand the variables (preanalytical and analytical) that affect the data in those tables. Are differences between groups real (i.e., test article related) or the product of variability that simply reflects the design and/or conduct of the study? Nearly every study has something uniquely different about it, so a consistent but flexible approach is important. There are few, if any, hard-and-fast data interpretation rules that apply across all preclinical studies.
Once the prework is done, where do you start with the data tables? It is a personal choice, but mistakes are limited by consistency and attention to detail. Some prefer to start with the summary tables because they provide the most efficient means (no pun intended) of assessing differences between groups. These tables are also where those often annoying asterisks are found if the protocol included statistical analysis of the data. Others prefer to start with individual animal tables. It may take a little longer to identify apparent differences between groups, but individual animal data provide a much clearer picture of interanimal variability, intra-animal variability when serial testing is done, within group variability, overlap of results among groups, and the influence of outliers on group means. Those who begin with individual animal data tables often say they want to have a good feel for possible test article–related effects before they are influenced by the outcome of statistical analyses. In the end, however, both sets of tables need to be reviewed. If you have enough space on your desktop (electronic or otherwise) to place the tables side-by-side, you will kill 2 birds with one stone. Reviewing results of each test in both sets of tables at the same time ensures that you avoid one of the biggest mistakes in data interpretation: failure to carefully evaluate individual animal data. Even when summary data tables appear to show nothing remotely different between control and treated groups, individual animal data should always be reviewed. If nothing else, it fosters an important habit and adds knowledge and experience concerning typical results for animals in studies of similar design.
Now that you’ve reviewed the data and identified some apparent differences between control and treated groups that may or may not be real or test article related. How do you decide? It’s often not obvious, and you may never be absolutely certain. But the more you know about the study and all the other measured/recorded end points (e.g., clinical observations, body weight, and microscopic findings), the better your decisions will be. The list of factors to consider is long, and it gets longer with experience. Some factors apply to most studies, but others may only apply to the study in question because of its uniqueness. At the risk of oversimplification, a partial list of questions to consider includes the following. How large are the differences? Are they dose dependent? Are they consistent over time or between sexes? Are they due to most of the animals in the treated group or just 1 or 2? When did the differences appear in relation to dosing? Were they present before dosing was initiated in large animal studies? Are they consistent with other clinical pathology findings? Are there correlative findings in other end points (e.g., in-life observations or microscopic findings)? Are the differences statistically significant? What is known about the test article? What was the vehicle? How many animals were tested? How much inter- and intra-animal variation (based on species, age, and study design characteristics such as route of administration, fasting status, number and volume of blood collections, site of blood collection, etc.) is expected for the tests in question? How much analytical variability is expected? Of course, it’s possible for test article–related effects to be very small, independent of dose, or inconsistent over time or between sexes. They may stand alone and not be associated with other findings, and they may not match what was previously observed in other studies. It happens. But gather all the information you can and make informed decisions based on the weight of the evidence. Will everyone agree with you? Not a chance. But you need to be able to support your decisions, especially about small differences, because they will likely be challenged.
How do statistics fit into this picture of data interpretation? The use and perceived worth of statistical analysis of preclinical clinical pathology data varies across industry. Some companies include statistical group comparisons for all studies that include a control group and have at least 3 animals/sex/group. Others require at least 10 animals/sex/group before group comparisons are run, and these companies typically do not include group comparisons for dog and monkey studies. Although there is validity to the argument that the low number of animals tested and the availability of baseline data diminish the value of statistics in large animal studies, statistical comparisons only cause an interpretation issue when they are considered to be more than just one piece of the interpretation puzzle. The old saying about statistics (actually from around the year 1900) is as true for clinical pathology data in preclinical studies as it is for any other discipline: statistics should be used the way a drunk uses a lamppost—for support, not illumination. By the very nature of statistics, everyone should be able to agree that statistically significant differences do not always represent test article–related effects, and many test article–related effects do not reach statistical significance. It does not matter whether the statistical analyses are simple or complex, they are just a tool, and their results must be considered with the totality of the data. If a statistically significant difference is deemed incidental, there are usually multiple reasons for making that assertion (e.g., inconsistency over time or between sexes, no relationship to dose, similarity to differences present before initiation of dosing, no correlation with other findings, etc.). The reason is never only because the difference was really small. Small test article–related differences are commonly identified, especially in more robust studies (e.g., rodent studies with 15 or more animals/sex/group), and sometimes there is just no value in fighting the asterisk. Maybe that really small difference was present in both sexes or at multiple intervals or with several other differences that were larger and more obviously test article related. Maybe that really small difference was consistent with the pharmacologic activity of the test article or was observed in previous studies. Or maybe the p values for the overall analysis of variance and pairwise comparison to control were both <.0001, and are probabilities hard to ignore. The inclusion of p values on summary tables, and not just asterisks, is a helpful feature providing added context for data interpretation. Regardless of the type of statistical test applied, a difference with a p value close to .05, whether slightly above or below that benchmark, deserves a little more scrutiny regarding its relationship to test article administration. Maybe there are good reasons to believe that the difference with a p value of only .08 is real and the difference with a p value of .04 is not.
Reference intervals are conspicuously absent from the previously listed factors to consider when deciding whether apparent differences between control and treated groups are test article related. The reason is simple. They have little value in making those decisions in conventional preclinical studies with proper controls. Reference intervals are typically constructed to represent the central 95% of results, bounded by upper and lower limits, from a reference population of animals deemed to be healthy. The reference population is usually defined by relatively simple partitioning criteria: species, strain, sex, and age. However, test results and therefore reference intervals are influenced by many other factors such as animal supplier, husbandry practices, diet, fasting status, sample collection site, sample handling procedures, sample matrix, and laboratory methods. Additional factors that are more study-specific include vehicle choice, number and volume of blood collections for other tests (e.g., toxicokinetics), anesthesia for certain procedures, the presence of telemetry instrumentation, and the placement and maintenance of indwelling catheters for continuous intravenous infusion studies. These many different factors are rarely, if ever, included as partitioning criteria when reference intervals are constructed. It should be obvious that test results from concurrent control animals are far more relevant to data interpretation than reference intervals constructed from a reference population that does not truly represent the treated animals in a given study.
On the other hand, even I admit that clinical pathology reference intervals have a role in preclinical studies, albeit limited. They can be helpful for assessing the relative health of large animals prior to test article administration, assessing sick animals, and providing perspective concerning normal interanimal variability for different parameters. They can serve as an indirect form of quality control for sample collection and handling procedures, instruments and assays, and even the animals. Finally, if the reference population is appropriate, reference intervals can benefit the evaluation of small investigational studies that include no control animals. However, if the reference population is not a good match for the study, the resulting misinterpretation of an early study could have a deleterious effect on the whole program.
Back to your study. You’ve now identified differences you consider test article–related effects, and it’s time to describe their magnitudes in your report. You have options, and there are three common ones. Use words such as minimal, mild, moderate, and marked. Calculate percentage differences from (or multiples of) concurrent control group means or respective baseline means. Or use a combination of words and calculated percentage differences. The first and last options are acceptable. The second option, by itself, is not. The primary criticism of simply using word modifiers is that one person’s “minimal” is another person’s “mild,” and readers don’t necessarily know the author’s thresholds. That is true. But everyone knows that a minimal effect is very small, a marked effect is very large, and mild and moderate effects are somewhere in between. They also understand that the likelihood of an effect being toxicologically important generally increases with increasing magnitude, and unless explained otherwise because of other findings, effects described as minimal or mild are less likely to be toxicologically important than effects described as moderate or marked. In contrast, not everyone knows that a 15% lower sodium concentration is a marked effect (and clearly important) and a 100% higher total bilirubin concentration is a minimal effect (and likely unimportant). Every parameter is different, and even a seasoned clinical pathologist has to stop and do the math in their head using a typical value (e.g., a 15% lower sodium concentration would equate to a decrease from 150 mmol/L to 127.5 mmol/L) in order to understand the relative magnitude and likely importance of a given percentage difference for most any given parameter. Imagine what someone without the same training and experience takes away from a long list of percentage differences for test article–related clinical pathology effects. The highest numbers will draw the most attention whether warranted or not. In order to provide perspective for all readers, it is imperative that word modifiers be included in any report that lists calculated percentage differences. But understand that if percentage differences are listed in text tables, there will undoubtedly be some readers that look no further than those convenient text tables and miss the perspective presented in the text.
While you are deciding how to describe the magnitudes of effects in your study, keep in mind that using calculated percentage differences could actually be misleading in some cases. A numerical description (e.g., +15%) is very specific. Is it appropriate to assign a very specific measure to a finding when means are based on only a few animals/sex/group in most dog and monkey studies? Is it appropriate to assign a very specific measure when 1 or 2 outlier values have significantly affected means and calculated differences? Is it appropriate to assign a very specific measure of change from respective baseline means when study-specific factors such as multiple blood collections, continuous infusion by catheterization, or an unusual vehicle have caused changes independent of the test article? If you use calculated percentage differences in these situations, you may find yourself in the uncomfortable position of trying to articulate why those numbers overstate or understate the practical magnitude of an effect without losing the reader.
After you’ve identified the differences you believe are real (i.e., test article related), the next step is typically to decide whether they are bad (i.e., adverse or toxicologically important). While adverseness was not the subject of this presentation, a couple of concepts are worthy of mention. Effects on some clinical pathology parameters can be adverse because the analyte itself is critical to normal health and too much or too little is a problem (e.g., calcium concentration). These effects will nearly always be accompanied by other findings that support the decision to call them adverse. Effects on other clinical pathology analytes can only be markers of an adverse effect because the analyte itself has no effect on normal health (e.g., alanine aminotransferase activity). To be considered a marker of an adverse effect, these latter clinical pathology effects should be anchored by another finding that is the actual adverse effect. Effects on still other clinical pathology analytes (e.g., absolute neutrophil count) can be either adverse or a marker of an adverse effect depending on the direction of change. It makes no sense to say an increase above a certain level in alanine aminotransferase activity or absolute neutrophil count is adverse. They become markers of adverse effects only when they can be attributed to, or associated with, findings considered adverse in another end point, typically histopathology.
It’s now about time to wrap up the initial draft of your report, but you’ve hit a couple of snags. What do you do with the small differences that you are just not sure about? And how much should you say about potential mechanisms for the effects that you have identified? With regard to uncertainty, some have the philosophy that a black and white decision has to be made for every difference. No waffling in the report. You must decide whether the difference is, or is not, test article related. That philosophy is unrealistic and disingenuous. While it is true that every difference either is, or is not, test article related, it is equally true that many preclinical studies lack the power to make definitive conclusions. Let’s not kid ourselves. We sign these reports. If you are not comfortable with making a definitive call for a fence-sitter finding, what is so wrong with simply describing the finding and your reasons for uncertainty? Will that have a negative effect on the program? No more, and maybe less, than if you are forced to make a definitive call that turns out to be wrong. With regard to speculation about potential mechanisms for test article–related effects, some have the philosophy that it is best to say nothing unless you know for sure. That’s understandable, but this is your first draft. If you have a pretty good idea about a potential mechanism, go for it. The worst that can happen is somewhere down the line, someone says, “take that out.” No harm, no foul. But at least you let everyone know what you were thinking, and it’s possible those thoughts will generate discussion and planning going forward that could ultimately benefit the program.
Finally, it’s time to submit your report (which by the way, should coincide with when the anatomic pathology report is due; no sense in submitting an early draft that is incomplete or possibly wrong because you didn’t know what the anatomic pathologist was going to find). Don’t ever expect that the next time you see that report will be for your signature. There will be differences of opinion and differences in writing style. Be ready to support your opinion when challenged but also keep an open mind. As far as writing style and “suggested” revisions, if the grammar is correct and the meaning is unchanged, try not to let it annoy you too much. Your life will be more pleasant. After 30 years and more than 4,000 reports, I think H. G. Wells had it right when he wrote that no passion in the world is equal to the passion to alter someone else’s draft.
Footnotes
Author Contribution
All authors (RH) contributed to conception or design; data acquisition, analysis, or interpretation; drafting the manuscript; and critically revising the manuscript. All authors gave final approval and agreed to be accountable for all aspects of work in ensuring that questions relating to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
