Sage Journals: Discover world-class research

Abstract

Introduction:

Prediction models of most toxicological assays translate continuous data to binary classifications (“positive” or “negative”) using cutoff values. Mostly these cutoffs do not consider data variability. Some OECD test guidelines, however, provide a range close to the cutoff. If a test result is in this range, a repetition of the test is proposed. Yet, these ranges were based on few data and not systematically derived.

Materials and Methods:

In the present study, we determined the borderline ranges from multilaboratory ring trial studies for five nonanimal methods addressing skin sensitization: Direct Peptide Reactivity Assay (OECD TG 442C), KeratinoSens^® and LuSens (OECD TG 442D), and human cell line activation test (OECD TG 442E) as well as the kinetic direct peptide reactivity assay (update of OECD TG 442C).

Results:

We used a uniform statistical approach based on the log median absolute deviation. Implementing the proposed borderline ranges helps to assess certainty of both individual test results and of combinations of multiple data sources in a defined approach (DA) or an integrated approach.

Conclusion:

The OECD guideline on DAs for skin sensitization provides the first regulatory application of borderline ranges to the “2 out of 3” DA.

Introduction

In most toxicological assays, continuous data are generated. For regulatory classification, the continuous data are then dichotomized with a prediction model, which is usually based on a cutoff value. Cutoff values are used to assign a measured test result to one or the other class (and in rare cases more than two classes, e.g., in the in vitro skin corrosion assays according to OECD test guideline 431 and 435^1,2 or the in vitro eye irritation assays according to OECD test guideline 437³). Due to data variability, the reality is, however, not black and white, and in some cases, where the experimental result is close to the cutoff, the assignment to either class is actually ambiguous.

Generally, variability is introduced by technical/experimental factors, including the technical equipment, staff, and the biological variability of the test system. Another source of variability is the tested material itself, that is, when the material is truly, and reproducibly borderline for a given endpoint (chemical-specific feature). During the last decade, awareness of uncertainty around cutoff values has increased and several groups have described variability, for example, for the local lymph node assay (LLNA), which is the key OECD adopted animal test to determine skin sensitizing properties.^4–8

A first approach to statistically determine borderline ranges for in vivo, in vitro, and in chemico skin sensitization methods has been described by Ref.⁹ The authors have used pooled (across substances and concentrations) standard deviations analyzing experimental data from one individual laboratory. The impact of those ranges on the predictive capacity has then further been analyzed.¹⁰ As a follow-up, borderline ranges were determined by various statistical methods based on historical data from one individual laboratory.¹¹

In this study,¹¹ the borderline range of the direct peptide reactivity assay (DPRA) was additionally determined experimentally by repeated testing of a test substance with a result very close to the cutoff. Among the different statistical methods, the log pooled median absolute deviation (log pooled MAD) proved to provide a borderline range closest to the experimentally derived borderline range.

Hence, log pooled MAD was the statistical tool providing the most realistic borderline range and the most appropriate tool to describe the uncertainty around the cutoff. Here, we apply this statistical method to different datasets using the results of validation ring-trials to determine the borderline range based on data variability from multiple test laboratories for multiple OECD adopted tests. Finally, we demonstrate how these borderline ranges can be applied to the “two out of three” Defined Approach (2o3 DA), which was recently adopted by the OECD.¹²

Data

For the determination of the borderline ranges of the in vitro and in chemico skin sensitization assays, data from the so-called multilaboratory ring trials were used.^13–17 During the validation of new methods, reproducibility within one laboratory and between laboratories is assessed in multilaboratory ring trials and a set of typically 20–30 substances is assessed in multiple runs in multiple laboratories. In the following, a short description of the ring trial study datasets, which were used for the present analysis, is provided. For the sake of brevity, we do not provide detailed method descriptions here, but refer the reader to the cited OECD test guidelines and/or literature. No human trials, animal or other types of experiments were conducted for the purpose of the present study.

Direct peptide reactivity assay

In the DPRA (OECD TG 442C¹⁸), the relative depletion of a model peptide by a single test substance concentration (5 mM in the cysteine-peptide assay) is assessed after a single exposure time (24 hours) in triplicate. Generally, a single test run is sufficient for a prediction (unless values close to the cutoffs are obtained). OECD TG 442C provides two prediction models, one based on the mean peptide depletion (calculated as the mean of the cysteine- and lysine peptide depletions) and a second prediction model based on the Cys-depletion alone, for example, for substances coeluting with the Lys-peptide during the HPLC analysis.¹⁸ For both prediction models, several cutoffs for different “reactivity classes” are provided in the TG, but only one of the cutoffs per prediction model is decisive for the hazard information (6.38% for the mean peptide depletion and 13.89% for the Cys-only depletion).¹⁸

A EURL ECVAM coordinated validation study with three laboratories testing 24 test chemicals in one to three runs¹³ was performed followed by independent peer review by the EURL ECVAM Scientific Advisory Committee. The validation dataset comprised 24 chemicals having one or three test runs in each of the three participating laboratories. In the present analyses, only test chemicals having three test runs in a participating laboratory were considered for the borderline range determination (n = 13–14).

Kinetic direct peptide reactivity assay

In the kinetic direct peptide reactivity assay (kDPRA),¹⁹ five test substance concentrations (0.3125, 0.625, 1.25, 2.5, and 5 mM) are assessed after six exposure times (10, 30, 90, 210, 150, 210, and 1440 minutes) in triplicate each. From the slopes of the reaction kinetics (for linear relationships), the logarithm of the maximal rate constant observed at any exposure time is determined as log k_max. A test substance is predicted as GHS Category 1A sensitizer with a log k_max above −2.0. Nonreactive chemicals or test chemicals resulting in a log k_max below −2.0 are concluded to be GHS Category 1B sensitzers or nonsensitizers (without further subclassification in the kDPRA based on log k_max). A single test run is considered sufficient for a prediction for test substances with linear reaction kinetics.¹⁹

A validation study cocoordinated by Givaudan and BASF with independent peer review by an international peer-review panel with seven laboratories testing 24 test chemicals in one or three runs was conducted.¹⁵ For all 24 chemicals, intralaboratory results with three or four laboratories testing the chemical three times are available. These repeated log k_max determinations were used in the present analysis.

In addition, the raw data of the kDPRA also provide a depletion value at 24 hours with a 5 mM test concentration (equivalent to the standard DPRA Cys-only endpoint). This is equivalent to the conditions in the DPRA (although a different analytical method is used), and these data were also evaluated according to the threshold of 13.89% for the Cys-only depletion in the DPRA. The same intralaboratory dataset as for the log k_max borderline range determination was used.

KeratinoSens

In the KeratinoSens^® (OECD TG 442D²⁰), a test substance is assessed in 11 concentrations at one exposure time (48 hours) in triplicate. The prediction model is based on the ratio of luciferase induction versus solvent control with a cutoff for positivity of 1.5-fold induction. Besides evaluating whether luciferase activity was induced to 1.5-fold or above the vehicle control level, statistical significance is assessed in this assay. If a concordant result is obtained between the first two runs, no additional run is conducted. Only concentrations keeping relative viability at ≥70% are evaluated.²⁰ A Givaudan coordinated validation study was conducted with independent peer review by the EURL ECVAM Scientific Advisory Committee with five laboratories testing 26–28 test chemicals in three runs.¹⁶

In addition to the ring trial data, an experimental confirmation of the borderline range was made by repeatedly testing a specific chemical at a specific concentration, which gives an average value just at the threshold, that is, testing the positive control cinnamic aldehyde at 16 μM in KeratinoSens in a total of 123 experiments with three valid runs in each experiment.

LuSens

In the LuSens (OECD TG 442D²⁰), a test substance is assessed in multiple concentrations at one exposure time (48 hours) in triplicate. Like for the KeratinoSens, the prediction model is based on the ratio of luciferase induction versus solvent control with a cutoff for positivity of 1.5-fold induction. Besides evaluating whether luciferase activity was induced to 1.5-fold or above the vehicle control level, statistical significance is assessed in this assay. If a concordant result is obtained between the first two runs, no additional run is conducted. Only concentrations leading to at least 70% relative viability are evaluated and at least three consecutive concentrations must afford at least 70% relative viability.²⁰ A BASF coordinated validation study was conducted with independent peer review by the EURL ECVAM Scientific Advisory Committee with 5 laboratories testing 20 test chemicals. Three laboratories tested 12 test substances in 3 independent experiments (consisting of at least 2 concordant runs), 3 laboratories tested 8 test substances in a single experiment, and 2 laboratories tested 13 test chemicals in a single experiment (an individual experiment consisting of at least two concordant runs).¹⁷

Human cell line activation test

In the human cell line activation test (h-CLAT) (OECD TG 442E²¹), a test substance is assessed in multiple concentrations at one exposure time (24 hours) in a single replicate. A test substance is predicted to activate dendritic cells in case the expression level (measured as relative fluorescence intensity) of the cell surface markers CD54 and CD86 relative to the concurrent vehicle control surpasses a cutoff of 200% and/or 150%, respectively, in any test substance concentration affording at least 50% relative cell viability in at least two independent experiments. If a concordant result is obtained between the first two runs, no additional run is conducted. A negative result is only acceptable for test substances with a log K_ow below 3.5, provided neither marker is induced above its cutoff and cell viability at the top concentration of the test is reduced by at least 10%.²¹

A EURL ECVAM coordinated validation study and independent peer review by the EURL ECVAM Scientific Advisory Committee with four laboratories testing 24 test chemicals in three runs was conducted.¹⁴

Statistics

For each method, only test substance with at least three test runs were considered for the borderline range determination with the exception of the LuSens where the ring trial study design did not ask for three independent test runs but rather two concordant runs comprising an experiment. For cell-based assays, only test results within the limits of cell viability (i.e., for KeratinoSens and LuSens only luciferase induction with cell viabilities of at least 70% and for h-CLAT CD54 and CD86 induction with cell viabilities of at least 50%) were taken into consideration.

For the determination of the log pooled MAD, data were log transformed for all assays but the kDPRA (the output of which is already a log-transformed value).

From the ring trial data, the pooled median absolute deviation (MAD_p) was determined as described in Ref.¹¹ In brief, the borderline range is determined from the MAD_p of a test method's results: $M A D_{p} = \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{k_{i}} (r_{i, j} - 1) * M A D}{\sum_{i = 1}^{n} \sum_{j = 1}^{k_{i}} (r_{i, j} - 1)} .$

with MAD_i,j the median absolute deviation of results for substance i (i = 1, …, n) and concentration j tested per substance i (j = 1, …, k_i) and r_i,j the number of replicates per substance i and concentration j. $M A D_{i, j} = \frac{\sum_{l = 1}^{r_{i, j}} |y_{i, j, l} - ỹ_{i, j}|}{(r_{i, j} - 1)} .$

with l the replicate per substance i and concentration j (l = 1, …, r_ij), y_i,j,l the test result of substance i, concentration j and replicate l and ỹ_i,j the median of test results for substance i and concentration j.

The borderline range for an individual laboratory m participating in the ring trial (m = 1, … 7) around the prediction model's cutoff is then determined as follows (after exponentiation of the MADs for transformation to the assays' readout scale):

The number m of participating laboratories in the ring trials was 3 for the DPRA, 7 for the kDPRA, 5 for the KeratinoSens, 5 for the LuSens, and 4 for the h-CLAT, respectively.

After the borderline range had been determined for the individual laboratories, the mean and median of the individual lower and upper ranges were determined for all laboratories. The mean of these values is then proposed as a borderline range to be used, as for example, now implemented in the OECD DA guideline¹² and the update of OECD TG 442C to include the kDPRA.¹⁹

Results and Discussion

Based on the ring trial results obtained by the individual participating laboratories, the log pooled MADs were determined for each laboratory individually (Tables 1 –5). For comparison, the log pooled MADs from historical data from individual laboratories are also provided for the DPRA, KeratinoSens, LuSens, and h-CLAT. Furthermore, for the DPRA as well as the KeratinoSens, the borderline range has also been determined experimentally.

Table 1.

Borderline Range Determination Based on the Log Pooled Median Absolute Deviations in the Direct Peptide Reactivity Assay Ring Trial Data

Data source	Mean peptide depletion (%)			Cysteine-only depletion (%)
	Cutoff			Cutoff
	6.38	22.62	42.47	13.89	23.09	98.24
Ring trial
Lab 1 (n^a = 13)	4.81–8.46	17.05–30.01	32.01–56.34	10.53–18.31	17.51–30.45	74.49–129.56
Lab 2 (n^a = 14)	5.49–7.42	19.46–26.30	36.53–49.38	12.17–15.85	20.23–26.36	86.07–112.13
Lab 3 (n^a = 14)	4.54–9.08	15.28–33.48	28.70–62.85	8.97–21.25	13.38–39.85	56.92–169.55
Mean	4.95–8.32	17.26–29.93	32.41–56.19	10.56–18.47	17.04–32.22	72.49–137.08
Median	4.88–8.39	17.16–29.97	32.21–56.27	10.54–18.39	17.27–31.33	73.49–133.32
BASF SE
Historical data^b (n = 385)	5.29–7.69	n/a	n/a	11.62–16.61	n/a	n/a
Experimental data^b (n = 27)	5.45–7.31	n/a	n/a	11.89–15.90	n/a	n/a

In bold are the values which have been implemented in OECD Guideline No. 497 on the defined approaches on skin sensitization.

n, the number of substances of the 24 substances assessed in the ring trial, for which at least three test runs were available.

Published in Gabbert et al.¹¹

Direct peptide reactivity assay

Protein binding is a key event and the molecular initiating event in the adverse outcome pathway of skin sensitization.²² However, it has been shown that the classical DPRA result alone is not a sufficient predictor of the skin sensitizing potency.²³ Nevertheless, as the so-called reactivity classes defined in the DPRA guideline are used in some DAs to assess skin sensitizing potency, we provide here also the borderline ranges for the additional cutoffs described in OECD TG 442C.¹⁸ The borderline ranges determined based on the log pooled MADs in the DPRA ring trial are summarized in Table 1. For the three laboratories participating in the ring trial and only considering substances for which at least three runs were available (i.e., 13–14), the borderline range around 6.38% mean peptide depletion varied between 4.54% and 9.08%, while the mean of all participating laboratories was 4.95%–8.32% and the median of all participating laboratories was 4.88%–8.39%. The latter two ranges were comparable to the range 5.29%–7.69% derived in a routine testing laboratory (assessing 385 substances) and the experimentally determined range of 5.45%–7.31%. This experimentally derived range (as previously published in Ref.¹¹) was made by repeatedly testing a specific chemical at a specific concentration, which gives an average value just at the threshold, that is, EGDMA at 8 μM in the DPRA.¹¹

Based on above considerations, we propose here based on the ring trial data to implement the borderline range of 4.95%–8.32% for the mean peptide depletion and 10.56%–18.47% for the Cys-only prediction model for the decision whether a run is conclusive or inconclusive. A flow chart for the DPRA prediction model(s), including the borderline ranges, is available in the OECD guideline on DAs for skin sensitization.¹²

The borderline ranges should eventually also be added to TG 442C. The range currently provided in OECD TG 442C (i.e., 3%–10% mean peptide depletion) should still be used for guidance, in which cases, a single run is sufficient or whether the experiment should be repeated, which is different from deciding whether a result is conclusive or not. This retest guidance was recently made a mandatory requirement in the revised OECD TG 442C.¹⁹

Kinetic direct peptide reactivity assay

In the ring trial 24 substances were assessed in 7 laboratories for interlaboratory reproducibility and in 3 (12 chemicals) or 4 (12 chemicals) laboratories, whereby each laboratory tested a total of 12 chemicals (each in 3 runs) for intralaboratory reproducibility. As the log k_max can only be calculated for positive chemicals, the MAD can also only be calculated for those. In total, there are 60 datasets of each three runs on a chemical and from a laboratory for log k_max. For the Cys-depletion at 24 hours and at 5 mM test chemical concentration, all the data including negative results could be used (i.e., 3 × 12 + 4 × 12). The borderline ranges determined based on the log pooled MADs in the kDPRA ring trial are summarized in Table 2 for log k_max and Cys-depletion.

Table 2.

Borderline Range Determination Based on the Log Pooled Median Absolute Deviations in the Kinetic Direct Peptide Reactivity Assay Ring Trial Data

Data source	log k_max (_s⁻¹_M⁻¹)_; cutoff: −2.0	Cys depletion (%) (after 24 hours, test chemical concentration 5 mM); cutoff: 13.89%
Ring trial
Lab 1 (n^a = 7)	−1.90 to −2.10	10.73 to 17.98
Lab 2 (n^a = 8)	−1.98 to −2.02	13.23 to 14.58
Lab 3 (n^a = 8)	−1.90 to −2.10	11.50 to 16.78
Lab 4 (n^a = 8)	−1.96 to −2.04	12.24 to 15.76
Lab 5 (n^a = 10)	−1.97 to −2.03	13.38 to 14.42
Lab 6 (n^a = 11)	−1.85 to −2.15	12.68 to 15.21
Lab 7 (n^a = 8)	−1.94 to −2.06	12.32 to 15.65
Mean	−1.93 to −2.07	12.30 to 15.77
Median	−1.94 to −2.06	12.32 to 15.65

In bold are the values which have been implemented in the updated OECD Test Guideline No. 442C.

n, the number of substances of the 24 substances assessed in the ring trial, for which at least three positive test runs were available in a given laboratory. The chemicals were randomized; each laboratory received a different subset of 12 chemicals for intralaboratory testing. Hence, there is a different number of positive chemicals for each laboratory, as each laboratory has a different chemical set tested.

For the seven laboratories participating in the ring trial and only considering substances for which at least three runs were available, the borderline range of the log k_max around the threshold if log k_max of −2.0 varied between −1.85 and −2.15, while the mean and median of all participating laboratories were −1.93 to −2.07 and −1.94 to −2.06, respectively.

For the seven laboratories participating in the ring trial and only considering substances for which at least three runs were available, the borderline range for Cys-depletion after 24 hours at a test chemical concentration of 5 mM around the Cys-only cutoff of 13.89% varied between 10.73% and 17.98%, while the mean and median of all participating laboratories were 12.30%–15.77% and 12.32%–15.65%, respectively.

Based on the ring trial data, we proposed to implement the statistically derived borderline range for log k_max values between −1.93 and −2.07 in OECD TG 442C.¹⁹

KeratinoSens

In the KeratinoSens ring trial, 26 substances were assessed in 3 runs in 5 laboratories, and additional 2 substances were assessed in 3 runs in 4 laboratories. The borderline ranges determined based on the log pooled MADs in the KeratinoSens ring trial are summarized in Table 3.

Table 3.

Borderline Range Determination Based on the Log Pooled Median Absolute Deviations in the KeratinoSens Ring Trial Data

Data source	Luciferase induction; cutoff: 1.5
Ring trial
Lab 1 (n^a = 28)	1.37–1.64
Lab 2 (n^a = 28)	1.33–1.69
Lab 3 (n^a = 28)	1.33–1.69
Lab 4 (n^a = 28)	1.35–1.67
Lab 5 (n^a = 26)	1.37–1.65
Mean	1.35–1.67
Median	1.35–1.67
Givaudan
Experimental data (n = 123)	1.40–1. 60

In bold are the values which have been implemented in OECD Guideline No. 497 on the defined approaches on skin sensitization.

n, the number of substances of the 28 substances assessed in the ring trial, for which at least three test runs were available.

For the five laboratories participating in the ring trial, the borderline range around 1.5-fold luciferase induction varied between 1.33 and 1.69, while the mean and median of all participating laboratories were 1.35–1.67 and 1.35–1.67, respectively.

The variability around the threshold was also assessed experimentally. The positive control cinnamic aldehyde is tested in each run at five test concentrations. At the concentration of 16 μM, the luciferase activity induced by cinnamic aldehyde is at 1.53 ± 0.2 in 623 individual valid runs in one laboratory (Givaudan) and thus just at the threshold. In a total of 123 individual experiments, 3 valid runs (total runs = 369) were available. The experimental borderline range based on pooled MAD from this large dataset on one chemical and one concentration is 1.40–1.60, and thus very comparable to the range of 1.35–1.67 from the ring trial data.

Based on the ring trial data, we propose to implement the statistically derived borderline range for luciferase inductions between 1.35 and 1.67 in OECD TG 442D. A flowchart for the KeratinoSens prediction model, including the borderline range, is available in the OECD guideline on DAs for skin sensitization.¹²

LuSens

In the ring trial, 20 substances were assessed, and of the 20 substances, 8 were tested in 3 independent experiments in 3 laboratories. The borderline ranges determined based on the log pooled MADs in the LuSens ring trial are summarized in Table 4. The ring trial study design of the LuSens was slightly different than for the other assays. In particular, the assay has been conducted to obtain two concordant runs for a prediction, while for the other assays, three independent runs were conducted also in cases where the first two runs were concordant. In the case of the LuSens, therefore, the borderline range was determined for all substances with at least two runs (resulting in concordant predictions).

Table 4.

Borderline Range Determination Based on the Log Pooled Median Absolute Deviations in the LuSens Ring Trial Data

Data source	Luciferase induction; cutoff: 1.5
Ring trial
Lab 1 (n^a = 20)	1.25–1.80
Lab 2 (n^a = 20)	1.30–1.73
Lab 3 (n^a = 20)	1.29–1.74
Lab 4 (n^a = 13)	1.15–1.96
Lab 5 (n^a = 13)	1.33–1.69
Mean^b	1.28–1.76
Median^b	1.29–1.74
BASF SE
Historical data^c (n = 131)	1.40–1.60

In bold are the values which should be used when using the LuSens to address the second key event in the 2 out of 2 approach.

n, the number of substances of the 20 substances assessed in the ring trial, for which at least two test runs were available. Laboratories 1 to 3 tested 8 of the 24 in 3 independent experiments (an experiment consisting of at least two concordant runs); laboratories 4 and 5 tested 13 test chemicals in individual experiments only.

For determination of the arithmetic mean and median MADs, the data of the laboratories 4 and 5 were excluded as all substances were assessed in single experiments only.

Published in Gabbert et al.¹¹

For the three laboratories testing 20 substances in three independent runs (laboratories 1–3), the borderline range around 1.5-fold luciferase induction varied between 1.25 and 1.80, while the mean and median of these three laboratories were 1.28–1.76 and 1.29–1.74, respectively. The two remaining laboratories tested 13 substances in individual runs only and obtained wider borderline ranges of 1.15–1.96. The mean borderline range of laboratories 1 to 3 was slightly wider than the historical range in a routine testing laboratory of 1.40–1.60.

In addition, for the LuSens, the log pooled MADs was also determined based on historical data from one individual laboratory as published in Ref.¹¹

Based on the ring trial data, we propose to implement statistically derived borderline range for luciferase induction values between 1.28 and 1.76 in OECD TG 442D. The LuSens prediction model, including the borderline, is summarized in Figure 1.

FIG. 1.

LuSens prediction model, including the borderline range. In the flowchart, it is described how to implement the borderline range determined from the LuSens ring trial. This introduces a third possible outcome of the assay: positive, negative, and borderline. The borderline range determined (log pooled median absolute deviation) from the LuSens ring trial is 1.28 to 1.76. Note: Flowcharts for the prediction models of the DPRA, KeratinoSens, and h-CLAT, including their respective borderline ranges, are available in the OECD guideline on DAs for skin sensitization.¹² DA, defined approach; DPRA, direct peptide reactivity assay; h-CLAT, human cell line activation test.

Human cell line activation test

In the h-CLAT ring trial 24 substances were assessed in 3 runs in 4 laboratories. The borderline ranges determined based on the log pooled MADs in the h-CLAT ring trial are summarized in Table 5.

Table 5.

Borderline Range Determination Based on the Log Pooled Median Absolute Deviations in the Human Cell Line Activation Test Ring Trial Data

Data source	RFI CD54; cutoff 200	RFI CD86; cutoff 150
Ring trial
Lab 1 (n^a = 24)	152–264	125–181
Lab 2 (n^a = 24)	153–261	125–181
Lab 3 (n^a = 24)	161–248	115–196
Lab 4 (n^a = 24)	162–247	125–180
Mean	157–255	122–184
Median	157–255	122–181
BASF SE
Historical data^b (n = 136)	170–235	132–170

In bold are the values which have been implemented in OECD Guideline No. 497 on the defined approaches on skin sensitization.

n, the number of substances of the 24 substances assessed in the ring trial, for which at least three test runs were available.

Published in Gabbert et al.¹¹

For the four laboratories participating in the ring trial and considering substances for which at least three runs were available (i.e., 24), the borderline ranges around the RFI of 200 for CD54 and the RFI of 150 for CD86 were 152–264 and 115–196, respectively. The means of all participating laboratories were 157–255 and 122–184 for CD54 and CD86, respectively. Likewise, the medians of all participating laboratories were 157–255 and 122–181 for CD54 and CD86, respectively. The ranges determined from historical data in a routine testing laboratory were slightly narrower with 170–235 and 132–170 for CD54 and CD86, respectively. In this laboratory, the assay design was slightly more stringent than in OECD TG 442E as the laboratory tested always two replicates instead of only one.

In addition, for the h-CLAT, the log pooled MADs were also determined based on historical data from one individual laboratory as published in Ref.¹¹

Based on the ring trial data, we propose to implement statistically derived borderline range RFI values between of 157–255 for CD54 and 122–184 for CD86 in OECD TG 442E. A flowchart of the h-CLAT prediction model, including the borderline ranges, is available in the OECD guideline on DAs for skin sensitization.¹²

Application of the borderline ranges to regulatory tests

The borderline ranges determined above are addressing the experimental variability. Certain methods such as the KeratinoSens, LuSens, or h-CLAT require that at least two independent runs are being performed, and if the outcome is not congruent, a third run is performed. Including the borderline considerations, the individual run then can be negative, positive, or borderline, and a final result is positive, negative, or inconclusive/borderline according to the scheme in Figure 2.

FIG. 2.

Decision logic, including the borderline ranges for multiple runs within a test or multiple tests within the 2o3 DA, implemented in the OECD guideline on DA for skin sensitization testing. br, result in borderline range; n, negative; p, positive. Where x is given, the third run/test does not need to be conducted. Note: If the first two runs/tests are either nonconcordant or if at least one of them falls into the borderline range as described here, the third test is needed. Two borderline outcomes or two inconclusive results combined with a borderline result will lead to a final inconclusive rating. 2o3 DA, “two out of three” defined approach.

We propose considering final results (from one or multiple runs depending on the test) in the borderline range as inconclusive rather than negative or positive. A repetition of a test yielding an inconclusive result, as proposed by several OECD TG, is not resolving this. A second test may yield result further away from the cutoff, but this may well be by chance. It would take several repetitions to prove, that the initial result within the borderline range (BR) was the outlier. Often an initial test result in the borderline range will be confirmed to be in the BR (or providing test results on both sides of the cutoff upon test repetition). We have to accept that every given method provides indeed three potential outcomes: positive, negative, and inconclusive.

A borderline result can still be useful and utilized, for example, in a weight of evidence assessment. In the majority of cases, however, an inconclusive result would lead to the requirement for additional information not obtained from the same testing method. A test method's predictive capacity is improved if inherent uncertainties are taken into account by implementing borderline ranges. If only the conclusive (positive and negative) results rather than all (positive, negative, and inconclusive) are considered, the predictivity is higher, but the number of substances, for which a prediction can be made, is lower because of some substances—those which yield results in the borderline range—no conclusion can be drawn. Within the large database,²⁴ only a limited number of results were close to the decision cutoff: 20 of the 199 test substances in the dataset²⁴ were considered borderline based by an initial assessment (with borderline range based on results from a single laboratory and determined with a different statistical method).¹⁰ Excluding these, 20 test substances only moderately changed the predictivity: sensitivity, specificity, and accuracy for the DPRA improved from 76% to 80%, from 72% to 75%, and from 75% to 78%, respectively.

Until the time of writing, borderline ranges had not been determined in validation studies neither for in vitro nor for in vivo methods adopted as OECD TG. While ranges around a cutoff had been discussed (and included in some TG), these ranges were not based on statistical analysis of experimental data. As an example: it is unknown (to the test method developers as well as evaluators at the OECD and ECVAM) how the 3%–10% range provided in the current TG 442C was derived (both statistical method and underlying data are unknown). Therefore, we propose to use a “standardized” statistical method (i.e., log pooled MAD) based on results from multiple laboratories in the ring trials for the validation of the respective method. With this approach, BR can be determined individually for each laboratory, and the mean of at least three participating laboratories provided a proper measure of the variability of test results close to the decision to cutoff across different laboratories. Results within this borderline range should then be considered ambiguous and uncertain and no conclusion regarding the assignment to either final result (positive or negative) can be drawn; the test result is “inconclusive.” Further data (not from this testing method) are needed to decide on the classification of such a chemical. This favors precise methods over methods with higher variability because less test substances will yield inconclusive results. It is also advocating various, different sources of information on the same adverse outcome pathway. Table 6 summarizes the borderline ranges for all test methods investigated here.

Table 6.

Proposed Borderline Range Based on the Log Pooled Median Absolute Deviations in the Ring Trials

	Endpoint	Cutoff	Ring trial mean	Ring trial median	BR implemented in OECD guidelines ^a
DPRA (OECD TG 442C)	Mean peptide depletion (%)	6.38	4.95 to 8.32	4.88 to 8.39	4.95 to 8.32^a
		22.62	17.26 to 29.93	17.16 to 29.97	n/a
		42.47	32.41 to 56.19	32.21 to 56.27	n/a
	Cysteine-only depletion (%)	13.89	10.56 to 18.47	10.54 to 18.39	10.56 to 18.47^a
		23.09	17.04 to 32.22	17.27 to 31.33	n/a
		98.24	72.49 to 137.08	73.49 to 133.32	n/a
kDPRA (OECD TG 442C)	log k_max (s⁻¹ M⁻¹)	−2	−1.93 to −2.07	−1.94 to −2.06	−1.93 to −2.07^a
kDPRA (OECD TG 442C)	Cysteine-only depletion (%)	13.89	12.30 to 15.77	12.32 to 15.65	n/a
KeratinoSens^® (OECD TG 442D)	Luciferase induction (fold-change)	1.5	1.35 to 1.67	1.35 to 1.67	1.35 to 1.67^a
LuSens (OECD TG 442D)	Luciferase induction (fold-change)	1.5	1.28 to 1.76	1.29 to 1.74	1.28 to 1.76
h-CLAT (OECD TG 442E)	Relative fluorescence intensity CD54	200	157 to 255	157 to 255	157 to 255
h-CLAT (OECD TG 442E)	Relative fluorescence intensity CD86	150	122 to 184	125 to 181	122 to 184

Borderline ranges based on the means of the ring trials have been implemented into the guideline on defined approached for skin sensitization¹² for DPRA; KeratinoSens^® and h-CLAT while the kDPRA borderline ranges have been implemented in the updated OECD TG 442C.¹⁸

BR, borderline range; DPRA, direct peptide reactivity assay; h-CLAT, human cell line activation test.

Application of the borderline ranges to the 2o3 DA

In the OECD guideline on DAs for skin sensitization,¹² the borderline approach (based on the statistically derived borderline range) has for the first time been implemented in a DA within an OECD guideline. Thus, a decision tree is used in the 2o3 DA incorporating the borderline range considerations (Fig. 2), which is the same as when rating chemicals with multiple runs. In the classical 2o3 DA,^24,25 chemicals are rated positive, if any two test results from OECD 442D, DPRA, and h-CLAT are rated positive. Similarly, two concordant negative results give a negative rating. Implementing the borderline range logic, a chemical is rated positive or negative, if two concordant and nonborderline results are obtained. Chemicals become inconclusive, if either at least two results are borderline, or if two nonconcordant results are obtained, with the third result being borderline. In case such inconclusive results are obtained, further evidence (e.g., from in silico tools or other [nonanimal] methods) or an expert-based assessment is needed, which then is not covered under Mutual Acceptance of Data by the OECD countries but could still be used for regulatory decisions. Decision-theoretic approaches such as the Bayesian value-of-information approach discussed in the context of skin sensitization testing in Ref.²⁶ can provide guidance to choose the optimal follow-up test in a systematic and transparent way. Alternatively, depending on the intended use and the regulatory context, results above the cutoff may still be considered positive. However, this very conservative approach would imply that the borderline range above the cutoff is factually ignored. Table 7 indicates the predictivity of the 2o3 DA on the OECD reference database both for human and LLNA reference data and for both the situation with and without, including the borderline range assessment. Sensitivity and balanced accuracy are enhanced when considering borderline chemicals, according to the scheme in Figure 2, as inconclusive. Of the 168 substances in the OECD reference data base with LLNA data, 71 substances had a least one borderline result in one of the three assays (13 substances had borderline results in 2 assays and 1 substance in all 3 assays). In total, 34 of the 168 substances were rated inconclusive in the 2o3 with 7 LLNA negatives and 27 LLNA positives. As observed previously, the effect of implementing the borderline range is moderate (4%–6% gain in balanced accuracy), but this reduction in uncertainty may be key for regulatory acceptance.

Table 7.

Predictivity of the “Two Out of Three” Defined Approach for the OECD Database of 168 Chemicals^a When Incorporating the Borderline Principles as Outlined in Figure 1

	Sensitivity (%)	Specificity (%)	Balanced accuracy (%)	n
“2 out of 3” versus human data
W/o considering borderline ranges	83	82	83	65
considering borderline ranges^b	89	88	88	55 (10 inconclusive)
Considering borderline ranges and the log K_ow h-CLAT limitation^c	89	88	88	55 (10 inconclusive)
“2 out of 3” versus LLNA data
W/o borderline ranges	74	85	80	168
considering borderline ranges^b	79	85	82	139 (29 inconclusives)
Considering borderline ranges and the log K_ow h-CLAT limitation^c	82	85	84	134 (34 inconclusives)

OECD (2021).¹²

Borderline ranges were considered as summarized in Table 6.

The reported performance metrics also include the h-CLAT limitation rating negatives as inconclusive for test materials with log K_ow above 3.5. None of the substances with human data fell within this group. Hence, the data are identical to the previous analysis.

LLNA, local lymph node assay.

Conclusion

The implementation of borderline ranges in regulatory toxicology provides a more realistic picture than ignoring the limited precision of a method and merely translating its (continuous) read-out data into a black and white decision. Borderline ranges should therefore be implemented in OECD test guidelines. Binary decisions on a toxic potential should be based on various, different data sources to account for individual methods yielding inconclusive results. The 2o3 DA in the OECD guideline on DAs for skin sensitization¹² is the first example of borderline ranges determined from experimental validation studies, which are included in the final prediction model, and which help to implement the known experimental uncertainty into the final outcome. It has to be kept in mind, however, that here we apply a new level of scrutiny when using Integrated Strategies and Defined Approaches using nonanimal methods: This level of scrutiny has not yet been applied to traditional toxicological testing methods using laboratory animal, despite the fact that the read-outs of animal studies can be variable and limited in their precision due to relatively low animal number and complex, and often subjective examinations. The implementation of borderline ranges will acknowledge uncertainties and favor the more precise methods.

Footnotes

Acknowledgments

We thank EURL ECVAM for providing us with the ring trial data of the DPRA and h-CLAT. Also, we particularly thank Chantra Eskes for her support in implementing the borderline range concept into the OECD guideline for DA SS.

Author Disclosure Statement

A.N. is an employee of Givaudan SA and M.M., R.L., and S.N.K. are employees of BASF SE, both companies use the methods described to develop and register substances.

Funding Information

No funding was received for this study.

References

OECD. Test No. 431: In Vitro Skin Corrosion: Reconstructed Human Epidermis (RHE) Test Method; OECD Guidelines for the Testing of Chemicals, Section 4, OECD Publishing, Paris, 2019.

OECD. Test No. 435: In Vitro Membrane Barrier Test Method for Skin Corrosion;OECD Guidelines for the Testing of Chemicals, Section 4, OECD Publishing, Paris, 2015.

OECD. Test No. 437: Bovine Corneal Opacity and Permeability Test Method for Identifying i) Chemicals Inducing Serious Eye Damage and ii) Chemicals Not Requiring Classification for Eye Irritation or Serious Eye Damage; OECD Guidelines for the Testing of Chemicals, Section 4, OECD Publishing, Paris, 2020.

Kolle

, Basketter

, Casati

, et al. Performance standards and alternative assays: Practical insights from skin sensitization. Regul Toxicol Pharmacol, 2013:65; 278–285.

Dimitrov

, Detroyer

, Piroird

, et al. Accounting for data variability, a key factor in in vivo/in vitro relationships: Application to the skin sensitization potency (in vivo LLNA versus in vitro DPRA) example. J Appl Toxicol, 2016:36; 1568–1578.

Hoffmann

. LLNA variability: An essential ingredient for a comprehensive assessment of non-animal skin sensitization test methods and strategies. ALTEX, 2015:32; 379–383.

Dumont

, Barroso

, Matys

, et al. Analysis of the Local Lymph Node Assay (LLNA) variability for assessing the prediction of skin sensitisation potential and potency of chemicals with non-animal approaches. Toxicol In Vitro, 2016:34; 220–228.

Luechtefeld

, Maertens

, Russo

, et al. Analysis of publically available skin sensitization data from REACH registrations 2008–2014. ALTEX, 2016:33; 135–148.

Leontaridou

, Urbisch

, Kolle

, et al. The borderline range of toxicological methods: Quantification and implications for evaluating precision. ALTEX, 2017:34; 525–538.

10.

Leontaridou

, Gabbert

, Landsiedel

. The impact of precision uncertainty on predictive accuracy metrics of non-animal testing methods. ALTEX, 2019:36; 435–446.

11.

Gabbert

, Mathea

, Kolle

, et al. Accounting for precision uncertainty of toxicity testing: Methods to define borderline ranges and implications for hazard assessment of chemicals. Risk Anal. 2020. [Epub ahead of print]; DOI: 10.1111/risa.13648

12.

OECD. Guideline No. 497: Defined Approaches on Skin Sensitisation; OECD Guidelines for the Testing of Chemicals, Section 4, OECD Publishing, Paris, 2021.

13.

ECVAM. TSAR entry direct peptide reactivity assay (cited September 15, 2020). https://tsar.jrc.ec.europa.eu/test-method/tm2009-06 (last accessed June 22, 2021).

14.

ECVAM. TSAR entry human cell line activation test (cited September 15, 2020). https://tsar.jrc.ec.europa.eu/test-method/tm2008-05 (last accessed June 22, 2021).

15.

Wareing

, Kolle

, Birk

, et al. The kinetic direct peptide reactivity assay (kDPRA): Intra- and inter-laboratory reproducibility in a seven-laboratory ring trial. ALTEX, 2020:37; 639–651.

16.

Natsch

, Bauch

, Foertsch

, et al. The intra- and inter-laboratory reproducibility and predictivity of the KeratinoSens assay to predict skin sensitizers in vitro: Results of a ring-study in five laboratories. Toxicol In Vitro, 2011:25; 733–744.

17.

Ramirez

, Stein

, Aumann

, et al. Intra- and inter-laboratory reproducibility and accuracy of the LuSens assay: A reporter gene-cell line to detect keratinocyte activation by skin sensitizers. Toxicol In Vitro, 2016:32; 278–286.

18.

OECD. Test No. 442C: In Chemico Skin Sensitisation; OECD Guidelines for the Testing of Chemicals, Section 4, OECD Publishing, Paris, 2020.

19.

OECD. Test No. 442C: In Chemico Skin. Sensitisation: Assays addressing the Adverse Outcome Pathway key event on covalent binding to proteins. OECD Guidelines for the Testing of Chemicals, Section 4, OECD Publishing, Paris, 2021.

20.

OECD. Test No. 442D: In Vitro Skin Sensitisation; OECD Guidelines for the Testing of Chemicals, Section 4, OECD Publishing, Paris, 2018.

21.

OECD. Test No. 442E: In Vitro Skin Sensitisation; OECD Guidelines for the Testing of Chemicals, Section 4, OECD Publishing, Paris, 2018.

22.

OECD. The Adverse Outcome Pathway for Skin Sensitisation Initiated by Covalent Binding to Proteins; OECD Guidelines for the Testing of Chemicals, Section 4, OECD Publishing, Paris, 2012.

23.

Wareing

, Urbisch

, Kolle

, et al. Prediction of skin sensitization potency sub-categories using peptide reactivity data. Toxicol In Vitro, 2017:45(Pt 1);134–145.

24.

Urbisch

, Mehling

, Guth

, et al. Assessing skin sensitization hazard in mice and men using non-animal test methods. Regul Toxicol Pharmacol, 2015:71; 337–351.

25.

Bauch

, Kolle

, Ramirez

, et al. Putting the parts together: Combining in vitro methods to test for skin sensitizing potentials. Regul Toxicol Pharmacol, 2012:63; 489–504.

26.

Leontaridou

, Gabbert

, Van Ierland

, et al. Evaluation of non-animal methods for assessing skin sensitisation hazard: A Bayesian Value-of-Information analysis. Altern Lab Anim, 2016:44; 255–269.

Assessing Experimental Uncertainty in Defined Approaches: Borderline Ranges for In Chemico and In Vitro Skin Sensitization Methods Determined from Ring Trial Data

Abstract

Introduction:

Materials and Methods:

Results:

Conclusion:

Introduction

Data

Direct peptide reactivity assay

Kinetic direct peptide reactivity assay

KeratinoSens

LuSens

Human cell line activation test

Statistics

Results and Discussion

Direct peptide reactivity assay

Kinetic direct peptide reactivity assay

KeratinoSens

LuSens

Human cell line activation test

Application of the borderline ranges to regulatory tests

Application of the borderline ranges to the 2o3 DA

Conclusion

Footnotes

Acknowledgments

Author Disclosure Statement

Funding Information

References