Abstract
Background
Laboratory investigations may be added to existing requests either automatically on the basis of algorithms (reflex testing) or by laboratory professionals (reflective testing). The clinical utility of reflex and reflective testing is not fully established. We studied efficiency (number of tests that needs to be added to make a diagnosis) and effectiveness (number of diagnoses) of reflex and reflective testing in selected biochemical scenarios.
Methods
Using fixed rules, we prospectively measured efficiency and effectiveness of reflex and reflective testing in the following scenarios (reflex initiators in parentheses): (1) hypovitaminosis D (hypocalcaemia plus elevated alkaline phosphatase activity); (2) hypomagnesaemia (hypokalaemia or hypocalcaemia); (3) hypothyroidism (high thyroid-stimulating hormone [TSH]); (4) hyperthyroidism (low TSH); (5) haemochromatosis (reflex or reflective addition of iron studies, followed by reflective addition of genetic studies). Separately, using a different data-set, we examined the impact of varying TSH thresholds on outcomes in the biochemical diagnosis of hyper- and hypothyroidism.
Results
In patients aged over 55 y, 25-hydroxy-vitamin D <50 nmol/L could be predicted with ≥90% certainty when albumin-adjusted calcium was ≤2.1 mmol/L plus alkaline phosphatase >150 U/L. Higher numbers of tests were needed to make a diagnosis in other scenarios. In general, more diagnoses were made by reflex testing. Outside the euthyroid TSH range, efficiency of diagnosis of hyper- and hypothyroidism became asymptotic, while effectiveness declined.
Conclusions
Near-maximal efficiency of reflex testing can be achieved, depending on the reflex and diagnostic thresholds applied. Reflective and reflex testing are complementary activities, the clinical utility of which depends on the initiators used.
Introduction
The practice of adding on laboratory tests to existing requests is common in contemporary clinical biochemistry in a number of countries. It is most often done to establish or exclude a diagnosis suggested by the results of the tests requested initially. The majority of tests are added by automated analysers, based on rules (algorithms) established by laboratory professionals; this is defined as ‘reflex testing’. The remainder are added by clinical biochemists, after consideration of a wider range of information than can readily be incorporated into reflex testing algorithms; this is defined as ‘reflective testing’. 1 Such information includes demographic data, clinical information and previous results, including those from other disciplines in laboratory medicine. The term ‘reflective’ captures the fact that it requires clinical judgement and experience. 2 Both reflex and reflective testing became possible with the advent of laboratory information management systems (LIMS) that were sufficiently flexible to permit modification of existing test requests at various stages of the analytical process.
Reflex and reflective testing form part of a wider spectrum of activity that ‘adds value’ to the high-volume processing of laboratory specimens, 3 and certainly both patients 4 and requesting clinical colleagues 5 appear to be broadly comfortable with the concept of tests being added unprompted. Reflex testing in particular is widely used, the major aim being to optimize the use of laboratory tests. 6–8 The need to regulate reflex testing 9 and to involve clinical colleagues in the establishment of reflex algorithms 10 has rightly been emphasized. The extent of reflective testing is less clear, although the principle that it can contribute to the diagnosis of specific conditions has been established. 11,12
In contrast with, for example, patient-specific narrative interpretations, reflex and reflective testing are amenable to quantitative analysis, in the form of the number of tests that needs to be added in order to make a diagnosis (NND). Despite this, few studies have examined their clinical utility in this way, 1,13,14 and several key questions remain unanswered. How does their efficiency (NND) compare across different biochemical scenarios? What is the effect of varying the threshold (cut-off) concentrations or activities that initiate reflex testing on efficiency and effectiveness (the number of diagnoses made)? Is it possible to achieve maximal or near-maximal efficiency (NND ∼ 1) with reflex testing, where the diagnosis can be made presumptively and the addition of the test therefore becomes superfluous? In the current study, we have examined reflex and reflective testing across a range of biochemical scenarios and across a range of thresholds, in an attempt to provide answers to some of these questions.
Methods
Reflective testing
Reflective testing by clinical biochemists working in a single department was observed prospectively for one year. No particular arrangements were made for the purposes of this study, and no feedback provided during the study period. We wished to examine, but not alter, ‘usual’ reflective testing practice. All participating staff were aware of the reflex thresholds in use.
Reflex testing
Reflex algorithms were established by the same group of clinical biochemists whose reflective testing practice was examined. These rules were then applied prospectively over the same period of study. The rules were established after pilot studies in which their functionality was established in the LIMS, and the anticipated numbers of additional tests estimated based on previous data.
Quantitative analysis of reflex and reflective testing
Efficiency
This was calculated as the number of tests that had to be added in order to make the biochemical diagnosis under consideration – the NND.
Effectiveness
This was the number of biochemical diagnoses made during the study period.
Biochemical scenarios
Diagnosis of hypovitaminosis D based on hypocalcaemia and elevated alkaline phosphatase activity;
Diagnosis of hypomagnesaemia based on hypokalaemia or hypocalcaemia;
Diagnosis of hypothyroidism (low free thyroxine) based on high thyroid-stimulating hormone (TSH);
Diagnosis of hyperthyroidism (high free thyroxine) based on low TSH;
Diagnosis of haemochromatosis based on a two-step process: addition of iron studies, followed by reflective addition of genetic studies.
Table 1 summarizes the specific criteria applied in each scenario.
Reflex and diagnostic thresholds applied prospectively
TSH, thyroid-stimulating hormone; ALT, alanine aminotransferase
Additional thyroid studies
We wished to examine the efficiency and effectiveness of reflex addition of free thyroxine across the entire range of TSH measurement. It was not financially feasible to apply this to TSH thresholds within the euthyroid range (effectively measuring free thyroxine on all samples). We therefore examined a separate data-set from a laboratory where free thyroxine and TSH are measured on all samples.
Analytical methods
All biochemical analyses, apart from 25-hydroxy-vitamin D, were performed on a Roche modular system (Roche Diagnostics Limited, Burgess Hill, West Sussex, UK). 25-hydroxy-vitamin D was measured by enzyme immunoassay (OCTEIA®, Immunodiagnostics Ltd, Boldon, Tyne and Wear, UK). The analysis of genetic mutations for hereditary haemochromatosis (HH) (p.Cy282Tyr and p.His63Asp) was performed at the Department of Human Genetics, Ninewells Hospital and Medical School, Dundee, by an amplification refractory mutation system method. 15 Existing evidence indicates that in the UK more than 90% of patients with HH are homozygous for the C282Y mutation of the HFE (High iron Fe) gene and another 4% are compound heterozygotes (combination of a C282Y mutant allele and an H63D mutant allele). 16
Results
Hypovitaminosis D
Reflex serum 25-hydroxy-vitamin D was added to 92 samples which met the initiator criterion; it was <50 nmol/L (the target concentration for replacement) in 81. During the same period, vitamin D was added reflectively in 124 cases, and was <50 nmol/L in 114. The total number of 25-hydroxy-vitamin D measurements during the same period was 1307, of which 801 were <50 nmol/L.
Hypomagnesaemia
Reflex serum magnesium was added to 209 samples where serum potassium was <2.5 mmol/L; it was below the lower reference limit (0.7 mmol/L) in 96. In 109 cases, it was added on account of an albumin-adjusted serum calcium <1.80 mmol/L; it was low in 41. The corresponding data for reflective addition of magnesium were as follows: (i) hypokalaemia: 115 added; low in 48; (ii) hypocalcaemia: 33 added; low in 14. The total number of magnesium measurements during the same period was 51,968, of which 9534 were <0.7 mmol/L.
Hypothyroidism
Reflex addition of 2851 free thyroxine measurements identified 153 cases where it was <11.0 pmol/L. The corresponding figures for reflective addition of free thyroxine were as follows: seven added; none <11.0 pmol/L. The total number of TSH measurements during the same period was 112,200, and the total number of free thyroxine measurements was 19,293, of which 1467 were <11.0 pmol/L.
Hyperthyroidism
Reflex addition of 161 free thyroxine measurements identified 59 cases where it was >22.0 pmol/L. The corresponding figures for reflective addition were: 28 added; six >22.0 pmol/L. 1,761 TSH results were >22.0 pmol/L during the same period.
Table 2 summarizes the efficiency and effectiveness figures for these biochemical scenarios.
Efficiency and effectiveness of reflex and reflective testing in four biochemical scenarios
NND, number needed to diagnose
*25-hydroxy-vitamin D < 50 nmol/L
†Magnesium <0.70 mmol/L
‡Serum free thyroxine <11.0 pmol/L
§Serum free thyroxine >22.0 pmol/L
Additional thyroid studies
We retrospectively examined a separate data-set from a laboratory where free thyroxine and TSH are measured on all samples. Efficiency and effectiveness of reflex addition of free thyroxine was plotted across the range of TSH measurements in the diagnosis of hypo- and hyperthyroidism (Figure 1).

Number of diagnoses (cumulative) and numbers needed to diagnose (NND) plotted across the range of thyroid-stimulating hormone (TSH) measurement. ‘Prospective’ relates to data collected applying TSH reflex thresholds prospectively; ‘retrospective’ to data reviewed retrospectively from a second laboratory where free thyroxine is routinely measured on all thyroid requests. See text for further details
Haemochromatosis
Of 1128 reflex iron studies added, percentage saturation of transferrin (PSAT) was high (>50% in women, >55% in men) in 64. Other causes of iron overload were identified in three cases. The results of the remaining patients were discussed with their primary care physicians, on the basis of which 35 underwent HFE genotyping. Six were homozygous for C282Y and one compound heterozygous for C282Y/H63D. By comparison, reflective testing resulted in the addition of 35 iron studies, in seven of which PSAT was high. After discussion with primary care physicians, four of these underwent HFE genotyping; one was found to be homozygous for C282Y. NND were calculated for the overall diagnostic process and for each component step (Table 3). The total number of iron studies during the same period was 24,263, in 823 of which PSAT was high.
Efficiency and effectiveness of reflex and reflective testing in the diagnosis of HH
PSAT, percentage saturation of transferrin; NND, number needed to diagnose; HFE, high Fe; HH, hereditary haemochromatosis; ALT, alanine aminotransferase
*Genotyping added reflectively after discussion with primary care physician, irrespective of the initial step
Discussion
The practice of adding on tests, whether reflexly or reflectively, is widespread yet remains poorly characterized. In the current study, we have examined the clinical utility of this practice.
First, we have shown that it is possible to achieve near-maximal efficiency with reflex and reflective testing, depending on the biochemical scenario. For hypovitaminosis D, reflex addition of 25-hydroxy-vitamin D was associated with NND of about 1.1; thus in patients over 55 y of age, the diagnosis of low vitamin D could be made with at least 90% certainty when serum albumin-adjusted calcium was ≤2.1 mmol/L and serum alkaline phosphatase was >150 U/L. (Reflective addition of 25-hydroxy-vitamin D was also associated with near-maximal efficiency, but may have involved different or additional considerations.) At a time when requests for vitamin D are soaring, 17 and laboratories actively seeking to contain costs, 18 this degree of certainty may tempt some towards presumptive diagnosis in those cases which meet these reflex thresholds, instead of adding on the test. The specific thresholds applied, both for adding the 25-hydroxy-vitamin D test and for making the diagnosis of hypovitaminosis D, are less important here than the ‘proof of principle’ that near-maximal efficiency can be achieved.
Second, the efficiency of reflex and reflective testing varies according to the biochemical scenario. For example, in this study both reflex and reflective testing as applied to hypovitaminosis D was much more efficient than as applied to the diagnosis of haemochromatosis (specifically the addition of iron studies). The higher NND seen in the latter scenario may reflect the difficulty in extracting those factors that identify patients likely to have iron overload from the complex and prevalent aetiologies of high alanine aminotransferase (ALT) activity. The much lower NND seen with reflective testing in this scenario (35 compared with 161 for reflex testing) is consistent with this explanation because reflective testing allows consideration of a wider range of information than reflex testing, as pointed out earlier, that may help to improve efficiency. However, reflective testing is not always more efficient than reflex testing. For diagnosis of hypomagnesaemia, the NND for both were similar (2.3 for reflex and 2.4 for reflective) and, in the case of hyperthyroidism, reflex testing was more efficient (2.9 compared with 4.7 for reflective testing).
We sought an explanation for this last, counter-intuitive, observation. We plotted the efficiency (NND) of reflex addition of free thyroxine across the range of TSH measurements in the diagnosis of hypo- and hyperthyroidism (Figure 1). As anticipated, the further the TSH result is from the euthyroid range, the more efficient the process (efficiency becomes asymptotic close to the euthyroid range, presumably reflecting the homeostatic feedback loop of the thyroid axis). Within the reflex thresholds (i.e. within the euthyroid range where free thyroxine is not added reflexly), we used data from a second laboratory where free thyroxine and TSH are measured on all thyroid requests, to establish in effect what the free thyroxine would have been had reflex testing existed in this range. As expected, we found that the NND rises steeply within the euthyroid range. Since reflective testing only occurs in this range (because there is no need for it once TSH falls out with reflex thresholds), we speculate that this may explain the higher NND seen compared with reflex testing; the reduced NND that might otherwise be anticipated with reflective testing was not enough to compensate for other factors (e.g. non-thyroidal illness) that explain the higher NND seen in this range. Clearly, the efficiency of both reflex and reflective testing depends critically on the reflex threshold applied.
Our study has several limitations. First, we studied a small number of biochemical scenarios. Our findings suggest that observations relating to efficiency and effectiveness of reflex and reflective testing are likely to apply only to the scenarios studied and should not be extrapolated. (However, experience, and in some instances existing literature, suggest that reflex and reflective testing is widely applied to the scenarios studied here). Second, we collected prospective data applying a single reflex threshold in each scenario throughout the period of the study. Our analyses on hypo- and hyperthyroidism clearly show that efficiency and effectiveness of reflex and reflective testing depend critically on the reflex threshold applied. On the other hand, these same analyses also provide a framework for further similar studies in other scenarios. Finally, the use of a retrospective data-set (with free thyroxine and TSH measured on all thyroid requests) was a compromise. As indicated above, we established in effect what the free thyroxine would have been had reflex testing existed in this range, rather than applying ‘real’ reflex thresholds prospectively. However, in the ‘non-euthyroid’ parts of the TSH range, we compared the prospective data collected using ‘real’ reflex thresholds with the retrospective data, and found an almost identical pattern (Figure 1), suggesting that this is a valid approach.
The findings reported here are important for several reasons. First, the observation that near-maximal efficiency can be achieved, depending on the biochemical scenario, and the diagnostic and reflex thresholds used, has implications for laboratory testing strategies. This applies especially to those added tests which are sent away to other laboratories, where staff input required is often disproportionate. Further studies are needed to establish if similar efficiency can be achieved in other scenarios. Second, the finding that efficiency and effectiveness of reflex and reflective testing varies according to biochemical scenario and reflex threshold indicates a real need for systematic studies. Third, the reflective addition of haemochromatosis genotyping triggered by biochemical evidence of iron overload (in turn triggered reflexly by high ALT activity or reflectively) indicates that these activities have ramifications across the disciplines of laboratory medicine. Finally, and fundamentally, our findings indicate that reflex and reflective addition of laboratory tests are complementary strategies. Reflex testing is best suited to those scenarios where high efficiency (low NND) can readily be achieved; the contribution of reflective testing is comparatively greater where more complex factors need to be considered.
DECLARATIONS
