Abstract

Reflex testing refers to the addition of tests by automated analysers, based on algorithms established by laboratory professionals. It is distinguished from reflective testing, which refers to the addition of tests by clinical biochemists and takes into account a more complex range of information than can readily be incorporated into reflex testing rules. 1 Much of the recent literature on the value of adding laboratory tests to existing requests has focused on reflective testing, although more tests are added reflexly. The pros and cons of reflective testing were debated at a recent meeting in Scotland of the Association for Clinical Biochemistry where an overwhelming majority of participants favoured the practice. Both patients 2 and requesting clinical colleagues 3 are comfortable with the concept of tests being added unprompted. Thus there appears to be a consensus that reflective testing is a Good Thing (sic). 4
The question is: how good? Recent studies which have examined the value of adding tests fall by and large into two categories: those which seek to prove the principle that the reflective addition of tests in a particular scenario identifies patients who would otherwise be missed; 5,6 and those which seek to quantify the efficiency of the process, using the number needed to diagnose (NND). 7–9 (NND is the number of tests that must be added in order to make the suspected diagnosis; for example, in the context of hypothyroidism suspected on the basis of an elevated thyroid stimulating hormone [TSH], it is the total number of free thyroxine measurements added, divided by the number of low free thyroxine results. It is inversely related to efficiency – the lower the NND, the more efficient the add-on process.) Although the applicable reflex thresholds have been documented in some of these studies, 8,9 they have received insufficient attention. This is because the efficiency of both reflex and reflective testing depends critically on the reflex thresholds applied. 9 The purpose of the present article is to explain why in more detail.
Let us look in turn at how the choice of reflex thresholds influences efficiency and effectiveness of reflex and reflective testing. Many laboratories offer TSH measurement as their first-line thyroid function test, with free thyroxine then being added by the laboratory ‘when indicated’. Figure 1 plots the efficiency (as NND) and effectiveness of reflex addition of free thyroxine across the range of TSH measurements. (To be entirely accurate, it infers what these outcomes would be from a retrospective analysis of simultaneous free thyroxine and TSH measurements; however, as we have shown previously,
9
this seems to be a valid approach.) The hypothyroid TSH reflex threshold of 4.0 mU/L (highlighted by a dotted vertical line in Figure 1a) is ‘tight’; it sits at the innermost end of the range of thresholds that might sensibly have been chosen. On the left side of this threshold, within the euthyroid TSH range, the NND rises steeply (i.e. efficiency falls), while the number of diagnoses rises more gently (i.e. effectiveness rises); any threshold <4.0 mU/L would add a modest yield of additional diagnoses at a very high relative cost (of thyroxine assays). (The threshold might reasonably have been chosen further out into the hypothyroid range, on the grounds of increasing efficiency, although this would be at the expense of effectiveness; a TSH threshold of 7.0 mU/L is an example of such a compromise.) By contrast, the hyperthyroid TSH threshold (0.1 mU/L) (similarly highlighted in Figure 1b) is ‘loose’; the number of diagnoses (effectiveness) rises steadily on the right side of this threshold, heading into the euthyroid TSH range, while NND rises very gradually until approximately 0.5 mU/L, at which point it rises much more steeply. If the ‘tighter’ threshold of 0.5 mU/L was adopted instead, there would be a gain in effectiveness with only modest loss of efficiency.
Number of diagnoses (cumulative) and numbers needed to diagnose (NND), for (a) hypothyroidism and (b) hyperthyroidism. Dotted vertical lines indicate reflex thresholds. Reflective testing is confined to shaded areas (see text for details). TSH, thyroid stimulating hormone
So much for the impact of reflex thresholds on the efficiency and effectiveness of reflex testing itself. It is instructive to examine the same outcomes for reflective testing inside each TSH reflex threshold in our study. 9 Inside the hypothyroid TSH threshold, not a single free thyroxine measurement was added reflectively during the study period of one year. Why was this? Examination of efficiency and effectiveness of reflex addition of free thyroxine outside the threshold provides a clue. The NND of 19 indicates a very inefficient process, but most cases of biochemical hypothyroidism (high TSH and low free thyroxine) were picked up, i.e. it was relatively effective. In this context, it would be reasonable for authorizing biochemists to assume that, since most cases had already been picked up by reflex testing and since the likelihood of making a diagnosis was low (NND likely to be very high), there was little or no point in reflectively adding free thyroxine. Towards the hyperthyroid end of the TSH range, they made a different calculation, because reflective testing was observed, albeit sparingly. Reflective addition of free thyroxine inside (to the right of) the hyperthyroid reflex threshold indicates that biochemists must have thought that there was a realistic prospect of securing a diagnosis (acceptable effectiveness), and that the process would not be completely wasteful (acceptable efficiency). The figures for reflective testing during the study period largely bear these assumptions out: 28 free thyroxine measurements were added to samples with TSH concentrations >0.1 mIU/L, of which six were high (NND 4.7). The finding that reflective testing for hyperthyroidism was less efficient than reflex testing (NND 2.9) was counterintuitive, given that reflective testing permits more complex information to be taken into account. The likely explanation – that reflective testing is confined to more ‘inefficient’ parts of Figure 1 – reinforces the critical impact of reflex thresholds on efficiency and effectiveness of reflective testing as well as reflex testing.
Failure to acknowledge this key determinant of efficiency and effectiveness of reflective testing is an ‘elephant in the room’; unless reflex thresholds are known, it is impossible to know in which part of the curve tests are being added. Quantifying efficiency and effectiveness is largely meaningless without this information. An additional point we have previously made 9 is that reflective and reflex testing are complementary activities. In fact, there is no overlap at all between reflex and reflective testing. The latter is unnecessary when the addition of the test has already been triggered reflexly, and is confined by the reflex thresholds to the parts of the range where efficiency and incremental effectiveness are low (shaded parts of Figure 1a and b). This means that comparisons of reflex and reflective testing are never ‘like with like’, and that the more complex information potentially considered by reflective testing may not compensate for the fact that it occurs in the ‘inefficient’ part of the range.
How should clinical biochemists decide on reflex thresholds? Broadly, they can choose between ‘tight’ thresholds (like the hypothyroid TSH threshold above), or ‘loose’ thresholds (like the hyperthyroid threshold). ‘Tight’ thresholds pick up most or all of the diagnoses but at the expense of efficiency (because NND rises close to reference limits); reflective testing is largely unnecessary, since reflex rules ‘do all the work’. ‘Loose’ thresholds favour efficiency at the cost of effectiveness, relying on reflective testing by vigilant biochemists to pick up some of the missed diagnoses inside the threshold. The relative importance of efficiency and effectiveness will vary according to the diagnostic scenario. For the biochemical diagnosis of thyroid disease, efficiency may be prioritized over effectiveness given the large volume of thyroid function tests and the fact that the diagnoses missed as a result of the reduction in effectiveness that is the ‘trade-off’ for greater efficiency are unlikely to have life-threatening consequences (and also that some will be picked up by reflective testing). A different calculation might be made for the diagnosis of hypomagnesaemia triggered by hypokalaemia or hypocalcaemia; here, effectiveness is more important and efficiency less, given the greater likelihood of serious consequences of missed diagnoses and the comparatively small volume of triggering results.
Although this framework provides a logical basis for choosing reflex thresholds, calculation of efficiency and effectiveness requires the triggering and triggered analytes to be measured over the entire range of the triggering analyte (or at least across the reflex threshold at each end of the range). There are several possible approaches to this problem. First, it may be possible to access shared data-sets which contain these data (as we did in order to construct Figure 1). Second, it may, depending on the diagnostic scenario, be considered financially justifiable to measure triggering and triggered analyte ‘up front’ with a view to making subsequent savings from the choice of a more efficient, logical reflex threshold (it is important to remember that only the triggered analyte represents extra expenditure – the triggering analyte is measured across the range anyway). Third, for individual laboratories, a retrospective trawl of the laboratory information management system may yield relevant data from reflective testing, although this is likely to be comparatively sparse. Simply to make judgements based on experience alone (which may have been – probably was – the case for the historical departmental TSH thresholds shown in Figure 1) will probably lead to the choice of broadly sensible thresholds, but will not permit the ‘fine-tuning’ yielded by plotting efficiency and effectiveness across the range of triggering analyte as shown above. The very fact that one of these thresholds was ‘tight’ and the other relatively ‘loose’ bears this out.
Finally, the discussions above have centred exclusively around the diagnosis of thyroid disease. The thyroid axis is a classic negative feedback loop. Not all diagnostic scenarios are based around such loops, and we cannot be sure that the efficiency and effectiveness curves will be the same for each diagnostic scenario. If evidence-based laboratory practice 10 is to extend to the choice of reflex thresholds, analogous data must be generated for individual diagnostic scenarios.
DECLARATIONS
