Abstract
New facts have recently enhanced interest in the topic of reference intervals. In particular, the International Organization for Standardization standard 15189, requesting that ‘biological reference intervals shall be periodically reviewed’, and the directive of the European Union on in vitro diagnostic medical devices asking manufacturers to provide detailed information on reference intervals, have renewed interest in the subject. This review presents an update on the topic, discussing the theoretical aspects and the most critical issues. The basic approach to the definition of reference intervals proposed in the original International Federation of Clinical Chemistry documents still remain valid. The use of data mining to obtain reference data from existing databases has severe limitations. New statistical approaches to discard outliers and to compute reference limits have been recommended. On the other hand, perspectives opened by the improvement in standardization through the implementation of the concept of traceability suggest new models to define ‘common’ reference intervals that can be transferred and adopted by different clinical laboratories in order to decrease the proliferation of different reference intervals not always justified by differences in population characteristics or in analytical methodology.
Introduction
The concept of ‘reference intervals’ as known today was developed by Gräsbeck and Saris in the late sixties and presented at a congress of the Scandinavian Society in 1969. 1 Previously, reference intervals were usually named ‘normal values’ or ‘normal ranges’ without a clear definition of the term. Twenty years later, the International Federation of Clinical Chemistry (IFCC) expert panel published the first official IFCC document on the theory of reference values. 2
Someone could ask why we are still talking about reference intervals in the era of evidence-based medicine, characterized by the use of decision limits, more than 30 years after the publication by Galen and Gambino of the landmark book ‘Beyond normality’. 3 There are several reasons for reconsidering this ‘old-fashioned’ argument. The most important one is that, while the theory is clearly defined, there is a big gap in its application in everyday life. Moreover, three new facts are driving renewed interest to the topic: the publication of the International Organization for Standardization (ISO) standard 15189: 2007 on requirements for quality and competence of clinical laboratories, 4 the implementation of the European Directive 98/79 on in vitro diagnostic (IVD) medical devices 5 and, indirectly related to the Directive, the creation of the Joint Committee for Traceability in Laboratory Medicine (JCTLM). 6
The first document 4 states in clause 5.5.5 that ‘biological reference intervals shall be periodically reviewed’ and, particularly, requires that they shall be verified every time a variation in analytical and/or preanalytical procedures takes place. It is difficult for laboratories to comply with this requirement, considering the enormous number of different types of tests and the very rapid evolution of the analytical technology. The issue is not easy for manufacturers either: the IVD Directive requires in Annex I ‘Essential requirements’, part B, 8.7, statement l, to include in the information supplied by the manufacturer ‘the reference intervals for the quantities being determined, including a description of the appropriate reference population’. 5 The third and probably most relevant element is the general requirement of the IVD Directive that for each method ‘the traceability of values assigned to calibrators and/or control materials must be assured through available reference measurement procedures and/or available reference materials of a higher order’. 5 This element, improving the method's comparability with other traceable methods, could substantially modify the situation by allowing an easier and more correct definition of reference intervals.
In this review, after an update of the theory underlining the critical aspects, the new perspectives and developments in the field of reference intervals are presented.
The theory of reference values
Reference values evolved with laboratory tests. They do not have a meaning per se but only if referred to in a particular context, usually a physiological situation. Until the 1970s, before Gräsbeck's publications and the work of the IFCC expert panel, the term ‘normal values’ was usually used. The use of the term ‘normal’ has, however, been discouraged, because it can assume different meanings: Murphy listed seven of them (statistical – Gaussian distribution – most representative of a class, most commonly present in a class, most suited for survival, that does not harm – in medicine – conventional, ideal). 7 Thus, it may be subjective and ambiguous; moreover, it implies that everything outside the reference range is ‘abnormal’, however, due to the way it is calculated, this is not true. For these reasons, the IFCC document introduced the term ‘reference values’. 2 Usually these values are health-associated, but they can also reflect specific physiological conditions like pregnancy, or refer to specific population groups, such as professional athletes. The basic concept is that the values represent a specific population and thus they rely upon the choice of the subjects on which they were obtained. First of all, the criteria for the selection of reference individuals have to be clearly defined; those individuals represent the reference population from which a reference sample group is selected, on which the reference values are measured. The obtained values will assume a certain distribution (reference distribution) and, by analysing it with appropriate statistical methods, reference limits can be calculated. Conventionally, these limits are set to include 95% of the measured values. The reference interval is defined by the reference limits and includes them. This has a series of consequences:
The reference interval of a specific measurand depends upon the intra- and inter-individual biological variability of the sample group of the reference subjects; thus, the numbers of selected subjects and their partition into subgroups may gain relevant importance;
The preanalytical aspects need to be strictly controlled so as to be reproducible when collecting samples from subjects of the general population;
Analytical aspects are essential. The standardization of the measurements allows comparison of data obtained on different groups of subjects and eventually their application to different populations in places and times different from those where they were obtained;
The method by which reference limits are calculated may modify significantly the results obtained, e.g. if inappropriate statistical models are applied or if the exclusion of outliers is not correctly performed.
Selection of the reference subjects
This represents the starting point. Usually, we select ‘healthy’ individuals, but exactly what do we mean by health? How can we define and recognize it? Several commentaries have been written on this topic. 8–10 The World Health Organization's definition: ‘a state of complete physical, mental and social wellbeing and not merely absence of disease or infirmity’ 11 cannot be a realistic starting point. On the other hand, the concept of health can differ between cultures and countries. In 1975, the Scandinavian Committee on Reference Values tried to define a list of pathological conditions to be excluded to consider an individual ‘healthy’. 12 However, when they tried to apply this recommendation it was impractical, 13 especially if aged subjects were involved. When Horn and Pesce 14 looked at the third National Health and Nutrition Examination Survey (NHANES III), they found that no more than 10% of the subjects aged 70–80 fell into the ‘healthiest’ category.
Consequently, a more pragmatic approach is needed and health should be judged subjectively as the absence of signs of disease specifically related to the measurand(s). 8 As clearly stated in the IFCC document, 2 the first step should be the definition of the scope for use of reference intervals and, secondly the definition of the method used to select the reference individuals. The main pathological situations to be excluded are reported in Section 3.3 of the IFCC document. 15 This can be done through an anamnestic questionnaire (an example of such a questionnaire is reported in the document C28-A2 of the Clinical and Laboratory Standards Institute (CLSI), 16 a physical examination and further investigations (e.g. laboratory tests, imaging studies). The excluded subjects will depend, of course, on the particular analyte evaluated. For example, to determine a reference interval for haemoglobin or related haematological analytes, it would be wise to exclude subjects with iron deficiency, marked vitamin B12 or folic acid deficiency, inflammation or chronic respiratory disease, tumours, genetic abnormalities of haemoglobin synthesis, etc., as all these factors influence haemoglobin synthesis.
When selecting individuals, it is necessary to take into account all variables that can affect the concentration of the analyte: gender, age, environment, lifestyle, ethnicity, etc. In the example of haemoglobin in addition to gender and age, stratification according to the altitude of living and smoking habits is also important. All these biological aspects can be used as partitioning criteria. In general, to determine and consider which factors may be of importance a comprehensive knowledge of physiology (in addition to pathology) is required. As already stated, the width of the reference interval is influenced by three sources of variability: the intra- and inter-individual biological variability of the selected reference individuals and analytical variability of the measurement system. Analytical variability, with the exception of analytes showing very low biological variation, such as electrolytes, usually has minimal influence on reference intervals. Effects of intra- and inter-individual variabilities are inextricably bound together. The relative sizes of these two sources of variability can substantially affect the utility of a reference interval as a tool for interpreting an individual result. Harris demonstrated in 1974 that only if the intra-individual coefficient of variation (CVI) is substantially larger than the inter-individual one (CVG) will the distribution of the results in a single individual span the entire range of the reference interval, so that the reference interval can be a useful tool to evaluate his state of health. 17 Unfortunately, this is a relatively uncommon situation; usually the CVI is smaller than CVG and the range of values of an individual span only a limited part of the distribution of the values of the reference population. Consequently, in the majority of cases the reference interval is of limited utility in evaluating the results of a single subject and its sensitivity in finding that abnormal data is low. Harris demonstrated that only if the CVI/CVG ratio (index of individuality) is >1.4 will the reference interval be a sensitive and useful tool, whereas if the ratio is <0.6 the utility of the reference interval is low. 18 In this case, the only way to improve the utility of a reference interval is to increase the ratio and, provided that the CVI cannot be modified, the only possible intervention is to reduce the CVG by stratifying the individuals into more homogeneous subgroups. 18–20
A priori vs. a posteriori selection
Two factors essentially drive the choice whether to decide in advance which individuals to select and how to partition them (a priori criterion) or whether to collect a relevant number of subjects, analyse them and decide thereafter which are to be kept in the reference population and how to partition them (a posteriori criterion): first, knowledge of the biology of the analyte and secondly the available resources.
If the biology of the analyte is well known, the a priori approach is the most convenient one. This approach is recommended in the IFCC document on selection of individuals. 16 The same document however does not explicitly exclude the possibility of utilizing the existing data, primarily not collected for the scope of defining reference intervals, quoting Martin et al. 21
Indirect reference values
The use of existing databases containing thousands or even millions of patients' records is an exciting opportunity, recently exploited by several authors. 22–30 With appropriate software, it is an inexpensive and relatively rapid procedure. This approach must not be confused with the a posteriori approach, unless the database also contains detailed clinical information. The first papers using this approach were published in the 1960s 31,32 and were based on the postulate that the majority of laboratory results are ‘normal’. Considering the frequency distribution of results, it should be possible to apply some statistical procedures to eliminate the extremes of the distribution curve, so excluding the less frequent results, typical of ‘unhealthy’ subjects. However, care has to be taken as several relevant limitations may have a negative influence on this approach. First, it does not fulfil the fundamental principle of the theory of reference values, which is the careful definition of the characteristic of the reference population. 33 With this approach we know almost nothing about the subjects we are using. Usually, the applied statistical calculation is based on a presumed distribution of the results of the studied population, but in some cases the assumed distribution of data may not be correct. For instance, in the case of a skewed distribution the statistical models usually fail. 34 Furthermore, there is little or no control over preanalytical variables. Finally, it is very difficult to provide any demonstration of the metrological traceability of the obtained results; consequently, the observed intervals might be applicable only in the laboratory which produced them and cannot be adopted for use elsewhere. 35 In conclusion, even if information technology provides us with a powerful means of calculation, the data mining approach cannot be endorsed as the best way of defining reference intervals. Only if the original data are obtained with carefully controlled methodology, the laboratory is able to provide traceable results and reliable clinical data are available can this approach be adopted. In the majority of cases, it can only represent a means to confirm and validate the findings obtained with the more scientifically sound a priori selection.
Preanalytical aspects
The samples for reference interval studies must be collected under conditions representative of those used in clinical practice. 36 Unfortunately, in clinical practice the preanalytical phase is usually poorly standardized. For this reason, when performing a reference interval study it is essential to accurately define and describe the preanalytical conditions to allow others to reproduce the same situation and to understand the effects of certain factors (e.g. the collection device or the posture of the individual). Table 1 shows the most important preanalytical conditions to be taken into account when a blood analyte is evaluated. For body fluids other than blood it may be necessary to include different factors, while, for specific analytes, more information may be needed (e.g. emotional stress level for certain hormones).
Main preanalytical factors to be considered in the production of reference values
Analytical aspects
This aspect is neglected in many publications on reference intervals. The IFCC document dealing specifically with this topic gave a series of recommendations for documenting the operating procedures, focusing on how internal quality control should be practiced during production and application of reference values. 37 These recommendations are, however, useful if the defined reference intervals are to be applied only within the laboratory who defined them, because it allows the baseline conditions to be properly fixed and then to understand whether the modification of certain analytical aspects may change the reference intervals. On the contrary, the recommended approach is not very effective in providing procedures which define reference intervals that are ‘transferable’ to different laboratories.
In 1991, the concepts of reference measurement systems and of the implementation of metrological traceability (defined as a ‘property of a measurement result relating the result to a stated metrological reference through an unbroken chain of calibrations of a measuring system or comparisons, each contributing to the stated measurement uncertainty’ 38 ) were still in an embryonic state. Although reference measurement procedures were already available for some common analytes and the concept of method hierarchy had been introduced more than 10 years ago, 39 systematic application of these concepts was lacking. Only some years later, was the concept of reference measurement systems formalized, based on implementation of reference measurement procedures, preparation of reference materials and identification of reference measurement laboratories. 40 The reference measurement system represents a trueness-based approach in which different commercial methods that provide results traceable to the system are able to produce comparable results in clinical laboratories using these assays. ISO has produced two standards on this concept: ISO 17511: 2003 41 and ISO 18153:2003. 42
Only reference intervals obtained with analytical procedures producing results traceable to the same reference measurement system should be transferred between laboratories (see ‘common reference intervals’ section).
Calculation of reference limits
This issue has been of most concern to authors dealing with reference values. There are three main problems: (i) the statistical methodology that provides the most effective way to extrapolate the results obtained on a sample population to the whole population itself; (ii) the partitioning of results among different groups (age, gender, etc.); (iii) the detection and discarding of outliers. Here, we give a brief outline of these aspects, referring the readers to more details from two excellent textbooks: Statistical Bases of Reference Values in Laboratory Medicine by Harris and Boyd 43 and Reference Intervals. A User's Guide by Horn and Pesce. 44
Statistical methods
Harris and Boyd 43 refer to the approach by Wootton et al. 45 who in 1951 applied parametric statistics for the first time to the calculation of reference intervals. However, these authors soon realized that this statistical model was only applicable in a minority of situations and, two years later; they proposed logarithmic transformation of data to achieve a Gaussian-like distribution. 46 However, the incorrect practice of defining the reference interval as the mean ± 2SDs, without any preliminary verification of the shape of the distribution of the data, has unfortunately continued for many years and is still sometimes used today.
Several publications on the use of fractiles for the definition of reference intervals appeared in the 1970s and 1980s, 47–50 but the milestone is represented by an IFCC publication in 1987. 51 This document clearly defined a number of elements that, even more than 20 years later, will remain valid. First, it promoted the (arbitrary) choice of using the central 95% of the distribution for the reference interval calculation. If we exclude a few different opinions, such as that of Jørgensen et al. 52 who proposed widening the interval to include 99.8% of the observed data to reduce false-positives in cases where a large battery of tests is requested, this approach is still widely accepted today. Secondly, the IFCC document recommended that reference limits should always be presented together with their 90% CIs. The width of the CIs decreases with the increase in the number of evaluated subjects and represents a reliable indicator of the uncertainty of the reference limits. Again, the document recommended the use of a non-parametric statistical method to calculate the reference limits. Even if parametric methods are theoretically more reliable, particularly if the sample of subjects is small, the uncertainty on the real ‘Gaussianity’ of the original distribution (or after its transformation) increases the uncertainty of the final estimate. A proposal for a simple and effective way to calculate reference limits was also included. This method is based on the calculation of the 0.025 and 0.975 fractiles. The α fractile cannot be calculated unless α is above 1/N, where N is the sample size. Thus, the determination of 0.025 and 0.975 fractiles requires at least 40 values (α = 1/40 = 0.025). However, with only 40 subjects, the minimum and maximum values represent the lower and the upper limit of the reference interval and it is therefore impossible to estimate their CIs. To calculate the uncertainty around the limits at least 120 subjects are needed: in this case, when the data are arranged in increasing order, the 2.5th centile is the third value in the series and the 97.5th centile is the 118th, while the 90% CI for the lower limit spans from the first to the seventh value and that for the upper limit spans from the 114th to the 120th value. The IFCC recommendation to use a minimum of 120 individuals per class derives from these considerations. Theoretically, this approach should be easy to apply, but the relatively high number of individuals to be enrolled may create practical difficulties, e.g. for paediatric populations, for expensive tests or for analytes with age and gender dependence that require partitioning among different classes. The number of subjects can be reduced by using parametric statistics, but require the data to have a Gaussian distribution. As an alternative, Horn et al. 53,54 proposed a ‘robust method’ based on the transformation of the original data according to Box and Cox, 55 followed by a relatively complex algorithm, based on robust indicators, able to provide correct answers even for less-than-ideal situations. This ‘robust’ algorithm gives different weights to the data, depending upon their distance from the mean. The use of this approach should allow estimation of correct reference limits with samples of only 20 subjects. To calculate the 90% CIs around the limits, it is possible to use the so-called ‘bootstrap’ methodology. With this methodology, observations are ‘resampled’, with replacement, from the data, creating a ‘pseudosample’. From each pseudosample the reference interval is derived. This process is repeated a large number of times (1000–2000), yielding a distribution of upper and lower reference limits. From this distribution, the 5th and the 95th quantiles may be used to determine the 90% CI for each limit. A critical drawback of this approach is that the 90% CIs can be very wide if the sample size is small (at least 80 individuals are needed to obtain acceptably small 90% CIs). Regression analysis has been proposed as an alternative technique to deal with small sample sizes and has been applied to the determination of age-dependent reference intervals. 56–58
Partitioning criteria
As discussed in the section on the selection of reference subjects, the decision of whether or not to separate different groups is extremely important and several statistical methodologies have been proposed to achieve this. An intuitive approach is based on the calculation of the statistical significance of the difference between the mean values of two subclasses. This approach can lead to the identification of different subclasses, even for very small differences that are indeed statistically significant although clinically irrelevant, especially when the number of subjects per class is high. Sinton et al. 59 suggested two classes should be separated only if the difference between the respective means is greater than 1/4 of the interval calculated from 95% of the individuals of the combined distribution. This criterion was originally proposed for specific analytes (i.e. calcium, inorganic phosphate and alkaline phosphatase), but did not allow adequate separation when applied to other analytes. 60
The most popular partitioning method was proposed by Harris and Boyd 43,61 and subsequently endorsed by the CLSI document C28-A2. 16 In their studies, the authors first considered the idea that partitioning should lead to reduced inter-individual variability in the subgroups compared with that of the entire data group. However, they found that a worthwhile reduction in inter-individual variation was hard to achieve, even with large differences between subgroup means. Therefore, they abandoned the goal of inter-individual variability reduction as the basis for establishing partitioning criteria and focused on the proportions of the subgroups outside the reference limits of the entire population and suggested that ‘the problem of whether or not to compute separate pairs of 95% reference limits for subgroups of the population may be reduced to the question: Does a single pair of limits, derived from a combined sample of subpopulations, come close enough to satisfying this criterion of 2.5% below and 2.5% above for each subpopulation?’. 61 The problem is in deciding at which level to set the acceptability limit. Harris and Boyd proposed that if the percentage is higher than 4% or lower than 1%, it is necessary to define different reference limits. This criterion appears valid because it considers not only the means but also the standard deviations of the subgroups, as a different standard deviation by itself may produce different reference limits. Their proposed test consists of two steps: first, evaluation of the difference between the means and secondly, comparison of their standard deviations. Further details and practical examples may be found in the CLSI C28-A2 document. 16 However, this approach works well only with Gaussian distributions and with subclasses of similar size and standard deviation. 60 To overcome these limitations Lahti et al. 62–64 have proposed a method based on similar concepts, but allowing the estimation specifically of the percentage of subjects in a subclass outside the reference intervals of the entire population in any situation. Following the criteria based on biological variability presented by Gowans et al. 65 they proposed the creation of a subclass when more than 4.1% or less than 0.9% of the subjects of the subgroup fall outside the limits of the entire group. If the percentage is <3.2% or >1.8%, they suggest combining the groups. For intermediate (marginal) situations, the choice to combine or not should be based on clinical criteria.
Detection of outliers
Whatever the method used for the calculation of the reference interval, the presence of outliers can significantly modify the limits, even if complex mathematical and statistical methods are applied. 53,54 Thus, correct detection and exclusion of outliers is important. A simple but effective method to detect outliers is visual inspection of the distribution of the data. If an outlier is detected and if there are no obvious reasons to discard it (such as conditions of the subject, analytical problems, calculation or transcription errors), it is useful to apply statistical methods to justify its exclusion.
The most popular statistical method is the one proposed by Dixon, 66 which is based on the D/R ratio, where D is the absolute value of the difference between the outlier and the next or preceding value and R represents the entire range of the observations (maximum–minimum), outlier included. According to Reed et al. 47 the CLSI C28-A2 document proposes one-third as the limit for this ratio. The test is, however, not very sensitive; in particular, when there is more than one outlier, the presence of a less extreme outlier may mask the other(s). In this case, the suggestion is to first evaluate a less extreme suspected outlier and, if the test identifies it as a true outlier, also discard the second more extreme outlier. Horn et al. 67 have proposed a more sophisticated two-step algorithm in which the data are first transformed using the Box and Cox method, 55 to obtain a Gaussian distribution, then the outlier identified using the Tukey robust approach. 68 This method identifies the extremes using the central 50% of the distribution, thus eliminating the confounding effects of more outliers and involves the computation of lower and upper quartiles (25th and 75th percentiles) of the transformed data (Q1 and Q3) from which the interquartile range (IQR) (Q3–Q1) is calculated. Finally, the lower and upper ‘fences’ are computed: the lower fence as Q1 – 1.5 × IQR and the upper fence as Q3 + 1.5 × IQR. Any data point outside the fences is considered an outlier and discarded.
Current situation and future developments
In 1960, Schneider wrote that ‘…practical medicine is basically founded on comparison. If medicine is to be scientific, we must not only understand the structural, functional and chemical relations operating in individuals, but we must also understand the bases of our comparisons’. 69 From all the issues discussed above, we could conclude that the target that Schneider drew in his publication has been reached. Unfortunately, while the theory is well-defined, its practical application in the everyday life of most clinical laboratories is far from optimal. Laboratories often use different reference intervals without any valid reason, such as variations in analytical methodology or population served. A typical example of the situation can be derived from a survey on the reference intervals used in Italy in 2005 for alanine aminotransferase. 70 In 93 laboratories, all claiming to use the IFCC procedure with pyridoxal phosphate addition (even though on different analytical platforms), the upper reference limit for adult males spanned from 40 U/L to 72 U/L, while the lower limit ranged from 0 U/L to 30 U/L. This was partly related to differences in reference intervals suggested by the manufacturers in their package inserts, although not every laboratory adopted manufacturers' values. This common situation is dangerous and misleading both for clinicians and patients (the same analytical result can be considered ‘normal’ in one laboratory and ‘abnormal’ in another, according to the reference interval in use). Moreover, it hampers the creation of common databases, i.e. the combination of data from different laboratories.
The reasons for these differences are multifactorial; possible reasons are adoption of literature data or manufacturers' values without any critical appraisal, e.g. changes in analytical methodologies are not accompanied by corresponding changes in reference intervals. Establishing reference intervals based on the laboratory's own served population is a very costly and demanding process, requiring recruitment of appropriate reference individuals. Laboratories have easy access to pathological samples but rarely to samples from apparently healthy subjects. For certain types of samples, e.g. from paediatric subjects, access to healthy individuals is particularly difficult since ethical issues may prevent phlebotomy simply for establishing reference intervals. Furthermore, it is very time-consuming to establish reference intervals for all analytes and to repeat the work for any change in methods or analytical systems.
A possible alternative to overcome this situation is the development and implementation of ‘common’ reference intervals.
Common reference intervals
The basis for adopting common reference intervals are simple: if analytical methods are the same or yield comparable results because they are correctly standardized, and the population has the same characteristics or, alternatively, it is known that the specific analyte is not influenced by ethnicity or environment, the same, i.e. common reference intervals can be used. Unfortunately, the practical application of this simple concept is not as easy as it would appear. A number of prerequisites, summarized in Table 2, will need to be in place before adopting it.
Necessary pre-requisites for production and use of common reference intervals
IFCC, International Federation of Clinical Chemistry; JCTLM, Joint Committee on Traceability in Laboratory Medicine; EQAS, External Quality Assessment Scheme
Establishing common reference intervals
Assuming that, for a given analyte, a reference measurement system exists, the most demanding task for producing common reference intervals which can be adopted by any clinical laboratory that operates under similar preanalytical and analytical conditions are the definition of an adequate set of reference values. This should include subjects from different ethnic groups and from various environments in order to document whether clinically significant differences exist, which would prevent the use of common reference intervals. The best way to obtain this information is to conduct a multicentre study, involving clinical laboratories in different regions or countries. This approach has been pursued in Spain 71–75 and further developed in the Nordic countries. 76–80 In particular, it requires:
An a priori selection of reference individuals according to well-defined criteria, as specified before. The number of participating centres and enrolled individuals should be determined according to the number of subjects required for partitioning by age, gender, race, lifestyle, etc. To obtain sufficiently narrow CIs for the reference limits, the optimal number of individuals within each group should be around 500, 78 the minimal sample size to allow for non-parametric calculation of the confidence limits being 120. The criteria for partitioning should be according to Lahti et al.; 62–64
A clear definition of the preanalytical phase. Ideally, to reproduce sample handling within clinical laboratories, the analyses should be performed on fresh samples. However, to reduce analytical variability it is usual to freeze the samples and analyse them in a single or minimum number of batches. This approach is only acceptable if it has been demonstrated that freezing does not affect the analyte. If sample stability is confirmed, storage of additional aliquots for further use is highly recommended;
The use of methods providing results traceable to the reference measurement system and high interlaboratory comparability. Traceability to the reference measurement system must be verified through the use of two or more commutable materials (e.g. frozen pools) with values assigned by the reference measurement procedure, possibly by a number of reference laboratories. The interlaboratory comparability must be checked using common quality control materials. An internal quality control programme must be implemented in each participating laboratory with clearly defined a priori criteria, for acceptance or rejection of each analytical run;
Proper data analysis for the calculation of reference limits. The data from the different centres must be compared to identify the presence of any analytical bias (from the quality control data) or atypical distribution of the reference values. Finally, before calculating reference limits, possible outliers must be detected and eliminated.
Adopting common reference intervals
In order to be able to apply common reference intervals, a clinical laboratory has to verify the similarity of the preanalytical conditions to those adopted in the production of the intervals, the performance of the analytical system employed, and the characteristics of the population served.
Preanalytical conditions
The reference intervals can only be used if the same preanalytical conditions are applied (e.g. specimen type, fasting subjects, etc.), or if it is possible to demonstrate that any introduced modification has no significant effect, e.g. demonstrated equivalence between results obtained with heparin plasma and serum samples, analyte concentrations are not modified by meals, etc.
Analytical aspects
The method in use must produce results traceable to the reference measurement system for that specific analyte. For European countries, if the analytical system is ‘CE-marked’ it should be used according to the manufacturer's specifications. However, even though the European Directive on IVD medical devices 5 stipulates traceability as an essential requisite, a number of routine analytical systems may still be significantly biased when compared with the internationally accepted reference systems, as was recently demonstrated for the measurement of some enzymes. 81
The analytical quality of the method in use should be controlled in order to keep its total error within stated limits. Targets for allowable total error can be derived from the criteria related to biological variability. 82 A list of estimated within-subject and between-subject biological variations and analytical quality specifications can be found at Westgard's website. 83 The magnitude of the total error can be checked through participation in External Quality Assessment Schemes (EQAS), provided that the control samples are commutable and their target values are assigned by laboratories using reference methods.
Characteristics of the population served by the laboratory
In a recent publication by Ichihara et al. 84 large between-city differences were demonstrated for several analytes in six Asian cities. If these results are confirmed and the observed differences considered large enough to merit separate reference intervals, the possibility of adopting common reference intervals is reduced to the analytes demonstrating no or low interpopulation variability. 85
In general, if race or life-style are known not to influence reference intervals, it is sufficient to verify the preanalytical and analytical aspects. If ethnicity or lifestyle are known to influence reference intervals or if no information is available, it is advisable that the clinical laboratory validates them on a small sample group derived from its own population, before their adoption. This validation can be done according to the CLSI document C28-A2, paragraph 8.2. 16 The advice is to examine 20 individuals representing the local apparently healthy population and satisfying the selection criteria. After discarding outliers, if no more than two of the 20 tested values fall outside the common interval, it can be adopted. If three or more values fall outside the common reference limits, the experiment should be repeated with another 20 subjects. If no more than two of the 20 repeat values fall outside the common interval, adopt the interval; if three or more values again fall outside, it probably means that the populations differ and a specific reference interval is needed, provided that all the preanalytical and analytical aspects are controlled. This type of binomial test works well if the reference values have a Gaussian-like distribution, but it is very insensitive if the distribution is skewed. In the latter case, more powerful statistical tests should be carried out, e.g. the Kolmogorov-Smirnov test which compares the full dataset from tested reference individuals with the 20 reference specimens for a given laboratory. A further approach may be the calculation of reference intervals from the values of 20 subjects using a robust statistical algorithm, like that proposed by Horn et al. 53,54 to check whether the obtained experimental limits are within the confidence limits of the common reference limits. Finally, an alternative approach could be the application of one of the previously described statistical methods for data mining to the laboratory's stored data. 24,31,32 The comparison of the reference limits obtained, with those of the proposed common reference interval can allow judgement of their applicability.
The approach described for adopting common reference intervals is not straightforward. Development of reference measurement systems and compliance by manufacturers with calibration traceability can be a slow process. Establishing robust reference intervals is time-consuming and expensive. Clinical laboratories are usually disinclined to modify reference intervals as this is a demanding task, which also requires education of clinicians and patients. Large multicentre studies are needed for the correct definition of common reference intervals, at least for certain analytes, in order to make real progress in this field and bridge the large gap existing between sound theory and poor practice.
Adoption of validated reference intervals
If all the previously defined organizational, preanalytical and analytical requirements are fulfilled not only can reference intervals obtained experimentally in multicentre studies be adopted as common reference intervals, but also reference intervals defined by a single laboratory. The IFCC Committee on Reference Intervals and Decision Limits (C-RIDL) has recently published a paper on the validation of already published reference intervals for creatinine in serum. 58 Obviously, when adopting a validated reference interval developed in a single centre, the preliminary verification of the interval on the local population acquires greater importance.
Reference intervals vs. decision limits
The main characteristics related to these two concepts are reported in Table 3. Some confusion can arise from the use of previously defined reference intervals as decision limits in specific circumstances, e.g. in the screening of blood donors, all subjects above the upper reference limit for alanine aminotransferase could be excluded from donation. However, while reference intervals describe the biological characteristics of a well-defined (usually apparently healthy) population, decision limits depend upon the diagnostic question and are obtained from specific clinical studies in order to define the probability for the presence of a certain disease or for a different outcome. The decision limits selected are usually based on the level of overlap of two populations (diseased/non-diseased) and on the desired degree of clinical sensitivity and specificity. The scope of these limits, as defined by the word itself, is to lead to decisions: individuals with values above or below the decision limit should be treated differently.
Differences between reference intervals and decision limits
ROC, receiver-operating characteristic
Individual reference intervals
Although Harris 86 defined the theoretical basis and the statistical methods for individual reference intervals more than 30 years ago, their implementation represents a challenge for the future. Today, the development of information technology allows us to archive a huge amount of data and to retrieve and process them rapidly. On the other hand, implementation of traceability concepts and the consequent improvement of assay standardization will increase result stability and comparability over time and location for many analytes. These premises will permit transformation of theory into practice. The experimental model is quite simple and requires the collection of several samples from the same individual during a period of stable health. The results of measurements on these samples for a given analyte will produce a temporal series, forming a baseline against which future results will be judged. A fundamental issue is the number of samples needed to define the baseline value with acceptable approximation. This depends upon the biological variability of the analyte, its analytical reproducibility and the applied mathematical models. 87 In clinical practice, this approach is already used in doping control programmes, in which the baseline values of haematological parameters for athletes are recorded and individual reference intervals calculated. This allows detection of the use of illegal substances causing significant changes in the individual's haematological analytes. 88,89
Conclusions
Even though a large number of publications on the topic of reference intervals already exist, it is clear that much work is needed to reach an optimal situation. In general, modifying the existing reference intervals is always a delicate task, requiring a culture of commitment to inform clinicians and their patients. The definition of common reference intervals will hopefully significantly reduce the number of different reference intervals employed for the same analyte, providing the clinician with more congruent and effective information. Laboratorians need to increase their efforts in these areas, trying to overcome the undoubted practical difficulties and the inactivity that sometimes characterized the past. Otherwise, the improvements in the theory will not be translated into clinical practice and patients will not obtain the expected advantages.
Footnotes
Acknowledgements
F. Ceriotti has been the chair of the IFCC Scientific Division Committee on Reference Intervals and Decision Limits (C-RIDL) since its creation in 2005. R. Hinzmann was the liaison between the Scientific Division Executive Committee and C-RIDL between 2005 and 2007.
