Abstract
Background
Gestational TSH and FT4 reference intervals may differ according to assay method, but the extent of variation is unclear and has not been systematically evaluated. We conducted a systematic review of published studies on TSH and FT4 reference intervals in pregnancy. Our aim was to quantify method-related differences in gestation reference intervals, across four commonly used assay methods, Abbott, Beckman, Roche and Siemens.
Methods
We searched the literature for relevant studies, published between January 2000 and December 2020, in healthy pregnant women without thyroid antibodies or disease. For each study, we extracted trimester-specific reference intervals (2.5–97.5 percentiles) for TSH and FT4 as well as the manufacturer-provided reference interval for the corresponding non-pregnant population.
Results
TSH reference intervals showed a wide range of study-to-study differences with upper limits ranging from 2.33 to 8.30 mU/L. FT4 lower limits ranged from 4.40 to 13.93 pmol/L, with consistently lower reference intervals observed with the Beckman method. Differences between non-pregnant and first trimester reference intervals were highly variable, and for most studies, the TSH upper limit in the first trimester could not be predicted or extrapolated from non-pregnant values.
Conclusions
Our study confirms significant intra- and intermethod disparities in gestational thyroid hormone reference intervals. The relationship between pregnant and non-pregnant values is inconsistent and does not support the existing practice in many laboratories of extrapolating gestation references from non-pregnant values. Laboratories should invest in deriving method-specific gestation reference intervals for their population.
Introduction
Thyroid dysfunction is common in females of reproductive age and occurs in 2–5% of pregnant women.1,2 Uncorrected thyroid dysfunction in pregnancy has deleterious effects on fetal and maternal health including an increased risk of pregnancy loss and offspring intellectual impairment.3,4 Prompt detextion and correction of thyroid dysfunction is therefore essential for optimal fetal and maternal outcomes.5–7 However, the laboratory diagnosis of thyroid dysfunction in pregnancy is confounded by a series of adaptive physiological changes that translate to clinically meaningful differences between pregnant and non-pregnant thyroid hormone reference intervals. In addition, thyroid hormone concentrations change through the course of pregnancy. Total thyroid hormone concentrations rise in early pregnancy due to increased production of thyroxine-binding globulin (TBG) together with stimulation of the thyroid-stimulating hormone (TSH) receptor by human chorionic gonadotrophin. 8 The increased thyroid hormone output is in turn accompanied by a fall in TSH concentration through pituitary thyroid feedback. 9 Free thyroid hormones, on the other hand, are maintained within the reference range, but free thyroxine (FT4) immunoassays are susceptible to method-dependent bias in pregnancy due to variations in albumin and TBG concentrations.
The challenges of method-dependent bias in TSH and FT4 reference intervals are well recognized,10,11 but the extent of assay-related variation in pregnancy is unclear and has not been systematically evaluated. Current international guidelines advocate the use of trimester-specific normative values derived from a healthy pregnant population in the evaluation of thyroid dysfunction in pregnancy. 12 In reality, many laboratories lack gestation-specific reference intervals and apply arbitrary non-pregnant cut-offs, creating the potential for misdiagnosis and inappropriate therapy. In the absence of gestation-specific reference intervals, the American Thyroid Association (ATA) guidelines recommend that the first trimester upper and lower TSH reference limits should be set at 0.5 and 0.4 mU/L below the corresponding upper and lower non-pregnant limits, respectively. These empirical cut-offs are selected to reflect the magnitude of the anticipated difference in the non-pregnant and pregnant values based on the expected TSH drop in early gestation. 12 However, the validity of this approach for different assay methods has not been systematically evaluated.
Thus, we conducted a systematic review of published studies on TSH and FT4 reference intervals in pregnancy. Our primary aim was to quantify method-related differences in reference intervals across four frequently used manufacturer assays, namely, Abbott, Beckman, Roche and Siemens. In addition, we examined the relationship between pregnant and non-pregnant reference intervals, and thus the validity of extrapolating gestation reference intervals from non-pregnant intervals for the different assay methods.
Methods
Search strategy
We searched Medline for published articles on thyroid hormone reference intervals in pregnancy between January 2000 and December 2020. We used various combinations of the search terms: ‘thyroid function’, ‘FT4’, ‘thyroxine’, ‘TSH’, ‘thyrotropin’, ‘pregnancy’, ‘gestation’, ‘reference range’ and ‘reference interval’. We sourced additional publications from references in individual articles. Relevant articles were selected after reading through titles and abstracts or full texts, when the title or abstract information was insufficient to exclude the study.
Study selection and data extraction
We selected articles in which thyroid hormones were measured using one of four assay methods, Abbott Architext, Beckman Access or Dxl, Roche Cobas or Elecsys and Siemens Advia Centaur. We included only studies that reported reference intervals as 2.5–97.5 centiles with gestational age information at the time of blood sampling. We excluded studies if they were not in English, had less than 120 patients, did not exclude women with positive antibodies or thyroid disease, or were conducted in areas with known excess or deficient iodine nutrition status. The extracted information comprised first author, country of study, population ethnicity, number of subjects, age distribution, trimester of sampling, TSH and FT4 reference intervals and reference intervals for the corresponding non-pregnant population. Non-pregnant reference intervals were extracted from the manufacturer-provided values as reported by the authors. Where manufacturer reference intervals were not stated, study derived non-pregnant reference intervals were used if available. Study selection and data extraction were independently conducted by two reviewers (MA, DU) and differences were resolved by consensus and referral to other reviewers (OO, CE).
Study quality
We assessed the methodological quality of studies using the Newcastle Ottawa Scale (NOS) for the assessment of non-randomized studies. The NOS was adapted for this study to assess study selection (3 points), representativeness of the sample to a healthy pregnant population (3 points) and the assessment and reporting of reference intervals (3 points).
Data analysis
Reference intervals were summarized for each study as 2.5–97.5 percentiles and grouped by assay method and trimester of pregnancy. Where multiple results were available in the same trimester, we selected the data point most representative of that trimester. We were unable to undertake a conventional meta-analysis, as most studies did not include standard measures of variance for the lower and upper reference intervals. Thus, we described the range for the lower and upper reference limits for each assay method in each trimester and compared study-to-study as well as intermethod variation. In addition, we summarized the TSH and FT4 lower and upper reference limits using median and interquartile range, with each study represented as an unweighted data point. Method-dependent differences in reference limits were then compared using the Kruskal Wallis test with the Bonferroni correction applied for multiple group comparisons. The Kruskal–Wallis test is a non-parametric method for comparing two or more independent samples, while the Bonferroni correction was applied to reduce the risk of a type 1 error from multiple comparisons. To explore the validity of extrapolating gestational reference intervals from non-pregnant values, we summarized the magnitude of the difference between non-pregnant (NP) and first trimester (T1) reference limits (NP–T1) for each study. Intermethod differences in NP–T1 medians were also compared with the Kruskal Wallis test and Bonferroni correction. All analyses were conducted using Stata, version 15.1, StataCorp, Texas, USA.
Results
Study selection
The study selection flow chart is presented in Figure 1. After excluding duplicate retrievals, we identified 779 studies which we screened by reading through their titles or abstracts. The full-text of 134 articles were assessed for eligibility, of which 91 studies were excluded for various reasons including unavailability of 2.5–97.5 percentile reference intervals, non-exclusion of thyroid disease or antibody-positive individuals, use of assay methods other than those being assessed, samples <120 subjects and populations with iodine deficiency or excess (Figure 1). The final study sample thus comprised 43 studies.13–55

Study selection flow chart.
Study characteristics
The characteristics of included studies are shown in supplemental Table 1. Out of the 43 selected studies, 19 were conducted in Asian countries, predominantly China (n = 16), while 15 studies were from European countries. Other studies were from North America (n = 3), South America (n = 3), the Middle East (n = 2) and Australia (n = 1). The studies included a total number of 132,794 pregnant women, comprising 68,097 samples analysed by Abbott (14 studies), 15,164 by Beckman (9 studies), 30,903 by Roche (15 studies) and 21,819 by Siemens (11 studies). Nineteen studies excluded women with antibodies to either thyroid peroxidase (TPOAb) or thyroglobulin (TgAb),13,14,17–20,22,31,33,34,36,39,41,44,45,47,51,52,55 while 24 studies did not measure TgAbs and excluded women with positive TPOAb only.15,16,21,23–25,27–30,32,35,37,38,40,42,43,46,48–50,53,54 The median age of patients ranged from 24 to 35 years with TSH and FT4 reference intervals determined during the first, second and third trimesters in 42, 28 and 26 studies, respectively. Studies that presented data separately for patients with different ethnicities and with multiple assay methods are presented separately. The quality scores ranged from 6 to 9, and most studies scored between 7 and 8 points.
TSH reference intervals
TSH reference intervals (2.5–97.5 percentile) for the first to third trimesters are shown in Figures 2 to 4, respectively. In the first trimester, the TSH lower limit ranged from 0.01to 0.59 mU/L, with most studies reporting a TSH lower limit <0.20 mU/L (Figure 2(a)). The upper limit showed greater study-to-study variation and within-method variation which were observed for all assay methods in the first trimester (Figure 2(a)). The Abbott assays showed the widest variation, with a TSH upper limit range of 2.33–8.30 mU/L, including a study by Dhatt et al., that reported extremely high upper limits in women of Arab and Asian ethnicity 15 (Figure 2(a)). The intramethod variation in TSH upper limits continued into the second and third trimesters, while the lower limits remained <0.50 mU/L in the second trimester and <0.60 mU/L in the third trimester (Figures 3(a) and 4(a)). Comparisons of medians across methods showed no significant method-related difference for the lower or upper TSH limit in all trimesters (P > 0.05, supplemental Table 2). Three studies with intermethod measurements in the same subjects (Fan, 16 Springer, 42 Liu 18 ) also reported no consistent pattern of method-related differences in TSH reference intervals. Distribution of TSH lower and upper limits by trimester and assay methods are shown in Figure 5. TSH limits for each assay were progressively higher in each trimester (Figure 5(a) and (b)).

First trimester TSH and FT4 reference ranges.

Second trimester TSH and FT4 reference ranges.

Third trimester TSH and FT4 reference ranges.

TSH lower and upper limits by assay method. Each circle represents the lower or upper limit reported in each study.
FT4 reference intervals
FT4 reference intervals (2.5–97.5 percentile) are shown in Figures 2 to 4. Reference intervals varied across studies in all trimesters and was present within as well as across assay methods. The Beckman method consistently yielded lower FT4 reference intervals than other assay methods. FT4 lower limits in the first trimester ranged from 7.16 to 12.37, 5.90 to 10.81, 10.30 to 13.41 and 9.01 to 13.93 pmol/L for the Abbott, Beckman, Roche and Siemens assays, respectively. The upper limits ranged from 15.96 to 24.60, 13.20 to 18.66, 18.00 to 22.50 and 16.73 to 26.49 pmol/L for the Abbott, Beckman, Roche and Siemens assays, respectively. The Beckman upper limit reported in some studies was lower than the Roche or Siemens lower limit in other studies. Relatively lower Beckman concentrations were also observed in the study by Liu et al. which measured FT4 using the Beckman, Abbot and Roche assays in the same patients. 18 The distribution of FT4 lower and upper limits by trimester and assay method is presented in Figure 5. FT4 reference intervals got progressively lower with each trimester, but method-related differences persisted in the second and third trimesters. Comparison of median lower and upper FT4 limits consistently showed lower Beckman values compared with other methods, in all trimesters (P < 0.05, supplemental Table 2).
Difference between non-pregnant and first trimester reference intervals
To examine the validity of extrapolating gestational reference intervals from non-pregnant values, we determined the difference between non-pregnant and first trimester reference limits (NP–T1) for TSH and FT4 (Figure 6). For the TSH lower limit, most NP–T1 values were in the 0–0.5 mU/L range, and thus roughly consistent with the recommendation to derive gestation TSH lower limit by subtracting 0.4 from the non-pregnant lower limit. In contrast, there was greater variation for the upper limit with differences ranging from –3.98 to +2.72 mU/L. TSH upper limit NP–T1 was >1.0 mU/L in 18 studies, meaning that the recommended subtraction of 0.5 mU/L from the non-pregnant upper limit would have over-estimated the gestation TSH upper limits by at least 0.5 mU/L in these samples.

Non-pregnant minus first trimester (NP–T1) lower and upper reference limits. Circles represent data points from each study. The non-pregnant data was based on the manufacturer-provided reference range for the corresponding non-pregnant population, as reported in the study. The dashed vertical lines in panel (a) (0–0.4) and panel (b) (0–0.5) represent the expected NP–T1 difference based on guideline recommendations for the lower and upper TSH limits, respectively.
TSH upper limit NP–T1 was negative in eight studies, indicating that the 0.5 mU/L subtraction would under-estimate gestation TSH upper limits in these samples. Only 15 studies (4 Abbott, 1 Beckman, 6 Roche, 4 Siemens) had a TSH upper limit NP–T1 in the 0–1.0 mU/L range, i.e. roughly equivalent with the 0.5 mU/L difference. No single assay method showed a consistent pattern of difference between non-pregnant and gestation upper TSH limit. Using the ratio of the non-pregnant and gestation TSH upper limits (NP/T1) also gave highly variable results (data not shown). NP–T1 for the FT4 lower and upper limits were also variable and ranged from –2.76 to +2.50 pmol/L for the lower limit and –6.0 to +6.0 pmol/L for the upper limit with no specific method-related patterns (Figure 6).
Ethnicity
We explored the influence of ethnicity on reference intervals by grouping the data according to the two most frequently represented ethnic groups in the studies, i.e. Chinese and Caucasians (21 studies each). supplemental Figure 1 shows the distribution of TSH and FT4 reference limits according to trimester, assay method and ethnicity. Statistical comparison of reference limits by ethnicity was not feasible due to small group numbers. However, Roche assays tended to report higher TSH upper limits for Chinese compared with Caucasian patients (median TSH 4.80 vs. 3.40 mU/L, supplemental Figure 1(b)). A study of reference intervals in women of Arab and Asian ethnicity by Dhatt et al. reported no difference in TSH reference intervals but showed lower FT4 reference intervals in Arab compared with Asian women trimesters 1 and 2 15 (Figures 3 and 4).
Discussion
We have undertaken a systematic review of published reports on thyroid hormone reference intervals in pregnancy with the aim of evaluating the variation across assay methods. We observed marked variation for the TSH upper limit with a wide range of study-to-study differences affecting all analytical methods. The Beckman assays yielded comparatively lower FT4 reference intervals that were incongruent with other methods. We also explored the validity of existing strategies in many laboratories of estimating gestational reference intervals from intervals derived from the non-pregnant population. Marked variation was observed in the difference between non-pregnant and first trimester reference intervals, and no single assay method showed a consistent pattern of difference. Our study thus confirms significant method-related disparities in gestational thyroid hormone reference intervals and highlights the limitations of applying general population reference intervals in pregnancy.
Method-related differences in FT4 and TSH measurements have been well documented in the non-pregnant population.10,56 In addition, the UK National External Quality Assessment Scheme (NEQAS) also reported method-related variation in thyroid function reference intervals including relatively lower FT4 concentrations for the Beckman assays. 57 However, only a few studies have systematically addressed these differences in pregnancy. In the study by Springer et al., gestational thyroid hormone reference intervals were established with seven different analytical systems. 42 The authors reported significant intermethod differences for both TSH and FT4 intervals, with the lowest FT4 intervals observed with the Beckman assay. 42 Several authoritative narrative reviews of pregnancy reference intervals have previously confirmed these assay-dependent differences in FT4 and TSH intervals and highlighted their potential clinical implications.58,59 A meta-analysis of TSH and FT4 gradients from the non-pregnant to pregnant state also showed assay-related variation and suggested that the upper TSH cut-off in pregnancy could be approximated by subtracting 22% from the non-pregnant TSH upper limit. 60 However, this analysis was limited to studies conducted exclusively in Chinese populations. 60 In contrast, we were unable to show a consistent pattern of difference between the non-pregnant and pregnant TSH upper limit, perhaps due to inclusion of a wider range of studies in our analysis.
Our findings have implications for clinical practice. Uncorrected hypothyroidism carries an increased risk of fetal loss 6 and offspring intellectual impairment. 61 Furthermore, maternal over-treatment with Levothyroxine in pregnancy may increase the risk of cognitive dysfunction and attention deficit hyperactivity disorders in children.61,62 Over-estimating TSH upper limits would miss cases of gestational hypothyroidism, while under-estimation would wrongly diagnose hypothyroidism, putting women without thyroid dysfunction at risk of unnecessary and potentially harmful therapy. The need for assay-dependent reference intervals is even more pressing for FT4 reference intervals due to the striking method discrepancies observed in these series. These considerations remain pertinent given that many laboratories lack gestation-specific reference intervals and continue to apply non-pregnant intervals in pregnancy. Our findings show that gestation reference intervals cannot reliably be deduced from the non-pregnant range and that the ATA recommendation to subtract 0.5 mU/L from the non-pregnant upper limit would over or under-estimate the upper TSH limit in the majority of samples.
Ideally, each laboratory should derive its own gestational reference intervals based on the assay method and local population. This is not always practicable, particularly for small laboratories with limited resources. One approach would be for health authorities to collaborate at regional level to establish reference intervals for the commonly used assay methods within the region. The establishment of reference intervals should follow criteria set by international bodies.11,63 Furthermore, the reporting of gestational thyroid function tests should be assay and pregnancy specific and clinicians should be alert to the potential for method-related differences. For laboratories that lack gestation-specific data, the use of arbitrary cut-off points is now discouraged, and best practice in the circumstance would be to use reference intervals derived from a population with similar assay platform and comparable characteristics in terms of ethnicity and iodine nutrition. If non-pregnant reference intervals must be used, then clinicians need to be aware of the limitations of such an approach. Clinical studies investigating the impact of thyroid dysfunction should avoid outcome analyses based on fixed cut-offs and use comparable measures of population percentiles or multiples of medians as has previously been suggested. 59
Our study has some limitations. Because our review covers a 20-year period, it is likely that assay methods would have changed with time and some of the older studies may not reflect current methods. In addition, we were only able to evaluate the most commonly used assay platforms, and as such, the variation in other assay methods is unknown. Also, some of the observed variation may reflect differences in laboratory quality standards as well as unmeasured confounders such as iodine nutrition. Lack of iodine nutrition data in most studies meant that we could not formally assess the impact of iodine status on reference intervals. For example, the study by Dhatt et al. in a mixed-ethnic United Arab Emirate population, reported unequivocally raised TSH values, suggesting unrecognized iodine deficiency or thyroid dysfunction in their cohort. 15 Lastly, we were unable to conduct a conventional meta-analysis of the lower and upper reference limits since most studies did not provide data distribution measures for these limits such as standard deviation or 95% confidence intervals. Instead, we adopted a pragmatic approach in which each study was represented as a single unweighted data point and medians for the lower and upper reference limits were compared using non-parametric methods. While this approach provides crude estimates of intermethod differences, it might have lacked the sensitivity to detext more subtle variation.
Our study’s strength is that it is the first systematic review to focus on assay-dependent differences in thyroid hormone reference intervals in pregnancy. We have used stringent inclusion criteria to systematically select relevant studies and to summarize a large body of data spanning 20 years. Lastly, we have probed the validity of current guideline recommendations and highlight practical challenges facing laboratories without gestation-specific reference interval data.
In conclusion, we show wide variation in thyroid hormone reference intervals both within and across assay methods. We found no consistent relationship between the non-pregnant and pregnant reference intervals to permit extrapolation of pregnancy intervals from non-pregnant intervals. Future guidelines should acknowledge the limitations of current approaches, and efforts should now be invested in deriving gestation reference intervals that are assay and population specific.
Supplemental Material
sj-pdf-1-acb-10.1177_00045632211026955 - Supplemental material for Method-dependent variation in TSH and FT4 reference intervals in pregnancy: A systematic review
Supplemental material, sj-pdf-1-acb-10.1177_00045632211026955 for Method-dependent variation in TSH and FT4 reference intervals in pregnancy: A systematic review by Onyebuchi E Okosieme, Medha Agrawal, Danyal Usman and Carol Evans in Annals of Clinical Biochemistry
Supplemental Material
sj-pdf-2-acb-10.1177_00045632211026955 - Supplemental material for Method-dependent variation in TSH and FT4 reference intervals in pregnancy: A systematic review
Supplemental material, sj-pdf-2-acb-10.1177_00045632211026955 for Method-dependent variation in TSH and FT4 reference intervals in pregnancy: A systematic review by Onyebuchi E Okosieme, Medha Agrawal, Danyal Usman and Carol Evans in Annals of Clinical Biochemistry
Supplemental Material
sj-pdf-3-acb-10.1177_00045632211026955 - Supplemental material for Method-dependent variation in TSH and FT4 reference intervals in pregnancy: A systematic review
Supplemental material, sj-pdf-3-acb-10.1177_00045632211026955 for Method-dependent variation in TSH and FT4 reference intervals in pregnancy: A systematic review by Onyebuchi E Okosieme, Medha Agrawal, Danyal Usman and Carol Evans in Annals of Clinical Biochemistry
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Ethical approval
Not applicable.
Guarantor
OEO.
Contributorship
Concept and design: OEO, CE; Data acquisition: OEO, MA, DU, CE; Data analysis: OEO; Writing and editing: OEO, MA, DU, CE; All authors approved the final version of the article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
