Comparing cardiac output monitors and defining agreement: A systematic review and meta-analysis

Abstract

Background

Measuring cardiac output is common in critical care and perioperative medicine. Different monitoring systems are often judged against others in comparative studies. There is no agreed standard or definition on which to base the conclusions of such studies.

Objectives

To review comparative studies of cardiac output monitors using an agreement:tolerability index (ATI) as a measure of monitor precision. To compare the ATI of a monitor with the conclusions of authors regarding agreement and clinical utility.

Design

Systematic review of comparative studies of cardiac output monitoring systems. The precision of each monitor was standardised against an ATI using a tolerability interval based on the normal range for cardiac index. The conclusions of each study were described as positive, neutral or negative, depending on whether authors reported the monitor to be acceptably precise and/or clinically useful. Comparison was made between the precision of a monitor and the likelihood of it being favoured by authors.

Data sources

PubMed was searched up to March 2012.

Eligibility criteria

Studies published in English that compared two or more methods for measuring cardiac output in adult humans.

Results

A total of 213 papers documenting 409 separate comparisons of two methods of measuring cardiac output were included. ATIs for the different comparisons varied from 0.07 to 6.84 (where an ATI < 1 indicates acceptable agreement, 1–2 marginal and >2 unacceptable agreement). Thirty-one percent of authors defined their own terms for acceptable agreement. ATI was only moderately correlated with the conclusions of the authors (Spearman rho = 0.47, P < 0.0001).

Conclusions

Authors should define what constitutes acceptable agreement a priori when reporting comparative studies of cardiac output monitors. The ATI and the tolerability interval may be a useful basis for helping define acceptable precision.

Keywords

Cardiac output thermodilution measurement

Introduction

Monitors that measure cardiac output can be used both to diagnose states of low, normal or high cardiac output, and to assess the response to therapeutic interventions by continuous or serial measurements. The use of such monitors is common in critical care and is recommended in certain peri-operative situations.¹ Several different monitoring technologies exist. The ‘gold standard’ is normally cited as being a thermodilution technique using a pulmonary artery catheter, which is a common reference standard to which newer methods have been compared. Bland and Altman described a practical way of describing the results of agreement studies in which a particular variable is measured by one method and then compared to measurements from a reference method.² The difference between the two methods is expressed as bias (the average difference) and the scatter of differences about the bias as the 95% limits of agreement. Critchley and Critchley reviewed comparative studies of cardiac output monitors in 1999, of which only two new methods were included (Doppler and bioimpedence techniques) and most used thermodilution as the reference.³ They found that fewer than half of these studies provided any objective criteria by which to accept or reject the new method. Based on the thermodilution method having an assumed accuracy of ±10–20%, Critchley and Critchley recommended that future studies should accept a new method if the limits of agreement between a new and a reference technique were ±30%. (This was based on combining notional errors of 20% from each method, which gives a pooled variance error of 28% or 30%.)

Cecconi et al. have emphasised the fact that Critchley and Critchley recommended acceptability of agreement on the basis of the reference measurement technique being thermodilution with a precision of ±10–20%.⁴ They argue that the precision of the reference technique should be measured during any comparative study and that if the reference technique has wider precision than that assumed by Critchley and Critchley, then a limit of agreement of ±30% might not be acceptable. Cecconi and Rhodes have called for a consensus on how such comparative studies are conducted and reported to avoid such misinterpretation,⁵ and 10 years previously Critchley and Critchley recommended that these studies should report the mean cardiac output, the bias, the limits of agreement, and the percentage error.³

Columb has suggested that authors specifically define agreement a priori for the specific clinical scenario by defining a tolerability interval to guide their conclusions on whether the precision of a clinical monitor is acceptable.⁶ The tolerability interval is the range of values that are clinically acceptable and would not usually prompt any intervention to alter the variable. This could be a normal range or reference interval, but does not have to be so. The agreement:tolerability index (ATI) expresses the precision of a measuring device in relation to this ‘clinically acceptable’ range. An index greater than 2.0 implies that ‘extreme-to-extreme misclassification’ can be expected to occur as each side of the limits of agreement exceeds the tolerability range and this represents unacceptable agreement. An index between 1.0 and 2.0 implies that although extreme-to-extreme misclassification is unlikely, measurements in the tolerability range may be classified to either extreme so this ‘two-sided misclassification’ represents marginal agreement. An index less than 1.0 implies that at worst, patients in the tolerability range are only likely to be misclassified to an adjacent extreme or ‘one-sided misclassification’ and this represents acceptable agreement.⁶

In this model, extreme-to-extreme misclassifications represent the worst level of precision in that the true variable might be below the tolerability range, but the monitor displays a value above that range. For example, setting a tolerability interval as equal to the normal range for cardiac output would mean that an extreme-to-extreme misclassification could result in a patient with cardiogenic shock being recorded by a monitor as having a hyperdynamic circulation. Interventions based on such a misclassification might be directly opposite to what is clinically indicated, and such poor precision from a monitor should not be acceptable. Therefore, we have chosen to derive a tolerability interval for assessing cardiac output monitors from the normal range of cardiac index.

Several new technologies for measuring cardiac output have become available since the review by Critchley and Critchley.⁷ We sought to review all studies published to date that have compared two or more methods of measuring cardiac output, focussing on the reported precision of the monitor and conclusions of the authors on whether a monitor performs to a clinically acceptable level.

Materials and methods

A maximum clinical tolerability interval was agreed by the authors of 1.6 L.min⁻¹.m⁻² based on the normal range of cardiac index of 2.6 to 4.2 L.min⁻¹.m⁻²,⁸ approximating a ±24% range. We then searched PubMed⁹ up to March 2012 using the search strategy in Box 1.

Retrieved abstracts were independently screened by two authors (DN and IC) for papers describing original studies comparing two or more methods of measuring cardiac output (or cardiac index) in adult humans. The full text of all potential papers identified by one or both authors were obtained and reviewed. Papers were excluded at this stage if they did not meet the above inclusion criteria or did not contain sufficient information to allow extraction of the mean difference in cardiac output (or index) and the limits of agreement for a comparison between two methods of measuring cardiac output.

Box 1. PubMed search strategy

“Cardiac Output/analysis”[Mesh] OR

“Cardiac Output/instrumentation”[Mesh] OR

“Cardiac Output/methods”[Mesh] OR

“Cardiac Output/standards”[Mesh]) OR

“Catheterization, Swan-Ganz”[Mesh] OR

“Thermodilution/instrumentation”[Mesh] OR

“Thermodilution/methods”[Mesh] OR

“Thermodilution/standards”[Mesh] OR

“Thermodilution/statistics and numerical data”[Mesh] OR

“cardiac output monitor” OR

“haemodynamic monitor” OR

“hemodynamic monitor” OR

“flow monitor”

Limits: Humans, English, All Adult: 19+ years

For papers reporting comparisons between three or more methods of measuring cardiac output, data on each comparison were extracted separately. Therefore, the unit of analysis was the comparison, not the report. Where necessary, cardiac output data were estimated by measurements from figures or graphs included in the report. The conclusions of the authors of each report were classified as ‘positive’ (suggesting that the new method of measuring cardiac output performed well in comparison to the reference method), ‘neutral’ (if no particular conclusion was reached or if qualifying statements were included that seemed to dilute any general conclusion of rejection or acceptance) or ‘negative’ (suggesting poor performance). These subjective assessments were made by IC and DN independently. Discrepancies were then resolved by joint consideration of the study in question. The Cohen kappa statistic was used to assess agreement between raters.

An agreement:tolerability index was calculated for each comparison using equation (1).

Agreement : Tolerabilityindex = \frac{Upperlimitofagreement - lowerlimitofagreement}{maximumclinicaltolerabilityinterval} .

(1)

For comparisons reporting cardiac output data only, the cardiac index was estimated using equation (2).

{TI}_{CO} = {TI}_{CI} \times BSA

(2)

where TI = tolerability interval, CO = cardiac output, CI = cardiac index, BSA = body surface area, assumed to be 1.73 m².

Summary agreement:tolerability indices were produced using equation (3) for methods of measuring cardiac output where several separate comparisons had been made.

Summaryagreement : tolerabilityindex = \sum (\frac{n_{i} \times {ATI}_{i}}{\sum n_{i}})

(3)

where i = study number, n_i = number of subjects in study i, ATI_i = agreement:tolerability index in study i.

Confidence intervals for the agreement:tolerability indices were generated as follows. The maximum clinical tolerable interval is an arbitrary constant and so has no associated estimation error. This means that any error in the ATI comes from the limits of agreement and hence from any error in the estimation of the standard deviation of the bias. The standard error of the standard deviation of the bias was calculated using equation (4).

se ofsd = s \times \sqrt{1 - (\sqrt{\frac{2}{n - 1}} \times \frac{Γ (\frac{n}{2})}{Γ (\frac{n - 1}{2})})^{2}}

(4)

where s = standard deviation of the bias, n = number of subjects and Γ(·) is the gamma function.

Assuming this error was identical to the standard error of the ATI allowed confidence intervals to be calculated using equation (5).

95 % confidence intervalsoftheATI = ATI \pm 1.96 \times se

(5)

where ATI = agreement:tolerability index, se = standard error of the standard deviation of the bias.

Where the lower confidence interval boundary would otherwise cross zero, it was truncated to zero as a negative ATI should not be possible. Cochran Q statistics were used due the expected statistical and methodological heterogeneities in the studies. Spearman rho was used to assess the correlation of the ATI with the conclusions of the authors. Statistical significance was defined as P < 0.05 (two-sided). Analyses were carried out in R version 2.11.1.¹⁰ Some custom subroutines were written in C++ and compiled with the MinGW version of the GNU C++ compiler.¹¹

Results

The literature search results are detailed in Figure 1. Included studies are listed in the Supplementary material table W1, which is only available as online supplementary material due to its size. A total of 409 comparisons reported in 213 papers were ultimately included. Fourteen different methods of measuring cardiac output were employed, each with many variations. These included certain common physical principles such as indicator dilution, pulse contour analysis, Doppler ultrasound, bioimpedence and the Fick principle.

Figure 1.

Flow chart of literature search.

Agreement:tolerability indices for the individual comparisons ranged from 0.07 to 6.84 and there was significant heterogeneity (Cochran Q, P < 0.0001) in the estimates. The indices, conclusions of the authors and years of publication are shown in Figure 2.

Figure 2.

Graph of agreement tolerability index against year of report publication.

Several pairs of monitoring devices have been frequently compared. Summary ATIs were calculated for 12 of the most common comparisons. These are presented in Table 1. Forest plots for these comparisons are shown in Figures 3 to 11.

Figure 3.

Studies comparing bolus and continuous methods of measuring cardiac output via a pulmonary artery catheter. Comparison numbers in square brackets refer to the entry in Table W1 in the supplemental material.

Table 1.

Summary agreement:tolerability indices for frequent comparisons.

Methods	Number of studies	Number of subjects	Summary ATI	Range of ATIs	Summary percentage error^a
PAC - bolus vs. continuous	57	1173	1.24	0.58–3.18	30%
Other PAC comparisons	13	351	0.68	0.42–1.22	16%
PAC vs. PiCCO thermodilution	9	275	1.30	0.79–2.00	31%
PAC vs. PiCCO pulse contour	30	455	1.68	0.78–2.66	40%
PAC vs. PiCCO not specified	32	692	1.30	0.30–2.40	31%
PAC vs. FloTrac	41	1026	1.64	0.72–3.50	39%
PAC vs. LiDCO	9	192	1.71	0.79–4.18	41%
PAC vs. Oesophageal Doppler	27	537	1.66	0.49–4.62	40%
PAC vs. USCOM	11	402	1.82	0.52–2.27	44%
PAC vs. Bioimpedence techniques	31	1176	2.36	0.36–6.84	57%
PAC vs. Fick/rebreathing techniques	28	577	1.43	0.07–3.92	34%
PAC vs. echo	16	474	1.10	0.24–2.24	26%

PAC: pulmonary artery catheter.

Derived ± percentage errors at the tolerability interval representing the 95% limits of agreement.

Thirty-one percent of papers (covering 30% of comparisons) gave an a priori definition of agreement. In 188 comparisons, the authors reported positive conclusions, neutral in 111 and negative in 110. The Cohen kappa statistic was 0.89 (P < 0.001) for agreement between the two independent raters of the reported conclusions. ATI was moderately correlated with the conclusions of the authors (Spearman rho = 0.47, P < 0.0001).

Figure 4.

Studies comparing PiCCO with PAC. Comparison numbers in square brackets refer to the entry in Table W1 in the supplemental material.

Figure 5.

Studies comparing FloTrac with PAC. Comparison numbers in square brackets refer to the entry in Table W1 in the supplemental material.

Figure 6.

Studies comparing LiDCO with PAC. Comparison numbers in square brackets refer to the entry in Table W1 in the supplemental material.

Figure 7.

Studies comparing oesophageal Doppler with PAC. Comparison numbers in square brackets refer to the entry in Table W1 in the supplemental material.

Figure 8.

Studies comparing USCOM with PAC. Comparison numbers in square brackets refer to the entry in Table W1 in the supplemental material.

Figure 9.

Studies comparing bioimpedence techniques with PAC. Comparison numbers in square brackets refer to the entry in Table W1 in the supplemental material.

Figure 10.

Studies comparing Fick or rebreathing-based techniques with PAC. Comparison numbers in square brackets refer to the entry in Table W1 in the supplemental material.

Figure 11.

Studies comparing echocardiography with PAC. Comparison numbers in square brackets refer to the entry in Table W1 in the supplemental material.

Discussion

Only 31% of authors provide an objective definition of acceptable agreement on which to base their conclusions. This compares to 44% of authors included in the 1999 study by Critchley and Critchley.³ When the precision of each monitor under study is standardised against a tolerability interval equal to the normal range for cardiac output, there is a high degree of overlap between a monitor considered by authors to be acceptably precise and one that is not. As represented by Figure 2, there is only a moderate correlation (r_s = 0.47) between the actual measured precision of a monitor and the likelihood that it will be considered to have clinical utility.

The definition of acceptable precision described by Critchley and Critchley³ is limited in that it only establishes a new method as having a similar precision to the reference method, without necessarily answering the tricky question of what is required of a monitor when used for guiding clinical decisions. Although their original assumption was that a precision of 20% would be clinically acceptable, this was based on what was seen to be acceptable to ‘most authors at the time’.¹² There is no indication of what level of precision is acceptable to most authors or clinicians today. Peyton and Chong have suggested that a precision of ±45% would be ‘a more realistic expectation of achievable precision in clinical practice’.¹³ We suggest that a monitor with precision as low as ±45% would approximately equate to an agreement:tolerability interval of 1.9; at best marginal agreement and closer to the unacceptable threshold that risks extreme-to-extreme misclassifications. As shown in Figure 2, some authors make favourable conclusions regarding monitors that could describe a patient with low cardiac output as having high cardiac output (or vice versa), which seems difficult to justify.

Cecconi et al.⁴ have proposed the following for future comparative studies of cardiac output monitors:

The reference technique should be as accurate and precise as possible.

The precision of the reference technique should be measured within the study.

The bias and limits of agreement between the two techniques should be quoted.

The precision of the new tested technique should be calculated.

They also propose that the desired precision of the new technique should be described a priori. We agree with these recommendations and further recommend the definition of acceptable precision as described by Columb⁶ where it can be described based on a tolerability interval.

Figures 3 to 11 are forest plots of those studies that used a reference technique of thermodilution via a pulmonary artery catheter. Each figure is categorised by similar modalities or techniques. Table 2 shows a summary statistic for these different categories of monitoring technique, and perhaps suggests a hierarchy of precision. Summing results from a mix of papers with differing study protocols in differing patient populations in this way may not produce a valid assessment of which method performs best in practice and readers should be cautious not to over-interpret these data.

Some clinicians limit their interpretation of cardiac output data to analysis of trends, i.e. paying more attention to changes over time than the absolute values produced. However, although this approach may minimise the effect of bias, it might still be corrupted by low precision. Apparently temporal changes may simply reflect poor precision in measuring physiology that in reality is static. Consensus is limited on the ideal statistical method for assessing a monitor’s performance as a ‘trend-o-meter’.¹⁴ There may also be an argument for using cardiac output monitors as a tool to predict a likely outcome from an intervention. For example, indices used to predict ‘fluid responsiveness’ might give a range of probabilities that an intravenous fluid bolus will result in an increase in cardiac output;¹⁵ low precision of the monitor would contribute to lower predictive value.

The questions ‘what is a monitor for?’ and ‘how well must a monitor perform to be clinically useful?’ still need to be answered in the conduct and analysis of any such study. Ultimately, the ideal study design to assess the utility of a monitor would use a measure that has intrinsic value i.e. a clinically important outcome. Studies such as those demonstrating improved outcomes with the intra-operative use of oesophageal Doppler¹⁶ or pulse contour analysis as part of a goal directed therapy,¹⁷ to some extent, avoid the problems of comparative statistics that we describe. They can recommend the monitor as being successful of itself and not merely in reference to another. However, when reporting studies that seek to compare one cardiac output monitor with another (rather than report clinically significant end-points), authors should define and justify their terms for assessing agreement.

Our study has several important limitations. We have made subjective interpretations of the conclusions of each group of authors, distilling their discussion into a simple label of ‘positive’, ‘neutral’ or ‘negative’. Despite an effort to make unbiased interpretations of the papers, this subjectivity might be a source of error, relying as it does on an individual’s interpretation of another individual’s language. However, the Cohen kappa statistic at 0.89 suggests good concordance between the raters. Our ‘normal range’ definition of an acceptable tolerability interval could also be criticised for being arbitrary. At its most basic, it merely acts as a standard by which to compare the relative precision of monitors. We have also applied a reference range that pertains to a normal resting state, which might not be applicable to some of the clinical situations in the reviewed studies. However, we think it makes some sense that the reference range be considered in any measurement of a physiological variable. Although many authors seem to rely on finding an acceptably small bias, it is the scatter of the differences, as manifested by the 95% limits of agreement, that needs to be formally considered in assessing agreement. A systematic bias can be corrected by simple recalibration. So for the purposes of this report, and simplicity, the bias in the studies has been ignored and only the 95% agreement interval used to assess the agreement. We acknowledge that there will be some relevant data that has not been gathered through the process of the literature search; so, as our search was limited to English language reports only, our conclusions can only apply to these. However, we suggest that there are enough data to be broadly representative.

It was not practical to formally check that variance was constant across the range of included cardiac outputs in each study as the raw data would be required to verify and to transform as required. If variance is not constant for a particular study, then the ATI, being an average measure, may conceal poorer performance of the monitor at extreme values (particularly low values) of cardiac output. Many included studies took several readings of cardiac output from each subject with each method of measuring cardiac output. This can also result in a lower variance in the study than would have been the case if all the readings with a particular monitor came from different individuals. In an attempt to adjust for this, we used a weighting system based on the number of study participants to produce the summary ATIs, rather than the more conventional random effects model based on the inverse of the variance. In addition, as expected there was considerable methodological and statistical heterogeneity in the studies, as can easily be observed in the forest plots. So although we have presented summary weighted ATIs for various comparisons, these are presented for guidance only.

Conclusion

Comparative studies of cardiac output monitors are still not universally providing an a priori definition of acceptable precision on which to base conclusions and a tolerability interval approach may be useful. There is only a moderate correlation between the precision of cardiac output monitors and the opinions of the authors regarding clinical agreement.

Footnotes

Acknowledgement

Parts of this review were presented as poster presentations at 12th EBPOM: Congress in Evidence-Based Perioperative Medicine, 3–4 July 2013, London.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

References

MTG3 CardioQ-ODM (oesophageal Doppler monitor): guidance. National Institute for Health and Care Excellence, 2011, guidance.nice.org.uk/mtg3.

Bland

Altman

. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; 1: 307–310.

Critchley

. A meta-analysis of studies using bias and precision statistics to compare cardiac output measurement techniques. J Clin Monit Comput 1999; 15: 85–91.

Cecconi

Rhodes

Poloniecki

. Bench-to-bedside review: the importance of the precision of the reference technique in method comparison studies – with specific reference to the measurement of cardiac output. Crit Care Lond Engl 2009; 13: 201.

Cecconi

Rhodes

. Validation of continuous cardiac output technologies: consensus still awaited. Crit Care Lond Engl 2009; 13: 159.

Columb

. Clinical measurement and assessing agreement. Curr Anaesth Crit Care 2008; 19: 328–329.

Armstrong

Fernando

Columb

. Minimally- and non-invasive assessment of maternal cardiac output: go with the flow!. Int J Obstet Anesth 2011; 20: 330–340.

Klabunde RE. Cardiovascular physiology concepts. 2nd ed., Philadelphia, PA: Lippincott Williams & Wilkins/ Wolters Kluwer, 2012, p. 67.

Home – PubMed – NCBI. http://www.ncbi.nlm.nih.gov/pubmed/ (accessed 14 February 2014).

10.

R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, 2010, http://www.R-project.org.

11.

GNU Compiler Collection – MinGW, http://www.mingw.org/.

12.

Critchley

LAH

. Bias and precision statistics: should we still adhere to the 30% benchmark for cardiac output monitor validation studies? Anesthesiology 2011; 114: 1245.

13.

Peyton

Chong

. Minimally invasive measurement of cardiac output during surgery and critical care. Anesthesiology 2010; 113: 1220–1235.

14.

Critchley

Lee

AM-H

. A critical review of the ability of continuous cardiac output monitors to measure trends in cardiac output. Anesth Analg 2010; 111: 1180–1192.

15.

Cannesson

Le Manach

Hofer

. Assessing the diagnostic accuracy of pulse pressure variations for the prediction of fluid responsiveness: a “gray zone” approach. Anesthesiology 2011; 115: 231–241.

16.

Abbas

Hill

. Systematic review of the literature for the use of oesophageal Doppler monitor for fluid replacement in major abdominal surgery. Anaesthesia 2008; 63: 44–51.

17.

Pearse

Dawson

Fawcett

. Early goal-directed therapy after major surgery reduces complications and duration of hospital stay. A randomised, controlled trial [ISRCTN38797445]. Crit Care Lond Engl 2005; 9: R687–R693.