Abstract
Cardiopulmonary exercise testing (CPET) using a spectrum of different approaches demonstrates usefulness for objectively assessing patient disease severity in clinical and research settings. Still, an absence of trained specialists and/or improper data interpretation techniques can pose major limitations to the effective use of CPET for the clinical classification of patients. This study aimed to test an automated disease likelihood scoring algorithm system based on cardiopulmonary responses during a simplified step-test protocol. For patients with heart failure (HF), pulmonary hypertension (PAH), obstructive lung disease (OLD), or restrictive lung disease (RLD), we compared patient scores stratified into one of four “silos” generated from our novel algorithm system against patient evaluations provided by expert clinicians. Patients with HF (n = 12), PAH (n = 9), OLD (n = 16), or RLD (n = 10) performed baseline pulmonary function testing followed by submaximal step-testing. Breath-by-breath measures of ventilation and gas exchange, in addition to oxygen saturation and heart rate were collected continuously throughout testing. The algorithm demonstrated close alignment with patient assessments provided by clinical specialists: HF (r = 0.89, P < 0.01); PAH (r = 0.88, P < 0.01); OLD (r = 0.70, P < 0.01); and RLD (r = 0.88, P < 0.01). Furthermore, the algorithm was capable of differentiating major disease from other disease pathologies. Thus, in a clinically relevant manner, these data suggest this simplified automated disease algorithm scoring system used during step-testing to identify the likelihood that patients have HF, PAH, OLD, or RLD closely correlates with patient assessments conducted by trained clinicians.
Introduction
Comprehensive cardiopulmonary exercise testing (CPET) has traditionally been offered as a clinical diagnostic tool in large-to-mid-sized medical centers. It is typically conducted as a test aimed at pushing patients to maximal exertion, which includes acquisition of physiological data via 12-lead ECG, pulse oximetry, blood pressure, and metabolic cart systems. For successful completion of CPET this commonly requires two trained specialists with oversight by a physician, which can often take 1 h or longer for preparation, testing, cool down, and discharge. In addition, the majority of commercially available metabolic cart systems display an array and sometimes daunting number of cardiopulmonary measurements, often displayed breath-by-breath, which typically requires a higher level of expertise for test interpretation. 1 Thus, with the overall complexity of CPET requiring specialized personnel and time involvement, modest-to-moderate patient risk, and the need for technological integration of multiple tools, the cumulative effect of these factors can act as a barrier for the routine implementation of this powerful clinical assessment resource.
Previous reports from our group have demonstrated that various simplified forms of submaximal exercise testing can be used to improve the pathophysiological understanding of abnormal breathing patterns and gas exchange in patients with heart failure (HF) and/or pulmonary hypertension (PAH).2–6 As an important next step for this line of research, 7 we propose there is immediate clinical utility in developing a simplified disease likelihood algorithm based on data acquired via a combination of rest and submaximal exercise physiological testing. Output from our novel algorithm would consider patient responses from a basic resting forced vital capacity (FVC) maneuver in addition to continuous breath-by-breath measurements from a sequence involving 2 min of rest, a 3-min progressive step test, and 1 min of recovery. The overarching goal of such a system is to simplify clinical exercise testing while alleviating burdens such as complex data interpretation.
To determine how well our disease likelihood algorithm performs in properly identifying different cardiopulmonary disease types, we recruited adult patients demonstrating a primary diagnosis of restrictive lung disease (RLD), obstructive lung disease (OLD), chronic HF, or PAH. Patient outputs from algorithm scoring provided ranks for the likelihood of a given disease, which were then compared with separate patient scoring provided by expert clinician reviewers (i.e. professionals whose primary responsibility is the interpretation of clinical exercise tests). Accordingly, this study tested the hypotheses that: (1) scores provided by our novel disease likelihood scoring algorithm would align with those provided by clinicians; and (2) algorithm scores can be used to properly categorize patients in hierarchical order with respect to the most likely primary diagnosis.
Methods
Study design and patients
Participant characteristics and baseline pulmonary function across groups.
Data presented as means ± SD.
Table P value and χ2 value (dF = 3) represents Kruskal–Wallis H test for the overall group effect for each variable. Pairwise differences for significant Kruskal–Wallis H tests were assessed using Wilcoxon rank-sum tests.
P < 0.05, OLD vs. RLD.
P < 0.05, HF vs. OLD;
P < 0.05, PH vs. RLD.
HF, heart failure; PH, pulmonary hypertension; OLD, obstructive lung disease; RLD, restrictive lung disease; FVC, forced vital capacity; %pred., percent of predicted values; FEV1, forced expiratory volume in 1 s.
Patients performed upright seated pulmonary function testing while at rest to assess basic measurements of FVC, forced expiratory volume in 1 s (FEV1), and the FEV1/FVC ratio according to ATS guidelines. 8 This was followed by a brief rest period and a submaximal exercise step test. The exercise protocol consisted of three phases including 2 min of standing rest, 3 min of submaximal incremental stepping exercise, and 1 min of recovery. During the exercise phase, step rates were increased every minute (60, 80, and 100 step/min controlled via metronome) equivalent to 15, 20, and 25 steps/min, respectively. During all phases, ventilation and gas exchange were continuously measured breath-by-breath (SHAPE Medical Systems Inc., St. Paul, MN, USA). Heart rate (HR) and rhythm via 12-lead ECG as well as oxygen saturation (SpO2) via forehead pulse oximetry were also continually monitored.
Key variables in disease pathology likelihood algorithm.
HF, heart failure; PAH, pulmonary arterial hypertension; OLD, obstructive lung disease; RLD, restrictive lung disease; VE/VCO2, ventilatory efficiency; O2p, oxygen pulse: oxygen consumption/heart rate; VO2, oxygen consumption; OUES, oxygen uptake efficiency slope; CircEq VO2 % pred, circulatory equivalent oxygen consumption; HR, heart rate; MPIph, multi-parameter index for pulmonary hypertension; GxCap, pulmonary capacitance: oxygen pulse × the partial pressure of end tidal CO2; SpO2, oxygen saturation; FEV1, forced expiratory in 1 s; PECO2, the partial pressure of mean expired CO2; PETCO2, the partial pressure of end tidal CO2; FVC % pred, % predicted of forced vital capacity.
In utilizing the silo system, scores representing each of the four disease categories were recorded for each participant. Importantly, the silo algorithm was designed to weigh scores in an incremental manner based on increasing to decreasing likelihood of a participant demonstrating any one of the primary diseases. Additionally, in utilizing the same scoring scale, separate weighted scores for each of the four disease stratifications were recorded for each participant by three separate expert clinicians from the Mayo Clinic Cardiovascular Health Clinic stress testing practice. Thus, clinician scores were considered the “criterion” method, whereas the silo scoring system was considered the “practical” method to be tested for validity and reliability in being able to identify patients with either HF, PAH, OLD, or RLD. Clinician experts in our stress practice typically review over 50 stress tests daily across a wide spectrum of patient disease etiologies.
Statistical analyses
Where applicable, data are presented at means ± SD. Relationships between silo and clinician scores for each group were assessed using Pearson product moment correlation tests. Mean differences between silo and clinician scoring across disease stratifications were tested using two-factor mixed model analysis of variance (ANOVA) models, which included silo-by-clinician interaction terms for each model. In the event of significant interaction terms, Tukey–Kramer post-hoc corrections were performed to identify where between-within significance occurred. Validity was determined using the following indices: standard error of estimate (SEE) with 95% confidence limits (CL); mean bias (mean difference between silo and mean clinician scores) with 95% limits of agreement (LOA, ± 1.96 × SD of differences between scores); 9 and Pearson product moment correlation coefficient (r) with 95% CL. We interpreted the magnitude of r values based on thresholds of Cohen 10 as follows: small = 0.10; medium = 0.30; and large = 0.50; whereby larger is better. Interrater reliability was determined by intraclass correlation coefficients (ICC = σB2/(σB2 + σW2), where σB2 is between variance, whereas σW2 is within variance) with 95% CL as well as standard error of measurement (St.SEM = SD × √(1–ICC)) with 95% CL.11,12 Interpretation of ICC was determined using the following thresholds: poor <0.40; moderate 0.40 to < 0.74; excellent > 0.75. All two-tailed statistical significance was determined using an alpha level set at 0.05. Statistical analyses were performed using SPSS programming (version 22.0).
Results
All participants were able to complete exercise testing as described above. At end exercise, there were no significant differences for respiratory exchange ratio (HF versus PAH versus OLD versus RLD; 0.92 ± 0.09 versus 0.92 ± 0.06 versus 0.97 ± 0.14 versus 1.05 ± 0.23, respectively; P > 0.05), HR (HF versus PAH versus OLD versus RLD; 102 ± 18 versus 115 ± 16 versus 115 ± 14 versus 108 ± 16 bpm, respectively; P > 0.05), relative VO2 (HF versus PAH versus OLD versus RLD; 15.2 ± 4.2 versus 16.6 ± 4.8 versus 16.6 ± 3.7 versus 13.9 ± 4.3 mL/kg/min, respectively; P > 0.05), and the ratio of ventilation to maximal voluntary ventilation (HF versus PAH versus OLD versus RLD; 0.43 ± 0.16 versus 0.57 ± 0.15 versus 0.44 ± 0.21 versus 0.51 ± 0.17, respectively; P > 0.05).
For baseline pulmonary function, HF and RLD demonstrated significantly lower FVC than PAH and OLD (both P < 0.01), whereas there were no significant differences in FEV1 across groups (P > 0.05). Consequently, HF and RLD demonstrated higher FEV1/FVC when compared to PAH and OLD (P < 0.01, Table 1).
Figure 1 illustrates there was close alignment between silo and clinician scoring for the likelihood of each possible diagnosis. For HF, the score relationship was strong between silo and clinician (r = 0.89, P < 0.01; Fig. 1a). Scoring for PAH between silo and clinician was also strong (r = 0.88, P < 0.01; Fig. 1b). Likewise, scoring between silo and clinician was strong for both OLD (r = 0.70, P < 0.01) and RLD (r = 0.88, P < 0.01; Fig. 1c and d, respectively).
Relationship between silo score and scores by the reviewers for entire group (n = 47). (a) HF (r = 0.89, P < 0.01); (b) PAH (r = 0.88, P < 0.01); (c) OLD (r = 0.70, P < 0.01); and (d) RLD (r = 0.88, P < 0.01).
Validity indices of patient groups scored by clinicians (criterion) vs. silo algorithm (practical).
Validity indices reported with 95% confidence limits in parentheses. Validity indices were calculated using clinician mean scores (criterion) against corresponding silo scores (practical). Rows with dashes correspond with data in the top for row for each patient group indicating the respective silo (no.) that was set to predict the primary likelihood of that patient condition (e.g. for HF, silo 1 indicated in parentheses as (1), was set to primarily predict the likelihood of HF, and therefore validity data for that Group-Silo pairing has been placed on the top row for patients with HF). There are validity scores for each condition within a given patient stratification because irrespective of patient group, both clinicians and silo algorithm scored patients for the likelihood of having each of the four conditions.
HF, heart failure; PAH, pulmonary hypertension; OLD, obstructive lung disease; RLD, restrictive lung disease; SEE, standard error of estimate; Std. SEE, standardized standard error of estimate; r, Pearson product moment correlation coefficient; bias, mean bias = mean difference between values.
Interrater reliability indices of patient groups scored by clinicians and silo algorithm.
Reliability indices reported with 95% confidence limits in parentheses. Interrater reliability indices were calculated using each clinician score as well as corresponding silo score. Rows with dashes correspond with data in the top for row for each patient group indicating the respective silo (no.) that was set to predict the primary likelihood of that patient condition (e.g. for HF, silo 1 indicated in parentheses as (1), was set to primarily predict the likelihood of HF, and therefore reliability data for that Group-Silo pairing has been placed on the top row for patients with HF). There are reliability scores for each condition within a given patient stratification because irrespective of patient group, both clinicians and silo algorithm scored patients for the likelihood of having each of the four conditions.
P < 0.0001.
P < 0.05.
HF, heart failure; PAH, pulmonary hypertension; OLD, obstructive lung disease; RLD, restrictive lung disease; ICC, intraclass correlation coefficient; St. EM, standard error of measurement; Std. St. EM, standardized standard error of measurement.
Score distribution patterns across diseases between silo and Clinician scores were similar. In addition to being able to properly score the likelihood of a primary disease, each disease silo was capable of discretely differentiating the likelihood of a given secondary disease. As such, the overall score distribution pattern for the likelihood of HF was not different between silo and clinician (P > 0.05; Fig. 2a). Similar outcomes were observed for PAH (P > 0.05; Fig. 2b), OLD (P > 0.05; Fig. 2c), and RLD (P > 0.05; Fig. 2d).
The pattern of score distribution across disease in silo and the reviewers: HF; PAH; OLD; and RLD.
Finally, and consistent with observations above, for the HF silo in Fig. 2a, the diagnosis of HF demonstrated the highest score for both silo and Clinicians (P < 0.05). Similar results were also observed for PAH (P < 0.05; Fig. 2b) and RLD (P < 0.05; Fig. 2d). However, OLD did not demonstrate a “highest” score for the OLD silo (P > 0.05; Fig. 2c).
Discussion
The present study tested how well a novel automated algorithm (i.e. silo scores), based on variables acquired from a brief and simple cardiopulmonary test, is able to properly differentiate primary and secondary disease types compared with similar assessments made by trained Clinicians. These data suggest that for each silo (i.e. HF, PAH, OLD, and RLD), output scores closely correlated with scores provided by Clinicians, which was accompanied by similar disease score distributions across methods. In general, the present observations suggest our automated cardiorespiratory algorithm consistently demonstrated close scoring alignment with primary patient disease class stratifications identified by experienced Clinicians.
The observations from this study support the proposed intent of our original disease likelihood/severity scoring algorithm, which was designed to help with simplifying exercise testing implementation and interpretation for routine use in clinical and/or laboratory settings. These data are consistent with our previous reports showing that for a spectrum of cardiopulmonary diseases, clinical assessment of functional capacity does not always require maximal exercise testing.2,5,6,13 The present observations suggest that changes in key cardiopulmonary measurements can be detected with relatively mild to moderate levels of physical exertion. An additional strength of our proposed algorithm system is that when conducting submaximal exercise testing, the risk of patient events may be reduced leading to the need for less strategic oversight and required in-room personnel. Lastly, the physiological assessment tool in the present study illustrates the ease whereby clinical exercise testing may be able to be routinely performed in the absence of needing to perform time consuming data post-processing and advanced computations. Thus, with the present system, it may be possible to reduce the “extra layer” of interpretive expertise needed.
Our findings suggest that an automated scoring algorithm based on patterns of cardiopulmonary responses from a modified step-test exercise protocol of mild to moderate physical exertion may be used to track common patient clinical phenotypes including HF, PAH, RLD, or OLD. Based on our sample size, we acknowledge the scope of our observations is preliminary in being able to determine the wide-scale real-world clinical efficacy of our novel approach. Still, these data suggest there is immediate potential associated with the present simplified testing paradigm applied to less traditional clinical settings where providing rapid feedback can help with guiding clinical decision making and a simpler way to track patients with chronic diseases to determine need for more expert intervention.
Although it is well-recognized that diseases of the cardiopulmonary system are not exclusively centralized to cardiac versus pulmonary limitations in patients, it remains a challenge for clinicians to routinely and quickly differentiate contributions from primary versus secondary underlying disease processes, such as is the case for example, in patients with HF, PAH secondary to HF, or PAH without HF. Therefore, it is promising that while frequently sharing similar signs and symptoms, these data suggest the current approach may be helpful for separating out conditions that align closely. Moreover, it is important to highlight that the proposed value of this clinical exercise testing method is using a pseudo “activities of daily living” exercise setting (e.g. climbing stairs) where signs and symptoms of the present diseases are well-acknowledged to be commonly exacerbated in patients.
In contrast to the proposed value of the present exercise testing method aimed at identifying primary disease in patients with HF, PAH, OLD, or RLD, we acknowledge that calibration of sub-condition OLD scoring with clinician scores did not consistently align within each strata. This was followed by an inability of the silo algorithm to consistently and clearly identify OLD from other diseases. Therefore, our future direction will be to study a larger patient sample, which will include a broader spectrum of OLD severity that will prove to strengthen the capability of our algorithm to identify this specific disease. Nevertheless, despite this study limitation, it is important to note that the chronic conditions tested in this study do not tend to exist by themselves and, hence, a clear distinction in patients with variable mild or modest disease is more difficult. With this, we suggest this may be a strength of our algorithm since output scores provide a more inclusive picture of “reality” rather than trying to suggest there is a single limiting factor during activity.
Footnotes
Clinical implications
Clinical CPET produces a complex array of measurements that can be integrated into algorithms allowing for a more simplified approach to screen and track patients. Ultimately, with simplified protocols, this allows CPET to move closer to a “point of care” approach or essentially a vital sign that could be pursued in primary care settings guiding the need and direction for sending patients to subspecialty clinics.
Conflict of interest
The author(s) declare that there is no conflict of interest.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
In memoriam
We are grateful to Dr. James ‘Jim’ E. Hansen, Professor Emeritus at UCLA Harbor for his contributions on this study. Dr. Hansen was supportive of sub-maximal exercise gas exchange testing to advance its clinical application. James Hansen passed away in May of 2017. He has authored numerous publications and a textbook series on cardiopulmonary testing and its clinical utility in heart, lung and pulmonary vascular disease.
