Chronic obstructive pulmonary disease phenotypes using cluster analysis of electronic medical records

Abstract

Chronic obstructive pulmonary disease is a heterogeneous disease. In this retrospective study, we hypothesize that it is possible to identify clinically relevant phenotypes by applying clustering methods to electronic medical records. We included all the patients >40 years with a diagnosis of chronic obstructive pulmonary disease admitted to the University of New Mexico Hospital between 1 January 2011 and 1 May 2014. We collected admissions, demographics, comorbidities, severity markers and treatments. A total of 3144 patients met the inclusion criteria: 46 percent were >65 years and 52 percent were males. The median Charlson score was 2 (interquartile range: 1–4) and the most frequent comorbidities were depression (36%), congestive heart failure (25%), obesity (19%), cancer (19%) and mild liver disease (18%). Using the sphere exclusion method, nine clusters were obtained: depression–chronic obstructive pulmonary disease, coronary artery disease–chronic obstructive pulmonary disease, cerebrovascular disease–chronic obstructive pulmonary disease, malignancy–chronic obstructive pulmonary disease, advanced malignancy–chronic obstructive pulmonary disease, diabetes mellitus–chronic kidney disease–chronic obstructive pulmonary disease, young age–few comorbidities–high readmission rates–chronic obstructive pulmonary disease, atopy–chronic obstructive pulmonary disease, and advanced disease–chronic obstructive pulmonary disease. These clusters will need to be validated prospectively.

Keywords

asthma comorbidity chronic obstructive pulmonary disease epidemiology factor analysis phenotype

Introduction

Chronic obstructive pulmonary disease (COPD) is a heterogeneous disease characterized by persistent airflow limitation. It is caused by the inhalation of cigarette smoke and other noxious particles and gases. Mortality rates are 40.8 per 100,000 United States inhabitants every year, and as of 2010, chronic respiratory diseases were the fourth leading cause of death in the United States and are projected to be the third by 2020.^1,2

Until recently, international guidelines were basing specific treatment recommendations solely on airway obstruction, quality of life, number of exacerbations and exercise capacity oversimplifying in practice a very heterogeneous group of patients. This approach has resulted in improved symptoms and decreased number of COPD exacerbations, but its impact on survival has been disappointing.¹ Main causes of death in patients with respiratory conditions are related to cardiovascular disease and cancer^3,4 with other comorbidities also playing an important role.⁵ Therefore, in its latest revision, the Global Initiative for Obstructive Lung Disease started providing guidance for the management of common comorbidities. However, their recommendations are limited by the lack of disease-specific outcome studies.¹

Several studies have attempted to capture the heterogeneity of COPD patients using clustering techniques in order to describe phenotypes, provide more personalized therapies and pencil possible pathophysiology links.^6–13 However, most of them have had restrictive inclusion criteria, small sample sizes, have relied on highly specialized measurements and have rarely included United States subjects. The motivation of our study is to fill the gap left by previous similar studies by adding clinically relevant COPD phenotype categories¹⁴ using cluster analyses on readily available electronic medical record data. This can constitute the first step toward stratified treatment in patients with COPD.¹⁵

Materials and methods

Study location and patient population

This retrospective analysis included all the patients older than 40 years, admitted to the University of New Mexico Hospital, a 580-bed University tertiary hospital, between 1 January 2011 and 1 May 2014 and carrying a diagnosis of COPD (ICD9 codes: 490, 491, 492 or 496), regardless of their primary admission diagnosis.¹⁶

This study was conducted in accordance with the amended Declaration of Helsinki. The University of New Mexico Health Science Center review board approved the protocol and waived the need for informed consent, protocol number: 14-312.

Study design and data collection

We used i2b2, a de-identified replica of our hospital medical records system that includes data on diagnoses, procedures, prescriptions, hospital admissions and laboratory results.¹⁷

Our data collection included the following: demographics, comorbidities included in Charlson’s comorbidity index, presence of atopy, obesity, number of admissions, prescriptions for inhalers grouped as short acting beta-agonist, long-acting beta-agonist, anticholinergics, steroids and their combinations, prescriptions for oral steroids, beta-blockers and statins. We collected comorbidities according to previously validated methods.¹⁸ I2b2 does not include pulmonary function tests. To capture the severity of disease, we included weight loss¹⁹ and elevated plasma bicarbonate²⁰ among the variables collected. All the variables, including age (40–65 years and >65 years) and number of admissions (one admission and ≥two admissions), were coded as binary for the analysis. The denominator for the number of admissions was the duration of the study.

Data analysis

Cluster analysis is a set of methodologies that group objects (e.g. patients) based on their characteristics. We used the sphere exclusion method with applications in cheminformatics, bioinformatics and pattern recognition.^21,22 It is a disjoint, similarity-based method; that is, a patient can belong to only one cluster, and the measure used for grouping is similarity. In a multidimensional space with as many dimensions as variables, a distance metric between individuals is dissimilarity, which is complementary to similarity (1: similarity). By definition, similarity can have a value between 0 if all the variables are different and 1 if they are equal.

In sphere exclusion, the only input needed from the analyst is a similarity threshold. The algorithm first computes the similarity between all individuals. It then chooses the individual with the most “neighbors” within the specified similarity cut-off. These individuals form the first cluster are excluded from further analysis. The process is then repeated iteratively until the only individuals left are singletons—individuals without neighbors.^21–23

To choose the optimal similarity threshold, the clustering algorithm is run over a range of similarity thresholds without excluding individuals at any step. In this case, each subject can belong to more than one cluster. For this data set, we found the optimal balance between number of clusters and clustering overlap to be at similarity threshold of 0.62.²³ Plotting the Dunn index²⁴ and correlation coefficient²⁵ for different similarity thresholds provided consistent results (Figure 1).

Figure 1.

Dunn index and clustering coefficient against similarity.

All the collected variables were candidates for the clustering algorithm. Number of admissions was also included given the clinical relevance of the frequent exacerbation phenotype.¹

After applying factor analysis to exclude inter-correlated variables, 24 common variables (out of 40 candidate variables) were selected with 10 latent variables, p = 0.54 (H₀ model with 10 latent variables). The relevant variables were as follows: age, ICD9-CM codes 496 and 490, congestive heart failure, cerebrovascular disease, myocardial infarction, diabetes mellitus (DM) with complications, chronic kidney disease (CKD), obesity, depression, dementia, severe liver disease, plegias, rheumatologic disease, atopy, diagnosis of cancer, prescription for anticholinergic bronchodilators, prescription for fluticasone–salmeterol, prescription for albuterol–ipratropium, prescription for non-cardio selective beta-blocker, prescription for salmeterol, bicarbonate level >30 mEq/L, weight loss and number of admissions ≥2 (Appendix 1).

Analyses were computed using STATA/SE 13.1 (StataCorp LP, College Station, TX, USA) and MATLAB with the statistical toolbox installed (MATLAB version 8.3.0.532 (R2014a), Natick, MA, USA: The MathWorks Inc., 2014).

Results

A total of 3144 patients met the inclusion criteria. Of these, 1436 (46%) were older than 65 years, 1636 (52%) were males and 1745 (55.5%) had the ICD9 code 496. The median Charlson score was 2 (interquartile range (IQR): 1–4) With the most frequent comorbidities being depression (36%), congestive heart failure (25%), obesity (19%), cancer (19%) and mild liver disease (18%) (Table 1).

Table 1.

Demographics and general descriptors.

	All subjects (n = 3144)
Age less than 65 years	1436 (45.7%)
Male (%)	1636 (52%)
COPD ICD9 code
ICD9 = 490	421 (13.4%)
ICD9 = 491	671 (21.3%)
ICD9 = 492	307 (9.8%)
ICD9 = 496	1745 (55.5%)
Number of comorbidities	2.4 ± 1.7
Charlson, median (IQR)	2 (1–4)
Admission ≥2	1603 (51%)
Advanced disease
Weight loss	413 (13.1%)
Bicarbonate > 30 mEq/L	300 (9.5%)

IQR: interquartile range; COPD: chronic obstructive pulmonary disease.

We obtained nine clusters with 189 patients remaining as outliers (Figure 2). The characteristics of each one of the clusters as compared to the rest are detailed below.

Figure 2.

Grayscale heat map with the results of cluster analysis (white = 0%, black = 100%). Clusters are represented in the horizontal axis: 1: depression–COPD, 2: malignancy–COPD, 3: coronary artery disease–COPD, 4: young age–low comorbidity–high readmission–COPD, 5: advanced malignancy–COPD, 6: cerebrovascular disease–COPD, 7: atopy–COPD, 8: DM–CKD–COPD and 9: advanced disease–COPD.

The largest cluster, cluster 1 contains 1748 patients and is characterized by a large proportion of patients older than 65 years and by depression. The patients have relatively few comorbidities, without a clear pattern and a Charlson score of 2 (IQR: 1–3).

We found two “malignancy” clusters: cluster 2 with 312 patients, few comorbidities and low number of readmissions and cluster 5 with 144 patients, signs of advanced disease and frequent readmissions.

We also identified two “cardiovascular” clusters: cluster 3 with 291 patients with a significant proportion of patients older than 65 years and predominantly coronary artery disease and congestive heart failure and Cluster 6, respectively, with 120 patients, higher proportion of patients younger than 65 years and predominantly cerebrovascular disease.

The remaining clusters were cluster 4 with 152 patients, most of them younger than 65 years, with few comorbidities, the highest number of prescriptions for bronchodilators and also with frequent readmissions. Cluster 7 includes 81 patients, the majority younger than 65 years, with asthma/atopy and many readmissions. Cluster 8 has 64 patients, younger than 65 years, who suffer from CKD or diabetes and with few readmissions. Cluster 9 with 41 patients includes patients with a high prevalence of signs of advanced disease and frequent readmissions. A classification rule consisting of a series of decisions that replicates the clustering algorithm can be found in Appendix 1 and Tables 3 and 4.

Discussion

In this study, we showed that patients with a diagnosis of COPD admitted to our hospital can be divided into clinically relevant phenotypes. Based on readily available data and using cluster analysis methodology, we obtained nine phenotypes, with only 6 percent of the patients as outliers. We also derived a classification rule for use in future validation and clinical practice. Most of our phenotypes confirm few previously known phenotypes obtained using different methodologies (Figure 2, Table 5). Given our sample size and more inclusive criteria, we observed new phenotype categories not previously described that may further stratify the COPD patient population.

The development of this classification scheme for patients with COPD can be used to generate new phenotype-specific outcomes and interventions.

The first and largest cluster describes the COPD-depression phenotype. Due to similar symptoms, regular screening tools that differentiate depression in COPD have limited validity.²⁶ In pulmonary practices, the prevalence of depression in COPD patients is estimated at 40 percent, and this increases with the severity of ventilatory obstruction as measured by the forced expiratory volume in first second (FEV1).²⁷ The relationship between depression and COPD is complex, as depression can further impact the social isolation, mobility impairment and quality of life in COPD patients. There are data suggesting that COPD precedes depression. In a prospective cohort study, the relative risk for developing depression 2 years after a new diagnosis of COPD was estimated at 2.21 (95% confidence interval: 1.64–2.97).²⁸ It is also known that there is a primary association between depressive symptoms and smoking and that depression severely limits the effectiveness of any smoking cessation intervention.²⁸ Understanding the COPD, depression phenotype could help develop COPD-specific depression screening tools and evaluate the effectiveness of preventive and therapeutic strategies.

Very relevant from a clinical perspective are the COPD—cardiovascular disease phenotypes. One is dominated by cerebrovascular disease, while the other by coronary artery disease. Both phenotypes have different secondary prevention strategies and therapeutic needs making the distinction clinically relevant.²⁹ The strong association between COPD and cardiovascular disease has been observed using different methodologies. In the Lung Health Study, a prospective cohort, more than 30 percent of the deaths were related to cardiovascular disease.⁴ In terms of chronology, members of our study group described trajectories of disease in a population wide data registry with 6.2 million individuals. They found that all the trajectories starting with a diagnosis of atherosclerosis were followed by the diagnosis of COPD supporting a pathophysiological link and a temporal relationship.³⁰ Five other studies using either only comorbidities or more complex data sets have each identified at least one cluster characterized by the presence of cardiovascular disease.^6–8,10,11 This pattern has generated a great interest in discovering the underlying pathophysiology that leads to COPD after onset of atherosclerosis, and so far, the common link is attributed to inflammation.

Another phenotype previously described in the literature and confirmed in our cohort is the COPD–asthma overlap. Using a different set of variables, at least three studies that employed cluster analysis identified this phenotype.^6,12,31 COPD–asthma phenotype is of special interest as it highlights a subpopulation of patients usually excluded from therapeutic clinical trials, which have a poor quality of life and consume a disproportionate amount of healthcare resources.¹

Our analysis also revealed five phenotypes, not previously described using cluster analysis, and we feel these phenotypes are very common and relevant in clinical practice. All of them highlight the disconnection between most COPD studies and the real-life COPD spectrum seen in hospital wards and outpatient clinics. One of the malignancy clusters incorporates patients with signs of more advanced disease (number 5), whereas cluster number 2 includes patients who rarely necessitate hospital readmission despite a diagnosis of cancer. The link between malignancy and COPD has been well established, expanding over lung cancers but also extra-pulmonary cancers.^3,4 In the Danish health registry, the overall risk for cancer was elevated in COPD, independent of comorbidities suggesting a different pathogenesis than, for example, cardiovascular disease.³² Regarding the biological basis for these malignancy phenotypes, it has been noted that smokers overexpress repair genes and oncogenes and underexpress tumor suppression genes.³³ Although the expression of repair genes returns to normal several years after smoking cessation, oncogenes and tumor suppression genes continue to be altered decades later. To what extent the differential expression of these genes is deranged may determine which individuals develop cancer versus other comorbidities.³⁴ From a therapeutic perspective, COPD patients with malignancies tend to receive limited treatment. Grouping them into a more homogeneous phenotype and access to care in integrated clinics may lead to better outcomes. This model of healthcare delivery has been successfully used in other pulmonary diseases such as cystic fibrosis.³⁵

A third phenotype of patients not previously reported in the literature is the chronic kidney disease–diabetes mellitus (CKD-DM)–COPD cluster. We note the low readmission rates of this phenotype. Interestingly, in previous studies, both CKD and DM have been associated with hospital readmissions in patients admitted with any diagnosis, but this does not hold true for COPD patients.³⁶ To which extent this characteristic is attributable to their model of care or to common underlying mechanism of disease is unknown. For example, DM targets the lung with reduction in carbon monoxide diffusion capacity, FEV1 and forced vital capacity (FVC) that show a dose response effect to fasting plasma glucose.³⁷ Furthermore, recently, a study reported improved asthma control in patients treated with thiazolidinediones.³⁸

Another new phenotype described in our cohort is the “advanced COPD phenotype,” represented by cluster 9. These patients who have more advanced disease based on the frequency of weight loss and elevated serum bicarbonate are also readmitted frequently.¹⁹ One potential intervention would be initiation of end-of-life discussions especially if higher mortality is associated with this phenotype.

Finally, cluster number 4 can intuitively be labeled as “COPD Resistant to Treatment” or “COPD Non-Compliant.”^39,40 Although these patients receive the highest number of prescriptions, they accrue very frequent readmissions. Since information on compliance was not available in our database, we were unable to differentiate between the two possible explanations.

Eight other studies have used similar clustering techniques to group COPD patients into phenotypes. They differ in the number and selection of individuals, the choice and number of variables and in the clustering techniques utilized. Choices at each of these levels can be expected to influence the results. This is reflected in Table 2 with each study describing phenotypes according to the variables selected for analysis.

Table 2.

Previous articles with >100 patients using cluster analysis and phenotypes as described by the authors.

	Number of patients	Population, country	Variables used for clustering	Clusters obtained
Baty et al.⁶	340,948	Population based	Comorbidities	1. Asthma/COPD
		Sweden		2. Anxiety/depression
				3. Malignancy
				4. Heart failure
				5. Coronary artery disease
Burgel et al.⁷	322	Pulmonary units	Age, smoking, airflow obstruction, exacerbations, BMI, QoL, anxiety, depression, MMRC	1. Young severe respiratory disease
		France		2. Older, mild respiratory disease, mild comorbidities
				3. Young, severe disease, mildly symptomatic
				4. Older, severe respiratory disease, severe comorbidities
Burgel et al.⁸	527	COPD clinics + NELSON^a	Comorbidities, COPD physiologic data	1. Low mortality, low comorbidities
		Belgium, Netherlands		2. Young, severe emphysema, low BMI, low comorbidities
				3. Older, less severe disease, high BMI, comorbidities
Fens et al.⁹	157	NELSON^a	pbFEV1, FEV1 reversibility, chronic bronchitis, coronary artery disease, BMI, dyspnea at rest, packs year, LABA, eNose signature, emphysema score	1. Mild COPD, minimal symptoms, good QoL
		Netherlands		2. Poor lung function, emphysema or chronic bronchitis and eNOSE
				3. Emphysema, preserved lung function
				4. Severe symptoms, mild lung disease
Vanfleteren et al.¹⁰	213	Pulmonary rehabilitation	Comorbidities	1. Less comorbidities
		Netherlands		2. Cardiovascular
				3. Cachectic
				4. Metabolic
				5. Psychological
Garcia Aymerich et al.¹¹	342	COPD admissions	Symptoms, sputum microbiology, QoL, radiological emphysema score, PFT, nutrition, inflammation biomarkers total of 224 variables	1. Severe airflow limitation and poor respiratory performance
		Spain		2. Mild airflow limitation
				3. Mild airflow limitation and high BMI
Weatherall et al.¹²	175	Population based	pbFEV1/FVC %, pbFEV1%, FEV1 change %, FRC%, DLCO%, IgE serum, mean FeNO, sputum production and smoking history	1. Severe airflow obstruction, low QoL, overlap of asthma, emphysema and chronic bronchitis
		New Zealand		2. Emphysema
				3. Asthma with eosinophilic airway inflammation
				4. Mild airflow obstruction
				5. Chronic bronchitis non-smokers
Renard et al.¹³	2164	Pulmonary clinics	DemographicSymptomsBiochemicalClinical/functional	1. Moderate—quasi stable
		North America and Europe		2. Functional emphysema
				3. Mixed
				4. Exacerbator emphysema
				5. Inflamed comorbid

BMI: body mass index; QoL: quality of life; MMRC: Modified Medical Research Council dyspnea scale; COPD: chronic obstructive pulmonary disease; pb: pre-bronchodilator; FEV1: forced expiratory volume in first second; FVC: forced vital capacity; LABA: long-acting beta-agonist inhaler; FRC: functional residual capacity; DLCO: carbon monoxide lung diffusion capacity; FeNO: fraction of exhaled nitric oxide; PFT: pulmonary function test.

NELSON: population-based cancer-screening trial in heavy smokers and ex-smokers.

Clustering techniques are most valuable when the phenotypes obtained describe the etiology, pathogenesis or clinical characteristics of the patients. Studies clustering detailed physiologic data and biomarkers with clinical characteristics may be better suited to investigate the pathogenesis and etiology of the different phenotypes, for example, Fens et al.,⁹ chronic bronchitis and inflammatory eNose profile. Studies based on readily available data like ours^6,10 can also generate hypotheses about pathogenesis as detailed above, but more importantly, they can describe phenotypes outside of the research setting. They can help organize care and generate clinical research with phenotype-specific interventions.⁴¹ For example, psycho-social interventions for the “Non-Compliant” phenotype or palliative care options for the advanced disease phenotype.

Our results are similar to the studies that used only comorbidities for clustering.^6,10 The larger number of clusters found in our study can be explained by its sample size, our inclusion criteria and the variable selection which also included severity of disease markers and medications. Baty et al.⁶ described a malignancy cluster; we were able to detail further, describing two malignancy clusters, one with signs of advanced disease and frequent readmissions and one without.

Another pre-requisite to make phenotyping useful is providing the classification rules for use in future validation and clinical practice⁴² (Appendix 1, Tables 3 and 4).

Previous cohort studies and industry-sponsored randomized controlled trials have had stringent enrollment criteria excluding many clinically relevant and common COPD subpopulations. One strength of our study was including all adult patients (>40 years) with a diagnosis of COPD. As stated above, these lax criteria have helped us to recognize a larger number of phenotypes than previous studies. This is also the largest US COPD population is analyzed with this methodology.¹³

Another strength of our study is the use of the sphere exclusion method for clustering. It does not require a priori determination of the number of clusters, and using homogeneity measures to determine the optimal similarity threshold automatically (Figure 1) makes the clustering algorithm autonomous and minimizes the risk of bias.

Our study has several limitations. First, the proposed clusters rely on the available variables.²¹ We lacked data on spirometry, 6-min walk test, mortality and quality of life questionnaires. In their absence, our ability to detect some clinically relevant phenotypes may have been limited; for example, the previously described upper lobe–predominant emphysema with poor exercise performance was not observed.⁴³ We also had to use alternative measures of severity of disease (sodium bicarbonate and weight loss), which may not apply to all the phenotypes. However, under normal clinical settings, such information is rarely available. In a recent report, only 32 percent of newly diagnosed COPD patients had undergone spirometry testing,⁴⁴ and thus, relying on these data to clinically classify patients beyond the pulmonary office is, as of today’s practice, of limited value. Furthermore, even in the absence of these variables, we were able to detect clinically relevant phenotypes with plausibly different underlying pathogenesis. Our methodology can only describe co-occurrence of diseases and not cause effect relationships.

Another limitation of the study is single center study design. People from New Mexico may have different distribution of risk factors than patients in other areas of the United States. They have a higher exposure to biomass fuels, and the Hispanic population has also been found to be particularly resistant to the development of this disease.^45,46

The main impact of our study is the development of a classification scheme for patients with COPD that can be used to generate new phenotype-specific outcomes and interventions. While other studies have mainly focused on mortality and readmission rates, it is debatable whether these are the most or only relevant outcomes. For instance, two phenotypes can have the same mortality and hospitalization rate for very different reasons thus requiring different therapeutic approaches. Phenotype-specific outcomes such as time to myocardial infarction, depression recurrence or improved compliance along with mortality and readmission rates may offer an extra dimension to better differentiate between and within subpopulations of COPD patients.

Conclusion

Moving from generalized to stratified medicine, clustering studies are needed to both discover pathophysiology links and group patients with clinically similar phenotypes allowing for more personalized and integrated care. Our study confirms previous COPD phenotypes and adds new ones to better understand the interplay between COPD and comorbidities. The use of readily available data in defining our clusters makes this methodology appealing for validation and implementation at other centers.

Footnotes

Appendix 1 Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study is supported by The University of New Mexico School of Medicine.

References

GOLD—the Global Initiative for Chronic Obstructive Lung Disease, http://goldcopd.org/gold-reports/ (accessed 4 November 2016).

CDC. Chronic Obstructive Pulmonary Disease (COPD)—data and statistics, http://www.cdc.gov/copd/data.htm (accessed 28 March 2015).

McGarvey

John

Anderson

. Ascertainment of cause-specific mortality in COPD: operations of the TORCH Clinical Endpoint Committee. Thorax 2007; 62: 411–415.

Anthonisen

Connett

Enright

. Hospitalizations and mortality in the Lung Health Study. Am J Respir Crit Care Med 2002; 166: 333–339.

Cully

Graham

Stanley

. Quality of life in patients with chronic obstructive pulmonary disease and comorbid anxiety or depression. Psychosomatics 2006; 47: 312–319.

Baty

Putora

Isenring

. Comorbidities and burden of COPD: a population based case-control study. PLoS ONE 2013; 8: e63285.

Burgel

P-R

Paillasseur

J-L

Peene

. Two distinct chronic obstructive pulmonary disease (COPD) phenotypes are associated with high risk of mortality. PLoS ONE 2012; 7: e51048.

Burgel

P-R

Paillasseur

J-L

Caillaud

. Clinical COPD phenotypes: a novel approach using principal component and cluster analyses. Eur Respir J 2010; 36: 531–539.

Fens

van Rossum

AGJ

Zanen

. Subphenotypes of mild-to-moderate COPD by factor and cluster analysis of pulmonary function, CT imaging and breathomics in a population-based survey. COPD 2013; 10: 277–285.

10.

Vanfleteren

Spruit

Groenen

. Clusters of comorbidities based on validated objective measurements and systemic inflammation in patients with chronic obstructive pulmonary disease. Am J Respir Crit Care Med 2013; 187: 728–735.

11.

Garcia-Aymerich

Gómez

Benet

. Identification and prospective validation of clinically relevant chronic obstructive pulmonary disease (COPD) subtypes. Thorax 2011; 66: 430–437.

12.

Weatherall

Travers

Shirtcliffe

. Distinct clinical phenotypes of airways disease defined by cluster analysis. Eur Respir J 2009; 34: 812–818.

13.

Rennard

Locantore

Delafont

. Identification of five chronic obstructive pulmonary disease subgroups with different prognoses in the ECLIPSE cohort using cluster analysis. Ann Am Thorac Soc 2015; 12: 303–312.

14.

Han

Agusti

Calverley

. Chronic obstructive pulmonary disease phenotypes: the future of COPD. Am J Respir Crit Care Med 2010; 182: 598–604.

15.

Hingorani

Windt

Riley

. Prognosis research strategy (PROGRESS) 4: stratified medicine research. BMJ 2013; 346: e5793.

16.

Prieto-Centurion

Rolle

. Multicenter study comparing case definitions used to identify patients with chronic obstructive pulmonary disease. Am J Respir Crit Care Med 2014; 190: 989–995.

17.

CTSC Biomedical Informatics, http://hsc.unm.edu/research/ctsc/BMI/i2b2.shtml (accessed 29 March 2015).

18.

Quan

Sundararajan

Halfon

. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med Care 2005; 43: 1130–1139.

19.

Prescott

Almdal

Mikkelsen

. Prognostic value of weight change in chronic obstructive pulmonary disease: results from the Copenhagen City Heart Study. Eur Respir J 2002; 20: 539–544.

20.

Groenewegen

Schols

Wouters

EFM

. Mortality and mortality-related factors after hospitalization for acute exacerbation of COPD. Chest 2003; 124: 459–467.

21.

Everitt

Landau

Leese

. Index. In: Cluster analysis. New York: John Wiley & Sons, Inc., 2011, pp. 321–330.

22.

Taylor

. Simulation analysis of experimental design strategies for screening random compounds as potential new drugs and agrochemicals. J Chem Inf Comput Sci 1995; 35: 59–67.

23.

MacCuish

Nicolaou

MacCuish

. Ties in proximity and clustering compounds. J Chem Inf Comput Sci 2001; 41: 134–146.

24.

Dunn

. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 1973; 3: 32–57.

25.

Farris

. On the cophenetic correlation coefficient. Syst Biol 1969; 18: 279–285.

26.

Wilson

. Depression in the patient with COPD. Int J Chron Obstruct Pulmon Dis 2006; 1: 61–64.

27.

Polsky

Doshi

Marcus

. Long-term risk for depressive symptoms after a medical diagnosis. Arch Intern Med 2005; 165: 1260–1266.

28.

Cinciripini

Wetter

Fouladi

. The effects of depressed mood on smoking cessation: mediation by postcessation self-efficacy. J Consult Clin Psychol 2003; 71: 292–301.

29.

Kernan

Ovbiagele

Black

. Guidelines for the prevention of stroke in patients with stroke and transient ischemic attack: a guideline for healthcare professionals from the American Heart Association/American Stroke Association. Stroke 2014; 45: 2160–2236.

30.

Jensen

Moseley

Oprea

. Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nat Commun 2014; 5: 4022.

31.

Wardlaw

Silverman

Siva

. Multi-dimensional phenotyping: towards a new taxonomy for airway disease. Clin Exp Allergy 2005; 35: 1254–1262.

32.

Kornum

Sværke

Thomsen

. Chronic obstructive pulmonary disease and cancer risk: a Danish nationwide cohort study. Respir Med 2012; 106: 845–852.

33.

Rutgers

Postma

ten Hacken

. Ongoing airway inflammation in patients with COPD who do not currently smoke. Chest 2000; 117: 262S.

34.

Brody

Spira

. State of the art. Chronic obstructive pulmonary disease, inflammation, and lung cancer. Proc Am Thorac Soc 2006; 3: 535–537.

35.

Lobo

Rojas-Balcazar

Noone

. Recent advances in cystic fibrosis. Clin Chest Med 2012; 33: 307–328.

36.

Bahadori

FitzGerald

. Risk factors of hospitalization and readmission of patients with COPD exacerbation—systematic review. Int J Chron Obstruct Pulmon Dis 2007; 2: 241–251.

37.

Walter

Beiser

Givelber

. Association between glycemic state and lung function: the Framingham Heart Study. Am J Respir Crit Care Med 2003; 167: 911–916.

38.

Rinne

Feemster

Collins

. Thiazolidinediones and the risk of asthma exacerbation among patients with diabetes: a cohort study. Allergy Asthma Clin Immunol Off J Can Soc Allergy Clin Immunol 2014; 10: 34.

39.

Vestbo

Anderson

Calverley

PMA

. Adherence to inhaled therapy, mortality and hospital admission in COPD. Thorax 2009; 64: 939–943.

40.

Chung

. New treatments for severe treatment-resistant asthma: targeting the right patient. Lancet Respir Med 2013; 1: 639–652.

41.

Schizophrenia IPS of Organization WHO. Report of the International Pilot Study of Schizophrenia, http://apps.who.int//iris/handle/10665/39405 (1973, accessed 28 March 2015).

42.

Weatherall

Shirtcliffe

Travers

. Use of cluster analysis to define COPD phenotypes. Eur Respir J 2010; 36: 472–474.

43.

Fishman

Martinez

Naunheim

. A randomized trial comparing lung-volume-reduction surgery with medical therapy for severe emphysema. N Engl J Med 2003; 348: 2059–2073.

44.

Han

Kim

Mardon

. Spirometry utilization for COPD: how do we measure up? Chest 2007; 132: 403–409.

45.

Sood

Petersen

Blanchette

. Wood smoke exposure and gene promoter methylation are associated with increased risk for COPD in smokers. Am J Respir Crit Care Med 2010; 182: 1098–1104.

46.

Ramírez-Venegas

Sansores

Quintana-Carrillo

. FEV1 decline in patients with chronic obstructive pulmonary disease associated with biomass exposure. Am J Respir Crit Care Med 2014; 190: 996–1002.