Identification of gene expression signature for cigarette smoke exposure response

Abstract

Gene expression profiling data can be used in toxicology to assess both the level and impact of toxicant exposure, aligned with a vision of 21st century toxicology. Here, we present a whole blood-derived gene signature that can distinguish current smokers from either nonsmokers or former smokers with high specificity and sensitivity. Such a signature that can be measured in a surrogate tissue (whole blood) may help in monitoring smoking exposure as well as discontinuation of exposure when the primarily impacted tissue (e.g., lung) is not readily accessible. The signature consisted of LRRN3, SASH1, PALLD, RGL1, TNFRSF17, CDKN1C, IGJ, RRM2, ID3, SERPING1, and FUCA1. Several members of this signature have been previously described in the context of smoking. The signature translated well across species and could distinguish mice that were exposed to cigarette smoke from ones exposed to air only or had been withdrawn from cigarette smoke exposure. Finally, the small signature of only 11 genes could be converted into a polymerase chain reaction-based assay that could serve as a marker to monitor compliance with a smoking abstinence protocol.

Keywords

Gene expression signature blood surrogate species smoking

Introduction

Cigarette smoke exposure has a negative impact on human health and is linked to the development of several fatal diseases.¹ The response to cigarette smoke exposure has been monitored by widely used biomarkers such as the levels of “nicotine equivalents”² in urine or the metabolites of the tobacco-specific lung carcinogen 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone.³ However, the use of these biomarkers is associated with several drawbacks, including their short half-life and interindividual differences in metabolism.^4,5 Hair nicotine and cotinine have been used to monitor smoke exposure among infants and adults and found to be more precise measures of exposure than urine cotinine levels.^6
–8 Salivary cotinine concentrations have also been proposed as a noninvasive biomarker for environmental tobacco exposure in children.⁹ While hair and saliva nicotine and cotinine measurements may provide accurate verification of nonsmoking status and provide useful measure of secondhand smoke exposure, they are restricted to a single constituent present in cigarette smoke. Moreover, such biomarkers do not offer insights into the biological mechanisms that are impacted in response to cigarette smoke exposure, a feature advocated by 21st century toxicity testing principles.¹⁰

New technologies, such as whole genome microarrays, have therefore been incorporated into toxicity testing to increase efficiency and to provide a more data-driven approach to exposure response assessment.¹¹ Several studies have described chemical-specific gene expression profiles associated with the adverse effects of active substances in various tissues,^12
–14 and the large airway transcriptome of smokers has been well characterized. In line with the theory of the field of injury, molecular changes in response to smoke exposure can be detected even when no histological abnormalities are visible.^15

–19

Sample acquisition from the primary site of exposure (e.g., the airways) is usually invasive and is therefore not convenient for exposure assessment and monitoring. As a minimally invasive alternative, peripheral blood sampling can be employed in the general population to establish systemic biomarkers. Several blood-based biomarkers of potential harm have been proposed, including those related to inflammation, oxidative stress, and platelet activation.²⁰ A more global picture of the impacted biology can be achieved by gene expression studies, and various exposures have been shown to alter gene expression profiles in blood.^21
–23 Indeed, several studies have shown that gene expression in blood can distinguish between average subject populations, such as those with early stage non-small cell lung cancer, from those with non-malignant lung disease,^24,25 subjects with chronic obstructive pulmonary disease (COPD) from healthy smokers,²⁶ and even smokers with no detectable disease from nonsmokers.^27

–33

A transcriptome-based exposure response signature could be as simple as the presence or absence of a single gene expression, or, more likely, could be characterized by the expression levels of a collection of genes, each contributing to a specific diagnosis. It is therefore distinct from a molecular biomarker that is generally based on differentially expressed genes (DEGs) between case and control populations.

Because of the large interindividual variations that can be expected in human populations, signatures should be robust and well designed, maintaining high specificity (Sp) and sensitivity (Se) across independent subject cohorts, laboratories, and nucleic acid extraction methods. However, many published signatures lack proper validation, making them overly optimistic.^34,35

In the interest of developing more robust and predictive gene signatures, the Industrial Methodology for PROcess Verification in Research (IMPROVER)^36,37 aimed to identify the best classification pipeline for outcome prediction based on microarray data. However, despite the success of the IMPROVER Diagnostic Signature Challenge (DSC), the development of computational methodologies that can be robust and versatile in clinical applications remains challenging.^38,39

The aim of the present study was to identify a whole blood-based gene signature for current smokers (CS) with the potential to distinguish between subjects who smoked and those who had stopped smoking (former smokers (FS)) or never smoked (nonsmokers (NS)). Taking advantage of the lessons learnt from the IMPROVER DSC, we developed a new methodology based on a prediction model that uses high fold-change genes extracted from several publicly available gene expression datasets that profiled samples from CS and NS or FS in the same tissue of interest. Preselecting genes based on high fold-change genes from various independent studies has the potential to enforce the robustness of the signatures across studies. The validations were performed using independent datasets.

To assess the impact of exposures on human health, several experimental models other than clinical studies are regularly used. Prevailing rodent studies have both strengths and limitations, but the more translatable the signatures, the better they serve predictive toxicology and disease research. To comply with these demands, we showed that the whole blood-based signature can discriminate smoke-exposed mice from sham-exposed and even from mice that were withdrawn from smoke exposure.

Materials and methods

Generation of the smoker whole blood transcriptome dataset

BLD-SMK-01

We have produced a blood gene expression dataset, BLD-SMK-01, from PAXgene blood samples obtained from a banked repository (BioServe Biotechnologies Ltd, Beltsville, Maryland, USA) based on well-defined inclusion criteria. At the time of sampling, the subjects were between 23 and 65 years of age. Subjects with a disease history and those taking prescription medications were excluded. CS had smoked at least 10 cigarettes daily for at least 3 years. FS had ceased smoking at least 2 years prior to sampling and before quitting had smoked at least 10 cigarettes daily for at least 3 years. CS and NS were matched by age and gender. A total of 31 blood samples were obtained from CS, 30 from NS, and 30 from FS.

QASMC study

The Queen Ann Street Medical Center (QASMC) clinical case–control study was conducted at The Heart and Lung Centre (London, UK), according to Good Clinical Practices, and was registered at ClinicalTrials.gov with the identifier NCT01780298. It aimed to identify a biomarker or a panel of biomarkers that would enable differentiation between subjects with COPD (CS with a ≥10 pack/year smoking history at GOLD Stage 1 or 2) and three comparative groups of matched subjects: NS, FS, and CS. Sixty subjects in each group were enrolled (240 subjects in total). The additional goals of this study were to assess standard biomarkers of inflammation and to compare inflammatory cell responses and selected markers of inflammation in blood, induced sputum, and nasal samples. The 240 patients included males (58%) and females (42%) aged between 40 and 70 years. All subjects were matched by ethnicity, gender, and age (within 5 years) with the recruited COPD subjects. Blood samples were sent to AROS Applied Biotechnology AS (Aarhus, Denmark) for processing and to Affymetrix Human Genome U133 Plus 2.0 GeneChips (Santa Clara, California, USA) for hybridization, as described below.

RNA isolation

Total RNAs (including microRNAs) were isolated using the PAXgene blood miRNA kit (catalog number 763134; Qiagen, Germany) according to the manufacturer’s instructions. The concentration and purity of the RNA samples were determined using an ultraviolet spectrophotometer (NanoDrop ND1000; Thermo Fisher Scientific, Waltham, Massachusetts, USA) by measuring the absorbance at 230, 260, and 280 nm. The RNA integrity was further checked using an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, California, USA). Only RNAs with an RNA integrity number >6 were processed for further analysis.

RNA preparation and Affymetrix hybridization

Affymetrix probe sets targeting the 3′ ends of transcripts were prepared from 50 ng of RNA using the Ovation^® Whole Blood Reagent and Ovation^® RNA Amplification System V2 (NuGEN, San Carlos, California, USA). The quantity of complementary DNA (cDNA) was measured with a NanoDrop^® 1000 or 8000 spectrophotometer (Thermo Fisher Scientific) or a SpectraMax^® 384Plus microplate reader (Molecular Devices, Santa Clara, California, USA). The cDNA quality was determined by assessing the size of un-fragmented cDNA using an Agilent 2100 Bioanalyzer. The size distribution of the final fragmented and biotinylated product was also monitored using electropherograms. After labeling the cDNA, the fragments were hybridized to the GeneChip^® Human Genome U133 Plus 2.0 Array (Affymetrix) according to the manufacturer’s guidelines. Samples for target preparation were fully randomized for the Affymetrix gene expression microarray. After the investigation of artifacts on the chip scan, the data were processed through a standard quality control pipeline. Briefly, raw data files were read using the ReadAffy function of the affy package⁴⁰ from the Bioconductor (R 3.1.2 and Bioconductor 3.2) suite of microarray analysis tools⁴¹ available for the R statistical environment.⁴²

Population level analysis

For the population-level analysis (i.e., study of the average fold-changes), the data were subsequently normalized using GC-robust microarray analysis. Background correction and quantile normalization were used to generate microarray expression values⁴³ from all arrays passing quality control checks. For the individual signature prediction model, the data were normalized with MAS5.⁴⁴ An overall linear model was fitted for each comparison of interest to generate raw p values for each probe set on the expression array based on moderated t-statistics (Smyth, 2004). The Benjamini–Hochberg false discovery rate method was used to correct for multiple testing effects.

Individual sample prediction modeling

To achieve robustness in the prediction model, independent gene expression datasets from blood (GSE15289) and peripheral blood mononuclear cells (PBMCs; GSE42057) were obtained from the National Center for Biotechnology Information Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/gds/?term=GEO) and processed as described above (see Table 1). The Norwegian Women and Cancer (NOWAC) study (GSE15289)⁴⁵ dataset was composed of whole blood samples from postmenopausal women aged between 48 and 63 years, and included 211 NS and 74 CS. The Bahr et al. (GSE42057)⁴⁶ dataset included PBMC samples collected from 42 control subjects and 94 subjects with COPD of varying severity. All subjects were non-Hispanic Caucasians, and either CS or FS. These subjects were used to identify genes that exhibited large changes in average expression between samples from CS and NS or FS in each dataset. Average gene expression changes between CS and FS were used to guide signature extraction as follows:

Table 1.

Overall summary of the studies used to build, validate, or apply smoke exposure response signatures.

Study name	Study arms	Sample type	Species
BLD-SMK-01,	31 CS, 30 NS, and 20 FS	Whole blood	Hs
QASMC,	Blood: 60 COPD, 60 CS (healthy subjects), 60 FS, and 60 NS	Whole blood	Hs
NOWAC GSE15289	211 NS, 74 CS; and postmenopausal women	Whole blood	Hs
GSE42057	36 CS (22 COPD and 14 healthy subjects) and 100 FS (72 COPD and 28 healthy subjects)	PBMC	Hs
E-MTAB-3150	Sham and smoking groups at 1, 2, 3, 4, 5, and 7 months and cessation group at 3, 5, and 7 months. 10 animals per group except for cessation 3 and 7 months, smoking 5 months, and sham 5 months where only nine CEL files passed QC.	Whole blood	Mm, C57Bl6

CS: current smokers; NS: nonsmokers; FS, former smokers; COPD: chronic obstructive pulmonary disease; QC: quality control; PBMC: peripheral blood mononuclear cells; HS: Homo sapiens; Mm: Mus musculus.

L ₁ and L ₂ were the set of the M (M = 1000 in our study) highest fold-change genes from the two independent datasets (GSE15289 and GSE42057). The L ₁ and L ₂ lists were then used for a priori filtering of the training dataset as follows:

For each N from 1 to M, the performance of a linear discriminant analysis (LDA)⁴⁷ model was evaluated using 5-fold cross-validation (100 times) on $L_{1} [1 : N] \cap L_{2} [1 : N]$ , by computing the Matthews correlation coefficient (MCC), thus leading to MCC(N). MCC combined all true/false positive and negative rates, thus providing a single valuation metric.

N was selected for which the MCC(N) was maximum: $N_{max} = arg {max}_{N} (MCC (N))$ .

The core gene list for the signature was defined by $L_{1} [1 : N_{max}] \cap L_{2} [1 : N_{max}]$ .

The model was built by computing an LDA model on the core gene list.

The datasets were centered prior to learning and testing, so that an LDA model had a zero intercept.

Taqman^® quantitative reverse transcription-PCR assay

Reverse transcription reactions with 500 ng of starting RNA were performed using the iScript™ cDNA Synthesis Kit (catalog number 170-8890; Bio-Rad, Hercules, California, USA) according to the manufacturer’s instructions, and cDNAs were diluted to 10 ng/µL. A commercial human universal RNA (UHR) reference (catalog number 740000; Agilent Technologies) was added to the sample as a calibrator to reliably compare the data across multiple experiments and instruments in a reliable manner. The probes used in the Taqman^® assays spanned exons, and five housekeeping genes (B2 M, GAPDH, FARP1, A4GALT, and GINS2) were chosen for the data normalization step. The quatitative polymerase chain reaction (qPCR) step was carried out using Taqman^® assays and TaqMan^® Fast Advanced Master Mix (catalog number 444963, Life Technologies, Carlsbad, California, USA). Briefly, cDNAs were diluted to allow the application of 1.25 ng per well in a 384-well plate. In parallel, a master mix of Taqman^® assay reagents and Taqman^® Fast Advanced Mix was prepared for each assay, and the final reaction volume was 10 µL. qPCR was run using a Viia7 instrument (Life Technologies) and the automatic baseline and default threshold cycle (C _t) settings were applied for analysis. C _t values were normalized for each gene (by subtraction) with respect to the UHR C _t values and then to the GAPDH housekeeping gene values (leading to the so-called ΔΔC _t value).

Results

The establishment, validation, and translation of the exposure signature leveraged many datasets. Table 1 summarizes the studies involved in developing, validating, or applying these signatures.

Exposure signature establishment

Following RNA extraction and quality checking of the raw data files from the BLD-SMK-01 study, 82 samples remained for analysis, of which 28 were CS, 28 NS, and 26 FS. The population level transcriptomics analysis of BLD-SMK-01 samples revealed that there were no DEGs between NS and FS in whole blood (Figure 1), suggesting that it would be difficult to distinguish between them using the blood transcriptome. Conversely, many DEGs were identified between CS and NS or FS (Figure 1). Therefore, the signature was extracted based only on CS and NS samples from the BLD-SMK-01 study. The FS group was kept aside for validation purposes.

Figure 1.

Volcano plots for the DEGs in BLK-SMK-01. Volcano plots showing the estimated log₂ (fold-change) against −log₁₀ (adjusted p value). p values were computed based on moderated t-statistics and were adjusted by the Benjamini–Hochberg method. Left panel: Comparison of gene expression profiles between CS and NS. Middle panel: Comparison of gene expression profiles between CS and FS. Right panel: Comparison of gene expression profiles between FS and NS. DEGs: differentially expressed genes; CS: current smokers; NS: nonsmokers; FS: former smokers.

By applying the statistical modeling methodology for individual sample prediction described in the Materials and methods section, we obtained a prediction model based on the following core genes: LRRN3, SASH1, PALLD, RGL1, TNFRSF17, and CDKN1C. The 5-fold cross-validation MCC of this model was 0.77 (Se = 0.91; Sp = 0.85) when classifying CS samples versus NS samples.

The core genes in the signature were among those exhibiting high fold-changes in both NOWAC (GSE15289) and Bahr et al.’s (GSE42057) studies. The predictions based on the core genes improved the performance of an LDA model based on all 77 genes that are in common between the 1000 highest fold change genes in those two datasets (Se = 0.73; Sp = 0.81). When we studied predictive models obtained by leveraging each list of high fold-change genes individually, IGJ, RRM2, ID3, SERPING1, and FUCA1 were repeatedly identified as potential candidates in signatures with a high specificity and sensitivity. These five genes were also among those with a high fold-change in the blood transcriptomes of both NOWAC (CS vs. NS) and Bahr et al. (CS vs. FS) studies, and were used to complement the core gene signature to an extended signature. The cross-validation MCC of the model based on the extended signature (LRRN3, SASH1, PALLD, RGL1, TNFRSF17, CDKN1C, IGJ, RRM2, ID3, SERPING1, and FUCA1) was 0.73 (Se = 0.88; Sp = 0.84) when classifying CS versus NS. The genes that were part of the core and extended signatures are further described in Table 2.

Table 2.

Extended blood-based smoking signature and known function of the gene product.^a

Gene	Known function of the gene product
LRRN3	Orphan receptor, essential for neural development.⁴⁸ Potential role in initiation of the primary immune response through mediation of interaction between T cells and dendritic cells.⁴⁹
CDKN1C	Present in nuclei of normal resting (G0) T cells from peripheral blood. Inhibits cdk2 activity and cell proliferation.⁵⁰
PALLD	Cytoskeletal protein required for organization of normal actin cytoskeleton.⁵⁰ Involved in establishing cell morphology, motility, adhesion and cell–extracellular matrix interactions in several cell types.⁵¹ Enriched in Th1 cells.⁵²
SASH1	Scaffold molecule that brings together a signaling complex downstream of TLR4 resulting in early endothelial responses.⁵³
RGL1	Involved in Ras and Ral signaling pathways as downstream effector protein.⁵⁴
TNFRSF17	Maturation antigen on B cells.⁵⁵
IGJ	Expression increased in early B cells compared with hemopoietic stem cells and pro-B cells.⁵⁶
RRM2	Upregulated in B cells from patients with idiopathic pulmonary arterial hypertension.⁵⁷ Expressed in CD8+ and CD4+ cytotoxic T lymphocytes generated from allostimulated PBMCs (together with LRRN3).⁵⁸
SERPING1	Involved in regulation of the complement cascade.⁵⁹
FUCA1	Involved in degradation of fucose-containing glycoproteins and glycolipids.⁶⁰
ID3	Inhibitor of E-protein transcription factors, required for CD8+ lineage development.⁶¹

PBMC: peripheral blood mononuclear cell.

^aCore signature genes are shown in bold.

We compared our results with the cross-validation results of a model obtained when learning a sparse signature from BLD-SMK-01 alone (i.e. without using the two public datasets). We applied the approach comparable used by the best performing team of the IMPROVER DSC.^38,39 The 5-fold cross-validation performance of this model reached Sp = 0.96 and Se = 0.93 in predicting smokers versus NS; slightly above the performance of models based on the core and extended signatures.

Although the cross-validation specificity and sensitivity (Sp = 0.88; Se = 0.84) of the prediction model resulted in a slightly lower performance than the model obtained without using independent datasets (Sp = 0.96; Se = 0.93), its range of application was wider because of its robustness, as demonstrated in its application to independent studies and the signature translatability to mouse as shown below.

Verification of the exposure response signature in independent studies

To validate the core and extended signatures, we used the FS group from the BLD-SMK-01 dataset, as well as the blood dataset from the QASMC study. After checking the quality of the QASMC transcriptomics samples, 52 COPD, 58 CS, 58 FS, and 59 NS CEL files were available for predictions. To evaluate the prediction performance, the samples were stratified into two groups: CS (COPD and healthy smokers) and non-CS (NCS), the latter comprising both FS and NS. These groups enabled us to evaluate the robustness of the signature with respect to the COPD status. Each centered dataset was predicted using the model built on either the core gene signature or the extended signature. The classification performance of the signature against the QASMC study clearly confirmed that the model was robust regardless of COPD status (Se = 0.9, Sp = 0.9 for the core signature and Se = 0.90, Sp = 0.91 for the extended signature; Table 3, Figure 2).

Table 3.

Prediction results using independent datasets (BLD-SMK-01 (FS) and QASMC) for various signatures.^a

	Truth/predicted	Core			Extended			Other
	Truth/predicted	CS	NCS	True rate	CS	NCS	True rate	BLD-SMK-01	Beineke
BLD SMK-01	FS	3	23	0.88	4	22	0.85	0.73	0.73
QASMC	CS	99	11	0.90	100	10	0.91	0.81	0.87
QASMC	NCS	12	105	0.90	12	105	0.90	0.77	0.79

CS: current smokers; NCS: non-current smokers; FS: former smokers; LDA: linear discriminant analysis.

^aLDA model on the set of genes described in Beineke et al, 2012 are reported in the far right column. Both core and extended signatures led to higher specificities and sensitivities than those derived from BLD-SMK-01 samples alone and the signature model based on the set of genes from Beineke et al.

Figure 2.

LDA scores for the training set (BLD-SMK-01, CS and NS) and test samples (BLD-SMK-01 FS and QASMC). A positive score is predictive of a CS status, while a negative score indicates a NCS status. LDA: linear discriminant analysis; CS: current smokers; NS: nonsmokers; FS: former smokers; NCS: non-current smokers.

The effects of additional covariates such as gender and age were also examined. Both BLD-SMK-01 and QASMC studies were balanced with respect to gender and age. No significant association between age or gender and smoking status was present, as indicated by:

BLD-SMK-01: χ ² (gender, smoking status) p = 1, t-test (Age vs. smoking status) p = 0.8.

QASMC: χ ² (gender, smoking status) p = 0.9, t-test (Age vs. smoking status) p = 0.46.

Each gene in the signature was also tested for association with gender and age in BLD-SMK-01. None of the genes showed analysis of variance p values < 0.05, except for PALLD, which showed a minor gender effect.

Previous work on smoking signatures from whole blood samples includes a study by Beineke et al. based on blood datasets from smokers and NS without cardiovascular disease.⁶² We were unable to leverage the microarray data from this to validate further the prediction performance of our signature, because of the incompatible array platform (Agilent). In this earlier study, the authors reported a five gene signature (LRRN3, CLDND1, MUC1, GOPC, and LEF1) used in conjunction with age and gender as covariates in a logistic model with a cross-validated Sp value of 0.95 and a Se value of 0.79. The signature model was further validated in 180 independent subjects (from the PREDICT study, registered on ClinicalTrials.gov with identifier NCT00500617), based on quantitative reverse transcription (qRT)-PCR measurements with a Se value of 0.63 and a Sp value of 0.94. Our core and extended signature outperformed those described by Beineke and co-workers based on Se and Sp values from the LDA model (Table 3).

Performance of the signature in a rodent inhalation study

Longitudinal clinical and epidemiological observations are critical in linking the exposure response signature to adverse outcomes in humans. Because of easier sampling and shorter times to disease, animal models are often used to elucidate exposure effects and disease mechanisms in primary tissues. To determine the translatability of our blood-based smoking exposure response signature into an animal model that manifests important aspects of human smoking-related emphysema,^63,64 we used blood samples from C57Bl6 mice exposed to cigarette smoke for 7 months. The study also included sham-exposed animals and animals exposed to cigarette smoke for 2 months then switched to fresh air (Philips et. al, 2015).⁶⁵ For each group, blood samples were collected from 10 animals at 2, 3, 5, and 7 months. For the CS exposure and sham arms, samples were also collected after 4 months.

The exact model equation developed for the human samples did not perform well, perhaps because five of the genes were expressed at very low levels in mouse samples. We therefore verified that the remaining set of six genes (LRRN3, PALLD, ID3, IGJ, RRM2, and FUCA1) belonging to the extended signature were still able to discriminate between exposed and non-exposed mice based on the blood transcriptome. To this end, we retrained an LDA model from the blood sample transcriptomics. Interestingly, the performance of the models based on these six genes in the human samples was only slightly lower than the classification of the human blood samples (correct classification rate BLD-SMK-01 FS = 0.77, QASMC CS = 0.89, QASMC NCS = 0.79; Figure 3 and Table 4).

Figure 3.

LDA scores of the signature trained on the exposed and sham mouse blood samples collected at month 2, 3, 4, 5, and 7. A positive score is predictive of a CS status, while a negative score indicates a NCS status. The mainly negative score from the cessation arm samples represents the disappearance of exposure effect detection by the signature. LDA: linear discriminant analysis; CS: current smokers; NCS: non-current smokers.

Table 4.

Cross-validation results (5-fold cross-validation repeated 10 times independently) from the LDA model derived from mouse blood sample transcriptomics and associated prediction results.

	Truth\predicted	CS	NCS	True rate
C57Bl6 mice cross-validation	Smoking	31	9	0.78
C57Bl6 mice cross-validation	Sham	9.3	35.7	0.79
C57Bl6 mice prediction (smoking, sham training)	Smoking	35	5	0.86
	Cessation	6	21	0.78
	Sham	7	38	0.84

CS: current smoker; NCS: non-current smoker; LDA: linear discriminant analysis.

We further retrained a logistic and an LDA model on the same set of genes as Beineke et al.⁶² While still performing reasonably in cross-validation, the blood signature failed to translate in mice (data not shown).

Validation of the exposure response signature by PCR-based assay

To determine whether the discovered signature could be translated into a qRT-PCR-based exposure biomarker, gene expression levels in the extended signature were tested in a subset of 20 randomly selected human samples (10 CS and 10 NS). An LDA model was trained on normalized qRT-PCR (see Materials and methods section) data and assessed by 10-fold cross-validation (1000 times; 10-fold was chosen because of the small sample size), leading to a Sp value of 0.85 and Se value of 0.96. When applying the same technique to the core signature, a Sp value of 0.8 and a lower Se value of 0.62 were obtained (Table 5).

Table 5.

Cross-validation (10-fold cross-validation repeated 1000 times independently) results for LDA model of normalized qRT-PCR data.

	Extended signature			Core signature
Truth/predicted	CS	NS	True rate	CS	NS	True rate
CS	9.61	0.39	0.96	7.9	2.1	0.80
NS	1.36	7.64	0.85	3.39	5.61	0.62

CS: current smoker; NS: nonsmoker; LDA: linear discriminant analysis; qRT-PCR: quantitative reverse transcription polymerase chain reaction.

Discussion

Compared with single molecule measurements, gene expression profiling provides a global and more complete view of the biological processes in normal and pathological situations. When the expression trends of multiple genes are taken together, it is also possible to derive a signature or a classifier for a given physiological state from an exposure response to a disease state. While the primarily affected tissue offers a sample that more accurately represents the normal, exposed, or pathological state, it is often not realistic to classify subjects using tissue biopsies. Because of the ease of blood sampling using minimally invasive techniques, blood-based signatures hold great promise for biomarker discovery.^35,66 In this study, we derived a whole blood-based diagnostic signature that can serve as a biomarker for the smoking exposure response.

No significant association between age or gender and the genes in the blood-based signature was observed. Although age was an important covariate in two of the public datasets (GSE15289 and GSE42057), in which CS were on average older than NS or FS, this covariate was not included in the predictor, because it had no significant association with smoking status in the BLD-SMK-01 study. The core and extended signatures were also robust with respect to inter-study and interindividual variations as well as COPD status. They were validated in two independent datasets with a high specificity and sensitivity and over performed the signature reported by Beineke et al.⁶²

Several genes present in our signature have been reported previously in the context of peripheral blood and smoking (LRRN3, ^29,62,67 CDKN1C, PALLD, ²⁸ SASH1, ⁶⁸ and SERPING1 ³³) or smoking-related disease (FUCA1 ⁶⁸). LRRN3 expression was increased in CS compared with NCS, and LRNN3 overexpression has been reported in other smoker signatures from whole blood^45,62 as well as being shown to be relatively hypomethylated in CS and relatively hypermethylated following smoking cessation.⁶⁷ LRRN3 encodes an orphan receptor, which is essential for neural development⁴⁸; however, information about its function in immune cells is very limited.⁴⁹

Although our signature performed well on mouse blood samples, we observed that the coefficients of the mouse model and the human models did not correlate positively. A closer look revealed that the most prominent gene in the extended signature, LRRN3, which is over-expressed in blood samples from smokers, was downregulated in mice exposed to smoke. This may due to different numbers of white blood cells (WBCs) in mice and humans exposed to smoke. The total WBC count has been shown to be higher in healthy smokers than in NS.^20,69,70 Moreover, LRRN3 has been implicated in CD8+ T cell activation,⁷¹ a cell population that is increased in smokers as compared to NS and decreased upon smoking cessation according to a study based on cell type-specific antibodies and flow cytometry.⁷² Analysis of cell populations in smoke-exposed mice used in this study showed no change in the relative numbers or types amounts of circulating WBCs in smoke-exposed mice compared with sham mice (data not shown).

The use of rodent models is essential in predictive toxicology for testing new chemical compounds and evaluating disease endpoints, neither of which is feasible in human subjects. Ideally, a translatable exposure-response biomarker in a surrogate tissue between humans and experimental animals could aid our understanding of the link between the biomarker and the extent of damage in the target tissue. Our blood signature was translated to a rodent model with a high level of specificity and sensitivity, and similar to the human situation, could be used as a biomarker for smoking exposure response to complement the commonly used exposure markers, such as nicotine metabolites and carboxyhemoglobin levels in the blood.

Finally, a small signature, such as described here, allows the use of qRT-PCR. While Affymetrix gene expression profiling is a powerful technology to establish gene signatures, it is not the method of choice for using the signature in practical applications. When there is no need to follow the expression changes of the entire genome, the assay could potentially be developed into a kit with considerable savings in cost and time.

Conclusion

In conclusion, our systems toxicology approach enabled the construction of a robust whole blood-based smoker gene signature based on 11 genes that could distinguish CS from NCS with remarkable accuracy. The signatures presented in this study will not only allow us to monitor the smoking exposure response in humans, but should also permit the translation of the exposure response to a preclinical system.

Footnotes

Conflict of interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The study was fully funded by PMI and all authors are employees of PMI.

References

US Department of Health. The health consequences of smoking: a report of the Surgeon General. Atlanta: US Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for Chronic Disease Prevention and Health Promotion, Office on Smoking and Health. 2004, p. 62.

Hecht

Yuan

Hatsukami

. Applying tobacco carcinogen and toxicant biomarkers in product regulation and cancer prevention. Chem Res Toxicol 2010; 23: 1001–1008.

Carmella

Akerkar

Richie

. Intraindividual and interindividual differences in metabolites of the tobacco-specific lung carcinogen 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone (NNK) in smokers’ urine. Canc Epidemiol Biomarkers Prev 1995; 4: 635–642.

Acosta

Buchhalter

Breland

. Urine cotinine as an index of smoking status in smokers during 96-hr abstinence: comparison between gas chromatography/mass spectrometry and immunoassay test strips. Nicotine Tob Res 2004; 6: 615–620.

Jacobson

Ferguson

. Relationship between cotinine and trans-3′-hydroxycotinine glucuronidation and the nicotine metabolite ratio in Caucasian smokers. Biomarkers 2014; 19(8): 679–683.

Al-Delaimy

. Hair as a biomarker for exposure to tobacco smoke. Tob Control 2002; 11: 176–182.

Al-Delaimy

Crane

Woodward

. Is the hair nicotine level a more accurate biomarker of environmental tobacco smoke exposure than urine cotinine? J Epidemiol Community Health 2002; 56: 66–71.

Tzatzarakis

Vardavas

Terzi

. Hair nicotine/cotinine concentrations as a method of monitoring exposure to tobacco smoke among infants and adults. Hum Exp Toxicol 2012; 31: 258–265.

Phillips

Bentley

Abrar

. Low level saliva cotinine determination and its application as a biomarker for environmental tobacco smoke exposure. Hum Exp Toxicol 1999; 18: 291–296.

10.

Krewski

Acosta

Jr Andersen

. Toxicity testing in the 21st century: a vision and a strategy. J Toxicol Environ Health, B 2010; 13: 51–138.

11.

Thomas

Philbert

Auerbach

. Incorporating new technologies into toxicity testing and risk assessment: moving from 21st century vision to a data-driven framework. Toxicol Sci 2013; 136(1): 4–18.

12.

Hamadeh

Bushel

Jayadev

. Prediction of compound signature using high density gene expression profiling. Toxicol Sci 2002; 67: 232–240.

13.

Morgan

. Gene expression analysis reveals chemical-specific profiles. Toxicol Sci 2002; 67: 155–156.

14.

Waring

Jolly

Ciurlionis

. Clustering of hepatotoxins based on mechanism of toxicity using gene expression profiles. Toxicol Appl Pharmacol 2001; 175: 28–42.

15.

Brody

Steiling

. Interaction of cigarette exposure and airway epithelial cell gene expression. Annu Rev Physiol 2011; 73: 437–456.

16.

Gower

Steiling

Brothers

2nd . Transcriptomic studies of the airway field of injury associated with smoking-related lung disease. Proc Am Thorac Soc 2011; 8: 173–179.

17.

Gustafson

Soldi

Anderlind

. Airway PI3 K pathway activation is an early and reversible event in lung cancer development. Sci Transl Med 2010; 2: 26ra5.

18.

Spira

Beane

Shah

. Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nat Med 2007; 13: 361–366.

19.

Steiling

Ryan

Brody

. The field of tissue injury in the lung and airway. Canc Prev Res (Phila) 2008; 1: 396–403.

20.

Liu

Liang

Frost-Pineda

. Relationship between biomarkers of cigarette smoke exposure and biomarkers of inflammation, oxidative stress, and platelet activation in adult cigarette smokers. Canc Epidemiol Biomarkers Prev 2011; 20: 1760–1769.

21.

Forrest

Lan

Hubbard

. Discovery of novel biomarkers by microarray analysis of peripheral blood mononuclear cell gene expression in benzene-exposed workers. Environ Health Perspect 2005; 113: 801–807.

22.

Bushel

Heinloth

. Blood gene expression signatures predict exposure levels. Proc Natl Acad Sci U S A 2007; 104: 18211–18216.

23.

Chauhan

Howland

Wilkins

. Identification of gene-based responses in human blood cells exposed to alpha particle radiation. BMC Med Genomics 2014; 7: 43.

24.

Rotunno

. A gene expression signature from peripheral whole blood for stage I lung adenocarcinoma. Canc Prev Res (Phila) 2011; 4: 1599–1608.

25.

Showe

Vachani

Kossenkov

. Gene expression profiles in peripheral blood mononuclear cells can distinguish patients with non-small cell lung cancer from patients with nonmalignant lung disease. Canc Res 2009; 69: 9202–9210.

26.

Poliska

Csanky

Szanto

. Chronic obstructive pulmonary disease-specific gene expression signatures of alveolar macrophages as well as peripheral blood monocytes overlap and correlate with lung function. Respiration 2011; 81: 499–510.

27.

Büttner

Mosig

Funke

. Gene expression profiles of T lymphocytes are sensitive to the influence of heavy smoking: a pilot study. Immunogenetics 2007; 59: 37–43.

28.

Charlesworth

Curran

Johnson

. Transcriptomic epidemiology of smoking: the effect of smoking on gene expression in lymphocytes. BMC Med Genomics 2010; 3: 29.

29.

Dumeaux

Olsen

Nuel

. Deciphering normal blood gene expression variation—The NOWAC postgenome study. PLoS Genet 2010; 6: e1000873.

30.

Lampe

Stepaniants

Mao

. Signatures of environmental exposures using peripheral leukocyte gene expression: tobacco smoke. Canc Epidemiol Biomarkers Prev 2004; 13: 445–453.

31.

Lodovici

Luceri

De Filippo

. Smokers and passive smokers gene expression profiles: correlation with the DNA oxidation damage. Free Rad Biol Med 2007; 43: 415–422.

32.

Van Leeuwen

Van Agen

Gottschalk

. Cigarette smoke-induced differential gene expression in blood cells from monozygotic twin pairs. Carcinogenesis 2007; 28: 691–697.

33.

Votavova

Dostalova Merkerova

Fejglova

. Transcriptome alterations in maternal and fetal cells induced by tobacco smoke. Placenta 2011; 32: 763–770.

34.

Barbash

Soreq

. Statistically invalid classification of high throughput gene expression data. Sci report 2013; 3: 1102.

35.

Zeller

Blankenberg

. Blood-based gene expression tests promises and limitations. Cir: Cardiovasc Genet 2013; 6: 139–140.

36.

Meyer

Alexopoulos

Bonk

. Verification of systems biology research in the age of collaborative competition. Nat Biotechnol 2011; 29: 811–815.

37.

Meyer

Hoeng

Rice

. Industrial methodology for process verification in research (IMPROVER): toward systems biology verification. Bioinformatics 2012; 28: 1193–1201.

38.

Tarca

Lauria

Unger

. Strengths and limitations of microarray-based phenotype prediction: lessons learned from the improver diagnostic signature challenge. Bioinformatics 2013; 29(22):2892–2899.

39.

Tarca

Than

Romero

. Methodological approach from the best overall team in the improver diagnostic signature challenge. Syst Biomed 2013; 1: 4, 1–11.

40.

Gautier

Cope

Bolstad

. affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004; 20: 307–315.

41.

Gentleman

Carey

Bates

. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004; 5: R80.

42.

R Development Core Team. R: a language and environment for statistical computing. Vienna: R Development Core Team, 2007.

43.

Irizarry

Hobbs

Collin

. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003; 4: 249–264.

44.

Affymetrix

. Statistical algorithms description document. Technical paper, 2002.

45.

Dumeaux

Olsen

Nuel

. Deciphering normal blood gene expression variation—The NOWAC postgenome study. PLoS Genet 2010; 6: e1000873.

46.

Bahr

Hughes

Armstrong

. Peripheral blood mononuclear cell gene expression in chronic obstructive pulmonary disease. Am J Respir Cell Mol Biol 2013; 49(2): 316–323.

47.

McLachlan

. Discriminant analysis and statistical pattern recognition. Hoboken: Wiley Interscience, 2004.

48.

Taniguchi

Tohyama

Takagi

. Cloning and expression of a novel gene for a protein with leucine-rich repeats in the developing mouse nervous system. Mol Brain Res 1996; 36: 45–52.

49.

Ito

Masuko

Ito

. Possible function of neuronal leucine-rich repeat protein 3 (NLRR3) in primary immune response. Hirosaki Med J 2010; 61: 46–57.

50.

Otey

Carpen

. α-Actinin revisited: a fresh look at an old player. Cell Motil Cytoskeleton 2004; 58: 104–111.

51.

Parast

Otey

. Characterization of palladin, a novel protein localized to stress fibers and cell adhesions. J Cell Biol 2000; 150: 643–656.

52.

Äijö

Edelman

Lönnberg

. An integrative computational systems biology approach identifies differentially regulated dynamic transcriptome signatures which drive the initiation of human T helper cell differentiation. BMC Genomics 2012; 13: 572.

53.

Dauphinee

Clayton

Hussainkhel

. SASH1 is a scaffold molecule in endothelial TLR4 signaling. J Immunol 2013; 191: 892–901.

54.

Sood

Makalowska

Carpten

. The human RGL (RalGDS-like) gene: cloning, expression analysis and genomic organization. Biochim Biophys Acta (BBA)-Gene Struct Expression 2000; 1491: 285–288.

55.

Yang

Hase

Legarda-Addison

. B cell maturation antigen, the receptor for a proliferation-inducing ligand and B cell-activating factor of the TNF family, induces antigen presentation in B cells. J Immunol 2005; 175: 2814–2824.

56.

Hystad

Myklebust

Bø

. Characterization of early stages of human B cell development by gene expression profiling. J Immunol 2007; 179: 3662–3671.

57.

Ulrich

Taraseviciene-Stewart

Huber

. Peripheral blood B lymphocytes derived from patients with idiopathic pulmonary arterial hypertension express a different RNA pattern compared with healthy controls: a cross sectional study. Respir Res 2008; 9: 20.

58.

Hidalgo

Einecke

Allanach

. The transcriptome of human cytotoxic T cells: similarities and disparities among allostimulated CD4+ CTL, CD8+ CTL and NK cells. Am J Transplant 2008; 8: 627–636.

59.

Petersen

Thiel

Jensen

. Control of the classical and the MBL pathway of complement activation. Mol Immunol 2000; 37: 803–811.

60.

Intra

Perotti

M-E

Pavesi

. Comparative and phylogenetic analysis of α-l-fucosidase genes. Gene 2007; 392: 34–46.

61.

Jones-Mason

Zhao

Kappes

. E protein transcription factors are required for the development of CD4(+) lineage T cells. Immunity 2012; 36: 348–361.

62.

Beineke

Fitch

Tao

. A whole blood gene expression-based signature for smoking status. BMC Med Genomics 2012; 5: 58.

63.

Guerassimov

Hoshino

Takubo

. The development of emphysema in cigarette smoke-exposed mice is strain dependent. Am J Respir Cell Mol Biol 2004; 170: 974–980.

64.

Takubo

Guerassimov

Ghezzo

. α1-Antitrypsin determines the pattern of emphysema and function in tobacco smoke–exposed mice: parallels with human disease. Am J Respir Cell Mol Biol 2002; 166: 1596–1603.

65.

Phillips

Veljkovic

Peck

. A 7-month cigarette smoke inhalation study in C57BL/6 mice demonstrates reduced lung inflammation and emphysema following smoking cessation or aerosol exposure from a prototypic modified risk tobacco product. Food Chem Toxicol 2015; 80: 328–345.

66.

Kittleson

Shui

Irizarry

. Identification of a gene expression profile that differentiates between ischemic and nonischemic cardiomyopathy. Circulation 2004; 110: 3444–3451.

67.

Wan

Qiu

Baccarelli

. Cigarette smoking behaviors and time since quitting are associated with differential DNA methylation across the human genome. Hum Mol Genet 2012; 21: 3073–3082.

68.

Verdugo

Zeller

Rotival

. Graphical modeling of gene expression in monocytes suggests molecular mechanisms explaining increased atherosclerosis in smokers. PloS one 2013; 8: e50888.

69.

Fernández

JAF

Prats

Artero

JVM

. Systemic inflammation in 222.841 healthy employed smokers and nonsmokers: white blood cell count and relationship to spirometry. Tob Induc Dis 2012; 10: 1–8.

70.

Watanabe

Fukushima

Taniguchi

. Smoking, white blood cell counts, and TNF system activity in Japanese male subjects with normal glucose tolerance. Tob Induc Dis 2011; 9: 12.

71.

Chou

Ramirez

. Accelerated aging in HIV/AIDS: novel biomarkers of senescent human CD8+ T cells. PloS one 2013; 8: e64702.

72.

Miller

Goldstein

Murphy

. Reversible alterations in immunoregulatory T cells in smoking. Analysis by monoclonal antibodies and flow cytometry. Chest 1982; 82: 526–529.

Identification of gene expression signature for cigarette smoke exposure response—from man to mouse

Abstract

Keywords

Introduction

Materials and methods

Generation of the smoker whole blood transcriptome dataset

BLD-SMK-01

QASMC study

RNA isolation

RNA preparation and Affymetrix hybridization

Population level analysis

Individual sample prediction modeling

Taqman® quantitative reverse transcription-PCR assay

Results

Exposure signature establishment

Verification of the exposure response signature in independent studies

Performance of the signature in a rodent inhalation study

Validation of the exposure response signature by PCR-based assay

Discussion

Conclusion

Footnotes

Conflict of interest

Funding

References

Taqman^® quantitative reverse transcription-PCR assay