Sage Journals: Discover world-class research

Abstract

Genomic-based studies of disease now involve diverse types of data collected on large groups of patients. A major challenge facing statistical scientists is how best to combine the data, extract important features, and comprehensively characterize the ways in which they affect an individual's disease course and likelihood of response to treatment. We have developed a survival-supervised latent Dirichlet allocation (survLDA) modeling framework to address these challenges. Latent Dirichlet allocation (LDA) models have proven extremely effective at identifying themes common across large collections of text, but applications to genomics have been limited. Our framework extends LDA to the genome by considering each patient as a “document” with “text” detailing his/her clinical events and genomic state. We then further extend the framework to allow for supervision by a time-to-event response. The model enables the efficient identification of collections of clinical and genomic features that co-occur within patient subgroups, and then characterizes each patient by those features. An application of survLDA to The Cancer Genome Atlas ovarian project identifies informative patient subgroups showing differential response to treatment, and validation in an independent cohort demonstrates the potential for patient-specific inference.

Keywords

latent Dirichlet allocation time-to-event survival cancer genomics

Introduction

Technological advances continue to increase both the ease and accuracy with which measurements of the genome and phenome can be obtained and, consequently, genomic-based studies of diseases such as cancer often involve highly diverse types of data collected on large groups of patients. The primary goals of such studies involve identifying genomic features useful for characterizing patient subgroups as well as predicting patient-specific disease course and/or likelihood of response to treatment. Doing so requires computational methods that handle complex interactions, accommodate genetic heterogeneity, and allow for data integration across multiple sources.

A number of statistical methods are available for feature identification and prediction of a time-to-event phenotype such as overall survival or time to recurrence (for a review, see Chen et al.¹, Li and Li,² and Wei and Li³). Most often, classical models for a survival response are coupled with some dimension reduction methods for individual^4–6 or grouped predictors,^1,2,7,8 providing a concise representation of the genomic features affecting patient outcome. Although useful, the majority of these methods identify a set of covariates common to all patients and as a result may “distort what is observed” in the presence of heterogeneity.⁹ Survival-supervised clustering approaches naturally accommodate heterogeneity, providing for efficient and effective identification of patient subgroups.^10,11 However, these approaches do not identify salient features associated with subgroups and, as with the aforementioned methods, may sacrifice power and accuracy by focusing on one (or a few) data set(s) in isolation.

Latent Dirichlet allocation (LDA)¹² models are particularly well tailored for accommodating heterogeneity, selecting features, and characterizing complex interactions in a high-dimensional textual setting, but their application in genomics has been limited. By far, the most common application concerns identifying groups of words that co-occur frequently (topics) across large collections of text (eg, a collection of articles or abstracts). The derived topics provide insights into the collections’ content overall as well as into the specific content within a document, and estimated document-specific distributions over topics are useful in classifying new documents.^12–14

An extension allows for topic estimation to be supervised by a response that is suitably described by a generalized linear model.¹⁵ So-called supervised LDA (sLDA) debuted with a study of movie reviews (each considered a document) and estimated topics (collections of co-occurring words in a review) that determined the number of stars (supervising response) a movie received. Derived topics included ones having highest weight on words such as “power”, “perfect”, “fascinating”, and “complex”; another with highest weight on “routine”, “awful”, “featuring”, “dry”; a third on “unfortunately”, “least”, “flat”, “dull”; and so on. The movie-review-specific distribution over topics proved useful in classifying movies. Those with highest weight on the “power” topic generally had a high number of stars while those with highest weight on the “unfortunately” topic had a low number; those with weight on the “routine” topic most often ended up in the middle. Differences between the distributions also provided insights into differences between movies that received a similar number of stars.

Our interest here is not in evaluating movies. However, it is important to note that the questions addressed in Blei and McAuliffe¹⁵ are identical in structure to the most important questions we face in cancer genomics. In the former, questions include: “Given reviews and ratings for a group of movies, can we identify collections of words (topics) that discriminate movie reviews? Can each movie be described by a distribution over those topics? Can distributions over topics provide insights into differences between similarly rated movies? And can a movie-specific distribution over topics be used to predict what the rating of a new movie will be?” In cancer genomics, the questions include: “Given genomic, clinical, and survival information on a group of patients, can we identify collections of genomic and clinical features (topics) that define and discriminate among patient subgroups? Can a patient be well described by a distribution over those topics? Can distributions over topics provide insights into the genomic differences between two patients with similar survival? And can a patient-specific distribution over topics be used to predict survival of a new patient?”

To address these types of questions, we extend LDA for use in a clinical and genomic setting. Specifically, survival-supervised LDA (survLDA) is developed in the second section to facilitate topic supervision by a time-to-event response with censoring. Unlike in the textual domains of Blei et al.¹², Porteous et al.¹³, and Biro et al.¹⁴, the definition of a document is not obvious in this setting. The Methods section details the construction of documents, one for each patient, where words describe clinical events, treatment protocols, and genomic information from multiple sources. As we show in the Application of survLDA to the TCGA data section, application of survLDA to this collection of documents provides for the identification of topics useful in characterizing patient subpopulations as well as individual patients in a study of ovarian cancer conducted as part of The Cancer Genome Atlas (TCGA) project.¹⁶ Classification of new patients is considered in the third section, and we conclude with the Discussion section.

Methods

The LDA model

We briefly review the LDA model as detailed in Blei et al.¹² Assume there are D documents indexed by i = 1,…, D, each of which consists of N_i words. The vocabulary is the unique set of length V indexed by v = 1,…, V, from which the documents’ words arise and is usually taken to be the union of all words over documents. Further, assume that there are K latent “topics” indexed by k = 1,…, K, that govern the assignment of words to documents. Each topic corresponds to a discrete distribution over the V words in the vocabulary, with parameters given by the V-vector τ_k. Likewise, each document is assumed to be a mixture over the K topics with mixing coefficients θ_i(a K-vector parameter), indicating the proportion of words sampled from each topic.

For a given document i, N_i words arise from the following generative process, given the system-wide hyperparameters α (a K-vector Dirichlet parameter) and the τ_1:K (the topic V-vectors): 1

Draw topic proportions θ_i ~ Dirichlet(α).

For each of the N_i words, indexed by j: a

Draw a topic assignment Z_ij|θ_i ~ Multinomial(1,θ_i), where Z_ij ∊ {1, …, K}.

Draw a word W_ij|Z_ij, τ_1:K ~ Multinomial (1,τ_Zij), where W_ij ∊ {1, …, V}.

With this model in place, a variational expectation-maximization (EM) algorithm may be used to estimate the joint posterior distribution of θ_i and $Z_{i, 1 : N_{i}},$ given $w_{i, 1 : N_{i}},$ α, and τ_1:K for each document i (expectation step [E-step]) and then to estimate the system-wide hyperparameters α and τ_1:K(maximization step [M-step]). Upon convergence, the variational EM yields optimal values for the key quantities of interest, namely posterior estimates of the topics (τ_1:K) and document-specific distributions over topics (θ_1:D). An extension of LDA by Blei and McAuliffe in 2008 allows for topic estimation to be supervised by a response that is suitably described by a generalized linear model. When time-to-event responses such as survival times are of interest, sLDA is not directly applicable since it does not accommodate censoring.

The SurvLDA Model

The survLDA model assumes the same setup as in Section 2, but allows for topics to be supervised by a time-to-event outcome. For document i, the survival outcome is denoted by T_i; an indicator variable for death/censoring is also observed for each document, denoted by δ_i. The survival response $T_{i} | {\bar{Z}}_{i}, β, h_{0}$ is described by a Cox proportional hazards model¹⁷ with hazard function $h (t | {\bar{Z}}_{i}) = h_{0} (t) \exp {β^{'} {\bar{Z}}_{i}},$ where ${\bar{Z}}_{i}$ is a K-vector with components ${\bar{Z}}_{i k} = # {Z_{i j} = k} / N_{i} .$ In this Cox proportional hazards model, each regression coefficient β_k exhibits the beneficent (negative) or deleterious (positive) effect of topic k on survival. We use a Weibull model for h₀, noting that alternative specifications (eg, nonparametric)¹⁸ may be used. The system-wide model parameters for the survLDA model include a K-vector Dirichlet parameter α and the topic F-vectors τ1:_K, just as in the LDA model described above. Specific to survLDA are survival response parameters β (a K-vector of regression coefficients) and h₀(·) (the baseline hazard). As in LDA, a variational EM algorithm is used to estimate the joint posterior distribution of θ_i and $Z_{i, 1 : N_{i}},$ given $w_{i, 1 : N_{i}},$ T_i, δi, α, τ1:k, β, and h₀ for each document i (E-step) and then to estimate the system-wide hyperparameters α, τ_1:K, β, and h₀ (M-step). The derivation is given in the Appendix.

Document Construction in the TCGA Cohort

Unlike in the textual domains of Blei et al.¹², Porteous et al.¹³, and Biro et al.¹⁴ or in the movie review example described above, the definition of document is not obvious in this setting. To push the review analogy a bit further, whereas a movie review describes what is going on in a movie and provides an opinion on how the events were conveyed overall, we imagine patient reviews that describe what is going on in a patient with respect to genomic and clinical features. The analogy breaks down there, as the patient review does not contain an opinion on whether the features are positive or negative overall. Rather, the survLDA model is used to identify important features and estimate how these features relate to patient outcome as summarized by a time-to-event phenotype such as survival.

We use data from the TCGA ovarian project to construct patient-specific reviews or documents that summarize clinical and genomic features. For each of 511 patients in the TCGA ovarian cohort, clinical information such as age at diagnosis, date of surgery, surgical outcome, adjuvant therapies, time to recurrence, treatment at recurrence, overall survival, and dozens of other variables are available. Also available are high-throughput measurements of gene expression, methylation, single-nucleotide polymorphism (SNP)/copy number variation (CNV)s, and microRNAs.

For document construction, we use words associated with drugs, gene expression, and methylation, noting that other data sources could be integrated in a similar way. Specifically, the vocabulary (the union of words across all documents) includes words associated with commonly administered drugs (platinum, taxol, doxorubicin, topotecan, and gemcitabine) as well as words derived from potentially relevant genes. For gene words, we consider the 991 genes from the 12 cancer-related pathways defined in Jones et al.¹⁹, since studies suggest that the vast majority of cancer-causing mutations lie in genes within these pathways. We also include the 5000 genes having mRNA expression that is most correlated with overall survival in the TCGA cohort as well as the 5000 having methylation that is most correlated. Given the considerable overlap between these lists, the two combined give 7452 unique genes for a total of 7897 genes from which words are derived.

Ideally, a patient's document will provide a comprehensive description of his/her clinical and genomic state. Toward this end, a patient's document received a drug word for each drug the patient received and a gene word for gene ‘X’ if the patient showed aberrant expression for that gene. To determine the direction of aberrant expression, we considered the association between gene expression and survival time. If increased expression was associated with decreased survival time for gene X, then any patient with expression in the uppermost 10th percentile for that gene received a gene word. Similarly, if decreased expression was associated with decreased survival time, then any patient with expression in the lowest 10th percentile received a gene word. The same procedure was applied to methylation data. Once all documents were constructed, the term-frequency inverse-document frequency (tf - idf) statistic was applied to identify words with discriminating power, as is commonly done in LDA applications.²⁰ Term-frequency, tf(t,d), is the normalized frequency of a word t in a document d: specifically, tf (t, d) = $\frac{f (t, d)}{\max (f (w, d) : w \in d)}$ where f(t,d) is the frequency of the term t in document d, and max f(w,d): w ∊ d is the maximum frequency over all terms in the document. The inverse-document frequency is given by idf (t, D) = $\frac{| D |}{| d \in D : t \in d |}$ where |D| is the number of documents in the corpus and | ∊ D: t ∊ d| is the number of documents in the corpus for which t appears. The idf of a rare term is high, whereas the idf of a common term is low. The tf - idf statistic tf - idf(t,d,D) = tf(t,d)* idf(t,D) combines these two measures²¹ and is used here to identify terms that are relatively rare across documents (ie, discriminating), but relatively common within some sub-collection of documents.

To provide further detail, if a word shows up exactly the same number of times across all the documents (eg, the word “the” shows up 10 times in each of all the documents we have), then the tf - idf value of this word in every document will be 0 since idf will be zero (since the document frequency is the proportion of documents that contain this word [in this example, 1] and idf is log of the inverse-document frequency [here, 0]). On the other hand, a word's tf - idf value in a document will be higher if it shows up in some documents but not others. For example, if we have a word that appears in 10% of the documents, then tf - idf is 2.3 (assuming documents of equal length since then tf is 1 and idf is log(10)). In a TCGA cohort, words with tf - idf ≥0.25 were retained in the final collection of documents following the study by Horacek et al (2010).²²

Prediction

Given a new patient with clinical and genomic data, it may be of interest to construct a document w_1:N and use it to predict survival. With a fitted model {α,τ_1:K}, the posterior mean ${\bar{Z}}_{n e w} = \bar{Z} | w_{1 : N}, α, τ_{1 : K}$ can be obtained in order to estimate from which topics this new patient draws words and in what proportions. As was the case during model fitting, this posterior must be approximated via variational inference. We do so by following the same procedure as outlined in the first subsection of the Appendix, except that all survival-related terms in the evidence lower bound are dropped; see the Prediction section in the Appendix for details.

Given ${\bar{Z}}_{n e w}$ measures related to topic membership can be predicted for the new patient. This may be done qualitatively (eg, “This patient is predicted to belong strongly to the first topic and survival for that topic is poor, hence her prognosis is bad.”) or quantitatively (eg, predicting median survival time using the parametric survival model).

Application of Survlda to the Tcga Data

Given documents constructed as described above for each of the 511 women are considered, we applied survLDA. The supervising outcome of interest is all-cause mortality; and in all analyses, we used K = 7 topics, the last being the background topic. Application of survLDA provides two quantities of primary interest. The topics τ_1:K or estimated distributions over words identify clinical and genomic features that co-occur frequently in some groups of patients, but less frequently in others; and the document-specific distributions over topics θ_1:D characterize individual patients by specifying the proportions of their features coming from each topic. Of interest is determining the salient features in patient-specific documents that are represented by these topics and ultimately how the topics relate to overall survival.

Results

The left panel of Figure 1 shows a heat map with patients (columns) clustered according to topic membership for the six nonbackground topics (rows). The proportion of a patient's document words coming from a topic ranges from near 0 (almost no words, deep blue) to near 1 (virtually all words, red). As shown, most patient documents have the majority of words coming from a single topic, while some are best described by mixtures over topics. To see how differences among topics translates to differences in overall survival, the right panel of Figure 1 shows Kaplan–Meier curves for TCGA patients grouped by topic membership. Specifically, each patient is assigned to the topic having the highest weight in her document, as estimated by θ_1:D. Patients with documents having highest weight on topic 1, for example, show dramatically reduced survival (44% at 1.5 years), whereas patient documents best described by topic 2 show average survival (76% at 1.5 years). A closer look at the words underlying these topics provides some insight into the differences and identifies features that may be worthy of further investigation.

Figure 1

The left panel shows a heat map of the estimated patient-specific distributions over topics (θ) for each of 511 patients (the background topic is not shown). Topics are given in the rows; patients are clustered along the columns. Colors range from deep blue (topic underrepresented in the patient's document) to red (topic overrepresented). The right panel shows Kaplan–Meier survival curves for patients classified into one of the six nonbackground topics. Each patient was assigned to the topic having highest weight in his/her document, as estimated by θ_1:D.

The left panel of Figure 2 presents the topic-specific distributions over words for each topic. Red (blue) indicates an overabundance (dearth) of a word's weight in the corpus belonging to a particular topic. The right panel of Figure 2 shows a close-up view, highlighting 40 high-weight words that in part differentiate topics 1 and 2. A number of the results observed are consistent with prior studies. For example, CD163 expression levels have recently been shown to be prognostic of outcome in ovarian cancer patients, with higher expression associated with poor outcome.^23,24 This is consistent with what we observe, with an abundance of CD163 words in the poor outcome (topic 1) group. Similarly, increased expression of IGF2 has also been associated with poor survival in ovarian cancer patients.²⁵ Here we observe high methylation of IGF2AS (which is correlated with IGF2)²⁶ in the poor outcome group, which at first may seem to be a contradiction given that increased methylation often results in decreased expression. However, that is not the case for IGF2, where increased methylation correlates with increased expression.²⁵

Figure 2

The left panel shows a heat map of the topics derived from survLDA. Topics are shown in the columns; words are clustered along the rows. The colors range from blue (word underrepresented in the topic) to red (word overrepresented), with white in the middle (average representation). To aid in interpretation, we add the risk direction and data source from which each word was derived. For example, CYP19A1 – mRNA indicates that underexpression of CYP19A1 is associated with increased risk and that CYP19A1 words were entered into a document for patients with underexpression of CYP19A1. As the heat map shows, there are many words that distinguish topics 1 and 2, having high weight in one topic but not the other. The insets highlight 40 such words; those having high weight in topic 1 (topic 2) are shown in the upper (lower) right.

Other genes such as TRPC3, ALDH1A3, and FOXP1 have been studied in other cancers; and our results suggest that these genes may play important roles in ovarian cancer as well. Underexpression of TRPC3 has been correlated with poor prognosis in lung cancer,^27,28 as has hypermethylation of ALDH1A3 in bladder cancer.²⁹ FOXP1 is a relatively well-known tumor suppressor gene with increased expression associated with improved outcomes among breast cancer patients.³⁰ As in these studies, we observe TRPC3 underexpression and ALDH1A3 hypermethylation in our poor prognosis group and increased FOXP1 expression in patients with longer survival.

It is interesting to note that with the exception of CD163, these genes would not likely have been identified in this cohort using other approaches, as the marginal P-values from a Cox proportional hazards test are far from overwhelming (CD163 P = 0.013, IGF2AS P = 0.631, ALDH1A3 P = 0.951; FOXP1 P = 0.188; TRPC3 P = 0.282), indicating that although there are differences in the expression and/or methylation of these genes between patients primarily described by topics 1 and 2, those differences are obscured by heterogeneity in the full cohort. Although further investigation of these and other genes that display markedly different abundance patterns between patient subtypes might improve our understanding of the mechanisms that underlie differences between the groups, we note that a main advantage of LDA models in general and survLDA in particular is that topics describe co-occurrence of groups of words, not just occurrence of high-frequency words. The left panel of Figure 3 is a co-occurrence heat map showing the percentage of topic 1 patients having a given pair of words in their document. It is clear that the majority of topic 1 patients show high co-occurrence of topic 1 words and low co-occurrence of topic 2 words. The same holds true of patients best described by topic 2 words (right panel). Consequently, characterization of patient subtypes by these collections of genes taken together may prove to be more informative than characterization by individual genes. To further investigate whether these gene groups are meaningful, in the following section we evaluate their prognostic utility in independent patient populations.

Figure 3

Heat maps showing co-occurrence of the 40 high-weight topic 1 and topic 2 words shown in Figure 2 The left heat map considers the 25 patients having documents with highest weight on topic 1. Shown are the percentages of those patients having both words in their document, ranging from 0 (blue) to 100% (red). The black line separates topic 1 and topic 2 words. The right panel is similar, showing percentages of co-occurrence in documents of the 85 patients best described by topic 2 words.

Prediction on Independent Data Sets

To evaluate survLDA for patient-specific prediction, we consider two independent data sets. Specifically, we consider 240 patients from the study by Tothill et al.³¹, conducted in Australia consisting of patients with ovarian, tubal, and peritonial cancers; we also consider 260 patients from the study by Yoshihara et al.³² conducted in Japan. These independent populations are referred to hereinafter as the validation patients. Although the TCGA data we have used is restricted to patients with stage III or IV serous ovarian adenocarcinomas, these independent studies are more heterogeneous and thus present a more challenging (and realistic) validation data set. Documents for the validation patients were derived as described in Section 2 with the quantile thresholds taken from the training set (TCGA) data. Identifying thresholds in the training data allows us to construct documents for validation patients one at a time, as would be required in any setting where patient-specific prediction was of interest. Drug and methylation words were not included as the validation data sets did not contain this information.

With documents in hand, the survLDA output was used to predict topic membership for patients in the validation set, using the prediction approach given in Section 2. The left panel of Figure 4 shows survival for patients best described by words from topics 1 and 2, the two topics discussed earlier. There is a significant difference between survival in the two groups (P = 0.037). As in the training set, those patients predicted to belong to topic 2 have better survival, on average, than those patients predicted to belong to topic 1. Although statistically significant, the difference in the survival curves is attenuated relative to that observed in the training set (73% vs. 93% at 1.5 years in the validation set; 44% vs. 76% in the training set). Of course, some decrease in performance is expected as one moves to an independent validation set. Here we also lose some predictive ability as the validation set does not contain information about methylation or treatment, and so our predictor was built using a single data source (expression). Nevertheless, the ability to recover at least some information regarding outcome suggests that the topics are biologically relevant.

Figure 4

Topic-based prediction of overall survival in an independent patient cohort.

Discussion

A problem pervasive in genomic-based studies of disease concerns taking large, diverse data sets collected on a cohort of patients and using the information contained therein to characterize patient subtypes as well as individuals. Computational scientists often address this problem by performing analysis within a single data type and comparing results subsequently in an effort to identify a signal supported by the disparate analyses (eg, a gene's SNPs, expression, and methylation all associate with a phenotype). Comparing results manually has its obvious disadvantages. At the same time, meta-analysis approaches such as Fisher's combined probability test can be limited by low power³³; and efforts to combine data directly are challenged by measurements on different scales with differential dependencies. The survLDA-based framework proposed here addresses these challenges by transforming the information contained in high-throughput genomic screens into text. Doing so has both advantages and disadvantages.

One advantage is that data integration is seamless. In the implementation presented, a word for a gene is assigned to a patient's initial document if the gene shows extreme expression; the same word is assigned if the gene shows extreme methylation. In this way, a document may contain copies of words associated with extreme genomic features, measured from expression and/or methylation. Other types of data are easily incorporated into the framework. For example, just as with extreme expression or methylation, a gene word could be included in a document if that gene harbored a CNV.

A second advantage is that the threshold required for a gene to be included in the analysis is much lower than would be required with other methods. As detailed in the Methods section, some preselection of genes is done, but the selection does not require even nominally significant association with a survival endpoint, as is often required in survival studies with high-dimensional covariates.^4,7,34 This allows for the identification of many important genes, some previously known to be involved in cancers other than ovarian, which would not otherwise have been considered.

Although the identification of individual genes may prove useful, a main advantage of LDA in general and survLDA in particular is that it reveals groups of genomic aberrations that co-occur together (topics) and then characterizes individual patients by those groups. The topics themselves are useful in that they define collections of genes, methylations, or other covariates among which undiscovered interactions might occur, while the patient-specific distributions over topics give insights into the similarities and differences among patients that go beyond the information that can be gained from grouping by like outcome.

Our findings of the predictive ability of the approach are mixed. In our earlier work,³⁵ we conducted simulation studies to assess predictive ability under a variety of settings. As that work suggested for the sample size considered here, there is some ability for prediction, but improvements are expected with increases in the number of patients as well as improvements in document creation strategies. As detailed in,³⁵ sample sizes larger than those considered here are required to improve prediction significantly. In general, more work is needed to better understand the specific effects of sample size, document size, word frequencies, and replication, which are determined in part by the method used for document construction. Our approach to assign a word for any gene showing extreme expression or methylation was motivated by the study of Zilliox and Irizarry,³⁵ where the authors identify bimodal genes and, for each individual and each gene, assign a binary variable indicative of mode membership. The resulting gene expression ‘barcode’ for each patient proved useful in classifying patients into biologically meaningful groups³⁶ and the extrapolation of their approach proved to be an effective strategy here. Another possibility is to assign an increasing number of words in direct proportion with signal. For example, consider breaking a gene's expression into deciles, say, and assign 1–10 words for each document (eg, a value between the sixth and seventh deciles gets seven words). We did not favor this approach for two main reasons. First, the approach assumes linearity of expression and methylation, which is often not the case. Second, the approach results in documents having few unique words, which reduces specificity of topics as well as document-specific distributions over topics. Document construction continues to be explored, and improvements are expected to prove useful in a number of settings.

In addition to the means by which covariates are translated into words, there are many aspects of the proposed methods that require further development. In particular, survLDA assumes the simplest of Dirichlet priors on the distributions of topics over patients and therefore the documents are considered conditionally independent given a. While this is a reasonable assumption for the TCGA data set we considered, there are other realms where correlation among the documents could arise. For example, one could have multiple documents arising from the same subject, one for each time point or tissue; or, when integrating multiple cancer types, subjects with the same type of cancer would be expected to be more alike than subjects with differing cancer types. Adding such hierarchy has already been explored to some extent for traditional LDA,³⁷ presenting a starting point for future methodological work.

Similarly, the composition of the topics themselves is essentially free. Were it not for our imposition of a background topic, the topics would be completely unstructured a priori. As it is, K - 1 topics are still governed solely by the data. This need not be the case, as methods similar to those proposed for construction of a background topic (see the Appendix) could be extended. In particular, the Dirichlet prior could be modified directly or a set of restrictions could be imposed for each topic and groups of words so that certain words cannot appear together or may only appear together in certain topics.

In summary, it is becoming increasingly clear that studies aimed at solving the most challenging problems in cancer genomics involve highly diverse types of data collected on large groups of patients. Many methods will prove useful. We expect that advantage will be gained from methods that are able to integrate data and account for cohort heterogeneity, allow supervision by outcomes of interest such as survival, provide for patient-specific inference, and facilitate prediction of unobserved outcomes. The proposed approach provides tools for these purposes in an effort to help ensure that maximal information is obtained from genomic-based studies of disease.

Author Contributions

CK conceived the model and application, and wrote much of the paper. JD implemented an initial version of the model, figured out the extension of sLDA to survival data, analyzed data, wrote some of the paper, and wrote the appendix. SY improved upon the initial implementation and conducted further analysis. All authors reviewed and approved the final manuscript.

Footnotes

Acknowledgments

The authors wish to thank Michael Jordan for conversations that helped motivate this work and Michael Newton and Ning Leng for conversations that helped to improve the manuscript.

Appendix

References

Chen

, Wang

, Ishwaran

An integrative pathway-based clinical-genomic model for cancer survival prediction. Stat Probab Lett. 2010; 80: 1313–9.

, Li

Dimension reduction methods for microarrays with application to censored survival data. Bioinformatics. 2004; 20: 3406–12.

Wei

, Li

Nonparametric pathway-based regression models for analysis of genomic data. Biostatistics. 2007; 8: 265–84.

, Luan

Kernel Cox regression models for linking gene expression profiles to censored survival data in pacific symposium on biocomputing; 2003: 65–76.

Ghosh

, Yuan

Combining multiple models with survival data: the PHASE algorithm, technical report. Penn State University Department of Statistics: University Park, PA 2010.

Pang

, Datta

, Zhao

Pathway analysis using random forests with bivariate node-split for survival outcomes. Bioinformatics. 2010; 26: 250–8.

Chen

, Wang

Integrating biological knowledge with gene expression profiles for survival prediction of cancer. J Comput Biol. 2009; 16: 265–78.

, Song

, Huang

Supervised group lasso with applications to microarray data analysis. BMC Bioinformatics. 2007; 8: 60.

Aalen

O.O.

Heterogeneity in survival analysis. Stat Med. 1988; 7: 1121–37.

10.

Dettling

, Bühlmann

Supervised clustering of genes. Genome biology. 2002; 3(12): 1–0069.

11.

, Gui

Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics. 2004; 20(suppl 1): i208–15.

12.

Blei

D.M.

, Ng

A.Y.

, Jordan

M.I.

Latent Dirichlet allocation. J Mach Learn Res. 2003; 3: 993–1022.

13.

Porteous

, Newman

, Ihler

, Asuncion

, Smyth

, Welling

Fast collapsed Gibbs sampling for latent Dirichlet allocation in KDD ‘08. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA, 2008: 569–77.

14.

Biro

, Szabo

, Benczur

Latent dirichlet allocation in web spam filtering. In: Castillo

, Chellapilla

, Fetterly

eds. AIR-Web. Beijing: ACM International Conference Proceeding Series; 2008: 29–32.

15.

Blei

D.M.

, McAuliffe

J.D.

Supervisedtopic models in advances. In: Platt

J.C.

, Koller

, Singer

, Roweis

, eds. Neural Information Processing Systems 20. Cambridge, MA: MIT Press; 2008: 121–8.

16.

National Cancer Institute and National Human Genome Research Institute. The cancer genome atlas. 2011. Available from: http://cancergenome.nih.gov/2011.

17.

Cox

D.R.

Regression models and Life-tables. J R Stat Soc Series B Stat Methodol. 1972; 34: 187–220.

18.

Cook

T.D.

, DeMets

D.L.

Introduction to statistical methods for clinical trials. Statistical Science. US: Chapman and Hall/CRC; 2008: 366–82.

19.

Jones

, Zhang

, Parsons

D.W.

. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science. 2008; 321: 1801–6.

20.

Hong

, Davison

B.D.

Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics. Washington DC, USA, 2010, 80–8.

21.

Salton

, Fox

, Wu

Extended Boolean information retrieval. Commun ACM. 1983; 26: 1022–36.

22.

Horacek . Natural Language Processing and Information Systems: 14th International Conference on Applications of Natural Language to Information Systems. Helmut Horacek, Elisabeth Metais, Rafael Munoz, Magdalena Wolska, editors. June 24–26, 2009. NLDB 2009, Saarbrücken, Germany.

23.

J.H.

, Moon

J.M.

, Kim

Y.B.

Prognostic significance of serum soluble CD163 level in patients with epithelial ovarian cancer. Gynecol Obstet Invest. 2013; 75: 263–7.

24.

Lim

, Lappas

, Riley

. Investigation of human cationic antimicrobial protein-18 (hCAP-18), lactoferrin and CD163 as potential biomarkers for ovarian cancer. J Ovarian Res. 2013; 75: 5.

25.

Huang

, Murphy

S.K.

Increased intragenic IGF2 methylation is associated with repression of insulator activity and elevated expression in serous ovarian carcinoma. Front Oncol. 2013; 3: 131.

26.

T.H.

, Chuyen

N.V.

, Li

, Hoffman

A.R.

Loss of imprinting of IGF2 sense and antisense transcripts in Wilms’ tumor. Cancer Res. 2003; 63: 1900–5.

27.

Saito

, Minamiya

, Watanabe

. Expression of the transient receptor potential channel c3 correlates with a favorable prognosis in patients with adenocarcinoma of the lung. Ann Surg Oncol. 2011; 18: 3377–83.

28.

Yang

S.L.

, Cao

, Zhou

K.C.

, Feng

Y.J.

, Wang

Y.Z.

Transient receptor potential channel C3 contributes to the progression of human ovarian cancer. Oncogene. 2009; 28: 1320–8.

29.

Kim

Y.J.

, Yoon

H.Y.

, Kim

J.S.

. HOXA9, ISL1, and ALDH1A3 methylation patterns as prognostic markers for non-muscle invasive bladder cancer: array-based DNA methylation and expression profiling. Int J Cancer. 2013; 133(5): 1135–42.

30.

Fox

S.B.

, Brown

, Han

. Expression of the forkhead transcription factor FOXP1 is associated with estrogen receptor alpha and improved survival in primary human breast carcinomas. Clin Cancer Res. 2004; 10: 3521–7.

31.

Tothill

R.W.

, Tinker

A.V.

, George

; Australian Ovarian Cancer Study Group. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res. 2008; 14: 5198–206.

32.

Yoshihara

, Tajima

, Yahata

Gene expression profile for predicting survival in advanced-stage serous ovarian cancer across two independent datasets. PLoS One. 2010; 5: 9615.

33.

Zaykin

D.V.

, Zhivotovsky

L.A.

, Westfall

P.H.

, Weir

B.S.

Truncated product method for combining P-values. Genet Epidemiol. 2002; 22: 170–85.

34.

Liu

, Gartenhaus

R.B.

, Chen

X.W.

, Howell

C.D.

, Tan

Survival prediction and gene identification with penalized global AUC maximization. J Comput Biol. 2009; 16: 1661–70.

35.

Korthauer

, Dawson

J.A.

, Kendziorski

Survival-supervized latent Dirichlet allocation models. Do

Kim-Anh

, Qin

Zhaohui Steve

, Vannucci

Marina

editors. Advances in Statistical Bioinformatics: Models and Integrative Inference for High-Throughput Data. Cambridge: Cambridge University Press; 2013: 366–82.

36.

Zilliox

M.J.

, Irizarry

R.A.

A gene expression bar code for microarray data. Nat Methods. 2007; 4: 911–3.

37.

Teh

Y.W.

, Jordan

M.I.

, Beal

M.J.

, Blei

D.M.

Hierarchical Dirichlet processes. J Am Stat Assoc. 2006; 101: 1566–81.

38.

Wainwright

, Jordan

Graphical models, exponential families, and variational inference. Found Trends Mach Learn. 2008; 1: 1–305.

39.

Jordan

, Ghahramani

, Jaakkola

, Saul

Introduction to variational methods for graphical models. Mach Learn. 1999; 37: 182–233.

40.

Breslow

Covariance analysis of censored survival data. Biometrics. 1974; 30: 89–99.

Extending Information Retrieval Methods to Personalized Genomic-Based Studies of Disease

Abstract

Keywords

Introduction

Methods

The LDA model

The SurvLDA Model

Document Construction in the TCGA Cohort

Prediction

Application of Survlda to the Tcga Data

Results

Prediction on Independent Data Sets

Discussion

Author Contributions

Footnotes

Acknowledgments

Appendix

References