Abstract
Genomic-based studies of disease now involve diverse types of data collected on large groups of patients. A major challenge facing statistical scientists is how best to combine the data, extract important features, and comprehensively characterize the ways in which they affect an individual's disease course and likelihood of response to treatment. We have developed a survival-supervised latent Dirichlet allocation (survLDA) modeling framework to address these challenges. Latent Dirichlet allocation (LDA) models have proven extremely effective at identifying themes common across large collections of text, but applications to genomics have been limited. Our framework extends LDA to the genome by considering each patient as a “document” with “text” detailing his/her clinical events and genomic state. We then further extend the framework to allow for supervision by a time-to-event response. The model enables the efficient identification of collections of clinical and genomic features that co-occur within patient subgroups, and then characterizes each patient by those features. An application of survLDA to The Cancer Genome Atlas ovarian project identifies informative patient subgroups showing differential response to treatment, and validation in an independent cohort demonstrates the potential for patient-specific inference.
Introduction
Technological advances continue to increase both the ease and accuracy with which measurements of the genome and phenome can be obtained and, consequently, genomic-based studies of diseases such as cancer often involve highly diverse types of data collected on large groups of patients. The primary goals of such studies involve identifying genomic features useful for characterizing patient subgroups as well as predicting patient-specific disease course and/or likelihood of response to treatment. Doing so requires computational methods that handle complex interactions, accommodate genetic heterogeneity, and allow for data integration across multiple sources.
A number of statistical methods are available for feature identification and prediction of a time-to-event phenotype such as overall survival or time to recurrence (for a review, see Chen et al. 1 , Li and Li, 2 and Wei and Li 3 ). Most often, classical models for a survival response are coupled with some dimension reduction methods for individual4–6 or grouped predictors,1,2,7,8 providing a concise representation of the genomic features affecting patient outcome. Although useful, the majority of these methods identify a set of covariates common to all patients and as a result may “distort what is observed” in the presence of heterogeneity. 9 Survival-supervised clustering approaches naturally accommodate heterogeneity, providing for efficient and effective identification of patient subgroups.10,11 However, these approaches do not identify salient features associated with subgroups and, as with the aforementioned methods, may sacrifice power and accuracy by focusing on one (or a few) data set(s) in isolation.
Latent Dirichlet allocation (LDA) 12 models are particularly well tailored for accommodating heterogeneity, selecting features, and characterizing complex interactions in a high-dimensional textual setting, but their application in genomics has been limited. By far, the most common application concerns identifying groups of words that co-occur frequently (topics) across large collections of text (eg, a collection of articles or abstracts). The derived topics provide insights into the collections’ content overall as well as into the specific content within a document, and estimated document-specific distributions over topics are useful in classifying new documents.12–14
An extension allows for topic estimation to be supervised by a response that is suitably described by a generalized linear model. 15 So-called supervised LDA (sLDA) debuted with a study of movie reviews (each considered a document) and estimated topics (collections of co-occurring words in a review) that determined the number of stars (supervising response) a movie received. Derived topics included ones having highest weight on words such as “power”, “perfect”, “fascinating”, and “complex”; another with highest weight on “routine”, “awful”, “featuring”, “dry”; a third on “unfortunately”, “least”, “flat”, “dull”; and so on. The movie-review-specific distribution over topics proved useful in classifying movies. Those with highest weight on the “power” topic generally had a high number of stars while those with highest weight on the “unfortunately” topic had a low number; those with weight on the “routine” topic most often ended up in the middle. Differences between the distributions also provided insights into differences between movies that received a similar number of stars.
Our interest here is not in evaluating movies. However, it is important to note that the questions addressed in Blei and McAuliffe 15 are identical in structure to the most important questions we face in cancer genomics. In the former, questions include: “Given reviews and ratings for a group of movies, can we identify collections of words (topics) that discriminate movie reviews? Can each movie be described by a distribution over those topics? Can distributions over topics provide insights into differences between similarly rated movies? And can a movie-specific distribution over topics be used to predict what the rating of a new movie will be?” In cancer genomics, the questions include: “Given genomic, clinical, and survival information on a group of patients, can we identify collections of genomic and clinical features (topics) that define and discriminate among patient subgroups? Can a patient be well described by a distribution over those topics? Can distributions over topics provide insights into the genomic differences between two patients with similar survival? And can a patient-specific distribution over topics be used to predict survival of a new patient?”
To address these types of questions, we extend LDA for use in a clinical and genomic setting. Specifically, survival-supervised LDA (survLDA) is developed in the second section to facilitate topic supervision by a time-to-event response with censoring. Unlike in the textual domains of Blei et al. 12 , Porteous et al. 13 , and Biro et al. 14 , the definition of a document is not obvious in this setting. The Methods section details the construction of documents, one for each patient, where words describe clinical events, treatment protocols, and genomic information from multiple sources. As we show in the Application of survLDA to the TCGA data section, application of survLDA to this collection of documents provides for the identification of topics useful in characterizing patient subpopulations as well as individual patients in a study of ovarian cancer conducted as part of The Cancer Genome Atlas (TCGA) project. 16 Classification of new patients is considered in the third section, and we conclude with the Discussion section.
Methods
The LDA model
We briefly review the LDA model as detailed in Blei et al.
12
Assume there are
For a given document Draw topic proportions θ
For each of the Draw a topic assignment Draw a word
With this model in place, a variational expectation-maximization (EM) algorithm may be used to estimate the joint posterior distribution of θ
The SurvLDA Model
The survLDA model assumes the same setup as in Section 2, but allows for topics to be supervised by a time-to-event outcome. For document
Document Construction in the TCGA Cohort
Unlike in the textual domains of Blei et al. 12 , Porteous et al. 13 , and Biro et al. 14 or in the movie review example described above, the definition of document is not obvious in this setting. To push the review analogy a bit further, whereas a movie review describes what is going on in a movie and provides an opinion on how the events were conveyed overall, we imagine patient reviews that describe what is going on in a patient with respect to genomic and clinical features. The analogy breaks down there, as the patient review does not contain an opinion on whether the features are positive or negative overall. Rather, the survLDA model is used to identify important features and estimate how these features relate to patient outcome as summarized by a time-to-event phenotype such as survival.
We use data from the TCGA ovarian project to construct patient-specific reviews or documents that summarize clinical and genomic features. For each of 511 patients in the TCGA ovarian cohort, clinical information such as age at diagnosis, date of surgery, surgical outcome, adjuvant therapies, time to recurrence, treatment at recurrence, overall survival, and dozens of other variables are available. Also available are high-throughput measurements of gene expression, methylation, single-nucleotide polymorphism (SNP)/copy number variation (CNV)s, and microRNAs.
For document construction, we use words associated with drugs, gene expression, and methylation, noting that other data sources could be integrated in a similar way. Specifically, the vocabulary (the union of words across all documents) includes words associated with commonly administered drugs (platinum, taxol, doxorubicin, topotecan, and gemcitabine) as well as words derived from potentially relevant genes. For gene words, we consider the 991 genes from the 12 cancer-related pathways defined in Jones et al. 19 , since studies suggest that the vast majority of cancer-causing mutations lie in genes within these pathways. We also include the 5000 genes having mRNA expression that is most correlated with overall survival in the TCGA cohort as well as the 5000 having methylation that is most correlated. Given the considerable overlap between these lists, the two combined give 7452 unique genes for a total of 7897 genes from which words are derived.
Ideally, a patient's document will provide a comprehensive description of his/her clinical and genomic state. Toward this end, a patient's document received a drug word for each drug the patient received and a gene word for gene ‘X’ if the patient showed aberrant expression for that gene. To determine the direction of aberrant expression, we considered the association between gene expression and survival time. If increased expression was associated with decreased survival time for gene X, then any patient with expression in the uppermost 10th percentile for that gene received a gene word. Similarly, if decreased expression was associated with decreased survival time, then any patient with expression in the lowest 10th percentile received a gene word. The same procedure was applied to methylation data. Once all documents were constructed, the term-frequency inverse-document frequency (
To provide further detail, if a word shows up exactly the same number of times across all the documents (eg, the word “the” shows up 10 times in each of all the documents we have), then the
Prediction
Given a new patient with clinical and genomic data, it may be of interest to construct a document
Given
Application of Survlda to the Tcga Data
Given documents constructed as described above for each of the 511 women are considered, we applied survLDA. The supervising outcome of interest is all-cause mortality; and in all analyses, we used
Results
The left panel of Figure 1 shows a heat map with patients (columns) clustered according to topic membership for the six nonbackground topics (rows). The proportion of a patient's document words coming from a topic ranges from near 0 (almost no words, deep blue) to near 1 (virtually all words, red). As shown, most patient documents have the majority of words coming from a single topic, while some are best described by mixtures over topics. To see how differences among topics translates to differences in overall survival, the right panel of Figure 1 shows Kaplan–Meier curves for TCGA patients grouped by topic membership. Specifically, each patient is assigned to the topic having the highest weight in her document, as estimated by θ1:

The left panel shows a heat map of the estimated patient-specific distributions over topics (θ) for each of 511 patients (the background topic is not shown). Topics are given in the rows; patients are clustered along the columns. Colors range from deep blue (topic underrepresented in the patient's document) to red (topic overrepresented). The right panel shows Kaplan–Meier survival curves for patients classified into one of the six nonbackground topics. Each patient was assigned to the topic having highest weight in his/her document, as estimated by θ1:
The left panel of Figure 2 presents the topic-specific distributions over words for each topic. Red (blue) indicates an overabundance (dearth) of a word's weight in the corpus belonging to a particular topic. The right panel of Figure 2 shows a close-up view, highlighting 40 high-weight words that in part differentiate topics 1 and 2. A number of the results observed are consistent with prior studies. For example, CD163 expression levels have recently been shown to be prognostic of outcome in ovarian cancer patients, with higher expression associated with poor outcome.23,24 This is consistent with what we observe, with an abundance of CD163 words in the poor outcome (topic 1) group. Similarly, increased expression of IGF2 has also been associated with poor survival in ovarian cancer patients. 25 Here we observe high methylation of IGF2AS (which is correlated with IGF2) 26 in the poor outcome group, which at first may seem to be a contradiction given that increased methylation often results in decreased expression. However, that is not the case for IGF2, where increased methylation correlates with increased expression. 25

The left panel shows a heat map of the topics derived from survLDA. Topics are shown in the columns; words are clustered along the rows. The colors range from blue (word underrepresented in the topic) to red (word overrepresented), with white in the middle (average representation). To aid in interpretation, we add the risk direction and data source from which each word was derived. For example,
Other genes such as TRPC3, ALDH1A3, and FOXP1 have been studied in other cancers; and our results suggest that these genes may play important roles in ovarian cancer as well. Underexpression of TRPC3 has been correlated with poor prognosis in lung cancer,27,28 as has hypermethylation of ALDH1A3 in bladder cancer. 29 FOXP1 is a relatively well-known tumor suppressor gene with increased expression associated with improved outcomes among breast cancer patients. 30 As in these studies, we observe TRPC3 underexpression and ALDH1A3 hypermethylation in our poor prognosis group and increased FOXP1 expression in patients with longer survival.
It is interesting to note that with the exception of CD163, these genes would not likely have been identified in this cohort using other approaches, as the marginal

Heat maps showing co-occurrence of the 40 high-weight topic 1 and topic 2 words shown in Figure 2 The left heat map considers the 25 patients having documents with highest weight on topic 1. Shown are the percentages of those patients having both words in their document, ranging from 0 (blue) to 100% (red). The black line separates topic 1 and topic 2 words. The right panel is similar, showing percentages of co-occurrence in documents of the 85 patients best described by topic 2 words.
Prediction on Independent Data Sets
To evaluate survLDA for patient-specific prediction, we consider two independent data sets. Specifically, we consider 240 patients from the study by Tothill et al. 31 , conducted in Australia consisting of patients with ovarian, tubal, and peritonial cancers; we also consider 260 patients from the study by Yoshihara et al. 32 conducted in Japan. These independent populations are referred to hereinafter as the validation patients. Although the TCGA data we have used is restricted to patients with stage III or IV serous ovarian adenocarcinomas, these independent studies are more heterogeneous and thus present a more challenging (and realistic) validation data set. Documents for the validation patients were derived as described in Section 2 with the quantile thresholds taken from the training set (TCGA) data. Identifying thresholds in the training data allows us to construct documents for validation patients one at a time, as would be required in any setting where patient-specific prediction was of interest. Drug and methylation words were not included as the validation data sets did not contain this information.
With documents in hand, the survLDA output was used to predict topic membership for patients in the validation set, using the prediction approach given in Section 2. The left panel of Figure 4 shows survival for patients best described by words from topics 1 and 2, the two topics discussed earlier. There is a significant difference between survival in the two groups (

Topic-based prediction of overall survival in an independent patient cohort.
Discussion
A problem pervasive in genomic-based studies of disease concerns taking large, diverse data sets collected on a cohort of patients and using the information contained therein to characterize patient subtypes as well as individuals. Computational scientists often address this problem by performing analysis within a single data type and comparing results subsequently in an effort to identify a signal supported by the disparate analyses (eg, a gene's SNPs, expression, and methylation all associate with a phenotype). Comparing results manually has its obvious disadvantages. At the same time, meta-analysis approaches such as Fisher's combined probability test can be limited by low power 33 ; and efforts to combine data directly are challenged by measurements on different scales with differential dependencies. The survLDA-based framework proposed here addresses these challenges by transforming the information contained in high-throughput genomic screens into text. Doing so has both advantages and disadvantages.
One advantage is that data integration is seamless. In the implementation presented, a word for a gene is assigned to a patient's initial document if the gene shows extreme expression; the same word is assigned if the gene shows extreme methylation. In this way, a document may contain copies of words associated with extreme genomic features, measured from expression and/or methylation. Other types of data are easily incorporated into the framework. For example, just as with extreme expression or methylation, a gene word could be included in a document if that gene harbored a CNV.
A second advantage is that the threshold required for a gene to be included in the analysis is much lower than would be required with other methods. As detailed in the Methods section, some preselection of genes is done, but the selection does not require even nominally significant association with a survival endpoint, as is often required in survival studies with high-dimensional covariates.4,7,34 This allows for the identification of many important genes, some previously known to be involved in cancers other than ovarian, which would not otherwise have been considered.
Although the identification of individual genes may prove useful, a main advantage of LDA in general and survLDA in particular is that it reveals
Our findings of the predictive ability of the approach are mixed. In our earlier work, 35 we conducted simulation studies to assess predictive ability under a variety of settings. As that work suggested for the sample size considered here, there is some ability for prediction, but improvements are expected with increases in the number of patients as well as improvements in document creation strategies. As detailed in, 35 sample sizes larger than those considered here are required to improve prediction significantly. In general, more work is needed to better understand the specific effects of sample size, document size, word frequencies, and replication, which are determined in part by the method used for document construction. Our approach to assign a word for any gene showing extreme expression or methylation was motivated by the study of Zilliox and Irizarry, 35 where the authors identify bimodal genes and, for each individual and each gene, assign a binary variable indicative of mode membership. The resulting gene expression ‘barcode’ for each patient proved useful in classifying patients into biologically meaningful groups 36 and the extrapolation of their approach proved to be an effective strategy here. Another possibility is to assign an increasing number of words in direct proportion with signal. For example, consider breaking a gene's expression into deciles, say, and assign 1–10 words for each document (eg, a value between the sixth and seventh deciles gets seven words). We did not favor this approach for two main reasons. First, the approach assumes linearity of expression and methylation, which is often not the case. Second, the approach results in documents having few unique words, which reduces specificity of topics as well as document-specific distributions over topics. Document construction continues to be explored, and improvements are expected to prove useful in a number of settings.
In addition to the means by which covariates are translated into words, there are many aspects of the proposed methods that require further development. In particular, survLDA assumes the simplest of Dirichlet priors on the distributions of topics over patients and therefore the documents are considered conditionally independent given
Similarly, the composition of the topics themselves is essentially free. Were it not for our imposition of a background topic, the topics would be completely unstructured a priori. As it is,
In summary, it is becoming increasingly clear that studies aimed at solving the most challenging problems in cancer genomics involve highly diverse types of data collected on large groups of patients. Many methods will prove useful. We expect that advantage will be gained from methods that are able to integrate data and account for cohort heterogeneity, allow supervision by outcomes of interest such as survival, provide for patient-specific inference, and facilitate prediction of unobserved outcomes. The proposed approach provides tools for these purposes in an effort to help ensure that maximal information is obtained from genomic-based studies of disease.
Author Contributions
CK conceived the model and application, and wrote much of the paper. JD implemented an initial version of the model, figured out the extension of sLDA to survival data, analyzed data, wrote some of the paper, and wrote the appendix. SY improved upon the initial implementation and conducted further analysis. All authors reviewed and approved the final manuscript.
Footnotes
Acknowledgments
The authors wish to thank Michael Jordan for conversations that helped motivate this work and Michael Newton and Ning Leng for conversations that helped to improve the manuscript.
