Sage Journals: Discover world-class research

Abstract

This article examines methods for automated question classification applied to cancer-related questions that people have asked on the web. This work is part of a broader effort to provide automated question answering for health education. We created a new corpus of consumer-health questions related to cancer and a new taxonomy for those questions. We then compared the effectiveness of different statistical methods for developing classifiers, including weighted classification and resampling. Basic methods for building classifiers were limited by the high variability in the natural distribution of questions and typical refinement approaches of feature selection and merging categories achieved only small improvements to classifier accuracy. Best performance was achieved using weighted classification and resampling methods, the latter yielding an accuracy of F1 = 0.963. Thus, it would appear that statistical classifiers can be trained on natural data, but only if natural distributions of classes are smoothed. Such classifiers would be useful for automated question answering, for enriching web-based content, or assisting clinical professionals to answer questions.

Keywords

automated question answering question classification

Introduction

This article considers the feasibility of using automated question classification to discriminate among a broad range of consumer-generated cancer questions. Question classification has been a key step in automated question answering (QA) and offers many potential benefits over a deep parsing strategy for interpreting questions; however, prior approaches^1–6 have been too limited to answer the health questions of the general public. Question classification could also be used to assist clinical support staff in answering questions by suggesting a likely set of answer templates or be used to provide metadata for questions on the web, so that questions posted in social media could be linked to similar questions or to sources on the web that might provide answers.^7–9 There are two critical limitations of existing approaches to classifying health questions; first, most often they have been developed using manually constructed corpora, rather than questions posed by people who had a true information need, or they address only simple categorizations, such as factoids in the general domain. Our research, by contrast, starts with a new collection of authentic questions, develops a new, non-factoid taxonomy from it, and tests the hypothesis that a properly trained classifier can provide an effective means of identifying the information need expressed by a question. Creating such a classifier requires addressing the challenges of real data, including errors in spelling or grammar, and the likelihood that the natural distribution of people’s information needs will not be uniform.

Question classification is a special case of document classification in which each question is considered a document and the classification is a label for the type of answer that the questioner is expecting.^10–12 This class is referred to as the Expected Answer Type (EAT) of a question. The EAT, in conjunction with the topic of the question, can be used as input to an information extraction process (such as for open-domain question answering) or as a search key for real-time matching of questions to stored answers for closed-domain question-answering. Question classification requires having an appropriate taxonomy of EATs. Prior taxonomies for question-classification^1–4 have been developed using synthetic questions (imagined by researchers) or from questions asked by clinical experts. Also, the questions have been limited to simple types (people, places, organizations, names of drugs, and so on), referred to as “factoids.” As questions from the general public might require a complex description and not just a simple entity, we investigated developing a new taxonomy that could make such distinctions. To obtain natural consumer-generated questions, we collected questions found on community-based question answering (cQA) websites.¹³ An informal review of these data quickly revealed that no existing taxonomy would cover the range of questions posted by the public. Thus, collecting a natural corpus and creating such a taxonomy was essential.

We were most interested in assessing the feasibility of classifiers to support automated QA. Automated QA, while not new, is undergoing a resurgence, most notably IBM’s Watson.^14–16 QA is very appealing to information seekers because it saves them from searching for information or having to wait until a person can respond to their inquiry. Documents for disseminating health information, such as paper booklets or public websites, generally aim to serve a broad audience and thus present far more information than required for an individual, requiring many steps to get to the needed information. Automated QA is also appealing to information providers, however, because it allows them to address the specific needs of their clients, without consuming too many resources. Consumer-support services, including telephone answer-lines, one-on-one phone calls, and even web-based email typically require at least one full-time staff person, and that person may need very specialized expertise (such as Medela’s lactation consultant).^17,18

Automated QA would be especially useful for disseminating public health information, because it can be deployed to a broad audience using mobile phones, which we have found to be more prevalent than Internet use among some low-income populations.^19–21 We have conducted both survey and observational studies and consistently found that many participants preferred the idea of asking health questions by text messages, rather than by conducting a web-search. A key concern is how to recognize the information need expressed by the question quickly and accurately, which led us to this investigation of question classifiers.

The research described here tests the hypothesis that a classifier trained on real users’ questions can provide an effective means of identifying the information need expressed by a question. There are three main parts to this research. First, we collected cancer questions from cQA websites¹³ and performed some manual filtering to obtain relevant questions. Second, we iteratively built a taxonomy to divide questions by EAT and coded a sample of questions. Third, we created a set of test classifiers using our corpus as training and test data, and our taxonomy as our set of classifications, focusing on Supervised Machine Learning (SML) based techniques.^11,12 The most common SML techniques used in question classification are statistical approaches, decision trees, and vector space algorithms,¹¹ but none of these techniques has been evaluated with real data. Real data are challenging for many SML techniques, because similar terms may appear in numerous forms and there may be an uneven distribution among the types. To address these concerns, we considered the performance of the basic classifiers and several simple methods to improve classifier performance (such as dimensionality reduction),^11,22 as well as more sophisticated approaches, such as iterative resampling of the data.^23,24 The results provide valuable insight for further research involving natural datasets.

Methods

This work addresses two main concerns: first, how can one develop a reliable coding scheme for consumers’ questions, and second, how well do different classifiers work after being trained on natural questions asked by the general public to address a true information need. This evaluation required obtaining a suitable corpus, creating a taxonomy, manually coding progressively larger samples of questions and then using the final corpus for testing a wide-variety of well-known classifiers, and techniques for improving their performance. Below we elaborate on each of these steps.

Building a corpus and a taxonomy

The first step was to build a corpus of natural questions coded using a taxonomy of EATs suitable for answering questions consumers have about cancer. We developed both together using an interleaved, iterative process. Using a large set of harvested questions that had been promoted and verified (as described below), each iteration involved coding a sample with the current version of the taxonomy, assessing inter-rater reliability, and extending and refining the taxonomy. Refinement stopped when no additional improvement in reliability was noted.

Question collection

To gather questions, we created a web-crawling application to visit selected sites. The application uses keyword matching to determine relevant questions when crawling sites not restricted to cancer. The saved data are then imported into a Structured Query Language (SQL) database for processing and classification. The cQA websites crawled for our corpus were All Experts,²⁵ The American Society of Clinical Oncology,²⁶ The Cleveland Clinic,²⁷ Med Help,²⁸ Net Wellness,²⁹ and Your Cancer Questions.³⁰

Question promotion and verification

Our automated method for gathering questions was noisy. Some of the “questions” contained 1000’s of characters of background description and it is beyond the scope of our work to filter this material automatically. Also, some of the questions were conjunctions of several questions, which again we do not aim to recognize or divide automatically as part of the current effort. Other issues we noted were items related to non-human species and items that were not questions in the rhetorical sense. To address these problems, we added some automated filtering based on length and a manual review of all items to filter ones not relevant to the task, remove material not essential to the question and subdivide compound questions. After a coder promoted a question, a second coder verified it to assure that it met our criteria. This process of promotion and verification and coding was performed eight times on progressively larger samples of data, starting with a 15-item sample, and ending with a sample of about 1500 items, as the taxonomy and coding protocols were refined. (The first two coding rounds, using the smallest samples, were considered pilot studies, and thus involved only a single coder.)

EAT Taxonomy construction

From the start, we structured the new taxonomy as a multi-level hierarchy, where the top-level categories are Factual, PatientSpecific, and NonClinician, as this suited our intended QA application; however, all coding was done at the level of terminal categories, which we refined over time. Factual questions are questions that can be answered directly with medical facts by an automated QA system. This includes traditional factoid questions as well as more complex information needs, such as “how” and “why” questions. PatientSpecific questions are also medical questions, except they are about a specific person’s condition or treatment. These questions can only be answered by a clinician, so a QA system should either suggest consulting a provider or forward the question to a provider itself. NonClinician questions ask for non-medical information related to cancer. For example, questions about health insurance, legal issues, or emotional needs belong in this category. Such questions might also be deemed outside the scope of the system and addressed via an external resource, possibly through social media. The more specific categories of the taxonomy cover subtypes of questions, such as questions about the meaning of a term or an explanation of some medical procedure. The organization and rationale for the taxonomy was explained to the coders to help them discriminate among labels that might otherwise seem similar.

Taxonomy refinement

We used inter-coder agreement statistics^31,33 measured from classifications of samples of our corpus as feedback to inform taxonomy construction. In this manner, we could objectively measure the impact of each revision of the taxonomy. Coders would independently classify a subset of the promoted questions by hand, using a software tool that we developed. The tool allowed them to select codes from a dropdown menu, eliminating the possibility of typographical errors in the coding. Inter-coder agreement statistics were then calculated and meetings held to discuss questions and categories with low inter-coder agreement scores. The goal was to achieve an agreement of 0.70, which is generally considered reliable for a coding scheme. In the early stages, the most common reasons for low agreement were misinterpretation of the category meaning on the part of coders, an ambiguity between two or more categories in the taxonomy, or a gap in the taxonomy where a question did not fit into any category. After discussion, the taxonomy or the coding protocol were revised accordingly. After each revision, the manual classification task was re-started using the revised taxonomy. Altogether, there were two pilot studies (single coder) and six iterations (multiple coders). Table 1 shows the size of the raw samples used in each step and the number of items that were subsequently coded after promotion and verification.

Table 1.

Sample sizes for the coding tasks.

Coding task	Size of raw sample	Size of coded corpus (after promotion, verification)
First pilot	15	15
Second pilot	50	50
First iteration	153	114
Second iteration	235	195
Third iteration	335	254
Fourth iteration	4285	558
Fifth iteration	6934	1133
Sixth iteration	7692	1503

Filtered Taxonomy construction

While there is no fixed standard for the minimum number of examples needed to train a SML-based classifier as it depends on specifics of the data, we felt that for a dataset of around 1300 questions one should have at least 25 instances of each class. Since a few of the classes in the final taxonomy had much less, we wrote a function to create a one-to-one transformation that automatically mapped this taxonomy onto a smaller one where similar classes with low question counts were merged. We refer to this as the Filtered Taxonomy. We then used this version to test classifiers.

Training statistical classifiers

The second main part of this effort was to test different configurations of classifiers. All testing was done using Weka,³⁴ along with some functions of Lucene³⁴ following widely used algorithms for text classification,^{1,3,11,12,36,37} as well as other techniques to address problems in question classification, such as sparseness of terms or unbalanced distribution of classes.^{11,12,23,24,38,39}

Tests of classifier algorithms

For the basic classifiers, we tested naive Bayes (NB),³⁹ multinomial naive Bayes (MNB),⁴¹ J48 Trees,⁴² and a sequential minimal optimization (SMO) implementation of support vector machines (SVM),⁴³ which is faster than LibSVM,⁴⁴ comparable to others.⁴⁵ We used a linear kernel for SMO for the entirety of our testing, as pilot trials with radial and quadratic kernels performed consistently worse. All classifiers were tested with 10-fold cross validation^12,33 using the Filtered Taxonomy shown in Figure 2. We tested both the Level 1 Taxonomy Distribution (comprising just the top-level categories) and the Terminal Distribution (comprising all bottom-level categories).

Tests of dimensionality reduction

Since lexical variation among semantically similar questions is common, we tested several dimensionality reduction techniques. Here, dimensionality reduction involves removing terms from questions to improve performance without compromising accuracy. We implemented two Local Relevancy latent semantic indexing (LSI) techniques similar to a Ladder-Weighted LSI.²² We also tested automated spelling correction and feature replacement. We first tried a flat threshold Local Relevancy LSI technique, where a term was trimmed from the corpus if it appeared in more than a certain percentage of questions in every category. We tested the range of 5–25 percent as the threshold, in increments of 5 percent (there were no terms that appeared in 30 percent or more of questions every category). The second Local Relevancy LSI technique we tried used a range of incidence rates. A term was trimmed from the corpus if its incidence rate across all categories fell within a specified percentage of the mean incidence rate. This eliminated terms that had a similar rate of occurrence across all categories, not just a high rate. We tested ranges of 30–50 percent of the median in increments of 5 percent. (We did not use the LSI)^12,22 attribute selection in Weka,³³ as it required too much memory.)

We tested spelling correction with Lucene’s³⁵ built-in spell-checking functionality. The Lucene spell checker uses a dictionary text file as input for the correct spelling of words and outputs a list of possible corrected spellings for each word in a document that does not match a word in the dictionary. We combined the Ispell⁴⁵ standard American English dictionary as compiled by WordList⁴⁶ and the Consumer Health Vocabulary⁴⁷ dictionaries into a single input file. To further reduce the number of terms, we also tested feature replacement strategies, including replacing sequences of digits with the token #NUMBER, replacing drug names with the token #DRUG, and dates with the token #DATE, using a method similar to Juan et al.⁴⁹

Tests of weighted classification

To address the lack of uniformity in the distribution of question types, we tested a weighted SVM.³⁸ To construct a weighted SVM, we used Weka’s LibSVM library with its default settings and calculated the weights based on the inverse percentage of their appearance in the dataset, and multiplied that result by a 100, an empirically chosen constant.

Tests of resampling

We also tested resampling,^23,24 another method for imbalanced data. We used Weka’s supervised instance resampling filter. This filter was originally designed to classify large datasets by pulling out a portion of the data to generate a model. The filter can be “biased” to balance the subset closer to a uniform class distribution with an ensemble method of over-sampling under-represented class data and under-sampling over-represented classes. By keeping all the data, but adding a bias toward a more uniform class distribution, the filter balances a previously skewed dataset without reducing the overall sample size. We tested the filter using three values for bias, 1.0, 0.5, and 0.1.

Results

Test corpus

We collected items from six well-known health information websites that answer and archive consumer-generated questions. Formats for these sites include live chats, user answered forums, and e-mailed or web-form submitted questions. Our application retrieved more than 50,000 potentially relevant items. For the final testing of the classifiers, from a sample of 7692 items, a total of 1904 Raw Questions were processed, with 757 (39.7%) rejected by coders. The 1147 remaining Raw Questions yielded 1503 Promoted Questions. All 1503 Promoted Questions were classified by two coders, with 1279 placed in the final classified corpus. The remaining 224 (15%) questions were rejected by coders (during classification) or administrators (post classification). These rejected questions were determined to have been promoted improperly, either because they had been edited or split incorrectly, or because they were not questions related to cancer in humans. The final Level 1 Distribution agreement was close to 0.7 and the final terminal level distribution agreement was 0.55, which we deemed sufficient, given our task.⁴⁹ After these scores were calculated, examples with low agreement were discussed and the consensus coding was used. There were no relevant questions that coders were unable to place in a Level 1 Category.

Final EAT Taxonomy

Figures 1 and 2 show the final versions of the Full and Filtered Taxonomies, respectively. The Level 1 Distribution of classes was Factual 44 percent (N = 561), PatientSpecific 48 percent (N = 613), and NonClinician 8 percent (N = 105). The question count for the Level 1 Distribution shows that even with the largest subgroups possible, the corpus is still significantly unbalanced. In the terminal distribution, three categories, PatientRecommendation, EntityExplanation, and PatientExplanation, comprised over 57 percent of the data, while three others, NumericPropertyValue, Reference, and Definition comprised only 8.33 percent of the data. The maximum ratio of questions in the Level 1 Distribution is 5.84, whereas in the Terminal distribution it is 13.23, another indicator of the large degree of imbalance we found. The distribution of classes for the Filtered Taxonomy is shown in Figure 3.

Figure 1.

Full Taxonomy.

Figure 2.

Filtered Taxonomy.

Figure 3.

Class distribution in the unbiased Filtered Taxonomy.

Results of testing with the Level 1 Taxonomy

The percentage of correctly classified questions for all configurations of the Level 1 Distribution classifiers is shown in Table 2. The first row, marked U, corresponds to the corpus without any transformations applied. The second row, marked SC, corresponds to the transformed corpus that is obtained by applying the Lucene spell checker alone. Corpora with the threshold transform applied are labeled as Txx, where xx is the threshold percentage for trimming a term from the corpus. Similarly, corpora with the range transform applied are labeled as Rxx, where xx is the maximum percent deviance from the mean that incidence rates can occur in for a term to be trimmed from the corpus.

Table 2.

Classifier accuracy for the Level 1 Distribution.

Level 1 Corpus	NB (%)	MNB (%)	J48 Tree (%)	SMO (%)
U	69.0	73.4	69.1	75.5
SC	69.4	73.9	66.4	73.6
T ₅	68.8	70.4	66.6	71.6
T ₁₀	70.0	72.0	69.3	72.6
T ₁₅	69.2	71.6	68.5	72.3
T ₂₀	69.5	73.3	67.5	73.9
T ₂₅	70.1	72.9	68.2	73.6
R ₃₀	69.4	72.2	70.0	75.3
R ₃₅	69.8	72.5	68.6	74.4
R ₄₀	70.0	72.4	70.2	72.8
R ₄₅	69.4	72.5	68.2	73.7
R ₅₀	70.2	72.0	70.7	71.1

U: unmodified corpus; SC: corpus with only spelling correction applied; NB: naive Bayes; MNB: multinomial naive Bayes; SMO: sequential minimal optimization.

As Table 2 shows, none of the algorithms performs significantly better than the others on our classification task. SMO and mixed-membership naive Bayes (MMNB) slightly outperformed J48 and NB with our corpus. Dimensionality reduction, including spelling correction, term trimming, and feature replacement also had little impact, and upon investigation, was found to have removed very few terms (less than 5%, and often much less).

Results of testing with the Filtered Taxonomy

Accuracy results for dimensionality reduction configurations of the Filtered Taxonomy are shown in Table 3. The comparative results among classifier algorithms are similar to those in the Level 1 Distribution, albeit lower. SMO and MNB outperformed J48 and NB, however, the magnitude of the difference was small. The results of using SMO with feature replacement similarly showed insignificant improvement (F1 = 0.518; receiver operating characteristic (ROC) = 0.825) over the best prior configuration.

Table 3.

Classifier accuracy for Filtered Taxonomy.

Terminal corpus	NB (%)	MNB (%)	J48 Tree (%)	SMO (%)
U	40.1	45.7	38.9	50.7
SC	40.2	46.0	38.4	51.9
T ₅	41.1	45.8	39.6	48.1
T ₁₀	41.0	45.6	37.9	50.2
T ₁₅	40.1	44.4	39.0	50.7
T ₂₀	40.6	44.7	39.9	50.5
T ₂₅	40.6	45.5	39.4	50.9
R ₃₀	40.8	46.0	38.4	51.2
R ₃₅	40.3	46.0	39.0	50.5
R ₄₀	39.7	45.1	39.7	52.6
R ₄₅	41.0	45.1	40.6	51.2
R ₅₀	41.3	45.7	40.2	52.1

NB: naive Bayes; MNB: multinomial naive Bayes; SMO: sequential minimal optimization.

The results of using a weighted SVM showed some improvement over applying dimensionality reduction techniques. Using only the basic approach (default settings, weights all calculated based on the inverse percentage of a class’s appearance, and multiplication of the result by 100) yielded an F1 score of 0.565 (with ROC only 0.737). Since this improvement still seemed minor, we did not pursue this method further, although there are many ways to tune a weighted SVM.

The results of resampling the data with a bias of 1.0 and using the SMO classifier without other dimensionality reductions led to the best results overall. Figure 4 shows the results on the class distribution after rebalancing. The smoothing greatly improved the effectiveness of the classifier to F1 = 0.846, suggesting that the unbalance of the data was the main cause of the low accuracy. Running the data with a bias of 0.5 and 0.1 yielded similar results (F1 = 0.793 and F1 = 0.789, respectively), indicating that even with relatively little balancing, the results are significantly improved. Repeated resampling of the data further improves the results, to F1 = 0.963 (ROC = 0.985), after 5 applications.

Figure 4.

Distribution of classes in Filtered Taxonomy re-sampled with bias = 1.0.

Discussion

Our results with a wide range of basic classifiers appear to be similar to those previously reported for classifiers trained on idealized (artificially created) questions for the open domain.^35,46,50 This result suggests that while training a classifier should use a corpus of natural data, improvements to classifiers made on the basis of artificial data are likely to generalize to real data as well. However, because collections of natural questions are not likely to be as balanced as artificial datasets, additional statistical methods may be needed to compensate for the imbalance.

Classifiers trained for the Filtered (terminal) Taxonomy were initially much less accurate than classifiers trained on the Level 1 Distribution. This difference was not surprising given the lower levels of inter-coder agreement and poorer uniformity in the distribution of classes. Imbalance appears to have been the most significant factor however, as repeated statistical rebalancing was effective in improving classifier accuracy.

Prior work on question classification did not seem to consider the possible impact of dimensionality reduction or spelling correction, so we wanted to test them. However, our results suggest that these methods have little impact on question classifier results. We examined these results by hand to learn why they had such a small impact. From our sample of about 1300 items, there were 584 instances where a term was changed by our spelling correction. While term counts changed, however, a closer inspection of the corrected terms revealed that the spelling correction methods used were just as likely to create a new error as to correct one. For example, a user misspelled “chemotherapy” as “quimotherapy,” and it was corrected to “biotherapy.” Search engines make use of soundex algorithms (phonetic matching), which would be useful here as well. Replacing drug names with a general category also seems to have had a detrimental effect, possibly because specific drug names are more common in some categories than others.

We acknowledge the limitations of this work. Our results might be improved by coding more questions. Coding additional questions might have provided a training corpus with a more even distribution of questions; however, it is also possible that the distribution of question types is naturally uneven. Regarding the relevance of the training corpus for supporting interactive QA, although we selected questions submitted to websites that are similar in length to those that might be used in a QA dialog, most of the questions we found were embedded in longer narratives from which we manually extracted the questions. The degree to which these questions resemble questions sent in a more interactive modality (such as SMS) is still under investigation. In a related study, we examined the questions participants asked using text messages to be answered by a clinician.²¹ These questions were found to be very similar to the cQA questions in our corpus in grammar and lexical choice. Additional questions might also be obtained from social media sites with length limits, such as Twitter.

Conclusion

We have developed a new cancer question corpus and EAT Taxonomy, as well as some tools for doing manual classification. The data confirm that people have a wide range of questions, but there may be a natural imbalance in the types of questions that they ask, creating challenges for training a classifier. With the application of methods to address imbalance in the distribution of question types, however, we found that statistical classification methods can be effective, especially for the task of discriminating between Factual and PatientSpecific questions. Hence, these methods could be used to develop an effective automated QA system or to create a system that filters Factual questions and refers the remainder to a person for an appropriate response.

Footnotes

Acknowledgements

The authors thank our colleagues at University of Wisconsin–Milwaukee who have read earlier versions of this work, including Dr Rashmi Prasad and Dr Jun Zhang. The authors thank Majid Rastegar-Mojarad, for his suggestions to look at weighted SVMs and resampling and his help in using Weka. The authors also thank Zong Xiong and Zahrah Dillard for their assistance.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Roth

. Learning question classifiers. In: The 19th international conference on computational linguistics (COLING ’02), Taipei, Taiwan, 24 August–1 September 2002, pp. 556–562. New York: ACM.

Ittycheriah

Franz

Zhu

. IBM’s statistical question answering system. In: The 9th text retrieval conference (TREC-9) (NIST special publication 500-249), Gaithersburg, MD, 13–16 November 2000.

Demner-Fushman

Lin

. Answering clinical questions with knowledge-based and statistical techniques. Comput Linguist 2007; 33(1): 63–103.

Zheng

. AnswerBus question answering system. In: The 2nd international conference on human language technology research, San Diego, California, 24–27 March 2002, pp. 99–404. New York: ACM.

Roth

. Learning question classifiers: the role of semantic information. Nat Lang Eng 2006; 12(3): 229–249.

Dubien

. Question answering using document tagging and question classification. MS Thesis, University of Lethbridge, Lethbridge, AB, Canada, 2005.

Berners-Lee

O’Hara

. The read−write linked data web. Philos Trans A Math Phys Eng Sci 2013; 371: 20120513.

Huh

Yetisgen-Yildiz

Pratt

. Text classification for assisting moderators in online health communities. J Biomed Inform 2013; 46(6): 998–1005.

Konstantinidis

Fernandez-Luque

. The role of taxonomies in social media and the semantic web for health education. A study of SNOMED CT terms in You Tube health video tags. Methods Inf Med 2013; 52(2): 168–179.

10.

Hovy

Gerber

Hermjakob

. Toward semantics-based answer pinpointing. In: The 1st international conference on human language technology research, San Diego, California, 18–21 March 2001, pp. 1–7. New York: ACM.

11.

Jurafsky

Martin

. Speech and language processing. 2nd ed. Upper Saddle River, NJ: Prentice Hall, 2009.

12.

Duda

Hart

Stork

. Pattern classification. 2nd ed. New York: John Wiley & Sons, 2001.

13.

Liu

Cao

. Understanding and summarizing answers in community-based question answering services. In: The 22nd international conference on computational linguistics, Manchester, UK, 18–22 August 2008, pp. 497–504. New York: ACM.

14.

Ferrucci

Brown

Chu-Carroll

. Building Watson: an overview of the DeepQA project. AI Mag 2010; 31(3): 59–79.

15.

Horowitz

. IBM, WellPoint developing health care applications for Watson. eWeek, http://www.eweek.com/c/a/Health-Care-IT/IBM-WellPoint-Developing-Health-Care-Applications-for-Watson-417553/ (2011, accessed 2 December 2014).

16.

IBM. USAA and IBM join forces to serve military members, http://www-03.ibm.com/press/us/en/pressrelease/44431.wss (2014, accessed 2 December 2014).

17.

Blake

. Innovation in practice: mobile phone technology in patient care. Br J Community Nurs 2008; 13(4): 160, 162–165.

18.

Medela. Ask The LC, http://www.medelabreastfeedingus.com/ask-the-lc (2014, accessed 23 September 2014).

19.

Song

May

Vaidhyanathan

. A two-way text-messaging system answering health questions for low-income pregnant women. Patient Educ Couns 2013; 92(2): 182–187.

20.

Song

Cramer

McRoy

. Information gathering and technology use among low-income minority men at risk for prostate cancer. Am J Mens Health. Epub ahead of print 20 June 2014. DOI: 10.1177/1557988314539502.

21.

McRoy

Cramer

Song

. Assessing technologies for information-seeking by low-income men about prostate cancer screening. J Patient Cent Res Rev 2014; 1(4): 188–196.

22.

Ding

. LRLW-LSI: an improved Latent Semantic Indexing (LSI) text classifier. In: Wang

Grzymala-Busse

. (eds) Rough sets in knowledge technology (Lecture notes in computer science), vol. 5009. Berlin, Heidelberg: Springer, 2008, pp. 483–490.

23.

Dupret

Masato

. Bootstrap re-sampling for unbalanced data in supervised learning. Eur J Oper Res 2001; 134(1): 141–156.

24.

Estabrooks

Japkowicz

. A multiple resampling method for learning from unbalanced sets. Comput Intell 2004; 20(1): 18–36.

25.

About.com. All experts, http://www.allexperts.com/ (2011, accessed 12 September 2011).

26.

American Society of Clinical Oncology. ASCO expert chats, http://connection.asco.org/ (2011, accessed 12 September 2011).

27.

The Cleveland Clinic. Cleveland clinic live chats, http://my.clevelandclinic.org/multimedia/transcripts/default.aspx (2011, accessed 12 September 2011).

28.

MedHelp. Med help forums, http://www.medhelp.org/ (2011, accessed 12 September 2011).

29.

University of Cincinnati. Net Wellness questions, http://www.netwellness.org (2011, accessed 12 September 2011).

30.

Your Cancer Questions, https://web.archive.org/web/20100401190243/http://yourcancerquestions.com/ (2011, accessed 12 September 2011).

31.

Fleiss

. Measuring nominal scale agreement among many raters. Psychol Bull 1971; 76(5): 378–382.

32.

Krippendorff

. Computing Krippendorff’s alpha-reliability, http://repository.upenn.edu/asc_papers/43/ (2011, accessed 1 December 2014).

33.

Hall

Frank

Holmes

. The Weka data mining software: an update. SIGKDD Explor 2009; 11: 10–18.

34.

The Apache Software Foundation. Lucene. http://lucene.apache.org/ (2014, accessed 1 December 2014).

35.

Sundblad

. A re-examination of question classification. In: The 16th Nordic conference on computational linguistics (NODALIDA 2007), Tartu, 24–26 May 2007, pp. 394–397. Tartu: University of Tartu.

36.

Zhang

Lee

. Question classification using support vector machines. In: The 26th international ACM/SIGIR conference on research and development in information retrieval (SIGIR ’03), Toronto, ON, Canada, 28 July–1 August 2003, pp. 26–32. New York: ACM.

37.

Wang

Yao

. Multiclass imbalance problems: analysis and potential solutions. IEEE Trans Syst Man Cybern B Cybern 2012; 42(4): 1119–1130.

38.

Yang

Song

Wang

. A weighted support vector machine for data classification. Intern J Pattern Recognit Artif Intell 2007; 21(5): 961–976.

39.

John

Langley

. Estimating continuous distributions in Bayesian classifiers. In: The 11th conference on uncertainty in artificial intelligence, Montreal, 18–20 August 1995, pp. 338–345. New York: ACM.

40.

McCallum

Nigam

. A comparison of event models for Naive Bayes text classification. In: The AAAI-98 workshop on learning for text categorization (Technical report WS-98-05), Madison, WI, 26–27 July 1998.

41.

Quinlan

. C4.5: programs for machine learning. San Francisco, CA: Morgan Kaufmann Publishers, Inc., 1993.

42.

Platt

. Sequential minimal optimization: a fast algorithm for training support vector machines. Technical report MSR-TR-98-14, April 1998, p. 21 Microsoft Research.

43.

EL-Manzalawy

Honavar

. WLSVM: integrating LibSVM into Weka environment, http://perun.pmf.uns.ac.rs/radovanovic/dmsem/cd/install/LIBSVM/WLSVM/wlsvm.htm (2005, accessed 1 December 2014).

44.

Huang

. Support vector machines for text categorization based on latent semantic indexing. Technical report, Electrical and Computer Engineering Department, The Johns Hopkins University, 2003.

45.

Kuenning

. International Ispell, http://fmg-www.cs.ucla.edu/geoff/ispell.html (1996, accessed 1 December 2014).

46.

Atkinson

. SCOWL (And Friends), http://wordlist.sourceforge.net/ (2014, accessed 1 December 2014).

47.

Zeng

Tse

Crowell

. Identifying consumer-friendly display names for health concepts. AMIA Annu Symp Proc 2005; 2005: 859–863.

48.

Juan

Zhang

Huang

. Understanding the semantic intent of domain-specific natural language query. In: The 6th International Joint Conference on Natural Language Processing, Nagoya, Japan, 14–18 October 2013, pp. 552–560. Stroudsburg, PA: Association for Computational Linguistics (ACL).

49.

Stemler

. A comparison of consensus, consistency, and measurement approaches to estimating inter-rater reliability. Practical Assess Res Eval 2004; 9(4), http://pareonline.net/getvn.asp?v=9&n=4 (accessed 1 December 2014).

50.

Sable

Zhu

. Classifying medical questions based on an evidence taxonomy. In: The 20th national conference on artificial intelligence (AAAI-05): the workshop on question answering in restricted domains, Pittsburgh, PA, 9–10 July 2005, pp. 27–35. Palo Alto, CA: AAAI.

Toward automated classification of consumers’ cancer-related questions with a new taxonomy of expected answer types

Abstract

Keywords

Introduction

Methods

Building a corpus and a taxonomy

Question collection

Question promotion and verification

EAT Taxonomy construction

Taxonomy refinement

Filtered Taxonomy construction

Training statistical classifiers

Tests of classifier algorithms

Tests of dimensionality reduction

Tests of weighted classification

Tests of resampling

Results

Test corpus

Final EAT Taxonomy

Results of testing with the Level 1 Taxonomy

Results of testing with the Filtered Taxonomy

Discussion

Conclusion

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

References