A hidden Markov model-based text classification of medical documents

Abstract

The purpose of the study is to test the application of the hidden Markov model (HMM) using prior knowledge in medical text classification (TC). HMM has been applied to a wide range of applications in information processing, but not so much in TC applications. The Medical Subject Heading (MeSH) is utilized for prior knowledge in the model. A prototype for an HMM-based TC model is designed, and an experimental model based on the prototype is implemented so as to categorize medical documents into MeSH. A subset of OHSUMED is used for the experiments. Our results show that the performance of our model is comparable to those reported in the literature.

Keywords

hidden Markov model HMM MeSH text classification UMLS

Get full access to this article

View all access options for this article.

References

G.R. Thoma , S. Mao , D. Misra and J. Rees , Design of a digital library for early 20th century medico-legal documents . In: J. Gonzalo et al. (eds), Proceedings of the 10th European Conference, ECDL 2006, Alicante, Spain, September 17-22, 2006 ( Springer, Berlin, 2006) 147-57. [Lecture Notes in Computer Science 4172

F. Sebastiani , Machine learning in automated text categorization , ACM Computing Surveys 34(1) (2002) 1-47.

L.S. Larkey and W.B. Croft , Combining classifiers in text categorization. In: H.P. Frei et al. (eds), Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 1996 (ACM, New York, 1996) 289-97.

H. Cui , P.B. Heidorn and H. Zhang , An approach to automatic classification of text for information retrieval . In: Proceedings of the 2nd ACM/IEEE Joint Conference on Digital Libraries, JCDL, Portland, Oregon (ACM, New York, 2002) 96-7.

P. Thompson , Automatic categorization of case law. In: R.P. Loui (ed.), Proceedings of the 8th International Conference on Artificial Intelligence and Law (ICAIL) St Louis, Missouri 2001 (ACM, New York, 2001) 70-77.

L.S. Larkey , A patent search and classification system. In: Proceedings of the 4th ACM conference on Digital Libraries, Berkeley, California, 1999 (ACM, New York, 1999) 179-87.

T. Joachims , Text categorization with support vector machines: learning with many relevant features. In: C. Nedellec and C. Rouveirol (eds), Proceedings of the European Conference on Machine Learning: ECML-98 (Springer, Berlin, 1998) 137-42. [Lecture Notes in Computer Science 1398]

D.D. Lewis and M. Ringuette , A comparison of two learning algorithms for text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV 1994, 81-93.

T.M. Mitchell , Machine Learning (McGraw-Hill , Boston, 1997).

10.

Y. Huang and T.M. Mitchell , Text clustering with extended user feedback. In: S. Dumais (ed.), Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA (ACM, New York, 2006) 413-20.

11.

R. Jones , A. McCallum , K. Nigam and E. Riloff , Bootstrapping for text learning tasks. In: L.C. Aiello and T. Dean (eds), Working Notes of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99) Workshop on Text Mining: Foundations, Techniques and Applications, Stockholm, Sweden (Morgan Kaufman, Denver, CO, 1999) 52-63.

12.

L.M. Chan , Cataloging and Classification: an Introduction (McGraw-Hill , New York, 1994).

13.

W. Hersh , C. Buckley , T.J. Leone and D. Hickam , OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: W. Bruce Croft and C.J. van Rijsbergen (eds), Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland (Springer Verlag, New York, 1994) 192-201.

14.

W.L. Hsu and S.D. Lang , Classification algorithms for NETNEWS articles . In: S. Gauch (ed.), Proceedings of the 8th ACM International Conference on Information and Knowledge Management, Kansas City, Missouri November 2-6 1999 (ACM, New York, 1999) 114-21.

15.

W. Lam and C.Y. Ho , Using a generalized instance set for automatic text categorization. In: W. Bruce Croft et al. (eds), Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998 (ACM, New York, 1998) 81-9.

16.

Y. Yang , J. Zhang and B. Kisiel , A scalability analysis of classifiers in text categorization. In: J. Callan et al. (eds), Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, July 28-August 1, 2003 (ACM, New York, 2003) 96-103.

17.

N. Cesa-Bianchi , C. Gentile and L. Zaniboni , Hierarchical classification: combining Bayes with SVM. In: W. Cohen and A. Moore (eds), Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh 2006, (ACM, New York, 2006) 177-84.

18.

A. Epshteyn and G. DeJong , Generative prior knowledge for discriminative classification, Journal of Artificial Intelligence Research 27 (2006) 25-53.

19.

X. Wu and R. Srihari , Incorporating prior knowledge with weighted margin support vector machines . In: R. Kohavi et al. (eds), Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA (ACM, New York, 2004) 326-33.

20.

F. Lauer and G. Bloch , Incorporating prior knowledge in Support Vector Machines for classification: a review (2006). Available at: http://hal.archives-ouvertes.fr/docs/00/06/35/21/PDF/LauerBlochNeurocomp06.pdf (accessed 11 February 2007).

21.

G. Adami , P. Avesani and D. Sona , Bootstrapping for hierarchical document classification. In: O. Frieder et al. (ed.), Proceedings of CIKM-03, 12th ACM International Conference on Information and Knowledge Management (ACM Press, New York , 2003) 295-302.

22.

L.R. Rabiner , A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE 77(2) ( 1989) 257-86.

23.

J. Hu , M.K. Brown and W. Turin , HMM based on-line handwriting recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 18(10) (1996) 1039-45.

24.

R. Hughey and A. Krogh , Hidden Markov model for sequence analysis: extension and analysis of the basic method, Computer Applications in the Biosciences 12(2) (1996) 95-107.

25.

T. Jebara and A. Pentland , Action reaction learning: automatic visual analysis and synthesis of interactive behavior. In: H.I. Christensen (ed.), Proceedings of the1st International Conference on Computer Vision Systems (ICVS'99), Las Palmas de Gran Canaria, Spain, 13-15 January 1999 (Springer, Berlin, 1999) 255-72. [Lecture Notes in Computer Science 1542]

26.

P. Frasconi , G. Soda and A. Vullo , Hidden Markov models for text categorization in multi-page documents, Journal of Intelligent Information Systems 18(2/3) (2002) 195-217.

27.

D. Freitag and A. McCallum , Information Extraction with HMMs and Shrinkage. In: B. Hayes-Roth et al. (eds), Workshop Technical Report (WS-99-11) of the 16th National Conference on Artificial Intelligence (AAAI-99), Orlando, FL, 18-22 July 1999 , 31-6.

28.

T.R. Leek , Information extraction using hidden Markov models (Thesis, University of California at San Diego, 1997).

29.

D.R. Miller , T. Leek and R.M. Schwartz , A hidden Markov model information retrieval system. In: F. Gey et al. (eds), Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, 1999 (ACM, New York, 1999) 214-21.

30.

A.A. Markov , An example of statistical investigation of the text `Eugene Onegin' concerning the connection of samples in chains [translation] , Science in Context 19(4) (2006) 591-600.

31.

J. Ponte and W.B. Croft , A language modelling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998 (ACM, New York, 1998) 275-81.

32.

D. Hiemstra and W. Kraaij , Twenty-one at TREC7: Ad-hoc and cross-language track. In: E.M. Voorhees and D.K. Harman (eds), Proceedings of the 7th Text Retrieval Conference (TREC-7) 1998 (National Institute of Standards and Technology, Gaithersburg, MD, 1998) 174-85.

33.

G. Leroy and H. Chen , Meeting medical terminology needs - the ontology-enhanced medical concept mapper, IEEE Transactions on Information Technology in Biomedicine 5(4) (2001) 261-70.

34.

Y. Yang , An evaluation of statistical approaches to text categorization, Journal of Information Retrieval 1(1/2) (1999) 67-88.

35.

M.F. Porter , An algorithm for suffix stripping, Program 14(3) (1980) 130-37.

36.

C.J. van Rijsbergen , Information Retrieval (Butterworths , London, 1979).

37.

Y. Yang and X. Liu , A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA , 1999 (ACM, New York, 1999) 42-9.

38.

R.E. Schapire and Y. Singer , BoosTexter: a boosting-based system for text categorization, Machine Learning 39(2/3) ( 2000) 135-68.

39.

S. Bloehdorn and A. Hotho , Boosting for text classification with semantic features. In: B. Mobasher et al. (eds),: Advances in Web Mining and Web Usage Analysis: Proceedings of the WebKDD2004 (Springer, Berlin , 2006) 149-66. [Lecture Notes in Computer Science 3932]

40.

G. Horváth , Neural Networks in Measurement Systems. In: J.A.K., Suykens et al. (eds), Advances in Learning Theory: Methods, Models and Applications: NATO-ASI Series in Computer and Systems Sciences (IOS Press, Amsterdam, 2003) 375-402.