Text Mining with n-gram Variables

Abstract

Text mining is the process of turning free text into numerical variables and then analyzing them with statistical techniques. We introduce the command ngram, which implements the most common approach to text mining, the “bag of words”. An n-gram is a contiguous sequence of words in a text. Broadly speaking, ngram creates hundreds or thousands of variables, each recording how often the corresponding n-gram occurs in a given text. This is more useful than it sounds. We illustrate ngram with the categorization of text answers from two open-ended questions.

Keywords

st0502 ngram bag of words sets of words unigram gram statistical learning machine learning

References

Braun

, Behr

, and Kaczmirek

2013. Assessing cross-national equivalence of measures of xenophobia: Evidence from probing in web surveys. International Journal of Public Opinion Research 25: 383–395.

Büttcher

, Clarke

C. L. A.

, and Cormack

G. V.

2016. Information Retrieval: Implementing and Evaluating Search Engines. Cambridge, MA: MIT Press.

Gaustad

, and Bouma

2002. Accurate stemming of Dutch for text classification. Language and Computers 45: 104–117.

Guenther

, and Schonlau

2016. Support vector machines. Stata Journal 16: 917–937.

Hastie

, Tibshirani

, and Friedman

2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer.

Hollink

, Kamps

, Monz

, and de Rijke

2004. Monolingual document retrieval for European languages. Information Retrieval 7: 33–52.

Hull

D. A.

1996. Stemming algorithms: A case study for detailed evaluation. Journal of the American Society of Information Science 47: 70–84.

Ignatow

, and Mihalcea

2016. Text Mining: A Guidebook for the Social Sciences. Thousand Oaks, CA: Sage.

Jockers

M. L.

2014. Text Analysis with R for Students of Literature. Heidelberg: Springer.

10.

Kraaij

, and Pohlmann

1994. Porter's stemming algorithm for Dutch. In Informatiewetenschap 1994: Wetenschappelijke bijdragen aan de derde STINFON Conferentie, ed. N. L. G. M., and de Vroomen

W. A. M.

, 167–180. Tilburg, Netherlands.

11.

Madigan

, Genkin

, Lewis

D. D.

, Argamon

, Fradkin

, and Ye

2005. Author identification on the large scale. In Proceedings of the 2005 Meeting of the Classification Society of North America. St. Louis, MO: Classification Society of North America.

12.

Manning

C. D.

, Raghavan

, and Schütze

2008. Introduction to Information Retrieval. Cambridge: Cambridge University Press.

13.

Manning

C. D.

, and Schütze

1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.

14.

McCaffrey

D. F.

, and Elliott

M. N.

2008. Power of tests for a dichotomous independent variable measured with error. Health Services Research 43: 1085–1101.

15.

Paice

C. D.

1994. An evaluation method for stemming algorithms. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ed. Croft

W. B.

, and van Rijsbergen

C. J.

, 42–50. New York: Springer.

16.

Porter

M. F.

1980. An algorithm for suffix stripping. Program: Electronic library and information systems 14: 130–137.

17.

Savoy

2006. Light stemming approaches for the French, Portuguese, German and Hungarian languages. In Proceedings of the 2006 ACM Symposium on Applied Computing, 1031–1035. New York: ACM.

18.

Schonlau

2005. Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata Journal 5: 330–354.

19.

Schonlau

2015. What do web survey panel respondents answer when asked “Do you have any other comment?”. Survey Methods: Insights from the Field. http://surveyinsights.org/?p=6899.

20.

Vapnik

V. N.

2000. The Nature of Statistical Learning Theory. 2nd ed. New York: Springer.

21.

Williams

, and Williams

S. P.

2014. txttool: Utilities for text analysis in Stata. Stata Journal 14: 817–829.