Text mining is the process of turning free text into numerical variables and then analyzing them with statistical techniques. We introduce the command ngram, which implements the most common approach to text mining, the “bag of words”. An n-gram is a contiguous sequence of words in a text. Broadly speaking, ngram creates hundreds or thousands of variables, each recording how often the corresponding n-gram occurs in a given text. This is more useful than it sounds. We illustrate ngram with the categorization of text answers from two open-ended questions.
BraunM., BehrD., and KaczmirekL.2013. Assessing cross-national equivalence of measures of xenophobia: Evidence from probing in web surveys. International Journal of Public Opinion Research25: 383–395.
2.
BüttcherS., ClarkeC. L. A., and CormackG. V.2016. Information Retrieval: Implementing and Evaluating Search Engines.Cambridge, MA: MIT Press.
3.
GaustadT., and BoumaG.2002. Accurate stemming of Dutch for text classification. Language and Computers45: 104–117.
4.
GuentherN., and SchonlauM.2016. Support vector machines. Stata Journal16: 917–937.
5.
HastieT., TibshiraniR., and FriedmanJ.2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer.
6.
HollinkV., KampsJ., MonzC., and de RijkeM.2004. Monolingual document retrieval for European languages. Information Retrieval7: 33–52.
7.
HullD. A.1996. Stemming algorithms: A case study for detailed evaluation. Journal of the American Society of Information Science47: 70–84.
8.
IgnatowG., and MihalceaR.2016. Text Mining: A Guidebook for the Social Sciences.Thousand Oaks, CA: Sage.
9.
JockersM. L.2014. Text Analysis with R for Students of Literature.Heidelberg: Springer.
10.
KraaijW., and PohlmannR.1994. Porter's stemming algorithm for Dutch. In Informatiewetenschap 1994: Wetenschappelijke bijdragen aan de derde STINFON Conferentie, ed. N. L. G. M., and de VroomenW. A. M., 167–180. Tilburg, Netherlands.
11.
MadiganD., GenkinA., LewisD. D., ArgamonS., FradkinD., and YeL.2005. Author identification on the large scale. In Proceedings of the 2005 Meeting of the Classification Society of North America.St. Louis, MO: Classification Society of North America.
12.
ManningC. D., RaghavanP., and SchützeH.2008. Introduction to Information Retrieval.Cambridge: Cambridge University Press.
13.
ManningC. D., and SchützeH.1999. Foundations of Statistical Natural Language Processing.Cambridge, MA: MIT Press.
14.
McCaffreyD. F., and ElliottM. N.2008. Power of tests for a dichotomous independent variable measured with error. Health Services Research43: 1085–1101.
15.
PaiceC. D.1994. An evaluation method for stemming algorithms. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ed. CroftW. B., and van RijsbergenC. J., 42–50. New York: Springer.
16.
PorterM. F.1980. An algorithm for suffix stripping. Program: Electronic library and information systems14: 130–137.
17.
SavoyJ.2006. Light stemming approaches for the French, Portuguese, German and Hungarian languages. In Proceedings of the 2006 ACM Symposium on Applied Computing, 1031–1035. New York: ACM.
18.
SchonlauM.2005. Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata Journal5: 330–354.
19.
SchonlauM.2015. What do web survey panel respondents answer when asked “Do you have any other comment?”. Survey Methods: Insights from the Field.http://surveyinsights.org/?p=6899.
20.
VapnikV. N.2000. The Nature of Statistical Learning Theory. 2nd ed. New York: Springer.
21.
WilliamsU., and WilliamsS. P.2014. txttool: Utilities for text analysis in Stata. Stata Journal14: 817–829.