Txttool: Utilities for Text Analysis in Stata

Abstract

This article describes txttool, a command that provides a set of tools for managing free-form text. The command integrates several built-in Stata functions with new text capabilities. These latter functions include a utility to create a bag-of-words representation of text and an implementation of Porter's (1980, Program: Electronic library and information systems 14: 130–137) word-stemming algorithm. Collectively, these utilities provide a text-processing suite for text mining and other text-based applications in Stata.

Keywords

dm0077 txttool text mining Porter stemmer bag of words cleaning stop words subwords

References

Belotti

, and Depalo

2010. Translation from narrative text to standard codes variables with Stata. Stata Journal 10: 458–481.

Benoit

, Laver

, and Mikhaylov

2009. Treating words as data with error: Uncertainty in text statements of policy positions. American Journal of Political Science 53: 495–513.

Grimmer

, and Stewart

B. M.

2013. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21: 267–297.

Laver

, Benoit

, and Garry

2003. Extracting policy positions from political texts using words as data. American Political Science Review 97: 311–331.

Lowe

, and Benoit

2013. Validating estimates of latent traits from textual data using human judgment as a benchmark. Political Analysis 21: 298–313.

Porter

M. F.

1980. An algorithm for suffix stripping. Program: Electronic library and information systems 14: 130–137.

Raciborski

2008. kountry: A Stata utility for merging cross-country data from multiple sources. Stata Journal 8: 390–400.