Abstract
As a result of litigation over the past decade, major tobacco companies were compelled to make public a broad range of previously confidential documents. We have created a series of corpora from the tobacco industry documents (TIDs) for three purposes: (1) to establish baseline descriptions of various linguistic features of this unique set of texts; (2) to identify TIDs in which rhetorical manipulation (“deception”) may have occurred and to estimate the extent and prevalence of manipulation; (3) to analyze manipulation in order to classify it and develop means to identify similar manipulation in other industry document sets. Our threepart corpus creation strategy employed rigorous sampling methods. First, we drew a limited sample from the largest collection of TIDs, to determine a representative classification of text types and to estimate their proportions within the overall body of texts. Then, we created a reference corpus (500,000+ words) constituting a stratified random sample of all TIDs, whether or not they exhibit manipulation. Finally, we compiled a corpus of texts presumed to exhibit rhetorical manipulation. We assumed that multiple drafts of a text or versions of a text prepared for different audiences constituted rhetorical manipulation. This article presents our experience with the sampling methods utilized in this corpus-building process and our findings regarding text types comprising the reference corpus.
Get full access to this article
View all access options for this article.
