Developing Methods for Very-Large-Scale Searches in Proquest Historical Newspapers Collection and Infotrac the Times Digital Archive: The Case of Two Million Versus Two Millions

Abstract

Historical corpora designed for linguistic research are often too small to provide statistically robust information about infrequent items. Alternative sources exist in the form of historical collections available online, but these databases may present methodological problems. Some of these problems can be circumvented, and useful results can be gleaned, including a proxy for incidence. In studies on the integration of the word million into the English system of number words, based on billions of words from historical newspapers, it was possible to determine that parity was reached between obsolescent (two millions) and Present-Day (two million) forms in American papers around 1880 and in The Times around 1920. The explosive growth in the use of million proved to start with WWII in the U.S. and in the 1950s in the U.K. This information could not be teased from a 20-million-word ‘megacorpus’ of commonly used diachronic and synthetic corpora designed by linguists.

Get full access to this article

View all access options for this article.

References

Labov, William. 1994. Internal Factors. Vol. 1 of Principles of Linguistic Change. Oxford, UK: Blackwell.

MacQueen, Donald S. Forthcoming. “The Number of an Hundred Myriads”: The Integration of Million into the English System of Number Words. Ph.D. diss., Department of English, Uppsala University.

McMahon, April M. S. 1994. Understanding Language Change. Cambridge: Cambridge University Press.