Some Current Quantitative Problems in Corpus Linguistics and a Sketch of Some Solutions

Abstract

This paper surveys a variety of methodological problems in current quantitative corpus linguistics. Some problems discussed are from corpus linguistics in general, such as the impact that dispersion, type frequencies/entropies, and directionality (should) have on the computation of association measures as well as the impact that neglecting the sampling structure of a corpus can have on a statistical analysis. Others involve more specialized areas in which corpus-linguistic work is currently booming, such as historical linguistics and learner corpus research. For each of the problems, first ideas/pointers as to how these problems can be resolved are provided and exemplified in some detail.

Keywords

association measures mixed-effects/multi-level modeling MuPDAR token/type frequencies variability-based neighbor clustering

Get full access to this article

View all access options for this article.

References

Aijmer

Karin.

2002. Modality in advanced Swedish learners’ written interlanguage. Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching , ed. by Granger

Sylviane

Hung

Joseph

Petch-Tyson,

Stephanie

55–76. Amsterdam & Philadelphia: John Benjamins.

Altenberg

Bengt.

2002. Using bilingual corpus evidence in learner corpus research. Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching , ed. by Granger,

Sylviane

Hung

Joseph

Petch-Tyson

Stephanie

, 37–54. Amsterdam & Philadelphia: John Benjamins.

Baayen,

R. Harald.

2010a. A real experiment is a factorial experiment? The Mental Lexicon 5.1:149–157.

Baayen,

R. Harald.

2010b. Demythologizing the word frequency effect: a discriminative learning perspective. The Mental Lexicon 5.3:436–461.

Casenhiser

Devin,

Goldberg.

Adele E.

2005. Fast mapping between a phrasal form and meaning. Developmental Science 8.6:500–508.

Clancy

Patricia M.

2003. The lexicon in interaction: developmental origins of Preferred Argument Structure in Korean. Preferred Argument Structure: Grammar as Architecture for Function , ed. by Du Bois,

John W.

Kumpf

Lorraine E.

Ashby,

William J.

81–108. Amsterdam & Philadelphia: John Benjamins.

Clark-Sánchez

Victoria.

2013. Review of Quantitative Corpus Linguistics with R: A Practical Introduction. Corpora 8.2:269–272.

Daudaravičius

Vidas,

Marcinkevičienė

Rūta

. 2004. Gravity counts for the boundaries of collocations. International Journal of Corpus Linguistics 9.2:321–348.

Ellis

Nick C.

2006. Language acquisition as rational contingency learning. Applied Linguistics 27.1:1–24.

10.

Ellis

Nick C.,

Simpson-Vlach,

Rita

Maynard.

Carson

2007. The processing of formulas in native and L2 speakers: psycholinguistic and corpus determinants. Paper presented at the UWM Linguistics Symposium on Formulaic Language, April 16–21, 2007. Milwaukee: University of Wisconsin-Milwaukee.

11.

Evert

Stefan.

2009. Corpora and collocations. Corpus Linguistics: An International Handbook , Vol. 2, ed. by Lüdeling

Anke

Kytö,

Merja

1212–1248. Berlin & New York: Mouton de Gruyter.

12.

Firth

John R.

1957. A synopsis of linguistic theory 1930–55. Studies in Linguistic Analysis , 1–32. Oxford: Basil Blackwell.

13.

Granger

Sylviane.

1996. From CA to CIA and back: an integrated approach to computerized bilingual and learner corpora. Languages in Contrast: Papers from a Symposium on Text-based Cross-linguistic Studies, Lund, 4–5 March 1994 , ed. by Aijmer,

Karin

Altenberg

Bengt

Johansson,

Mats

37–51. Lund: Lund University Press.

14.

Gries

Stefan Th.

2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13.4:403–437.

15.

Gries,

Stefan Th.

2010a. Methodological skills in corpus linguistics: a polemic and some pointers towards quantitative methods. Corpus Linguistics in Language Teaching , ed. by Harris

Tony

Moreno Jaén,

María

121–146. Frankfurt am Main: Peter Lang.

16.

Gries,

Stefan Th

. 2010b. Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora. Paper presented at the Corpus Linguistics 2009, July 20–23, 2009. Liverpool: University of Liverpool. http://ucrel.lancs.ac.uk/publications/cl2009 .

17.

Gries,

Stefan Th.

2010c. Dispersions and adjusted frequencies in corpora: further explorations. Corpus Linguistic Applications: Current Studies, New Directions , ed. by Gries,

Stefan Th.

Wulff

Stefanie

Davies,

Mark

197–212. Amsterdam & New York: Rodopi.

18.

Gries

Stefan Th.

2011. Methodological and interdisciplinary stance in corpus linguistics. Perspectives on Corpus Linguistics: Connections and Controversies , ed. by Viana,

Vander

Zyngier

Sonia

Barnbrook,

Geoffrey

81–98. Amsterdam & Philadelphia: John Benjamins.

19.

Gries

Stefan Th.

2013. 50-something years of work on collocations: what is or should be next … International Journal of Corpus Linguistics 18.1:137–165.

20.

Gries,

Stefan Th

. (forthcoming). The most underused statistical method in corpus linguistics: multi-level (and mixed-effects) models. Corpora 10.1.

21.

Gries

Stefan Th., & Allison S. Adelman.

2014. Subject realization in Japanese conversation by native and non-native speakers: exemplifying a new paradigm for learner corpus research. Yearbook of Corpus Linguistics and Pragmatics 2014: New Empirical and Theoretical Paradigms , 35–54. Berlin & New York: Springer.

22.

Gries

Stefan Th.,

Deshors.

Sandra C.

2014. Using regressions to explore deviations between corpus data and a standard/target: two suggestions. Corpora 9.1:109–136.

23.

Gries

Stefan Th.,

Hilpert.

Martin

2008. The identification of stages in diachronic data: variability-based neighbor clustering. Corpora 3.1:59–81.

24.

Gries

Stefan Th.,

Hilpert.

Martin

2010. Modeling diachronic change in the third person singular: a multifactorial, verb- and author-specific exploratory approach. English Language and Linguistics 14.3:293–320.

25.

Stefan Th.,

Gries

Mukherjee.

Joybrato

2010. Lexical gravity across varieties of English: an ICE-based study of n-grams in Asian Englishes. International Journal of Corpus Linguistics 15.4:520–548.

26.

Gries

Stefan Th.,

Wulff.

Stefanie

2013. The genitive alternation in Chinese and German ESL learners: towards a multifactorial notion of context in learner corpus research. International Journal of Corpus Linguistics 18.3:327–356.

27.

Harris

Zellig S.

1970. Papers in Structural and Transformational Linguistics . Dordrecht: Reidel.

28.

Hilde,

Hasselgård

Johansson.

Stig

2011. Learner corpora and contrastive interlanguage analysis. A Taste for Corpora: In Honour of Sylviane Granger , ed. by Meunier,

Fanny

De Cock, Gaëtanelle Gilquin

Sylvie

Paquot,

Magali

33–61. Amsterdam & Philadelphia: John Benjamins.

29.

Janda,

Laura A.

(ed.) 2013. Cognitive Linguistics: The Quantitative Turn . Berlin & New York: De Gruyter Mouton.

30.

Joseph

Brian.

2004. On change in Language and change in language. Language 80.3:381–383.

31.

McDonald

Scott A.,

C. Shillcock.

Richard

2001. Rethinking the word frequency effect: the neglected role of distributional information in lexical processing. Language and Speech 44.3:295–322.

32.

Michelbacher

Lukas,

Evert,

Stefan

Schütze

Hinrich

. 2007. Asymmetric association measures. Paper presented at the International Conference on Recent Advances in Natural Language Processing (RANLP 2007), September 27–29, 2007.Borovets, Bulgaria.

33.

Michelbacher

Lukas,

Evert,

Stefan

Schütze.

Hinrich

2011. Asymmetry in corpus-derived and human word associations. Corpus Linguistics and Linguistic Theory 7.2:245–276.

34.

Mollin

Sandra.

2009. Combining corpus linguistic and psychological data on word co-occurrences: corpus collocates versus word associations. Corpus Linguistics and Linguistic Theory 5.2:175–200.

35.

Nakagawa

Shinichi, & Holger Schielzeth.

2013. A general and simple method for obtaining R² from generalized linear mixed-effects models. Methods in Ecology and Evolution 4.2:133–142.

36.

Péry-Woodley

Marie-Paule.

1990. Contrasting discourses: contrastive analysis and a discourse approach to writing. Language Teaching 23.3:143–151.

37.

R Core Team. 2014. R: a language and environment for statistical computing. R Foundation for statistical computing. Vienna, Austria. http://www.R-project.org/ .

38.

Recchia

Gabriel,

Johns,

Brendan T.

Jones.

Michael N.

2008. Context repetition benefits are dependent on context redundancy. Proceedings of the Annual Conference of the Cognitive Science Society 30:267–272.

39.

Simpson-Vlach

Rita,

Ellis.

Nick C.

2005. An academic formulas list (AFL): extraction, validation, prioritization. Paper presented at Phraseology 2005, October 13–15, 2005. Louvain-la-Neuve: Catholic University of Louvain.

40.

Stefanowitsch

Anatol,

Gries.

Stefan Th.

2003. Collostructions: investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8.2:209–243.

41.

Stoll

Sabine,

Gries.

Stefan Th.

2009. How to measure development in corpora? An association strength approach. Journal of Child Language 36.5:1075–1090.

42.

Szmrecsanyi

Benedikt,

Wolk.

Christoph

2011. Holistic corpus-based dialectology. Brazilian Journal of Applied Linguistics 11.2:561–592.

43.

Wahl,

Alexander R.

(in progress). New Approaches to Extracting Multi-word Expressions from Corpora: Unprespecified Ngram Lengths, Long-distance Dependencies, and Enhanced Association Measures . Santa Barbara: University of California at Santa Barbara dissertation.