Estimating the prevalence and diversity of words in written language

Abstract

Recently, a new crowd-sourced language metric has been introduced, entitled word prevalence, which estimates the proportion of the population that knows a given word. This measure has been shown to account for unique variance in large sets of lexical performance. This article aims to build on the work of Brysbaert et al. and Keuleers et al. by introducing new corpus-based metrics that estimate how likely a word is to be an active member of the natural language environment, and hence known by a larger subset of the general population. This metric is derived from an analysis of a newly collected corpus of over 25,000 fiction and non-fiction books and will be shown that it is capable of accounting for significantly more variance than past corpus-based measures.

Keywords

Lexical organisation semantic diversity big data corpus studies

Get full access to this article

View all access options for this article.

References

Adelman

J. S.

Brown

G. D. A.

Quesada

J. F.

(2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17(9), 814–823.

Adelman

J. S.

Sabatos-DeVito

M. G.

Marquis

S. J.

Estes

(2014). Individual differences in reading aloud: A mega-study, item effects, and some models. Cognitive Psychology, 68, 113–160.

Allen

L. R.

Garton

R. F.

(1968). The influence of word-knowledge on the word-frequency effect in recognition memory. Psychonomic Science, 10, 401–402.

Baayen

R. H.

Feldman

L. B.

Schreuder

(2006). Morphological influences on the recognition of monosyllabic monomorphemic words. Journal of Memory and Language, 55, 290–313.

Baayen

R. H.

Milin

Ramscar

(2016). Frequency in lexical processing. Aphasiology, 30, 1174–1220.

Balota

D. A.

Yap

M. J.

Cortese

M. J.

Hutchison

K. A.

Kessler

Loftis

. . . Treiman

(2007). The English Lexicon Project. Behavior Research Methods, 39, 445–459.

Brysbaert

Mandera

McCormick

S. F.

Keuleers

(2019). Word prevalence norms for 62,000 English lemmas. Behavior Research Methods, 51, 467–479.

Brysbaert

New

(2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977–990.

Brysbaert

Stevens

Mandera

Keuleers

(2016). The impact of word prevalence on lexical decision times: Evidence from the Dutch Lexicon Project 2. Journal of Experimental Psychology: Human Perception and Performance, 42, 441–458.

10.

Estes

W. K.

(1975). Some targets for mathematical psychology. Journal of Mathematical Psychology, 12, 263–282.

11.

Gardner

R. C.

Lalonde

R. N.

Moorcroft

Evers

F. T.

(1987). Second language attrition: The role of motivation and use. Journal of Language and Social Psychology, 6, 29–47.

12.

Gernsbacher

M. A.

(1984). Resolving 20 years of inconsistent interactions between lexical familiarity and orthography, concreteness, and polysemy. Journal of Experimental Psychology: General, 113(2), 256–281.

13.

Hoffman

Ralph

M. A. L.

Rogers

T. T.

(2013). Semantic diversity: A measure of semantic ambiguity based on variability in the contextual usage of words. Behavior Research Methods, 45, 718–730.

14.

Hsiao

Nation

(2018). Semantic diversity, frequency and the development of lexical quality in children’s word reading. Journal of Memory and Language, 103, 114–126.

15.

Johns

B. T.

Dye

M. W.

(2019). Gender bias at scale: Evidence from the usage of personal names. Behavior Research Methods, 51, 1601–1618.

16.

Johns

B. T.

Dye

Jones

M. N.

(2014). The influence of contextual variability on word learning. In Bello

Guarani

McShane

Scassellati

(Eds.), Proceedings of the 36th Annual Conference of the Cognitive Science Society (pp. 242–247). Austin: Cognitive Science Society.

17.

Johns

B. T.

Dye

M. W.

Jones

M. N.

(2016a). The influence of contextual diversity on word learning. Psychonomic Bulletin & Review, 23, 1214–1220.

18.

Johns

B. T.

Gruenenfelder

T. M.

Pisoni

D. B.

Jones

M. N.

(2012). Effects of word frequency, contextual diversity, and semantic distinctiveness on spoken word recognition. Journal of the Acoustical Society of America, 132(2), EL74–EL80.

19.

Johns

B. T.

Jamieson

R. K.

(2018). A large-scale analysis of variance in written language. Cognitive Science, 42, 1360–1374.

20.

Johns

B. T.

Jamieson

R. K.

(2019). The influence of time and place on lexical behavior: A distributional analysis. Behavior Research Methods, 51, 2438–2453.

21.

Johns

B. T.

Jamieson

R. K.

Jones

M. N.

(in press). The continued importance of theory: Lessons from big data approaches to cognition. In Woo

S. E.

Proctor

Tay

(Eds.), Big data in psychological research. APA Books.

22.

Johns

B. T.

Jones

M. N.

(2010). Evaluating the random representation assumption of lexical semantics in cognitive models. Psychonomic Bulletin & Review, 17, 662–672.

23.

Johns

B. T.

Jones

M. N.

Mewhort

D. J. K.

(2012). A synchronization account of false recognition. Cognitive Psychology, 65, 486–518.

24.

Johns

B. T.

Jones

M. N.

Mewhort

D. J. K.

(2019). Using experiential optimization to build lexical representations. Psychonomic Bulletin & Review, 26, 103–126.

25.

Johns

B. T.

Mewhort

D. J. K.

Jones

M. N.

(2017). Small worlds and big data: Examining the simplification assumption in cognitive modeling. In Jones

M. N.

(Ed.), Big data in cognitive science: From methods to insights (pp. 227–245). Routledge.

26.

Johns

B. T.

Sheppard

Jones

M. N.

Taler

(2016b). The role of semantic diversity in lexical organization across aging and bilingualism. Frontiers in Psychology, 7, Article 703.

27.

Jones

M. N.

Dye

Johns

B. T

. (2017). Context as an organizational principle of the lexicon. In Ross

(Ed.), The psychology of learning and motivation (Vol. 67, pp. 239–283). Academic Press.

28.

Jones

M. N.

Hills

T. T.

Todd

P. M.

(2015). Hidden processes in structural representations: A reply to Abbott, Austerweil, & Griffiths. Psychological Review, 122, 570–574.

29.

Jones

M. N.

Johns

B. T.

Recchia

(2012). The role of semantic diversity in lexical organization. Canadian Journal of Experimental Psychology, 66, 115–124.

30.

Jones

M. N.

Mewhort

D. J.

(2007). Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114, 1–37.

31.

Keuleers

Lacey

Rastle

Brysbaert

(2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44, 287–304.

32.

Keuleers

Stevens

Mandera

Brysbaert

(2015). Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment. Quar-terly Journal of Experimental Psychology, 68, 1665–1692.

33.

McDonald

S. A.

Shillcock

R. C.

(2001). Rethinking the word frequency effect: The neglected role of distributional information in lexical processing. Language and Speech, 44, 295–322.

34.

Morton

(1969). Interaction of information in word recognition. Psychological Review, 76, 165–178.

35.

Ramscar

Hendrix

Shaoul

Milin

Baayen

(2014). The myth of cognitive decline: Non-linear dynamics of lifelong learning. Topics in Cognitive Science, 6, 5–42.

36.

Recchia

Jones

M. N.

(2012). The semantic richness of abstract concepts. Frontiers in Human Neuroscience, 6, 315.

37.

Schmidt

S. R.

(1991). Can we have a distinctive theory of memory? Memory & Cognition, 19, 523–542.

38.

van Heuven

W. J.

Mandera

Keuleers

Brysbaert

. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176–1190.

39.

Verkoeijen

P. P.

Rikers

R. M.

Schmidt

H. G.

(2004). Detrimental influence of contextual change on spacing effects in free recall. Journal of Experimental Psychology: Learning, Memory, & Cognition, 30, 796–800.

40.

Wagenmakers

E. J.

Ratcliff

Gomez

McKoon

(2008). A diffusion model account of criterion shifts in the lexical decision task. Journal of Memory and Language, 58, 140–159.

41.

Westbury

C. F.

Shaoul

Hollis

Smithson

Briesemeister

B. B.

Hofmann

M. J.

Jacobs

A. M.

(2013). Now you see it, now you don’t: on emotion, context, and the algorithmic prediction of human imageability judgments. Frontiers in Psychology, 4, Article 991.

42.

Wickens

D. D.

(1987). The dual meanings of context: Implications for research, theory, and applications. In Gorfein

D. S.

Hoffman

R. R.

(Eds.), Memory and learning: The Ebbinghaus Centennial Conference (pp. 135–152). Lawrence Erlbaum.

43.

Yap

M. J.

Balota

D. A.

(2009). Visual word recognition of multisyllabic words. Journal of Memory and Language, 60(4), 502–529.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

31.70 MB