Vocabulary Demands of Academic Spoken English Revisited: A Case of University Lectures and TED Presentations

Abstract

The article shines light upon the differences in the vocabulary demands of academic spoken discourse between three broad scientific disciplines: Life Sciences, Physical Sciences, and Social Sciences. By employing the Academic Word List (AWL) and British National Corpus/Corpus of Contemporary American English (BNC/COCA) wordlist, the present study analyzed data of the transcripts from 160 university lectures, 39 seminars, and 600 TED talks. Results from the analysis of the 2.5-million-token corpus demonstrated an order of lexical difficulty in which Life Sciences and Social Sciences were the most and least lexically demanding fields of study, correspondingly. Research findings also indicated a strong supportive relationship between the AWL and the BNC/COCA wordlist. Learners with limited vocabulary knowledge at 1,000 and 2,000 levels could significantly increase their lexical coverage for academic lectures, seminars, and presentations with the support of the AWL. For disciplines with low lexical demands like Social Sciences, the vocabulary knowledge of nearly 2,570 word families (BNC/COCA 2,000 + AWL) could help students understand 95.07% of the running words in universities lectures and seminars. Research findings offer implications for vocabulary teaching and learning.

Keywords

British Academic Spoken Corpus TED Academic Word List BNC/COCA lexical coverage

Introduction

In the light of globalization and internalization in education, English has been adopted as the lingua franca of science and scholarship, and widely used as the medium of instruction at tertiary level. Nowadays, students can opt to study in English-speaking countries, enroll in distance learning courses, or participate in online educational programs offered by international universities regardless of their geographical locations. Besides, universities and educational institutions in non-English speaking countries are also seeking to adapt to the impact of English on academic life by providing students with academic courses in English. As a result, students of different disciplines nowadays are able to get easy access to a wide variety of English-medium academic courses to enrich their profound knowledge.

Given the prevalence of English as a means of instruction, comprehending academic English is, no doubt, of critical importance in students’ academic pursuit and success. Van Zeeland and Schmitt (2013) claimed that there is a significant correlation between lexical coverage and listening comprehension; thus students are expected to master a reasonable number of academic English words to gain thorough understanding of a particular topic. Apart from English academic written words embedded in learning materials such as book chapters, research articles, and journals, students are also required to attend lectures, workshops, tutorials, and seminars in which they frequently encounter academic spoken English vocabulary. However, it is quite a challenging task for Second Language (L2) students at English-medium educational institutions to understand academic spoken English in both conventional lectures and authentic academic talks such as TED talks. Listening comprehension involves a complex process of decoding information from spoken texts, so it requires students to have an intense lexical breadth and depth (Matthews & Cheng, 2015), otherwise messages can be easily misunderstood. Undeniably, insufficient academic lexical repertoire is one of the major factors attributing to poor listening comprehension, thus affecting students’ academic performance.

To foster students’ comprehension of academic spoken English texts, it is crucial to explore the threshold of vocabulary size needed to understand academic spoken English. This research project aims to determine the spoken lexical coverage that is necessary to understand academic spoken English of different disciplines.

Literature Review

Vocabulary Knowledge, Lexical Demand, and Text Comprehension

The field of applied linguistics has long acknowledged the importance of vocabulary to most, if not all aspects of language (Barcroft, 2007; Cheng & Matthews, 2018; Ha, 2021b; Lewis, 2002; Nurmukhamedov & Webb, 2019; Webb, 2020). For receptive language skills including listening and reading, vocabulary could be considered to be the decisive factor to learners’ text comprehension (Lange & Matthews, 2020; Laufer & Ravenhorst-Kalovski, 2010; Qian & Lin, 2020; Schmitt et al., 2011; van Zeeland & Schmitt, 2013). Questions regarding the relationship between vocabulary knowledge and comprehension have always been around the notion of lexical coverage, or the proportion of words learners need to know to adequately understand a text (Nation, 2006). In fact, the relationship between lexical coverage and comprehension has been of interest to teachers for a significant period of time (Laufer, 2021). The problem was that teachers did not have any evidence-backed answers to rely on, and as Laufer (2021, p. 238) pointed out, most of the teachers in the 1980s had to work on the assumption that 80% coverage would lead to reasonable comprehension. Further studies on the issue (Hu & Nation, 2000; Laufer, 1989, 2013; Schmitt et al., 2011) established the two lexical coverage thresholds for acceptable comprehension (95%) and optimal comprehension (98%).

It is interesting that a small gap of 3% of the running words in a text could represent a huge gap in lexical knowledge. To be specific, Hu and Nation (2000) suggested if learners were to know 95% of the tokens in a 1,000-word text, they would encounter 50 unfamiliar words (1 unknown word per every 20 words). However, when learners were able to reach the 98% coverage threshold, the ratio between unknown/ known words would be down to 1/50, equaling 20 unknown words for a 1,000-word text (Hu & Nation, 2000). Since the 98% coverage reduces more than half of the unknown words encountered at the 95% threshold, learners at the 95% coverage threshold normally have to double or even triple their existing lexical resource in order to achieve the 98% coverage for the same type of text (Al-Surmi, 2014; Coxhead & Walls, 2012; Dang & Webb, 2014; Nation, 2006; Nurmukhamedov, 2017; Tegge, 2017; Webb & Macalister, 2013). It is worth noting that these coverage thresholds did not only apply to reading comprehension (Schmitt et al., 2011), but were also applicable to listening comprehension (van Zeeland & Schmitt, 2013).

Vocabulary Demands of Academic Spoken English

Other than the concept of lexical coverage, research in the field of vocabulary studies also put forward another notion called lexical demand which refers to the amount of words required to cover 95% and 98% of a particular text genre (Nation, 2013; Webb & Nation, 2013).

Research on the lexical demands of texts often employ word-frequency lists as the main research methodology (Nation, 2022; Webb, 2020). These lists classify words into frequency levels based on the number of times they occur in a specific corpus (Nation, 2006, 2017, 2022). There are several frequency lists have been created to date, still, the field of lexical profiling recognizes the two most popular lists which are Nation’s (2006) British Nation Corpus (BNC) lists or Nation’s (2017) British Nation Corpus/Corpus of Contemporary American English (BNC/COCA) lists. Nation’s (2006) BNC lists consists of 14 word levels and 2 supplementary lists which are proper nouns (PN) and marginal words (MW). The BNC/COCA list contains 25 word levels plus 4 supplementary levels of PN, MW, transparent compounds (TC), and acronyms. Each of the mentioned word levels includes 1,000 word families.

Word family is also a worth mentioning concept. In order for a wordlist to reliably operationalize as a research methodology or a measurement instrument, it must be based on a clear and sound word counting unit. Word counting units deal with what should be counted as a word, obviously, from the lens of the researcher or the creator of a wordlist. A word counting unit could be either token, type, lemma, flemma, or word family (Nation, 2013, 2016; Nation & Coxhead, 2021). Token is simply a word. Some researchers prefer the term running word but they are basically the same. If we were to count all the words in this article, as demanded by the journal, then we were counting tokens. Types are unique tokens. If someone counted the word “word” in this article only once and then stopped counting it again no matter how many times it re-occurred, then they were counting word types. The concept of lemma refers to a headword and its inflectional forms that share the same part of speech. The decision whether or not to include inflected forms of different parts of speech separates lemmas from flemmas. A flemma includes a headword and all its inflectional forms regardless of part of speech. For instance, “thought” as a verb and “thought” as a noun should be counted as two different lemmas but are counted as one flemma. The word-frequency lists described in this paper, including the BNC, BNC/COCA, and other later-mentioned wordlists such as AWL and ASWL, are based on a word counting unit called word family. A word family, as Bauer and Nation (1993) proposed, is an inclusive definition of word that includes a headword and all of its inflectional and derivational forms. For example, a word family of the headword “accept” would include “acceptability,” “acceptable,” “acceptably,” “acceptance,” “acceptances,” “accepted,” “accepting,” “acceptor,” “acceptors,” “accepts,” “unacceptability,” “unacceptable,” and “unacceptably.”

For decades, these wordlists have served as a basis for several studies on lexical demand, informing countless teachers, students, and researchers of the necessary vocabulary knowledge to understand 95% and 98% of the words in various text genres (Al-Surmi, 2014; Dang & Webb, 2014; Ha, 2022a, 2022b; Nation, 2006; Nurmukhamedov, 2017; Nurmukhamedov & Sharakhimov, 2021; Phung & Ha, 2022; Webb & Rodgers, 2009a, 2009b).

Due to difficulties in data collection, corpus-driven studies on vocabulary profile of academic spoken English have been considered to be an under-researched area (Dang et al., 2017; O’Keeffe et al., 2007). Lexical research based on Nation’s (2006) BNC wordlist pointed out that it would generally take learners 4,000 and 8,000 word families to achieve 95% and 98% coverage, respectively (Coxhead & Walls, 2012; Dang & Webb, 2014; Nurmukhamedov, 2017). However, as the field of vocabulary studies is moving toward Nation’s (2017) BNC/COCA lists (Schmitt et al., 2017) and most widely used vocabulary tests have been incorporated the BNC/COCA lists as the source for their test items (Ha, 2021a; McLean et al., 2015; McLean & Kramer, 2015; Webb et al., 2017), findings from research that utilized the BNC lists can barely serve as a guideline for present practices. One of the few attempts to employ the BNC/COCA lists to investigate the vocabulary demands of academic spoken discourse was Liu and Chen (2019). Liu and Chen (2019) examined a 4.37-million-token corpus of TED talks and concluded that learners would need to be familiar with 3,000 and 6,000 to 7,000 word families plus PN, MW, TC, and acronyms to understand 95% and 98% of the words in TED presentations.

The Academic Word List and the Academic Spoken Word List

In an attempt to create a wordlist that could help learners quickly adapt to academic discourses, Dang et al. (2017) created a four-level Academic Spoken Word List (ASWL) that contains 1,741 word families. The ASWL was aimed to be integrated with the BNC/COCA lists “in a systematic program to enhance learners’ comprehension of academic spoken English” (Dang et al., 2021, p. 7). Table 1 illustrates the components of the ASWL and the extent to which ASWL items are included in the BNC/COCA lists.

Table 1.

The Four Levels of the Academic Spoken Word List (ASWL; Dang et al., 2017, p. 979, cited in Dang et al., 2021, p. 9).

ASWL level	BNC/COCA level	Number of word families	Examples
Level 1	First 1,000	830	Alright, know, stuff
Level 2	Second 1,000	456	Therefore, determine, approach
Level 3	Third 1,000	380	Achieve, significant, aspect
Level 4	Fourth 1,000 onwards	75	Arbitrary, optimize, theorem

The ASWL was found to cover approximately 90% of most academic spoken corpora (Dang et al., 2017, 2021; Liu & Chen, 2019). Certain comparisons were made between the ASWL and the AWL to endorse the use of the ASWL (Coxhead & Dang, 2019; Liu & Chen, 2019). For example, in a paper in 2019, Liu and Chen (2019) compared the coverage of the ASWL to the AWL and the BNC/COCA 2,000 separately. They concluded that learning the ASWL was a better option since it provided better coverage “compared to the AWL (89.4% vs. around 4%)” and offered similar coverage with lower learning load compared to the BNC/COCA 2,000 (Liu & Chen, 2019, pp. 360–364).

However, we consider these comparisons uncritical and even unfair for two reasons. Firstly, the AWL was created as an addition to West’s (1953) General Service List (GSL) in a way that none of the AWL items were included in the GSL (Coxhead, 2000). In other words, the AWL was intended to contain purely academic words, and the creators of the list intentionally filtered out all the general or non-academic words in the list (Coxhead, 2000). In a recent study, Phung and Ha (2022, p. 6) found that AWL items only accounted for a very small proportion (3.7%) of the most frequent 1,000 word families in the BNC/COCA lists. The ASWL, on the other hand, contains a large proportion of the most frequent word families in the BNC/COCA list (83%, see Table 1). Given the fact that the first 1,000 word families in the BNC/COCA could cover approximately 80% of the words in most text types (Ha, 2022a, 2022b; Nurmukhamedov & Sharakhimov, 2021; Phung & Ha, 2022; and as later shown in Table 2 of this study), we would not call the differences in lexical coverage between the two wordlists (AWL and ASWL) surprising.

Table 2.

Cumulative Coverage Including PN, MW, TC, and AC for Each Corpus (%).

Word list	BASE			TED
Word list	LS	PS	SS	LS	PS	SS
1,000	83.96	86.51	86.26	78.47	81.35	83.84
2,000	89.96	92.19	92.53	86.69	89.13	91.38
3,000	94.53	96.29^a	96.91^a	92.25	94.28	96.26^a
4,000	95.97^a	97.52	97.97_b	94.28	96.46^a	97.53
5,000	96.85	98.15^b	98.44	95.67^a	97.31	98.20^b
6,000	97.32	98.47	98.76	96.69	97.91	98.60
7,000	97.75	98.72	99.01	97.42	98.34^b	98.87
8,000	98.03^b	98.92	99.18	97.91	98.61	99.07
9,000	98.27	99.02	99.28	98.22^b	98.86	99.19
10,000	98.50	99.18	99.36	98.44	99.02	99.27
11,000–25,000	98.64–99.52	99.33–99.75	99.44–99.67	98.64–99.40	99.14–99.53	99.34–99.60
Tokens	454,664	352,911	911,172	222,354	247,729	324,602

Reaching 95% coverage.

Reaching 98% coverage.

The second criticism lies in the fact that Liu and Chen (2019) completely ignored the fact that different wordlists could be learnt together (Dang & Webb, 2014). In an influential paper, Dang and Webb (2014) investigated the lexical coverage of university lectures and academic seminars using Nation’s (2006) BNC lists and Coxhead’s (2000) AWL. Other than the findings reviewed above, they successfully pointed out a supportive relationship between the BNC and AWL. Their findings showed that, with the knowledge of the AWL, learners would only need to know 3,000 most frequent word families in the BNC to understand 95% of the running words in academic lectures and seminars. Meanwhile, the vocabulary size needed to reach the 95% coverage threshold would normally be 4,000 word families. In their paper, Liu and Chen (2019) only compared the ASWL to the AWL and BNC/COCA separately without examining the relationship between the AWL and BNC/COCA. Assuming there was a strong relationship between the BNC and the AWL, it is hypothesized that a similar bond could also be found between the BNC/COCA and AWL, especially when the first two 1,000-word levels in the BNC/COCA lists primarily contained words from spoken texts (Nation, 2020).

Adopting the research methodology developed by Dang and Webb (2014) and Phung and Ha (2022) investigate the vocabulary demands of the four listening sections in the IELTS test. Their results showed a strong relationship between the AWL and the BNC/COCA lists. Their findings suggested that, with the support from knowledge of the AWL, students with vocabulary knowledge of 2,000 most frequent word families in the BNC/COCA lists could understand 95% of the words in the first three sections of the IELTS listening test. If such a combination could significantly increase students’ coverage for a test of listening proficiency for academic purposes, then we would assume that similar increases could be found for university lectures, seminars, and TED presentations when word knowledge of the AWL and BNC/COCA is combined. To the best of our knowledge, no research has investigated the combined lexical coverage of the AWL and BNC/COCA, with emphasis on academic lectures, seminars, and TED.

Research Gaps and the Present Study

A research gap in the current literature that needs to be addressed is the methodological gap in Dang and Webb (2014) and Liu and Chen (2019). Given the influence of their findings, the present study seeks to (1) revisit the lexical profile of academic spoken English using the BNC/COCA lists and (2) to examine the degree to which the AWL would increase the BNC/COCA coverage for academic discourse. The present study employed the British Academic Spoken Corpus (BASE), the corpus that was investigated in Dang and Webb (2014) and a corpus of TED talks which were designed in accordance with the disciplinary sub-corpora presented in the BASE corpus. Such addition would make findings on the lexical profile of academic spoken English more diverse and representative. An investigation on the lexical demands of three academic disciplines of Life Sciences, Physical Sciences, and Social Sciences using the up-to-date BNC/COCA lists would be a great update for Dang and Webb’s (2014) findings and an expansion of Liu and Chen’s (2019) results.

As Dang et al.’s (2017) ASWL has been exclusively applied in most studies of academic spoken corpora, including university lectures, seminars, TED presentations, and so on (Coxhead & Dang, 2019; Dang et al., 2017, 2021; Liu & Chen, 2019), employing such a wordlist again in this study would be an unnecessary repetition. Therefore, instead of using the ASWL, the study would focus on the degree to which the AWL could improve the lexical coverage of the BNC/COCA lists and whether or not this combined coverage could be comparable to that of the ASWL for academic spoken discourse in previous studies.

In particular, the present study seeks to answer the following research questions:

RQ1: What vocabulary size of the BNC/COCA is required to understand 95% and 98% of the words in the BASE and TED disciplinary sub-corpora?

RQ2: Do vocabulary demands of academic spoken English vary among different fields of study?

RQ3: To what extent can the AWL contribute to learners’ existing word knowledge of the BNC/COCA lists?

Methodology

Data Collection

The TED Corpus

600 TED transcripts were manually collected by the researchers from the TED official website. The transcripts were then classified to the three major branches of sciences: Life Sciences (200 talks) which included topics like bacteria, bees, birds, coral reefs, fungi, dinosaurs, virus, ecology, and fish, Physical Sciences which consisted of topics such as asteroid; space; technology, astronomy, engineering, physics, energy… (200 talks) and Social Sciences which was made up from topics including arts, language, economy, body language, books, business, religion, happiness, communication, culture, friendship, and so on (200 talks). These sub-corpora were created for comparative purposes as the BASE corpus were also divided into similar disciplines. While there were certain differences in terms of size due to variations in the lengths of the presentations, an equal number of transcripts were collected for each sub-corpus, making them comparable and representative. Although the TED corpus in the present study was relatively smaller compared to those used in Nurmukhamedov (2017) and Liu and Chen (2019), such a size (794,685 tokens) was sufficient to make reliable findings in corpus-driven studies (Biber, 1993; Biber et al., 1998; Chujo & Utiyama, 2005). The classification of talks were based on their topics which were available on TED official website (https://www.ted.com/topics).

The BASE Corpus

The British Academic Spoken English (BASE) corpus comprised 160 comprised of 160 lectures at tertiary levels and 39 seminars recorded at two universities in the UK between 1998 and 2005 (Nesi & Thompson, 2006). The corpus contained approximately 1.7 million tokens and was made up of four disciplinary sub-corpora: Arts and Humanities, Life and Medical Sciences, Physical Sciences, and Social Sciences. Except for the Physical Sciences sub-corpus which contained 40 lectures and 9 seminars, all the other sub-corpora included 40 lectures and 10 seminars each.

Data Preparation

To make the corpora representative and comparable, all the TED- and BASE-related sub-corpora were classified into three major scientific disciplines: Life Sciences (LS), Physical Sciences (PS), and Social Sciences (SS). Therefore, the name of the Life and Medical Sciences sub-corpus in the BASE corpus was shortened as Life Sciences. Since topics in the fields of Arts and Humanities and Social Sciences share great similarities, the Arts and Humanities and Social Sciences sub-corpora in the BASE corpus were merged into Social Sciences.

Certain modifications were done to the transcripts prior to analysis. Firstly, presenters’ IDs (nf1167, sf1168, and su1169) and context supporting words (laugh, laughter, applause, cough, and cheers) were excluded since they were not part of the spoken discourse. Second, as lexical profiler programs used in vocabulary analysis could not read hyphenated compounds, hyphens in hyphenated words (part-time, second-hand) were replaced by space so that the elements of these compounds such as “part” and “time,” “second,” and “hand” could be classified according to their frequency. Words that were falsely categorized as “Not in the lists” by a lexical profiler program due to typos were corrected and returned to their frequency levels.

Data Analysis

A lexical profiler program named RANGE (Heatley et al., 2002) was used for data analysis. The RANGE program classifies words in a target text in accordance to the frequency levels set out by the accompanying wordlists. For example, the present study utilized two major wordlists which were the 25-level BNC/COCA lists (Nation, 2017) and the AWL (Coxhead, 2000) to analyze the lexical profile of the TED and BASE corpora. RANGE can also identify and categorize contractions (can’t, didn’t…) and connected speech (wanna, kinda…) to their families. For example, RANGE reads don’t as do and not and wanna as a family member of want.

Results

RQ1 and RQ2

Table 2 presents the coverage of the BNC/COCA lists plus supplementary words (PN, MW, TC, and acronyms) over the BASE and TED disciplinary sub-corpora. In general, TED presentations were found to be relatively more lexically demanding than university lectures and seminars. This difference in lexical demands was reflected in the vocabulary knowledge needed to achieve 95% and 98% coverage. For instance, except Social Sciences, TED talks in PS and LS disciplines were 1,000 to 2,000 word families more lexically demanding compared to their counterparts in BASE corpus.

It could be observed that the lexical demands differed significantly among the scientific disciplines. According to the word knowledge required to reach the 95% and 98% coverage thresholds, an order of lexical difficulty of LS > PS > SS could be established among the three scientific disciplines. It was interesting to see that such order of difficulty was reflected in both the TED and BASE sub-corpora.

If PN, MW, TC, and acronyms were assumed to be easily recognized and understood (Nation, 2006; Nurmukhamedov & Sharakhimov, 2021; Tegge, 2017), then students of social sciences disciplines such as economy, education, and linguistics would only need a vocabulary knowledge of 3,000 and 4,000 to 5,000 word families to comprehend 95% and 98% of the words in academic lectures and presentations, in the order given. Things were more complicated for Physical Sciences disciplines which included topics like space sciences and technology. While the analysis of the BASE-PS sub-corpus suggested that it only took learners 3,000 and 5,000 word families to gain 95% and 98% coverage, correspondingly, results from the analysis of the TED-PS sub-corpus showed that learners would need 4,000 and 7,000 most frequent word families in the BNC/COCA lists to gain 95% and 98% coverage, respectively. As the most lexically demanding discipline, lectures, seminars and presentations in the fields related to Life Sciences required the vocabulary knowledge at 4,000 to 5,000 and 8,000 to 9,000 levels to achieve the 95% and 98% coverage thresholds, in the given order.

RQ3

Table 3 illustrates the proportion of academic vocabulary in each sub-corpus. In general, it could be seen that the AWL accounted for a comparable amount of words in the sub-corpora. Dang and Webb (2014) showed that the AWL could provide significant support to the high-frequency levels in the BNC lists. However, a certain degree of difference could be expected when it comes to the relationship between the AWL and the BNC/COCA lists.

Table 3.

The Proportion of AWL Items in Each Sub-Corpus (%).

BASE			TED
LS	PS	SS	LS	PS	SS
4.18	4.23	4.49	4.42	4.44	4.76

When the 25-level BNC/COCA lists were run on the 570 word families of the AWL (with all the hyphens replaced by spaces). It was found that the first, second, third, fourth, fifth and sixth 1,000-word levels in the BNC/COCA lists accounted for 3.70%, 24.14%, 56.16%, 9.42%, 3.26%, and 1.12% of the words in the AWL, respectively. The coverage went below 1% from the seventh level onwards. This results indicated that most of the AWL items were included in the 3,000 and lower frequency-levels in the BNC/COCA. In other words, the strongest contribution of the AWL to the BNC/COCA could only be witnessed at the 1,000 (more than 90%) and 2,000 levels (approximately 70%).

To find out the extent to which knowing AWL could contribute to the comprehension of TED talks and university lectures, the BNC/COCA lists were run on the AWL items occurred in each TED and BASE disciplinary sub-corpus. After that, customized wordlists were created based on the AWL items that occurred at each BNC/COCA levels. The new, customized, “basewrd” files were then used together with RANGE to analyze the coverage of these AWL items in the target texts. The same technique was repeatedly applied to the six sub-corpora. Table 3 illustrates the results of such analyses.

Table 4 gives information about the cumulative coverage of AWL items at each BNC/COCA word-level. In general, nearly half of the AWL items were included in the first two most frequent 2,000 word families in the BNC/COCA lists. The 3,000 level in the BNC/COCA lists alone were shown to cover another half the AWL items, and nearly all of the AWL items were covered at the 4,000 level onwards. When we subtracted the cumulative coverage of the AWL items at each level from the total proportion of academic words in a sub-corpus, we found the remaining AWL items that had not been covered by BNC/COCA words (Table 4). These were the amount of words that learners might benefit from if they knew the AWL. Take the SS sub-corpus of the BASE corpus for an example, if learners only knew 1,000 most frequent word families in the BNC/COCA lists, their knowledge of 570 word families in the AWL would make them understand 4.49 − 0.37 = 4.12% more of the lectures in their discipline.

Table 4.

The Distribution of the AWL Items at Each 1,000-Word Level in the BNC/COCA Lists (%).

BNC/COCA word level	1,000	2,000	3,000	4,000	5,000
BASE
Cumulative coverage of AWL items
LS	0.50	1.95	3.98	4.14	4.18
PS	0.36	1.68	4.0	4.16	4.21
SS	0.37	1.95	4.27	4.43	4.47
Proportion of the remaining AWL items
LS	3.68	2.23	0.20	0.04	0
PS	3.87	2.55	0.23	0.07	0.02
SS	4.12	2.54	0.22	0.06	0.02
TED
Cumulative coverage of AWL items
LS	0.31	2.18	4.26	4.37	4.40
PS	0.30	2.12	4.22	4.36	4.41
SS	0.36	2.31	4.58	4.71	4.75
Proportion of the remaining AWL items
LS	4.11	2.24	0.16	0.05	0.02
PS	4.14	2.32	0.22	0.08	0.03
SS	4.40	2.45	0.18	0.05	0.01

Table 5 demonstrates the difference between knowing and not knowing the AWL at each level of vocabulary knowledge. Since the contribution of AWL is most visible at high-frequency word-levels and quickly fades away from the 3,000 level onwards, we only need to set our eyes on the 1,000 and 2,000 BNC/COCA levels. Generally, learners who have limited vocabulary knowledge at 1,000 level could understand 85%-90% academic lectures, seminars, and presentations. More notably, knowledge of the AWL could help learners who knew 2,000 most frequent word families get really close to the 95% coverage threshold for some less lexically demanding disciplines. For example, the AWL increased the lexical coverage at 2,000 frequency-level from 92.53% to 95.07% for the BASE-SS sub-corpus.

Table 5.

Additional Coverage AWL Provided to Each Word Level in the BNC/COCA Lists (%).

BNC/COCA word level	1,000	2,000	3,000	4,000	5,000
BASE
Without AWL
LS	83.96	89.96	94.53	95.97	96.85
PS	86.51	92.19	96.29	97.52	98.15
SS	86.26	92.53	96.91	97.97	98.44
With AWL
LS	87.64	92.19	94.73	96.01	96.84
PS	90.38	94.73	96.52	97.59	98.17
SS	90.38	95.07	97.13	98.03	98.46
TED
Without AWL
LS	78.47	86.69	92.25	94.28	95.67
PS	81.35	89.13	94.28	96.46	97.31
SS	83.84	91.38	96.26	97.53	98.20
With AWL
LS	82.58	88.94	92.42	94.33	95.69
PS	85.48	91.45	94.50	96.53	97.34
SS	88.24	93.84	96.44	97.58	98.22

The figures in bold highlight lexical coverages reaching 95% thanks to the addition of the AWL items.

Discussion

Variation in Lexical Demands of Academic Discourse Among Disciplines

Data showed observable differences in lexical demands between scientific disciplines. An order of lexical difficulty of Life Sciences > Physical Sciences > Social Sciences were also supported by the results from the analyses of both the BASE and TED corpora, which were also in line with Dang and Webb (2014). Students who wish to study in universities that use English as the medium of instruction would find those figures helpful. As the present study’s analyses were based on the Coxhead’s (2000) AWL and Nation’s (2017) BNC/COCA wordlist, teachers and education advisors could utilize existing vocabulary tests that measures vocabulary knowledge of the same wordlists to evaluate students’ readiness for the lectures and presentations of their selected fields of study. Such practices could be done using recently developed tests of receptive phonological vocabulary knowledge (Ha, 2021a; McLean et al., 2015).

The Relationship Between AWL and BNC/COCA: A Word Learning Strategy

Due to certain differences in design between the BNC and BNC/COCA wordlists, the relationship between the AWL and these two wordlists differed significantly. For example, Dang and Webb’s (2014) study showed that the AWL can still contribute nearly 0.8% and more than 0.3% to the lexical coverage of the third and fourth levels of the BNC lists, respectively. However, data from the present study showed that the contribution of the AWL to the BNC/COCA lists was barely visible from the 3,000 level. However, since the first two 1,000-word levels of the BNC/COCA were made up of words taken primarily from spoken discourse (Nation, 2020), they did not contain as much academic vocabulary as the first two levels in the BNC lists (Coxhead & Dang, 2019, Table 3; Dang & Webb, 2014). This created a situation where the AWL and the first 1,000 or even 2,000 of the BNC/COCA formed a sound relationship which teachers and learners could take advantage of. These findings align well with Phung and Ha (2022) who investigated the lexical profile of the IELTS listening test using the same research methodology.

As shown in the present study, the combined 1,570 word families (AWL and BNC/COCA 1,000) covered 88% to 90% of the words in university lectures, seminars, and TED presentations in the field of Social Sciences. It is worth noting that Social science was also the field of study that possibly consisted of topics like Culture, Design, Entertainment, and Global issues which made up a substantial proportion in Liu and Chen’s (2019) TED corpus. Therefore, if we were to recommend certain wordlists for English as a Second Language (ESL) teachers and learners, then, for certain reasons, we would say that the focus of vocabulary teaching and learning should be on the high- and mid-frequency levels in the BNC/COCA lists. First, such strategies would form a systematic development of learners’ lexical knowledge which would later support the comprehension of various types of text genres, not just limited to academic spoken discourse. Second, the suggested learning strategy would be continuously informed and supported by a strong background of lexical profiling studies that utilized the BNC/COCA as their research methodology (Hsu, 2018, 2022; Nurmukhamedov & Sharakhimov, 2021; Rahmat & Coxhead, 2021; Yang & Coxhead, 2020). Third, as most popular and reliable tests of receptive vocabulary knowledge were developed based on the AWL and BNC/COCA lists (McLean et al., 2015; McLean & Kramer, 2015; Webb et al., 2017), teachers and students who use the BNC/COCA lists or AWL as guidelines for their vocabulary learning and teaching could easily employ these tests to evaluate the progress. And finally, even when students need to quickly increase their vocabulary knowledge to cope with the specific academic demands, the combination of the AWL and BNC/COCA-2000 could be, in our opinion, the best solution for such circumstances.

As we are mentioning tests of vocabulary knowledge as a reason to support the teaching and learning of the AWL and BNC/COCA lists, it is also worth noting that the researchers in the field of vocabulary assessment are being more critical about the formats of vocabulary tests (Schmitt et al., 2020). To be specific, strong criticisms have been raised concerning the four-option, multiple-choice test format for the strategic or even blind guessing effects (Stewart et al., 2021; Stoeckel et al., 2021), and the six-option, matching format for issues regarding local item dependency (Ha, 2022c; Kamimoto, 2014). A general conclusion that most of the mentioned authors have agreed on is the endorsement of the meaning-recall format, the format where learners have to manually demonstrate their word knowledge through either L1 translation or explanation (Cheng et al., 2022; Ha, 2022c; Stewart et al., 2021; Stoeckel et al., 2021). However, as Nation and Coxhead (2021) discusses in their Chapter 9, each test format has its pros and cons, and is therefore only optimal for a particular purpose or circumstance. As a result, it would not be realistic to suggest a one-size-fit-all test of vocabulary knowledge. Fortunately, there are two tests of receptive, phonological vocabulary knowledge available in the field that employ both the meaning-recognition (McLean et al., 2015) and meaning-recall (Cheng et al., 2022) formats. Teachers of English are encouraged to consult Nation and Coxhead (2021) and other papers cited to decide which of these two mentioned tests are more appropriate for their testing purposes.

Conclusion

The present study offers insight into the lexical demands of academic spoken English among various scientific disciplines. It shows that lectures and presentations of different fields of studies require different vocabulary sizes to be successfully comprehended, which provide useful guidelines for students who wish to attend English as Medium of Instruction (EMI) universities. At the same time, the findings offer implications for vocabulary teaching and learning strategies by indicating how learning academic vocabulary early could support learners in understanding academic spoken discourses.

Despite being informative, the present study has certain limitations. The most worth mentioning is the fact that the BASE corpus is quite old. In fact, the BASE corpus was 17 years old at the time the present study was conducted. Moreover, lectures and seminars recorded in the BASE corpus contained primarily British English. In other words, the BASE corpus did not really reflect the current academic spoken English used in universities around the world, which somehow weakened certain claims made in the paper. Linguists are encouraged to revisit the issue through a more updated and comprehensive corpus of academic lectures and seminars.

Footnotes

Acknowledgements

The recordings and transcriptions used in this study come from the British Academic Spoken English (BASE) corpus (). The corpus was developed at the Universities of Warwick and Reading under the directorship of Hilary Nesi (Warwick) and Paul Thompson (Reading). Corpus development was assisted by funding from the Universities of Warwick and Reading, BALEAP, EURALEX, the British Academy, and the Arts and Humanities Research Council.

Author Contributions

All authors listed have made equally significant, direct, and intellectual contribution to the paper. All the author read and approved the manuscript for publication.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Nguyen Huynh Trang

Duyen Thi Bich Nguye

Hung Tan Ha

References

Al-Surmi

(2014). TV shows, word coverage, and incidental vocabulary learning. In Bailey

Damerow

(Eds.), Teaching and learning English in the Arabic Speaking world (pp. 132–147). Routledge.

Barcroft

(2007). When knowing grammar depends on knowing vocabulary: Native speaker grammaticality judgements of sentences with real and unreal words. The Canadian Modern Language Review/La revue canadienne des langues vivantes, 63(3), 313–343. https://doi.org/10.3138/R601-H212-5582-0737

Bauer

Nation

I. S. P.

(1993). Word families. International Journal of Lexicography, 6(4), 253–279.

Biber

(1993). Representative in corpus design. Literary and Linguistic Computing, 8, 243–257.

Biber

Conrad

Reppen

(1998). Corpus linguistics: Investigating language structure and use. Cambridge University Press.

Cheng

Matthews

(2018). The relationship between three measures of L2 vocabulary knowledge and L2 listening and reading. Language Testing, 35(1), 3–25. https://doi.org/10.1177/0265532216676851

Cheng

Matthews

Lange

McLean

(2022). Aural single-word and aural phrasal verb knowledge and their relationships to L2 listening comprehension. TESOL Quarterly. Advance online publication. https://doi.org/10.1002/tesq.3137

Chujo

Utiyama

(2005). Understanding the role of text length, sample size and vocabulary size in determining text coverage. Reading in a Foreign Language, 17, 1–22.

Coxhead

(2000). A new academic word list. TESOL Quarterly, 34(2), 213–238. http://dx.doi.org/10.2307/3587951

10.

Coxhead

Dang

T. N. Y.

(2019). Vocabulary in university tutorials and laboratories. In Hyland

Wong

(Eds.), Specialised English: New directions in ESP and EAP research and practice (pp. 120–134). Routledge.

11.

Coxhead

Walls

(2012). TED Talks, vocabulary, and listening for EAP. TESOLANZ Journal, 20, 55–65.

12.

Dang

T. N. Y.

Webb

(2014). The lexical profile of academic spoken English. English for Specific Purposes, 33, 66–76.

13.

Dang

T. N. Y.

Coxhead

Webb

(2017). The Academic Spoken Word List. Language Learning, 67(4), 959–997. https://doi.org/10.1111/lang.12253

14.

Dang

T. N. Y.

Coxhead

Webb

(2021). Vocabulary in academic spoken English. New Zealand Studies in Applied Linguistics, 26(2). https://eprints.whiterose.ac.uk/174018; https://www.alanz.org.nz/journal/vocabularyin-academic-spoken-english/

15.

H. T.

(2021a). A Rasch-based validation of the Vietnamese version of the listening vocabulary levels test. Language Testing in Asia, 11(1), 16. https://doi.org/10.1186/s40468-021-00132-7

16.

H. T.

(2021b). Exploring the relationships between various dimensions of receptive vocabulary knowledge and L2 listening and reading comprehension. Language Testing in Asia, 11, 20. https://doi.org/10.1186/s40468-021-00131-8

17.

H. T.

(2022a). Vocabulary demands of informal spoken English revisited: What does it take to understand movies, TV programs, and soap operas? Frontiers in Psychology, 13(1), 1–7. https://doi.org/10.3389/fpsyg.2022.831684

18.

H. T.

(2022b). Lexical profile of newspapers revisited: A corpus-based analysis. Frontiers in Psychology, 13(1), 1–10. https://doi.org/10.3389/fpsyg.2022.800983

19.

H. T.

(2022c). Test format and local dependence of items revisited: A case of two vocabulary levels tests. Frontiers in Psychology, 12(1), 1–6. https://doi.org/10.3389/fpsyg.2021.805450

20.

Heatley

Nation

I. S. P.

Coxhead

(2002). Range: A program for the analysis of vocabulary in texts. http://www.victoria.ac.nz/lals/about/staff/paul-nation

21.

Hsu

(2018). The most frequent BNC/COCA mid- and low-frequency word families in English-medium traditional Chinese medicine (TCM) textbooks. English for Specific Purposes, 51, 98–110. https://doi.org/10.1016/j.esp.2018.04.001

22.

Hsu

(2022). To what extent may EFL undergraduates with EMI develop English vocabulary? The case of civil engineering. LEARN Journal: Language Education and Acquisition Research Network, 15(1), 469–494.

23.

Nation

I. S. P.

(2000). Unknown vocabulary density and reading comprehension. Reading in a Foreign Language, 13(1), 403–430.

24.

Kamimoto

(2014). Local item dependence on the vocabulary levels test revisited. Vocabulary Learning and Instruction, 3(2), 56–68. https://doi.org/10.7820/vli.v03.2.kamimoto

25.

Lange

Matthews

(2020). Exploring the relationships between L2 vocabulary knowledge, lexical segmentation, and L2 listening comprehension. Studies in Second Language Learning and Teaching, 10(4), 723–749. https://doi.org/10.14746/ssllt.2020.10.4.4.

26.

Laufer

(1989). What percentage of text lexis is essential for comprehension? In Lauren

Nordman

(Eds.), Special language: From humans thinking to thinking machines (pp. 316–323). Multilingual Matters.

27.

Laufer

(2013). Lexical thresholds for reading comprehension: What they are and how they can be used for teaching purposes. TESOL Quarterly, 47(4), 867–872. http://dx.doi.org/10.1002/tesq.140

28.

Laufer

(2021). Lexical thresholds and alleged threats to validity: A storm in a teacup? Reading in a Foreign Language, 33, 238–246.

29.

Laufer

Ravenhorst-Kalovski

G. C.

(2010). Lexical threshold revisited: Lexical text coverage, learners’ vocabulary size and reading comprehension. Reading in a Foreign Language, 22(1), 15–30.

30.

Lewis

(2002). Implementing the Lexical approach: Putting theory into practice. Thomson Heinle.

31.

Liu

C. Y.

Chen

H.-H.-J.

(2019). Academic spoken vocabulary in TED Talks: Implications for academic listening. English Teaching & Learning, 43, 353–368.

32.

Matthews

Cheng

(2015). Recognition of high frequency words from speech as a predictor of L2 listening comprehension. System, 52, 1–13. http://doi.org/10.1016/j.system.2015.04.015

33.

McLean

Kramer

(2015). The creation of a new vocabulary levels test. Shiken, 19(1), 1–11.

34.

McLean

Kramer

Beglar

(2015). The creation and validation of a listening vocabulary levels test. Language Teaching Research, 19(6), 741–760. https://doi.org/10.1177/1362168814567889

35.

Nation

I. S. P.

(2006). How large a vocabulary is needed to reading and listening? The Canadian Modern Language Review, 63(1), 59–82.

36.

Nation

I. S. P.

(2013). Learning vocabulary in another language (2nd ed.). Cambridge University Press.

37.

Nation

I. S. P.

(2016). Making and using word lists for language learning and testing. John Benjamins.

38.

Nation

I. S. P.

(2017). The BNC/COCA Level 6 word family lists (Version 1.0.0) [Data file]. http://www.victoria.ac.nz/lals/staff/paul-nation.aspx

39.

Nation

I. S. P.

(2020). About the BNC/COCA headword lists. https://www.wgtn.ac.nz/lals/resources/paul-nations-resources/vocabulary-lists

40.

Nation

I. S. P.

(2022). Learning vocabulary in another language (3rd ed.). Cambridge University Press.

41.

Nation

I. S. P.

Coxhead

(2021). Measuring native-speaker vocabulary size. John Benjamins Publishing Company.

42.

Nesi

Thompson

(2006). The British academic spoken English corpus manual. https://www.reading.ac.uk/acadepts/ll/base_corpus/

43.

Nurmukhamedov

(2017), Lexical coverage of TED talks: Implications for vocabulary instruction. TESOL Journal, 8, 768–790. https://doi.org/10.1002/tesj.323

44.

Nurmukhamedov

Sharakhimov

(2021). Corpus-based vocabulary analysis of English podcasts. RELC Journal. Advance online publication. https://doi.org/10.1177/0033688220979315

45.

Nurmukhamedov

Webb

(2019). Lexical coverage and profiling. Language Teaching, 52(2), 188–200. https://doi.org/10.1017/S0261444819000028

46.

O’Keeffe

McCarthy

Carter

(2007). From corpus to classroom: Language use and language teaching. Cambridge University Press. https://doi.org/10.1017/CBO9780511497650

47.

Phung

D. H.

H. T.

(2022). Vocabulary demands of the IELTS listening test: An in-depth analysis. SAGE Open, 12(1), 1–13. https://doi.org/10.1177/21582440221079934

48.

Qian

D. D.

Lin

L. H. F.

(2020). The relationship between vocabulary knowledge and language proficiency. In Webb

(Ed.), The Routledge handbook of vocabulary studies (pp. 66–80). Routledge.

49.

Rahmat

Y. N.

Coxhead

(2021). Investigating vocabulary coverage and load in an Indonesian EFL textbook series. Indonesian Journal of Applied Linguistics, 10(3), 804–814. https://doi.org/10.17509/ijal.v10i3.31768

50.

Schmitt

Cobb

Horst

Schmitt

(2017). How much vocabulary is needed to use English? Replication of van Zeeland & Schmitt (2012), Nation (2006) and Cobb (2007). Language Teaching, 50(2), 212–226.

51.

Schmitt

Jiang

Grabe

(2011). The percentage of words known in a text and reading comprehension. The Modern Language Journal, 95(1), 26–43. https://doi.org/10.1111/j.1540-4781.2011.01146.x

52.

Schmitt

Nation

Kremmel

(2020). Moving the field of vocabulary assessment forward: the need for more rigorous test development and validation. Language Teaching, 53, 109–120. https://doi.org/10.1017/S0261444819000326

53.

Stewart

Stoeckel

McLean

Nation

Pinchbeck

(2021). What the research shows about written receptive vocabulary testing: a reply to Webb. Studies in Second Language Acquisition, 43, 462–471. https://doi.org/10.1017/S0272263121000437

54.

Stoeckel

McLean

Nation

(2021). Limitations of size and levels tests of written receptive vocabulary knowledge. Studies in Second Language Acquisition, 43, 181–203. https://doi.org/10.1017/S027226312000025X

55.

Tegge

(2017). The lexical coverage of popular songs in English language teaching. System, 67, 87–98.

56.

van Zeeland

Schmitt

(2013). Lexical coverage in L1 and L2 listening comprehension: The same or different from reading comprehension? Applied Linguistics, 34(4), 457–479. https://doi.org/10.1093/applin/ams074

57.

Webb

(Ed.). (2020). The Routledge handbook of vocabulary studies. Routledge.

58.

Webb

Macalister

(2013). Is text written for children useful for L2 extensive reading? TESOL Quarterly, 47(2), 300–322.

59.

Webb

Nation

I. S. P.

(2013). Teaching vocabulary. In Chappelle

(Ed.) Encyclopedia of applied linguistics (pp. 5670–5677). Wiley-Blackwell.

60.

Webb

Rodgers

M. P. H.

(2009a). The lexical coverage of movies. Applied Linguistics, 30(3), 407–427. https://doi.org/10.1093/applin/amp010

61.

Webb

Rodgers

M. P. H.

(2009b). Vocabulary demands of television programs. Language Learning, 59(2), 235–366. https://doi.org/10.1111/j.1467-9922.2009.00509.x

62.

Webb

Sasao

Balance

(2017). The updated vocabulary levels test. International Journal of Applied Linguistics, 168(1), 33–69. https://doi.org/10.1075/itl.168.1.02web

63.

West

(1953). A general service list of English words: With semantic frequencies and a supplementary word-list for the writing of popular science and technology. Longman.

64.

Yang

Coxhead

(2020). A corpus-based study of vocabulary in the new concept English textbook series. RELC Journal, 53(3), 597–611. https://doi.org/10.1177/0033688220964162