Sage Journals: Discover world-class research

Abstract

The article presents a lexical study that investigates the lexical demands of academic written texts at different levels of writing. By employing the British National Corpus/ Corpus of Contemporary American English (BNC/COCA) word list and the Academic Word List (AWL), the present study analyzed data from the British Academic Written English (BAWE) corpus, which contained 2,761 student assignments, and the Public Library of Science One (PLOS ONE) corpus which included 4,000 scholarly articles. Results from the analyses demonstrated significant differences in lexical difficulty between students’ assignments and scholarly publications. The proportion of academic vocabulary in written texts was also found to increase as the writers’ levels went up. Cross-discipline comparisons highlighted the difference in lexical difficulties between scientific disciplines.

Plain Language Summary

Lexical demands of Academic Written English

The present study was designed to find an answer to the question concerning the number of words needed to understand academic writing. Two corpora of students’ assignments and research articles which contained papers from a range of disciplines were analyzed. The results from the analyses showed the number of words needed to gain acceptable and optimal comprehension of academic written texts for different scientific disciplines and levels of writing. The findings also provided evidence for the differences in lexical difficulty between academic texts at various levels and of different disciplines.

Keywords

academic word list BNC/ COCA lexical coverage academic writing vocabulary

Introduction

For decades, the field of writing research have long acknowledged the significance of lexical complexity to the quality of writing and the proficiency of the writers (Barrot & Agdeppa, 2021; Biber et al., 2016; Bulté & Housen, 2014; Casal & Lee, 2019; Crossley & McNamara, 2014; Taguchi et al., 2013). The relationship between complexity and writing quality have been investigated through a wide range of research methodologies, from the comparison of noun phrase and clausal elaboration (Bider & Gray, 2010) to the examination of the occurrence of complex structures Casal (2020) and Lu et al. (2020), to the degree to which complex and difficult words were used (Choemue & Bram, 2021; H. S. Ha, 2019; Maamuujav, 2021; Zhang et al., 2021). These examinations of complexity in academic writing have, in Casal and Lu’s (2021, p. 96) opinion, opened “a novel scholarly space” and raised “important questions regarding the ways in which academic writing can be called syntactically complex.”

Among several factors determining writing quality, vocabulary, or the words used in a text, is believed to be the most crucial one (Choemue & Bram, 2021; H. S. Ha, 2019; Maamuujav, 2021; Zhang et al., 2021). Research conducted by Breeze (2008), Choemue and Bram (2021), Fairclough and Belpoliti (2016), Maamuujav (2021), and Zhang et al. (2021) demonstrated that texts produced by more advanced writers contained a wider range of less frequent and academic words compared to those written by less proficient learners. Corpus-driven, wordlist-based studies on the lexical profile of various text genres also showed that different types of text may have different lexical demands (Tegge, 2017, Table 7). In other words, it has been found that different types of texts require readers to know different amounts of vocabulary to adequately understand them, and larger lexical knowledge would be needed to comprehend more difficult texts.

An undeniable strength of lexical profiling studies based on word-frequency lists is that they provide readers with concrete, quantified indications of text difficulty rather than statistical evidences of associations or significant differences. For example, while a lexical knowledge of 3,000 word families in Nation’s (2006) British National Corpus were required to adequately comprehend informal spoken English (Al-Surmi, 2014; Nation, 2006; S. Webb & Rodgers, 2009a, 2009b), a vocabulary knowledge of 4,000 word families would be needed for acceptable comprehension of academic spoken discourses (Dang & Webb, 2014; Nurmukhamedov, 2017). These indications serve as a universal guideline for researchers to compare the difficulty of different types of texts as well as for English teachers and learners to select the suitable reading and listening materials (Nurmukhamedov & Sharakhimov, 2023, Table 1; Tegge, 2017, Table 7).

Although there have been a number of studies on the vocabulary profile of academic written texts (Cobb & Horst, 2004; Hsu, 2011, 2014, 2018, 2022; Hu et al., 2021; Hyland & Tse, 2007; Rahmat & Coxhead, 2021; Sun & Dang, 2020; Yang & Coxhead, 2022), most of them focused on the vocabulary loads of textbooks. The only paper that could be said to have examined the vocabulary profile of research articles was Hsu (2011). However, Hsu (2011) only tapped into the lexical demands of research papers of one discipline (Business). In addition, Hsu’s (2011) study employed Nation’s (2006) BNC wordlist which is believed to be relatively bias and outdated (Schmitt et al., 2017).

Therefore, to the best of our knowledge, no research has really provided insights into the vocabulary profile of academic texts at different levels of writing from a multidisciplinary perspective. More importantly, while there has been evidences for statistically significant differences between texts produced by writers at different levels, the degree of these differences has not been depicted in a way that could be easily understood by general readership. Realizing the need to fill in such a methodological gap in the field, the present study employed the British Academic Written English (BAWE) which included written assignments of undergraduate and post-graduate students from three universities in the UK, and the PLOS ONE corpus which consisted of scholarly articles from a top-tier scientific journal to tap into the lexical differences of academic writing at different levels. The study would also attempt to shed light on the variation in lexical demands between papers of different scientific disciplines. As the research utilized Nation’s (2017) comprehensive, up-to-date British National Corpus/Corpus of Contemporary American English (BNC/COCA) and Coxhead’s (2000) Academic Word List, its findings contribute to the AWL- and BNC/COCA-based universal guideline of lexical profiles of texts.

Literature Review

Lexical Coverage, Word-Frequency List, and the Indication of Text Difficulties

The field of applied linguistics have long recognized the significance of vocabulary. Most, if not all of the findings suggested that a rich lexical resource is fundamental to the operation and development of various aspects of language such as language skills (Cheng & Matthews, 2018; H. T. Ha, 2021b; Lange & Matthews, 2020; Laufer & Ravenhorst-Kalovski, 2010; Qian & Lin, 2020; Schmitt et al., 2011; van Zeeland & Schmitt, 2013) and even the knowledge of grammatical structures (Barcroft, 2007; Guo & Roehrig, 2011; Lewis, 2002; Zhang, 2012; Zhang & Koda, 2013).

The importance of vocabulary is most highlighted with the concept of lexical coverage, the proportion of words (also known as tokens or running words) a learner need to be familiar with to gain adequate comprehension for a text. The lexical coverage required for adequate understanding has been the subject of several studies and researchers have provided evidence-based answers to the question concerning the number of words required for adequate comprehension. According to Laufer (1989), 95% coverage was considered to be the lower boundary required for a sufficient understanding of a text. Namely, the remaining 5% means that one word in every 20 words or two lines would be unfamiliar to the readers (Hu & Nation, 2000). For learners who want to read for pleasure, 98% coverage was advised as the upper boundary (Hirsh & Nation, 1992; Laufer & Sim, 1985). With this threshold, one out of each 50 words would be unfamiliar to the reader, the degree considered acceptable for unassisted reading comprehension (Hu & Nation, 2000) as well as ideal coverage for the written text (Laufer et al., 2004). In the studies of listening comprehension, it was also proven that learners could attain adequate listening comprehension if 95% coverage was known, whereas 98% coverage may help learners reach a high level of understanding (van Zeeland & Schmitt, 2013).

Another breakthrough in vocabulary research could be said to be the creation of word-frequency lists. These lists rank words in accordance with the number of time they occur in a corpus that reflects the use of real-life English, and from which offer an indication of word-frequency level. The idea was simple: words that occur more frequently in real-life English would have higher chances to be acquired than less frequent ones. Moreover, as less frequent words are less likely to be known than higher-frequency words, they are perceived to be “more difficult” and therefore should be ranked at higher “levels.” Evidence-based arguments that support the relationship between word-frequency and word-difficulty were provided in the recent discussion between Hashimoto (2021) and Stewart et al. (2022). Supportive evidences for the association between frequency and difficulty of words were also strongly documented in validation studies of vocabulary tests that based on word-frequency lists (H. T. Ha, 2021a; McLean et al., 2015; S. Webb et al., 2017) and research that used these vocabulary tests to examine the vocabulary knowledge of leaners (Dang et al., 2022; H. T. Ha, 2021b; Noreillie et al., 2018; Stæhr, 2009).

Typical examples of the most popular and widely-referred word-frequency lists in the field are Nation’s (2006) BNC and Nation’s (2017) BNC/COCA wordlists. In 2006, Paul Nation created the BNC word-frequency list based on the British National Corpus (BNC). The BNC is a 100-million-token corpus composed of texts from various genres (e.g., spoken, fiction, magazines, newspapers, and academic). The BNC word list contained the fourteen 1,000-word-family levels ordered by the range, dispersion, and frequency of English terms found in the BNC to approximate a text’s vocabulary level. The list came with two supplementary lists of proper nouns (PN) and marginal words (MW). Despite being a methodological basis for a great number of lexical profiling studies (Tegge, 2017, Table 7; Nurmukhamedov & Sharakhimov, 2023, Table 1; Nurmukhamedov & Webb, 2019), the BNC wordlist is often viewed as bias and outdated since it contains texts in the 1980s—early 1990s which only reflect British English (Davies, 2008; Schmitt et al., 2017). Several years later, in 2012, Paul Nation create the BNC/COCA word lists from the British National Corpus and the Corpus of Contemporary American English. The inclusion of the COCA introduced the representation of American English to the lists, extended the range of word-frequency levels from 14 to 25, and also added two new supplementary lists of transparent compounds (TC), and acronyms (AC) (Nation, 2017, 2020). As Nation keeps updating the lists (Nation, 2020), the BNC/COCA could be seen as a more comprehensive and up-to-date version of the BNC and may serve as a “better indication of word frequency” (Schmitt et al., 2017, p. 218). The word list was also shown to outperform other wordlists (Dang et al., 2022; Dang & Webb, 2016).

Another wordlist worth mentioning is Coxhead’s (2000) Academic Word List (AWL). The AWL consists of 570 word families generated from a 3.5-million-token corpus containing academic texts from four disciplinary sub-corpora: Arts, Commerce (including economics, marketing, and management), Law, and the Sciences (including biology, computing, mathematics, etc.). The AWL was originally created in support of General Service List (GSL). This means that the 570 word families in the AWL were not included in West’s (1953) 2,000-word-family GSL, so that the two wordlists could stack and the AWL would provide additional coverage for learners who have mastered 2,000 most frequent word families in the GSL (Coxhead, 2000). The AWL was found to accounted for approximately 10% of the running words in academic written texts (Cobb & Horst, 2004; Hyland & Tse, 2007) and 4% to 5% in academic spoken discourses (Dang & Webb, 2014; Nurmukhamedov, 2017; Phung & Ha, 2022).

The Lexical Coverage as an Indication of Text Difficulty

The values of word-frequency levels, together with lexical coverage thresholds, have been providing researchers with valuable insights into the amount of words learners need to know to reasonably comprehend a particular type of text. Moreover, as texts at higher levels of difficulties often contain “high-level” vocabulary and more “academic” texts are usually made up of larger proportion of academic words (Phung & Ha, 2022), these lexical coverage values could also be used as an indication of lexical difficulty of a text. Over decades, research based on Nation’s (2006) BNC wordlist have documented a strong lexical profile of different types of texts (see Tegge, 2017, Table 7; Nurmukhamedov & Sharakhimov, 2023, Table 1; Nurmukhamedov & Webb, 2019). Recent BNC/COCA-based lexical profiling studies have also managed to map out the vocabulary demands of certain text genres. Table 1 provides a summary of recent literatures concerning the differences in lexical demands between various types of texts. Besides the conclusion that a knowledge of 5,000 most frequent word families in the BNC/COCA wordlist would safely secure 98% coverage of most spoken texts, it could be observed that soap operas were less lexically demanding compared to Movies, TV programs, and Podcasts. In addition, the four mentioned representatives of informal English were also proven to be easier to understand compared to TED talks which usually represented more academic spoken discourse. Phung and Ha (2022) also provided really useful data regarding the differences in the level of lexical difficulty and “academic” between different sections in the IELTS listening test. More interestingly, as all listed studies were based on the same wordlist (BNC/COCA), researchers and teachers can link the levels of difficulty as well as see and “feel” the degree to which a certain type of text is more difficult or easier to understand compared to various types of texts in a quantified way, whether they are spoken or written, academic or non-academic.

Table 1.

Coverage Provided by the BNC/COCA Lists and the AWL Among Various Text Genres.

Text type	95% coverage (BNC/COCA)	Proportion of AWL items (%)	98% coverage (BNC/COCA)	Study
Soap operas	2,000 + PN, MW, TC, AC	-	4,000 + PN, MW, TC, AC	H. T. Ha (2022a)
Movies	2,000 + PN, MW, TC, AC	-	5,000 + PN, MW, TC, AC	H. T. Ha (2022a)
TV programs	3,000 + PN, MW, TC, AC	-	5,000 + PN, MW, TC, AC	H. T. Ha (2022a)
Podcasts	3,000 + PN, MW, TC, AC	-	5,000 + PN, MW, TC, AC	Nurmukhamedov and Sharakhimov (2023)
IELTS listening test section 1	2,000 + PN, MW, TC, AC	1.89	3,000 + PN, MW, TC, AC	Phung and Ha (2022)
IELTS listening test section 2	3,000 + PN, MW, TC, AC	2.92	4,000 + PN, MW, TC, AC	Phung and Ha (2022)
IELTS listening test section 3	3,000 + PN, MW, TC, AC	3.39	4,000 + PN, MW, TC, AC	Phung and Ha (2022)
IELTS listening test section 4	3,000 + PN, MW, TC, AC	5.85	5,000 + PN, MW, TC, AC	Phung and Ha (2022)
TED talks	3,000 + PN, MW, TC, AC	3.79	5,000–7,000 + PN, MW, TC, AC	Liu and Chen (2019), Nurmukhamedov, 2017
Newspapers	3,000–4,000 + PN, MW, TC, AC	-	6,000–8,000 + PN, MW, TC, AC	H. T. Ha (2022b)
Subject-focused textbooks	5,000 + PN, MW, TC, AC		10,000 + PN, MW, TC, AC	Hsu (2018, 2022)

Research Gap and the Present Study

While various studies have been conducted to investigate the lexical coverage of various types of written and spoken texts, the lexical demands of academic research articles have been examined to a very limited extent. Moreover, to date, the extent to which academic writing differs according to scientific disciplines and levels of writers have not been shed light upon. The only study that could be said to have tapped into the disciplinary difference in academic discourse was Dang and Webb (2014). In their study, the four disciplinary sub-corpora of the British Academic Spoken English (BASE) was examined. By using Nation’s (2006) BNC lists, Dang and Webb (2014) have proven that Social Sciences and Arts and Humanities were the least lexically demanding disciplines and Life and Medical Sciences was the most difficult-to-understand branch of science. However, as their study only examined academic spoken English, the investigation into disciplinary difference in academic written English remains warranted.

By analyzing the BAWE corpus which included 2,761 student’s written assignments and the PLOS ONE corpus which consisted of 4,000 published scholarly articles, the research aim to shed light upon the lexical differences between various scientific disciplines and levels of writers. In particular, the research aims to answer the following questions:

To what extent does the coverage of high- mid- and low-frequency words differ according to different scientific disciplines and levels of writers?

To what extent does the vocabulary size needed to reach 95% and 98% coverage differ according to different scientific disciplines and levels of writers?

To what extent does the proportion of AWL items differ according to different scientific disciplines and levels of writers?

Methodology

The Corpora

For the present study, two corpora were compiled. The British Academic Written English (BAWE) corpus, a collection of texts produced by undergraduate and Master students in various scientific disciplines in the UK, was employed. The BAWE corpus is available online at Coventry University (http://www.coventry.ac.uk/bawe) and through the Sketch Engine corpus query interface (http://www.sketchengine.co.uk/). The BAWE corpus consisted of 2,761 proficiently graded student assignments collected at the Universities of Warwick, Reading, and Oxford Brookes with ESRC funding as part of the project entitled “An investigation of genres of assessed writing in British Higher Education” between 2004 and 2007. This 6.5-million-token corpus was made up of four broad disciplinary sub-corpora: Arts and Humanities (including 705 assignments), Life Sciences (including 683 assignments), Physical Sciences (including 596 assignments), Social Sciences (including 777 assignments), across four levels of study (Year 1, Year 2, Year 3 undergraduate, and taught Masters). The BAWE corpus were made up of different types of students’ assignments, ranging from essays, case studies, reviews, methodology recounts, research reports, proposals to empathy writing. However, as the main purpose of the corpus creators was to reflect academic writing from different levels and disciplines (Alsop & Nesi, 2009), the number of assignments were unevenly distributed, with essays contributing to nearly half of the papers in the corpus.

Two considerations led to the selection of the BAWE corpus. First, since it was a corpus created from graded university assignments, BAWE represented academic writing at students’ levels, which made a optimal comparison to academic writing at publication level. Second, as one of the study’s objectives was to see if the disciplinary differences in academic spoken discourse (Dang & Webb, 2014) could be found in academic written texts, the fact that BAWE contained students’ essays from the four academic disciplines presented in Dang and Webb (2014) made it the best choice for such purpose.

The second corpus, Public Library of Science (PLOS) ONE, was additionally used in this study to examine the lexical demands of academic written English at the expert level. The present study collected 4,000 scholarly articles from PLOS ONE via AntCorGen software, a freeware corpus generation tool searching for documents in the PLOS ONE research database (Anthony, 2019). Since 2006, the Public Library of Science (PLOS) has introduced PLOS ONE, a peer-reviewed open access scientific journal. PLOS ONE covers preeminent research from any discipline of science. In advance of publication, each submission must be evaluated by a PLOS ONE Editorial Board member and then went through a rigorous peer-review process. Estimates suggest that by 2010, PLOS ONE has become the most prominent journal globally (Morrison, 2011). Therefore, the application of PLOS ONE corpora was hugely significant in examining the lexicon at the expert level, through which a fascinating contrast could be conceived between the lexical needs of the student and the professional academic written texts. It was essential to notice that the Arts and Humanities sub-corpora was included in the Social sciences sub-corpora in the PLOS ONE corpus. Therefore, in the present study, 19,636,598 tokens would be derived from three disciplinary sub-corpora, including Biology and Life Sciences, Physical Sciences, and Social Sciences. Each sub-corpus included 1,000 texts, except for Social Sciences, which included 2,000 texts.

The Level-of-Writing Hypothesis

The present study operationalizes its analysis basing on a hypothesis that the levels of academic written texts could be reflected in (1) the writers’ level of study and (2) the rigorousness of the examination that the papers had to go through. For the first assumption, the authors considered Year 1, Year 2, Year 3, and Masters to be four different levels of writers as students would benefit from additional training and their academic writing would be improved accordingly. This was also in line with the purpose of Alsop and Nesi (2009) when creating the BAWE corpus.

The second assumption was made in an attempt to set up another threshold of writing level that goes beyond taught Masters students’ papers. These papers must be comparative in number of texts, variation of disciplines, be consistent in evaluation criteria, and of course, represent higher quality of writing compared to these taught Masters assignments. For a corpus of academic written texts that could tick all these boxes, the PLOS ONE corpus generated by AntCorGen (Anthony, 2019) appeared to be an optimal choice for certain reasons. First, it is arguably the only time-efficient way to collect a number of academic written texts that are comparable to the BAWE corpus’s 2,761 written assignments. Second, as PLOS ONE is a mega multidisciplinary journal that publishes research articles from all branches of science, it offers the possibility to collect scholarly articles from all the disciplines covered in the BAWE corpus. Third, as each journal had relatively different criteria for paper acceptance, which sometimes depend on the editors, reviewers and even the ranking of the journal, it was necessary, if not crucial, to collect data from the same journal to ensure that articles were evaluated using relatively similar criteria. As all of the papers in the PLOS ONE were collect from the same journal, they were assumed to be evaluated using relatively similar evaluating criteria. The last and also the most important, PLOS ONE was one of the highest ranked and regarded scholarly journal which set really strict criteria for paper acceptance, we would assume that these criteria for acceptance represented higher standards of writing quality compared to the acceptance standards of the raters/lecturers for Masters’ papers in the BAWE corpus.

Data Preparation

After downloading, the BAWE corpus comprises a collection of XML and text files. The BAWE text files were delivered to each specialized folder according to disciplines such as Arts and Humanities, Life Sciences, Physical Sciences, and Social Sciences, and each subject was separated into levels ranging from Y1 to Masters. The process was done by Windows command lines. The researcher re-checked the data several times during the data collecting procedure and occasionally repeated the process for perfect accuracy. Following their placement in the folders, the files were tallied and confirmed equivalent to the number of assignments submitted in 2008. The initial subjects were the same as those included in the BAWE corpus, including four broad disciplines: Arts and Humanities, Life sciences, Physical sciences, and Social sciences. However, according to the PLOS ONE corpus, Arts and Humanities were included in Social Sciences since they shared many characteristics in common. Accordingly, the Arts and Humanities sub-corpus was combined with the Social Sciences in this research. As a result, there were three broad disciplines, including Life sciences, Physical sciences, and Social sciences, in the BAWE corpus discussed in the present study. PLOS ONE corpus was collected using AntCorGen software (Anthony, 2019) in accordance with the three disciplines and the number of assignments of each BAWE subject sub-corpus. The PLOS ONE corpus collected the same amount of texts in Life Science and Physical Science. Since Arts and Humanities sub-corpus was included in the Social Science sub-corpora, the number of texts collected was doubled. Table 2 below shows the components of the BAWE and PLOS ONE corpora.

Table 2.

General Information About the Two Corpora.

Registers	Number of texts	Number of words
PLOS ONE
ALL	4,000	19,636,598
BLS	1,000	6,630,285
PS	1,000	4,005,976
SS	2,000	9,000,336
BAWE
ALL	2,761	6,576,852
LS	683	1,359,041
PS	596	1,326,900
SS	1,482	3,890,911

Data Analysis

The BAWE and PLOS ONE corpora were analyzed using the RANGE software (Nation & Heatley, 2002). A computer algorithm classifies text vocabulary according to the word lists that are utilized. It can be downloaded from Paul Nation’s website: http://www.vuw.ac.nz/lals/staff/paul-nation/nation.aspx. The Range comes with free downloadable word-frequency lists such as the General Service List (GSL)/AWL, the British National Corpus (BNC), and the BNC/Corpus of Contemporary American English (COCA). Nation’s (2017) 25-word-family BNC/COCA lists were used with RANGE. The range and frequency of words’ occurrence in the BNC/COCA were used to construct these lists. The RANGE software categorized less common terms that did not belong to the first 25,000 word families as proper nouns, marginal words, acronyms, transparent compounds, or not in the lists. Three other base word lists were used with RANGE to assess the percentage of academic terms in the BAWE, PLOS ONE corpus. Baseword lists 1 and 2 comprised the first and the second 1,000 words of GSL, and baseword list 3 was Coxhead’s (2000) AWL. Since RANGE could not read hyphenated words, hyphens in hyphenated items would be removed and replaced by spaces, and the frequency then reclassified these words according to their frequency in the BNC/COCA wordlists. The hyphens in terms such as “life-saving” and “early-onset” were deleted, and the words were categorized based on the frequency of their single-word components, such as life saving and early onset.

Results

Research Question 1: Variation in the Coverage of High- Mid- and Low-Frequency Words

Table 3 presents the lexical coverage of the PLOS ONE and BAWE corpus at each level in BNC/COCA word lists. It is obvious that the most frequent 1,000 word families made up most of the tokens in the corpora. It can also be observed that the proportion of words covered decreased as the frequency levels went down. To be more specific, for the BAWE corpus, the first 1,000-word level accounted for 67.93% of running words, while only 12.43% coverage was covered at the second 1,000-word level. There was an equivalent in the coverage at the first 1,000 word families between disciplines of the BAWE corpus. For the Life Sciences and Physical Sciences sub-corpus, the first 1,000-word level accounted for 67.07% and 67.67% of the tokens, respectively. There was a slight improvement in the Social Sciences sub-corpus at 68.31% coverage compared to the other two sub-corpora. At the 2,000-word-family level, only 12.60%; 12.73%, and 12.27% were reached in Life Sciences, Physical Sciences, and Social Sciences sub-corpora, respectively. This demonstrated the importance of recognizing the most well-known word families. The percentage of words that occurred in the BAWE corpus was <1% (0.74%) by the sixth 1,000 word families. This sixth 1,000-word level was also applied to Life Sciences (0.85%) and Physical Sciences (0.71%) sub-corpora, in which the percentage of words was <1%, while in the Social Sciences sub-corpus, its coverage was under 1% (0.99%) by the fifth level.

Table 3.

Coverage at Each Level in BNC/COCA Word Lists.

Word list	PLOS ONE				BAWE
Word list	ALL	Life Sciences	Physical Sciences	Social Sciences	ALL	Life Sciences	Physical Sciences	Social Sciences
1,000	59.75	56.52	56.66	63.31	67.93	67.07	67.67	68.31
2,000	13.54	12.74	12.54	14.52	12.43	12.60	12.73	12.27
3,000	12.36	13.07	11.40	12.27	9.36	9.25	9.96	9.19
4,000	2.68	2.80	3.41	2.27	2.10	2.41	2.31	1.92
5,000	1.60	1.83	2.20	1.18	1.16	1.52	1.30	0.99
6,000	1.07	1.59	1.22	0.64	0.74	0.85	0.71	0.71
7,000	0.64	0.75	0.88	0.45	0.53	0.66	0.53	0.49
8,000	0.51	0.52	0.79	0.38	0.37	0.45	0.41	0.34
9,000	0.35	0.43	0.45	0.24	0.23	0.37	0.19	0.19
10,000	0.22	0.26	0.39	0.13	0.19	0.27	0.21	0.15
11,000	0.26	0.42	0.30	0.13	0.16	0.23	0.14	0.14
12,000	0.18	0.22	0.36	0.08	0.13	0.19	0.16	0.10
13,000	0.22	0.37	0.22	0.12	0.10	0.20	0.10	0.07
14,000	0.21	0.24	0.30	0.15	0.10	0.20	0.08	0.06
15,000	0.17	0.22	0.42	0.04	0.07	0.14	0.08	0.04
16,000	0.19	0.39	0.25	0.02	0.08	0.16	0.11	0.05
17,000	0.09	0.13	0.16	0.03	0.05	0.09	0.05	0.04
18,000	0.08	0.12	0.12	0.03	0.07	0.11	0.07	0.05
19,000	0.08	0.10	0.18	0.01	0.03	0.05	0.04	0.02
20,000	0.11	0.23	0.13	0.01	0.04	0.08	0.04	0.02
21,000	0.05	0.05	0.09	0.03	0.02	0.04	0.03	0.02
22,000	0.04	0.05	0.09	0.01	0.01	0.03	0.01	0.01
23,000	0.15	0.14	0.26	0.11	0.01	0.03	0.01	0.01
24,000	0.02	0.03	0.03	0.00	0.01	0.02	0.01	0.01
25,000	0.02	0.03	0.04	0.01	0.01	0.02	0.01	0.01
Proper nouns	1.46	1.80	1.52	1.20	2.41	1.22	0.92	3.32
Marginal words	1.68	1.94	2.41	1.19	0.78	0.63	1.09	0.73
Transparent compounds	0.33	0.35	0.20	0.38	0.30	0.31	0.34	0.27
Acronyms	0.85	1.01	1.48	0.47	0.35	0.51	0.44	0.26
Not in the lists	1.09	1.64	1.50	0.61	0.22	0.27	0.23	0.19
Token	19,636,598	6,630,285	4,005,976	9,000,336	6,576,852	1,359,041	1,326,900	3,890,911

With texts at publication levels in the PLOS ONE corpus, the coverage at each level was altered. If the first 1,000 word families accounted for 67.93% of running words in the BAWE corpus, then at the PLOS ONE corpus, only 59.75% of running words was covered by the first 1,000-word level. The first 1,000 word families covered 56.52% and 56.66% coverage in the Life Sciences and Physical Sciences sub-corpora of the PLOS ONE corpus, respectively. In the Social Sciences sub-corpus yet, there was a rise at 63.31% coverage compared to the other two sub-corpora, same as in the BAWE corpus. Compared to the BAWE corpus, the table showed that lexical coverage at the first 1,000-word level of the PLOS ONE corpus was lower. Specifically, if the first 1,000-word level in the BNC/COCA list accounted for 67.07%, 67.67%, and 68.31% of the words in the three sub-corpus of the BAWE, including Life Sciences, Physical Sciences, and Social Sciences, respectively, then, in the PLOS ONE corpus, the figures were 56.52%, 56.56%, and 63.31%, respectively. By the seventh 1,000 word families, the percentage of words that occurred in the PLOS ONE corpus was <1% (0.64%), one level lower compared to the BAWE corpus (by the sixth 1,000 word families). This seventh 1,000-word level was also applied to Life Sciences (0.75%) and Physical Sciences (0.88%) sub-corpus, in which the percentage of words was <1%, while in the Social Sciences sub-corpus, its coverage was under 1% (0.64%) by the sixth level.

The results indicated that at the first 1,000 most frequent word families, the coverage was covered in the PLOS ONE corpus would be pared down compared to the BAWE corpus, and the word level in which the coverage was <1% in the PLOS ONE corpus was also lower than in the BAWE corpus. At both the BAWE and PLOS ONE corpora, Social Sciences sub-corpora were less lexically demanding than the other two sub-corpora.

Research Question 2: Variation in the Vocabulary Knowledge Needed to Reach 95% and 98% Coverage

Table 4 shows the cumulative coverage with and without proper nouns (PN), marginal words (MW), transparent compounds (TC), and acronyms (AC) for the BAWE and PLOS ONE corpora. Data from the analyses demonstrated the importance of supplementary words to text comprehension. To be more specific, without the assistance of PN, MW, TC, and AC, it was not possible to reach 98% coverage for students’ assignments and readers would need to know 10,000 word families to understand 95% of the words in the BAWE corpus. More notably, if all the supplementary words were assumed to be unknown, then it would be impossible for reader to reach any of the mentioned thresholds for comprehension.

Table 4.

Cumulative Coverage (%) Excluding and Including Proper Nouns (PN), Marginal Words (MW), Transparent Compounds (TC), and Acronyms (AC) for the BAWE and PLOS ONE Corpora.

Word list	BAWE		PLOS ONE
Word list	Without	With	Without	With
1,000	67.93	71.77	59.75	64.08
2,000	80.36	84.20	73.29	77.61
3,000	89.72	93.55	85.65	89.98
4,000	91.82	95.66	88.33	92.65
5,000	92.98	96.82	89.92	94.25
6,000	93.72	97.56	91.00	95.33
7,000	94.26	98.09	91.64	95.96
8,000	94.63	98.47	92.14	96.47
9,000	94.86	98.70	92.49	96.81
10,000	95.05	98.88	92.71	97.04
11,000	95.21	99.04	92.97	97.30
12,000	95.34	99.17	93.16	97.49
13,000	95.44	99.27	93.38	97.71
14,000	95.53	99.37	93.59	97.92
15,000	95.60	99.44	93.76	98.09
16,000	95.69	99.52	93.95	98.28
17,000	95.74	99.57	94.04	98.37
18,000	95.81	99.64	94.12	98.44
19,000	95.84	99.68	94.19	98.52
20,000	95.88	99.71	94.30	98.63
21,000	95.90	99.74	94.35	98.68
22,000	95.91	99.75	94.39	98.72
23,000	95.93	99.76	94.54	98.87
24,000	95.94	99.77	94.56	98.89
25,000	95.95	99.78	94.58	98.91
Proper nouns	2.41		1.46
Marginal words	0.78		1.68
Transparent compounds	0.30		0.33
Acronyms	0.35		0.85
Not in the lists	0.22	0.22	1.09	1.09
Tokens	6,576,852	6,576,852	19,636,598	19,636,598

In contrast, with the help of PN, MW, TC, and AC, only 4,000 word families was necessary to reach 95.66% coverage for the BAWE corpus, almost three times lower than the 10,000-word-family threshold (without the support of the four auxiliary lists). To reach 98% coverage for the BAWE corpus, a learner should know 7,000 word families plus PN, MW, TC, and AC. There were remarkable differences in the vocabulary necessary to reach 95% and 98% coverage between the BAWE and the PLOS ONE corpus. Providing that 4,000 word families was necessary to reach 95.66% coverage of the BAWE corpus, a vocabulary of the most 6,000 word families plus PN, MW, TC, and AC was required to reach 95.33% coverage of the PLOS ONE corpus. To attain 98% coverage of the PLOS ONE corpus, knowledge of 14,000 word families plus PN, MW, TC, and AC was expected. It is worth nothing that the number of words required to reach 98% (14,000) coverage of PLOS ONE was twice as much as the number of words required to reach 98% (7,000) coverage of the BAWE corpus.

Table 5 shows the cumulative coverage (%) including proper nouns (PN), marginal words (MW), transparent compounds (TC), and acronyms (AC) of the Life Sciences sub-corpus in the BAWE and the PLOS ONE ranging from Year 1 to Master (in the BAWE corpus) to Publication (in the PLOS ONE corpus). As shown in the Table, there was a similarity in the lexical demands to reach 95% coverage of the corpus at Year 1, Year 2, and Year 3 levels. Coupled with PN, MW, TC, and AC, a vocabulary size of 5,000 word families provided 96.16% (Year 1), 95.95% (Year 2), and 95.36% (Year 3) coverage. However, at the Master and Publication levels, a larger amount of lexicon was demanded to attain 95% coverage. Specifically, 6,000 word families were required to reach 95.83% coverage at the Master level, while 8,000 word families were required to achieve 94.92% coverage at the Publication level.

Table 5.

Cumulative Coverage Including Proper Nouns (PN), Marginal Words (MW), Transparent Compounds (TC), and Acronyms (AC) of the Life Sciences Sub-corpus in the BAWE and PLOS ONE Corpus.

Word list	Year 1	Year 2	Year 3	Masters	Publication
1,000	71.51	70.68	68.78	68.23	61.61
2,000	83.89	82.78	81.33	81.48	74.35
3,000	92.52	92.38	90.97	90.62	87.43
4,000	94.73	94.59	93.64	93.20	90.23
5,000	96.16	95.95	95.36	94.83	92.05
6,000	96.87	96.65	96.37	95.83	93.65
7,000	97.46	97.29	97.01	96.56	94.40
8,000	97.86	97.65	97.48	97.11	94.92
9,000	98.24	97.99	97.85	97.50	95.35
10,000	98.45	98.22	98.06	97.89	95.61
11,000	98.61	98.46	98.30	98.15	96.03
12,000	98.75	98.63	98.45	98.41	96.26
13,000	98.95	98.80	98.66	98.63	96.63
14,000	99.11	99.00	98.86	98.86	96.86
15,000	99.22	99.14	99.01	99.02	97.08
16,000	99.39	99.30	99.14	99.19	97.47
17,000	99.49	99.38	99.22	99.28	97.61
18,000	99.59	99.48	99.39	99.38	97.72
19,000	99.65	99.54	99.43	99.43	97.83
20,000	99.69	99.61	99.55	99.53	98.06
21,000	99.74	99.63	99.57	99.59	98.11
22,000	99.76	99.66	99.60	99.63	98.16
23,000	99.79	99.69	99.62	99.67	98.30
24,000	99.80	99.71	99.63	99.70	98.33
25,000	99.81	99.71	99.66	99.72	98.36
Not in the lists	0.19	0.29	0.34	0.28	1.64
Token	292,252	393,991	245,095	427,703	6,630,285

Likewise, there was also a resemblance in the vocabulary sizes to reach 98% coverage in Year 1 and Year 2 in the BAWE corpus. 9000 word families plus PN, MW, TC, and AC, were adequate to attain 98.24% coverage (Year 1) and 97.99% coverage (Year 2). In order to attain 98% coverage at Year 3 and the Master level, there was a slightly higher vocabulary volume required. For the Year 3 level, 98.06% coverage could be covered by knowing the most frequent 10,000 word families, while 11,000 word families were expected to reach 98.15% coverage at the Master level. On the other hand, to gain 98% coverage at the Publication level, the number of words demanded reached 20,000 word families. This number was almost three times higher than the lexical amount needed to achieve 95% coverage at the Publication level (8,000), approximately or more prominent than twice the number of words needed to achieve 98% coverage from Year 1 to Master level (Year 1 and Year 2: 9,000; Year 3: 10,000, and Master: 11,000).

Table 6 shows the cumulative coverage (%) including proper nouns (PN), marginal words (MW), transparent compounds (TC), and acronyms (AC) of the Physical Sciences sub-corpus in the BAWE and the PLOS ONE ranging from Year 1 to Master (in the BAWE corpus) to Publication (in the PLOS ONE corpus). It indicated the inconsistencies between the vocabulary amount required to approach 95% and 98% coverage of the Physical Sciences sub-corpus. It displayed a significant discrepancy in the amount of lexicon needed to reach 95% and 98% between the Publication level and the remaining levels. Vocabulary knowledge of 4,000 word families plus PN, MW, TC, and AC was sufficient to assist learners in reaching 95% coverage from Year 1 to Master level. Precisely, the most 4,000 most frequent word families provided 95.59% (Year 1), 94.93% (Year 2), 95.20% (Year 3), and 96.24% (Master) coverage, as shown in the table. Nevertheless, learners were expected to grasp a vocabulary size of up to 9,000 word families to achieve 95.17% corpus coverage at the Publication level, nearly twice as much as the other four levels.

Table 6.

Cumulative Coverage (%) Including Proper Nouns (PN), Marginal Words (MW), Transparent Compounds (TC), and Acronyms (AC) of the Physical Sciences Sub-corpus in the BAWE and PLOS ONE Corpus.

Word list	Year 1	Year 2	Year 3	Masters	Publication
1,000	73.01	69.88	69.44	70.03	62.28
2,000	84.50	82.22	82.36	84.02	74.82
3,000	93.28	92.59	92.69	94.21	86.21
4,000	95.59	94.93	95.20	96.24	89.62
5,000	96.89	96.40	96.49	97.38	91.82
6,000	97.62	97.13	97.28	97.94	93.04
7,000	98.15	97.69	97.89	98.33	93.93
8,000	98.58	98.18	98.32	98.60	94.71
9,000	98.75	98.40	98.51	98.77	95.17
10,000	98.94	98.64	98.76	98.93	95.55
11,000	99.09	98.82	98.91	99.03	95.86
12,000	99.18	99.06	99.07	99.16	96.22
13,000	99.28	99.17	99.16	99.23	96.43
14,000	99.38	99.23	99.26	99.32	96.73
15,000	99.47	99.32	99.35	99.38	97.14
16,000	99.56	99.49	99.46	99.45	97.39
17,000	99.62	99.53	99.50	99.52	97.55
18,000	99.71	99.59	99.55	99.60	97.67
19,000	99.75	99.63	99.58	99.64	97.85
20,000	99.79	99.67	99.63	99.68	97.99
21,000	99.82	99.71	99.66	99.70	98.08
22,000	99.82	99.73	99.67	99.71	98.17
23,000	99.83	99.75	99.68	99.71	98.43
24,000	99.85	99.76	99.69	99.73	98.46
25,000	99.86	99.77	99.71	99.73	98.50
Not in the lists	0.14	0.23	0.29	0.27	1.50
Token	289,922	303,616	410,896	322,466	4,005,976

There were inconsistencies among the vocabulary demanded to approach 98% coverage from Year 1 to Master level. If knowledge of the most frequent 6,000 word families plus PN, MW, TC, and AC could provide 97.94% coverage of the corpus at the Master level, the vocabulary knowledge of 7,000 word families was demanded to cover 98.15% coverage in Year 1, whereas knowledge of 8,000 word families accounted for 98.18% and 98.32% of the running words in the Year 2 and Year 3 sub-corpora, respectively. On the other hand, to gain 98% coverage of the Physical Sciences sub-corpus at the Publication level, the number of words demanded reached 20,000 word families including PN, MW, TC, and AC, approximately or larger three times than the number of words needed to achieve 98% coverage from Year 1 to Master level (Year 1: 7,000; Year 2 and Year 3: 8,000; Master: 6,000). This number was similar to the lexicon demanded to gain 98% coverage of the Life Sciences sub-corpus at the Publication level (Table 4).

Table 7 shows the cumulative coverage (%) including proper nouns (PN), marginal words (MW), transparent compounds (TC), and acronyms (AC) of the Social Sciences sub-corpus in the BAWE and the PLOS ONE ranging from Year 1 to Master (in the BAWE corpus) and to Publication (in the PLOS ONE corpus). If the vocabulary amount required to approach 95% and 98% coverage of the Life Sciences sub-corpus and the Physical Sciences sub-corpus displayed some inconsistencies, the Social Sciences sub-corpus, in contrast, showed the relative sameness in the number of vocabulary demanded to reach the tune of 95% and 98% coverage. As shown in the table, the amount of lexicon required to reach 95% coverage of the corpus from Year 1 to Publication level was identical. In which, knowledge of the most frequent 4,000 word families including PN, MW, TC, and AC accounted for 96.47% coverage in Year 1, 96.16% coverage in Year 2, 96.23% coverage in Year 3, 96.31% at Master level, and 95.60% coverage at the Publication level.

Table 7.

Cumulative Coverage (%) Including Proper Nouns (PN), Marginal Words (MW), Transparent Compounds (TC), and Acronyms (AC) of the Social Sciences Sub-corpus in the BAWE and PLOS ONE Corpus.

Wordlist	Year 1	Year 2	Year 3	Masters	Publication
1,000	74.74	73.53	72.73	70.61	66.54
2,000	86.47	85.47	84.96	83.83	81.06
3,000	94.74	94.25	94.17	94.31	93.33
4,000	96.47	96.16	96.23	96.31	95.60
5,000	97.36	97.22	97.24	97.30	96.78
6,000	98.06	97.95	97.98	97.98	97.42
7,000	98.55	98.44	98.46	98.50	97.87
8,000	98.89	98.79	98.82	98.80	98.25
9,000	99.09	98.99	99.02	98.98	98.48
10,000	99.24	99.15	99.16	99.12	98.61
11,000	99.38	99.29	99.31	99.25	98.74
12,000	99.48	99.39	99.42	99.34	98.83
13,000	99.54	99.45	99.50	99.41	98.95
14,000	99.60	99.52	99.55	99.48	99.10
15,000	99.64	99.57	99.59	99.52	99.13
16,000	99.67	99.63	99.62	99.59	99.16
17,000	99.71	99.67	99.66	99.63	99.19
18,000	99.77	99.72	99.70	99.69	99.21
19,000	99.79	99.74	99.75	99.71	99.23
20,000	99.80	99.77	99.76	99.72	99.23
21,000	99.82	99.78	99.78	99.74	99.26
22,000	99.82	99.79	99.79	99.75	99.27
23,000	99.83	99.80	99.79	99.76	99.38
24,000	99.83	99.81	99.80	99.76	99.38
25,000	99.84	99.81	99.81	99.77	99.39
Not in the lists	0.16	0.19	0.19	0.23	0.61
Token	882,987	1,118,099	924,695	941,682	9,000,336

Similarly, to achieve 98% coverage of the Social Sciences sub-corpus from Year 1 to Master level, the vocabulary quantity needed did not differ. A vocabulary of the most frequent 6,000 word families plus PN, MW, TC, and AC was sufficient to reach 98.06%, 97.95%, 97.98%, and 97.98% coverage ranging from Year 1 to Master level, respectively. In order to attain 98% coverage at the Publication level, there was a slightly higher vocabulary volume required. 98.25% coverage could be covered by knowing the most frequent 8,000 word families plus PN, MW, TC, and AC. Compared to Life Sciences and Physical Sciences sub-corpus, the results indicated that the lexical demanded to reach 95% and 98% coverage of the Social Sciences sub-corpus at the Publication level did not yield an apparent discrepancy with other education levels.

Research Question 3: Variation in the Distribution of Academic Vocabulary

The distribution of the AWL in each sub-corpus presented in the BAWE and PLOS ONE corpus is shown in Table 8. Although the AWL was not evenly distributed across the level sub-corpora in each subject sub-corpora, we could generally observe an increasing tendency in the AWL coverage from Year 1 (8.53%) to Year 2 (9.44%), Year 3 (10.03%), Master (11.00%), and Publication (12.85%) levels.

Table 8.

Coverage of the BAWE and PLOS ONE Corpus and Each Sub-corpus by the GSL and the AWL (%).

Corpus	GSL		AWL
Corpus	First 1,000 words	Second 1,000 words	AWL
PLOS ONE
ALL	65.75	5.32	12.85
BLS	62.29	4.71	12.48
PS	63.18	5.67	11.20
SS	69.24	5.60	13.80
BAWE—Y1
ALL	74.57	5.09	8.53
LS	72.88	5.87	8.73
PS	74.98	5.71	8.84
SS	75.00	4.64	8.36
BAWE—Y2
ALL	73.35	5.12	9.44
LS	72.01	5.53	9.85
PS	72.44	5.62	10.60
SS	74.07	4.83	8.98
BAWE—Y3
ALL	72.26	5.44	10.03
LS	69.45	6.22	9.55
PS	71.40	5.84	11.44
SS	73.39	5.06	9.52
BAWE—M
ALL	71.09	5.72	11.00
LS	69.43	7.00	9.08
PS	72.00	5.98	11.63
SS	71.54	5.04	11.66

For the PLOS ONE corpus, the highest coverage was covered in the Social Sciences sub-corpora (13.80%), and the lowest coverage was in the Physical Sciences (11.20%). Coverage of the AWL in the Biology and Life Sciences sub-corpus was 12.48%. The AWL was not evenly distributed across the three sub-corpora in the BAWE corpus and varied across four education levels from Year 1 to Master. In the Life Sciences sub-corpus of the BAWE corpus, the AWL ranged from 8.73% coverage to 9.85% coverage. In the Physical Sciences sub-corpus of the BAWE corpus, the AWL ranged from 8.84% coverage to 11.63% coverage. Similarly, the AWL ranged from 8.36% coverage to 11.66% coverage in the Social Sciences sub-corpus.

Discussion

Data from the analyses of the BAWE corpus showed that it takes readers 4,000 to 5,000 and 6,000 to 11,000 word families in the BNC/COCA list to understand 95% and 98% of the running words in students’ assignments, correspondingly, which could be said to be a little higher compared to the lexical demands of online newspaper and magazine articles found in H. T. Ha (2022b) and, to some extent, be comparable to Hsu’s (2018, 2022) findings concerning subject-focused textbooks. On the other hand, research articles published in PLOS ONE would require up to 9,000 and 20,000 word families to achieve 95% and 98% coverage, respectively. This signals that scholarly publications are much more difficult to understand compared to students’ assignments. Given that native English speakers and ESL learners can only increase their lexical resource by approximately 1,000 (Nation & Coxhead, 2021) and 500 (Ozturk, 2016; S. A. Webb & Chang, 2012) word families a year, in the given order, these differences in lexical demands between students’ assignments and scholarly articles could be deemed as significant.

Research findings also reveal the disciplinary differences in lexical difficulties. Table 9 compares the present study’s results to Dang and Webb’s (2014) findings. Although the two studies employed two different wordlists, certain similarities could be observed. The most noticeable observation is the fact that Social Sciences was the least lexically demanding branch of science, generally requiring only half of what was needed to understand texts in Life and Physical sciences. Although unable to spot significant difference in lexical difficulties between Physical and Life Sciences, research findings were in line with Dang and Webb (2014) that these two were the most lexically demanding scientific disciplines. It is interesting to see that papers in the field of Social Sciences, whether they were students’ assignments or scholarly articles, only required as much lexical knowledge as online newspapers and magazines (H. T. Ha, 2022b). While the present study did not offer evidence to support a scientific explanation for this similarity, we would hypothesize that it was to similarities in topics. As papers in the field of Social Sciences often deal with issue in the society including business, education, politics and other aspects of human, they share certain language features in common with online newspapers and magazines which often write about the same topics.

Table 9.

Disciplinary Differences in Lexical Demands in Academic Written and Spoken English.

Written academic English(The present study—BNC/COCA)		Spoken academic English(Dang & Webb, 2014—BNC)
95%	98%	95%	98%
BAWE-LS (5,000–6,000)	BAWE-LS (9,000–11,000)	LS (5,000)	LS (13,000)
BAWE-PS (4,000)	BAWE-PS (6,000–8,000)	PS (4,000)	PS (10,000)
BAWE-SS (4,000)	BAWE-SS (6,000)	AH (4,000)	AH (7,000)
PLOS ONE-LS (8,000)	PLOS ONE-LS (20,000)	SS (3,000)	SS (5,000)
PLOS ONE-PS (9,000)	PLOS ONE-PS (20,000)
PLOS ONE-SS (4,000)	PLOS ONE-SS (8,000)

Regarding whether the lexical profile of academic written English differs according to the writers’ levels, the present study suggests that it has an inconsistency. The distinction in vocabulary requirements across levels is particularly noticeable when analyzing the two sub-corpora containing Life Sciences and Physical Science. When comparing Year 1 to Master level in the Life Sciences and Physical Science sub-corpora, there is a typically trivial variation in the quantity of vocabulary required (approximately 1,000 word families to 2,000 word families). Still, as previously noted, the vocabulary demand at publication level is astonishingly higher (by two or almost three times). In the Social Sciences sub-corpus, the vocabulary required to achieve 95% coverage from Year 1 to Publication remains stable, while at the publication level, a higher vocabulary will be required to achieve 98% coverage but not exceedingly notable. These findings suggested that while the distinction in lexical difficulty between students’ assignments and scholarly articles was evident, the lexical differences between students’ writing at different levels were barely visible. Life Sciences could be said to be the only case where the hypothesized order of lexical difficulty was confirmed.

Concerning the coverage of the AWL in academic written English, the results showed that the AWL accounted from 8.53% to 12.85% of the running words. The coverage provided by the AWL in this study is considerably extensive compared with those in the studies of academic spoken English (Dang & Webb, 2014), university-based economics lectures (Thompson, 2006), TED Talks (Nurmukhamedov, 2017), which accounted for 4.41%, 4.90%, and 3.79%, respectively. It is conceivable that the AWL highlights words encountered in academic writing English rather than academic spoken discourse (Nurmukhamedov, 2017). The results in this study are consistent with other studies of academic written English such as Cobb and Horst (2004), Hyland and Tse (2007), Coxhead (2000), Martínez et al. (2009), which accounted for 11.6%, 10.6%, 10%, 9.06%, respectively. There is a 12.85% of the AWL distribution in the PLOS ONE corpus, the highest of any number reported. This figure reaffirms the high academic level and the massive vocabulary size required in the PLOS ONE corpus.

Different from the results concerning the 95% and 98% coverage, research findings on the coverage of AWL items seemed to support the proposed hypothesis about levels of writing. To be specific, the proportion of AWL became greater as the levels of writers went up, from Year 1 to Publication level: 8.53% (Y1) to 9.44% (Y2) to 10.03% (Y3) to 11.00% (M) to 12.85% (PLOS ONE). This could be said to align perfectly with Phung and Ha’s (2022) findings concerning the percentage of AWL items in the four sections of the IELTS listening test as illustrated in Table 1. Moreover, the results could be said to be well in line with Djiwandono (2016) which examined the differences in lexical richness between students’ and lecturers’ academic essays. Djiwandono’s study indicated that learners at the less advanced level use fewer academic words than those at a more advanced level. The findings of these studies generally provided further evidence for the assumption that more academic texts tend to contain more academic vocabulary, which was also supported by Choemue and Bram (2021).

Conclusion

This study has provided remarkable findings for the lexical demands of academic written English from a multidisciplinary perspective. Although the results did not show visible differences in lexical difficulties between different levels of students’ assignments, results from the analyses demonstrated a considerable difference in lexical demands between students’ assignments and published scholarly articles. Regarding the disciplinary differences in lexical demands, although we could not replicate the order of lexical of difficulty of Life Sciences > Physical Sciences > Social Sciences as presented in Dang and Webb (2014), data analysis of both the BAWE and PLOS ONE corpora supported the claim that Social Sciences is the least lexically demanding or the easiest-to-understand discipline.

Despite the helpful findings, several limitations should be noted in the present research. Although the current study contributes to the literature of lexical coverage and vocabulary demands in an academic written setting, the study has only examined the coverage of three disciplines, including Life Sciences, Physical Sciences, and Social Sciences. It should be noted that these disciplines made up <30% of the subjects covered in PLOS ONE journal, and other disciplines may be worth exploring beyond the focus of the current study. Moreover, while coverage may have the most significant impact influential factor, it is only one of several elements that might affect understanding, and that 100% coverage does not necessarily guarantee comprehension. There are other determinants such as learners’ background knowledge (S. A. Stahl et al., 1989, 1991; S. A. Stahl & Jacobson, 1986), the relevance of unknown words in context (S. Stahl, 1990) and individual differences (Mezynski, 1983; S. Stahl, 1990) that may affect comprehension of academic written discourse. Another limitation concerns the lack of variation in different types of academic writing. As the present study analyzed academic papers without a close examination of various text types (case study, critique, essay, review, proposal…), this led to the lack of sensitivity and variation in analyzed data. The assumption concerning writers’ levels used in this study was also a limitation worth mentioning. As scholarly publications could come from anyone with any academic background, some of whom might be taught postgraduate or even undergraduate students.

As a result, further studies into the issue are encouraged to dig deeper into the performance of specific groups of writers with different levels of language proficiency. It would be interesting to correlate the metrics of productive vocabulary knowledge of someone and his or her writing’s lexical coverage. Corpus-driven studies in the future should also examine the disciplinary differences in lexical difficulty from a wider range of disciplines. Currently, AntCorGen (Anthony, 2019) can collect more than 100,000 research articles from 11 different disciplines. It could be said that the present study did not exploit the maximum potential of this corpus generator program. Also, an in-depth investigation into the differences in lexical difficulties between various types of academic written texts would be a great addition that fill the gap of the present study.

Footnotes

Acknowledgements

The data in this study come from the British Academic Written English (BAWE) corpus, which was developed at the Universities of Warwick, Reading and Oxford Brookes under the directorship of Hilary Nesi and Sheena Gardner (formerly of the Centre for Applied Linguistics, Warwick), Paul Thompson (formerly of the Department of Applied Linguistics, Reading) and Paul Wickens (School of Education, Oxford Brookes), with funding from the ESRC (RES-000-23-0800). Details of how to access to the corpus via the Oxford Text Archive can be found at

Authors’ Contribution

The first author was responsible for data collection and analysis. All authors contributed equally to the preparation of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Ethics Statement for Animal and Human Studies

Not applicable.

ORCID iD

Hung Tan Ha

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Alsop

Nesi

(2009). Issues in the development of the British academic written English (BAWE) corpus. Corpora, 4(1), 71–83. https://doi.org/10.3366/e1749503209000227

Al-Surmi

(2014). TV shows, word coverage, and incidental vocabulary learning. In Bailey

Damerow

(Eds.), Teaching and learning English in the Arabic Speaking world (pp. 132–147). Routledge.

Anthony

(2019). AntCorGen (Version 1.1.2) [Computer Software]. Waseda University. https://www.laurenceanthony.net/software

Barcroft

(2007). When knowing grammar depends on knowing vocabulary: Native speaker grammaticality judgements of sentences with real and unreal words. The Canadian Modern Language Review/La revue canadienne des langues vivantes, 63(3), 313–343. https://doi.org/10.3138/R601-H212-5582-0737

Barrot

J. S.

Agdeppa

J. Y.

(2021). Complexity, accuracy, and fluency as indices of college-level L2 writers’ proficiency. Assessing Writing, 47, 100510. https://doi.org/10.1016/j.asw.2020.100510

Biber

Gray

(2010). Challenging stereotypes about academic writing: Complexity, elaboration, explicitness. Journal of English for Academic Purposes, 9(1), 2–20.

Biber

Gray

Staples

(2016). Predicting patterns of grammatical complexity across language exam task types and proficiency levels. Applied Linguistics, 37, 639–668.

Breeze

(2008). Researching simplicity and sophistication in student writing. International Journal of English Studies, 8(1), 51–66.

Bulté

Housen

(2014). Conceptualizing and measuring short-term changes in L2 writing complexity. Journal of Second Language Writing, 26, 42–65.

10.

Casal

J. E.

(2020). An integrated corpus and genre analysis approach to writing research and pedagogy: Development of graduate student genre knowledge [Unpublished doctoral dissertation, The Pennsylvania State University].

11.

Casal

J. E.

Lee

J. J.

(2019). Syntactic complexity and writing quality in assessed first-year L2 writing. Journal of Second Language Writing, 44, 51–62.

12.

Casal

J. E.

(2021). ‘Maybe complicated is a better word’: Second-language English graduate student responses to syntactic complexity in a genre-based academic writing course. International Journal of English for Academic Purposes: Research and Practice, 2021, 95–114. https://doi.org/10.3828/ijeap.2021.7

13.

Cheng

Matthews

(2018). The relationship between three measures of L2 vocabulary knowledge and L2 listening and reading. Language Testing, 35(1), 3–25. https://doi.org/10.1177/0265532216676851

14.

Choemue

Bram

(2021). Lexical richness in scientific journal articles: A comparison between ESL and EFL Writers. Indonesian Journal of EFL and Linguistics, 6(1), 147. https://doi.org/10.21462/ijefl.v6i1.349

15.

Cobb

Horst

(2004). Is there room for an AWL in French? In Bogaards

Laufer

(Eds.), Vocabulary in a second language: Selection, acquisition, and testing (pp. 15–38). John Benjamins.

16.

Coxhead

(2000). A new academic word list. TESOL Quarterly, 34(2), 213–238.

17.

Crossley

S. A.

McNamara

D. S.

(2014). Does writing development equal writing quality? A computational investigation of syntactic complexity in L2 learners. Journal of Second Language Writing, 26, 66–79.

18.

Dang

T. N. Y.

Webb

(2014). The lexical profile of academic spoken English. English for Academic Purposes, 33, 66–76. https://doi.org/10.1016/j.esp.2013.08.001

19.

Dang

T. N. Y.

Webb

(2016). Evaluating lists of high-frequency words. ITL - International Journal of Applied Linguistics, 167, 132–158.

20.

Dang

T. N. Y.

Webb

Coxhead

(2022). Evaluating lists of high-frequency words: Teachers’ and learners’ perspectives. Language Teaching Research, 26, 617–641.

21.

Davies

(2008). The Corpus of Contemporary American English (COCA). https://www.english-corpora.org/coca/

22.

Djiwandono

P. I.

(2016). Lexical richness in academic papers: A comparison between students’ and lecturers’ essays. Indonesian Journal of Applied Linguistics, 5(2), 209.

23.

Fairclough

Belpoliti

(2016). Emerging literacy in Spanish among Hispanic heritage language university students in the USA: A pilot study. International Journal of Bilingual Education and Bilingualism, 19, 185–201. https://doi.org/10.1080/13670050.2015.1037718

24.

Guo

Roehrig

A. D.

(2011). Roles of general versus second language (L2) knowledge in L2 reading comprehension. Reading in a Foreign Language, 23(1), 42–64

25.

H. S.

(2019). Lexical richness in EFL undergraduate students’ academic writing. English Teaching, 74(3), 3–28.

26.

H. T.

(2021a). A rasch-based validation of the Vietnamese version of the listening vocabulary levels test. Language Testing in Asia, 11(1), 1–19. https://doi.org/10.1186/s40468-021-00132-7

27.

H. T.

(2021b). Exploring the relationships between various dimensions of receptive vocabulary knowledge and L2 listening and reading comprehension. Language Testing in Asia, 11, 1–20. https://doi.org/10.1186/s40468-021-00131-8

28.

H. T.

(2022a). Lexical profile of newspapers revisited: A corpus-based analysis. Frontiers in Psychology, 13(1), 800983–801010. https://doi.org/10.3389/fpsyg.2022.800983

29.

H. T.

(2022b). Vocabulary demands of informal spoken English revisited: What does it take to understand movies, TV programs, and soap operas? Frontiers in Psychology, 13(1), 831684–831687. https://doi.org/10.3389/fpsyg.2022.831684

30.

Hashimoto

B. J.

(2021). Is frequency enough?: The frequency model in vocabulary size testing. Language Assessment Quarterly, 18(2), 171–187. https://doi.org/10.1080/15434303.2020.1860058

31.

Hirsh

Nation

I. S. P.

(1992). What vocabulary size is needed to read unsimplified texts for pleasure? Reading in a Foreign Language, 8(2), 689–696.

32.

Hsu

(2011). The vocabulary thresholds of business textbooks and business research articles for EFL learners. English for Specific Purposes, 30, 247–257. https://doi.org/10.1016/j.esp.2011.04.005

33.

Hsu

(2014). Measuring the vocabulary load of engineering textbooks for EFL undergraduates. English for Specific Purposes, 33, 54–65. https://doi.org/10.1016/j.esp.2013.07.001

34.

Hsu

(2018). The most frequent BNC/COCA mid- and low-frequency word families in English-medium traditional Chinese medicine (TCM) textbooks. English for Specific Purposes, 51, 98–110. https://doi.org/10.1016/j.esp.2018.04.001

35.

Hsu

(2022). To what extent may EFL undergraduates with EMI develop English vocabulary? The case of civil engineering. LEARN Journal: Language Education and Acquisition Research Network, 15(1), 469–494.

36.

Gao

Qiu

(2021). Lexical coverage and readability of science textbooks for english-medium instruction secondary schools in Hong Kong. SAGE Open, 11(1), 1–9. https://doi.org/10.1177/21582440211001867

37.

Nation

I. S. P.

(2000). Unknown vocabulary density and reading comprehension. Reading in a Foreign Language, 13(1), 403–430.

38.

Hyland

Tse

(2007). Is there an “academic vocabulary”? TESOL Quarterly, 41(2), 235–253.

39.

Lange

Matthews

(2020). Exploring the relationships between L2 vocabulary knowledge, lexical segmentation, and L2 listening comprehension. Studies in Second Language Learning and Teaching, 10(4), 723–749. https://doi.org/10.14746/ssllt.2020.10.4.4

40.

Laufer

(1989). What percentage of text lexis is essential for comprehension? In Lauren

Nordman

(Eds.), Special language: From human thinking to thinking machines (pp. 316–323). Multilingual Matters.

41.

Laufer

Elder

Hill

Congdon

(2004). Size and strength: Do we need both to measure vocabulary knowledge? Language Testing, 21, 202–226.

42.

Laufer

Ravenhorst-Kalovski

G. C.

(2010). Lexical threshold revisited: Lexical text coverage, learners’ vocabulary size and reading comprehension. Reading in a Foreign Language, 22(1), 15–30.

43.

Laufer

Sim

D. D.

(1985). Measuring and explaining the reading threshold needed for English for academic purposes texts. Foreign Language Annals, 18(5), 405–411.

44.

Lewis

(2002). Implementing the Lexical approach: Putting theory into practice. Thomson Heinle.

45.

Liu

C. Y.

Chen

H. H. J.

(2019). Academic spoken vocabulary in TED talks: Implications for academic listening. English Teaching & Learning, 43, 353–368. https://doi.org/10.1007/s42321-019-00033-2

46.

Casal

J. E.

Liu

(2020). The rhetorical functions of syntactically complex sentences in social science research article introductions. Journal of English for Academic Purposes, 44, 1–16.

47.

Maamuujav

(2021). Examining lexical features and academic vocabulary use in adolescent L2 students’ text-based analytical essays. Assessing Writing, 49, 100540. https://doi.org/10.1016/j.asw.2021.100540

48.

Martínez

I. A.

Beck

S. C.

Panza

C. B.

(2009). Academic vocabulary in agriculture research articles: A corpus-based study. English for Specific Purposes, 28, 183–198.

49.

McLean

Kramer

Beglar

(2015). The creation and validation of a listening vocabulary levels test. Language Teaching Research, 19(6), 741–760. https://doi.org/10.1177/1362168814567889

50.

Mezynski

(1983). Issues concerning the acquisition of knowledge: Effects of vocabulary training on reading comprehension. Review of Educational Research, 53, 253–279.

51.

Morrison

(2011). PLoS ONE: Now the world’s largest journal?. The Imaginary Journal of Poetic Economics. Retrieved September 6, 2021, from https://poeticeconomics.blogspot.com/2011/01/plos-one-now-worlds-largest-journal.html

52.

Nation

I. S. P.

(2006). How large a vocabulary is needed for reading and listening? The Canadian Modern Language Review, 63(1), 59–82.

53.

Nation

I. S. P.

(2017). The BNC/COCA Level 6 word family lists (Version 1.0.0) [Data file]. http://www.victoria.ac.nz/lals/staff/paul-nation.aspx

54.

Nation

I. S. P.

(2020). About the BNC/COCA headword lists. https://www.wgtn.ac.nz/lals/resources/paul-nations-resources/vocabulary-lists

55.

Nation

I. S. P.

Coxhead

(2021). Measuring native-speaker vocabulary size. John Benjamins Publishing Company.

56.

Nation

I. S. P.

Heatley

(2002). Range: A program for the analysis of vocabulary in texts [software] . http://www.victoria.ac.nz/lals/staff/paul-nation/nation.aspx

57.

Noreillie

A. S.

Kestemont

Heylen

Desmet

Peters

(2018). Vocabulary knowledge and listening comprehension at an intermediate level in English and French as foreign languages: An approximate replication study of Stæhr (2009). ITL - International Journal of Applied Linguistics, 169(1), 212–231. https://doi.org/10.1075/itl.00013.nor

58.

Nurmukhamedov

(2017). Lexical coverage of TED talks: Implications for vocabulary instruction. TESOL Journal, 8(4), 768–790. https://doi.org/10.1002/tesj.323

59.

Nurmukhamedov

Sharakhimov

(2023). Corpus-based vocabulary analysis of English podcasts. RELC Journal, 54, 7–21. https://doi.org/10.1177/0033688220979315

60.

Nurmukhamedov

Webb

(2019). Lexical coverage and profiling. Language Teaching, 52(2), 188–200. https://doi.org/10.1017/S0261444819000028

61.

Ozturk

(2016). Second language vocabulary growth at advanced level. Language Learning Journal, 44(1), 6–16. https://doi.org/10.1080/09571736.2012.708054

62.

Phung

D. H.

H. T.

(2022). Vocabulary demands of the IELTS listening test: An in-depth analysis. Sage Open, 12(1), 1–13. https://doi.org/10.1177/21582440221079934

63.

Qian

D. D.

Lin

L. H. F.

(2020). The relationship between vocabulary knowledge and language proficiency. In Webb

(Ed.), The Routledge Handbook of vocabulary studies (pp. 66–80). Taylor & Francis.

64.

Rahmat

Y. N.

Coxhead

(2021). Investigating vocabulary coverage and load in an Indonesian EFL textbook series. Indonesian Journal of Applied Linguistics, 10(3), 804–814. https://doi.org/10.17509/ijal.v10i3.31768

65.

Schmitt

Cobb

Horst

Schmitt

(2017). How much vocabulary is needed to use English? Replication of van Zeeland & Schmitt (2012), Nation (2006) and Cobb (2007). Language Teaching, 50(2), 212–226. https://doi.org/10.1017/s0261444815000075

66.

Schmitt

Jiang

Grabe

(2011). The percentage of words known in a text and reading comprehension. Modern Language Journal, 95, 26–43.

67.

Stæhr

L. S.

(2009). Vocabulary knowledge and advanced listening comprehension in English as a foreign language. Studies in Second Language Acquisition, 31, 577–607. https://doi.org/10.1017/s0272263109990039

68.

Stahl

(1990). Beyond the instrumentalist hypothesis: Some relationships between word meanings and comprehension (Technical Report No. 505).

69.

Stahl

S. A.

Hare

V. C.

Sinatra

Gregory

J. F.

(1991). Defining the role of prior knowledge and vocabulary in reading comprehension: The retiring of number 41. Journal of Reading Behavior, 23(4), 487–508. https://doi.org/10.1080/10862969109547755

70.

Stahl

S. A.

Jacobson

M. G.

(1986). Vocabulary difficulty, prior knowledge, and text comprehension. Journal of Reading Behavior, 18(4), 309–323. https://doi.org/10.1080/10862968609547578

71.

Stahl

S. A.

Jacobson

M. G.

Davis

C. E.

Davis

R. L.

(1989). Prior knowledge and difficult vocabulary in the comprehension of unfamiliar text. Reading Research Quarterly, 24(1), 27–43. https://doi.org/10.2307/748009

72.

Stewart

Vitta

J. P

Nicklin.

McLean.

Pinchbeck

G. G

Kramer

(2022). The relationship between word difficulty and frequency: A response to hashimoto (2021). Language Assessment Quarterly, 19(1), 90–101. https://doi.org/10.1080/15434303.2021.1992629

73.

Sun

Dang

T. N. Y.

(2020). Vocabulary in high-school EFL textbooks: Texts and learner knowledge. System, 93, 102279. https://doi.org/10.1016/j.system.2020.102279

74.

Taguchi

Crawford

Wetzel

D. Z.

(2013). What linguistic features are indicative of writing quality? A case of argumentative essays in a college composition program. TESOL Quarterly, 47(2), 420–430.

75.

Tegge

(2017). The lexical coverage of popular songs in English language teaching. System, 67, 87–98. http://doi.org/10.1016/j.system.2017.04.016

76.

Thompson

(2006). A corpus perspective on the lexis of lectures, with a focus on economic lectures. In Hyland

Bondi

(Eds.), Academic discourse across disciplines (pp. 253–270). Peter Lang.

77.

van Zeeland

Schmitt

(2013). Lexical coverage in L1 and L2 listening comprehension: The same or different from reading comprehension? Applied Linguistics, 34, 457–479. https://doi.org/10.1093/applin/ams074

78.

Webb

S. A.

Chang

A. C.-S.

(2012). Second language vocabulary growth. RELC Journal, 43(1), 113–126. https://doi.org/10.1177/0033688212439367

79.

Webb

Rodgers

M. P. H.

(2009a). The lexical coverage of movies. Applied Linguistics, 30(3), 407–427.

80.

Webb

Rodgers

M. P. H.

(2009b). Vocabulary demands of television programs. Language Learning, 59(2), 335–366.

81.

Webb

Sasao

Ballance

(2017). The updated vocabulary levels test. ITL - International Journal of Applied Linguistics, 168(1), 33–69. https://doi.org/10.1075/itl.168.1.02web

82.

West

(1953). A general service list of English words. Longmans, Green & Co.

83.

Yang

Coxhead

(2022). A corpus-based study of vocabulary in the new concept English textbook series. RELC Journal, 53, 597–611. https://doi.org/10.1177/0033688220964162

84.

Zhang

(2012). Vocabulary and grammar knowledge in second language reading comprehension: A structural equation modeling study. The Modern Language Journal, 96(4), 558–575. https://doi.org/10.1111/j.1540-4781.2012.01398.x

85.

Zhang

Koda

(2013). Morphological awareness and reading comprehension in a foreign language: A study of young Chinese EFL learners. System, 41(4), 901–913. https://doi.org/10.1016/j.system.2013.09.009

86.

Zhang

Chen

(2021). Developmental features of lexical richness in English writings by Chinese L3 beginner learners. Frontiers in Psychology, 12, 752950. https://doi.org/10.3389/fpsyg.2021.752950

Lexical Demands of Academic Written English: From Students’ Assignments to Scholarly Publications

Abstract

Plain Language Summary

Keywords

Introduction

Literature Review

Lexical Coverage, Word-Frequency List, and the Indication of Text Difficulties

The Lexical Coverage as an Indication of Text Difficulty

Research Gap and the Present Study

Methodology

The Corpora

The Level-of-Writing Hypothesis

Data Preparation

Data Analysis

Results

Research Question 1: Variation in the Coverage of High- Mid- and Low-Frequency Words

Research Question 2: Variation in the Vocabulary Knowledge Needed to Reach 95% and 98% Coverage

Research Question 3: Variation in the Distribution of Academic Vocabulary

Discussion

Conclusion

Footnotes

Acknowledgements

Authors’ Contribution

Declaration of Conflicting Interests

Funding

Ethics Statement for Animal and Human Studies

ORCID iD

Data Availability Statement

References