What can gaze behaviors,neuroimaging data,and test scores tell us about test method effects and cognitive load in listening assessments?

Abstract

The aim of this study was to investigate how test methods affect listening test takers’ performance and cognitive load. Test methods were defined and operationalized as while-listening performance (WLP) and post-listening performance (PLP) formats. To achieve the goal of the study, we examined test takers’ (N = 80) brain activity patterns (measured by functional near-infrared spectroscopy (fNIRS)), gaze behaviors (measured by eye-tracking), and listening performance (measured by test scores) across the two test methods. We found that the test takers displayed lower activity levels across brain regions supporting comprehension during the WLP tests relative to the PLP tests. Additionally, the gaze behavioral patterns exhibited during the WLP tests suggested that the test takers adopted keyword matching and “shallow listening.” Together, the neuroimaging and gaze behavioral data indicated that the WLP tests imposed a lower cognitive load on the test takers than the PLP tests. However, the test takers performed better with higher test scores for one of two WLP tests compared with the PLP tests. By incorporating eye-tracking and neuroimaging in this exploration, this study has advanced the current knowledge on cognitive load and the impact imposed by different listening test methods. To advance our knowledge of test validity, other researchers could adopt our research protocol and focus on extending the test method framework used in this study.

Keywords

Cognitive load theory construct-irrelevant variance functional near-infrared spectroscopy gaze behavior keyword matching listening test

Listening comprehension includes bottom-up and top-down cognitive processes. The former involves the decoding and accrual of incoming auditory information, whereby small units of language (e.g., phonemes and morphemes) are decoded and combined rapidly to create larger units (e.g., clauses and discourse) in the mind (Field, 1999; Ockey & Wagner, 2018; Vandergrift, 2007; Vandergrift & Goh, 2009). In contrast, the latter entails making inferences about the incoming auditory information and closing the gaps in the message, based on the listener’s contextual and world knowledge (Field, 1999; Vandergrift, 2007; Vandergrift & Goh, 2009). Importantly, these cognitive processes of constructing meaning from auditory information make listening comprehension a challenging language skill to assess, especially under test conditions.

For high-stakes, academic listening tests, some researchers have questioned whether the test methods used can elicit the cognitive processes that represent those used by listeners during real-life academic lectures (Aryadoust, 2013; Field, 2009; J. Field, 2013). Here, “test method” refers to the presentation format of the audio text and test items, and it encompasses while-listening performance (WLP) and post-listening performance (PLP) tests (Aryadoust, 2012). A WLP test, such as the listening sections of the International English Language Testing System (IELTS), requires test takers to read and answer the test items while the audio text is being presented (Aryadoust, 2012). In contrast, a PLP test (e.g., the listening sections of the Test of English as a Foreign Language internet-based test) allows test takers to read and answer the test items only after they have listened to the audio text and taken notes (Aryadoust, 2018).

Through retrospective verbal reports, Field (2009) suggested that the simultaneous need to listen to the audio text and read and answer test items during WLP tests requires test takers to split their attentional resources across visual (i.e., the test items) and auditory information (i.e., the audio text). To manage this information during WLP tests, test takers must use their working memory (WM), which is “a hypothetical cognitive system responsible for providing access to information required for ongoing cognitive processes” (Wilhelm et al., 2013, p. 1). As an individual’s WM capacity is limited (Wilhelm et al., 2013), Field (2009) assumed that the need to multi-task during the WLP tests may impose a “heavier” cognitive load on the test takers than notetaking-while-listening in simulated lectures (i.e., the PLP conditions). Field (2009) also reported that several high-scoring test takers in WLP tests adopted what he called “shallow listening”; that is, they could neither recall the main points of the lecture nor generate a coherent mental representation of the message conveyed. Field (2009) suggested that “shallow listening” occurred when the test takers extensively used visual cues in the test items, such as keywords and phrases, and matched them against the words that were presented in the audio text. This test-taking behavior is known as keyword or lexical matching (Field, 2009), and it appears to be a strategy that test takers used to mitigate the limitations of their WM. Importantly, this strategy can introduce sources of variance in test performance that are irrelevant to the listening construct, otherwise termed as construct-irrelevant variance (Messick, 1996).

These findings from Field (2009), together with the notion of “heavier” cognitive load in WLP tests, inspired Aryadoust’s (2019) recent eye-tracking study, which compared the differences in test takers’ gaze behaviors when reading test items in the presence and absence of auditory texts during a computer-mediated WLP test. Aryadoust (2020) suggested that gaze behaviors such as fixations (i.e., the temporal pauses in eye-movements (on the test items)), and visits (i.e., visits to regions of interest (on the test items) from entry to exit) (Holmqvist et al., 2011) may be measured to interpret cognitive load indirectly. Relatedly, studies in the field of human–computer interaction have reported that cognitive load is (i) inversely related to fixation rate, and (ii) positively related to fixation duration (Wang et al., 2018; Zagermann et al., 2016).

Although eye-tracking technology appears to be promising for investigating cognitive load, language testing researchers have not used test-takers’ gaze behavioral measures to examine the cognitive load imposed by different listening test methods (i.e., WLP vs. PLP). Nonetheless, the authors of a recent neuroimaging study using functional near-infrared spectroscopy (fNIRS) reported that the listening-while-answering questions during the WLP tests imposed a lower cognitive load upon the test takers compared with notetaking-while-listening during PLP tests (also known as PLP-Audio) (Aryadoust et al., 2020). Notably, this finding is in contrast with Field’s (2009) assumption that a higher cognitive load is imposed by WLP tests. Aryadoust et al.’s (2020) study was inspired by previous research that has suggested that brain activity directly measured through fNIRS is sensitive to cognitive load and studies have highlighted the positive linear relationship between brain activity levels and cognitive load (Fishburn et al., 2014; Li et al., 2005). Specifically, Aryadoust et al. (2020) showed that the test takers’ activity levels across the brain regions supporting top-down (i.e., left dorsomedial prefrontal cortex (dmPFC)) and bottom-up language processing (i.e., left inferior frontal gyrus (IFG) and left posterior medial temporal gyrus (pMTG)) were lower during the WLP tests than notetaking-while-listening during PLP tests (PLP-Audio). The study’s mapping of top-down and bottom-up processing on the above-mentioned brain regions was based on previous neuroimaging studies of language comprehension that associated top-down processing with the dmPFC (Buchweitz et al., 2014; Ferstl et al., 2008; Ferstl & von Cramon, 2002; Hickok & Poeppel, 2004; Saur et al., 2008), and bottom-up processing with the IFG and pMTG (Buchweitz et al., 2014; Buchweitz et al., 2009; Jobard et al., 2007; Michael et al., 2001; Whitney et al., 2010).

Together, the foregoing interpretations by Aryadoust et al. (2020) and Field (2009) on cognitive load warrant further investigation as these two studies used different research methods and reported contrasted findings. To address these interpretations, we will discuss a cognitive load theoretical framework that accounts for listeners’ cognitive architecture in the next section.

Cognitive load theory: Cognitive load in listening tests

A leading theory in the field of educational research and instructional design is Sweller’s (1988) cognitive load theory. This theory assumes that an individual’s WM capacity is limited, and suggests that task designers should reduce the sources of extraneous cognitive load so that individuals can direct their cognitive resources to complete effectively the primary task at hand (Chandler & Sweller, 1991; Sweller et al., 2011). Sweller et al. (2011) suggested that “extraneous cognitive load is generated by the manner in which information is presented to [listeners] and is under the control of [the task] designer.” Thus, the extraneous cognitive load imposed on the test takers’ WM during listening tests may arise from the test methods that determine the way information is presented (Sweller, 2011; Sweller et al., 2011).

Crucial to the cognitive load theory and in contrast with the views of Field (2009), all sensory information received is processed according to its modality. This means that the WM subsystem that processes visual information differs from the WM subsystem that processes auditory information (Baddeley, 2012; Baddeley & Hitch, 1974). Importantly, the information received is mentally integrated across time and space (Sweller, 2011; Sweller et al., 2011). Thus, if a test method requires test takers to integrate disparate sources of information that are temporally and/or spatially segregated, then this can introduce a heavy extraneous cognitive load (also known as the split-attention effect (Chandler & Sweller, 1992; Sweller et al., 2011)). Contrary to the findings of Field (2009), the split-attention effect—as described by Chandler and Sweller (1992) and Sweller et al. (2011)—suggests that WLP tests may not impose heavier extraneous cognitive load than notetaking-while-listening during PLP tests. It is possible that the need to temporally integrate visual (i.e., test items) and auditory (i.e., listening text) information across a single time interval and within the same space during WLP tests afforded¹ test takers to adopt keyword matching and “shallow listening” (Field, 2009). This strategy can, in turn, result in significantly lower cortical activity and lower cognitive load during WLP tests compared to PLP tests (Aryadoust et al., 2020)—as opposed to Field’s (2009) claim of higher cognitive load induced by the WLP format.

Sweller’s (1988) cognitive load theory also can be interpreted as contemporarily suggesting that answering test items during the computer-mediated PLP tests may impose a greater extraneous load, as test takers need to recall information about the listening text from their long-term memory (Kintsch, 1988) and split their visual attention spatially and temporally while gazing back and forth between their notes (written on paper) and the test items (on the monitor). Hence, the need to channel information from long-term memory into WM and integrate this information with the visual information that is segregated across time and space may lead to higher cortical activations when answering test items during PLP tests. When all else is kept equal, this theory suggests that test takers may perform better during WLP tests given the lighter cognitive load. Yet, Aryadoust et al. (2020) reported that WLP tests induced a heavier cognitive load compared to answering test items during PLP tests, with higher activity levels across the left dmPFC, IFG, and pMTG. Furthermore, these authors reported the absence of significant differences in listening test scores across test methods (Aryadoust et al., 2020). Importantly, these findings contrasted with the results of Field (2009), who reported that listeners scored lower when identifying the main points of lectures during WLP tests than in non-test PLP conditions.

The present study

At present, there is an inconsistency between the theoretical framework (i.e., Sweller’s (1988) cognitive load theory) and the experimental findings across small (n < 30) studies regarding cognitive load and performance for WLP and PLP test methods (Aryadoust et al., 2020; Field, 2009). In order to advance current knowledge on cognitive load and push forth research boundaries, the present study combined eye-tracking and fNIRS technologies to study the cognitive load imposed by different listening test methods by examining test takers’ gaze behavioral measures and brain activation patterns, respectively. In addition, test scores were tabulated to assess test takers’ measured listening performance across the computer-mediated WLP and PLP tests. To record event-related brain activation, we adopted the neuroimaging framework by Aryadoust et al. (2020) and presented the audio texts and test items in segments. For the WLP tests, the segmented audio texts and test items were concurrently presented to the test takers in a single phase. In contrast, the PLP tests were presented to the test takers in two phases, consisting of PLP-Audio and PLP-Question phases. During the PLP-Audio phase, test takers could take notes (on paper) while listening to the segmented audio texts. During the PLP-Question phase, test takers read their notes while answering the test items presented on a computer monitor. As the segments have varying durations, all measures of gaze behaviors are normalized or expressed as a function of segment duration (i.e., they are divided by segment duration). The hypotheses for this study are as follows:

Hypothesis One: In line with Sweller’s (1988) cognitive load theory and the findings from Aryadoust et al. (2020), it is hypothesized that the cognitive load imposed by the WLP tests will be lighter than that imposed by notetaking-while-listening in the PLP-Audio phase. As test takers will need to temporally integrate visual (i.e., test items) and auditory (i.e., listening text) information across a single time interval, they are expected to adopt the keyword matching strategy and “shallow listening” during the WLP tests, causing lighter cognitive load. In the absence of eye-tracking data during notetaking-while-listening in the PLP-Audio phase, “shallow listening” during the WLP tests will be inferred only from the fNIRS data and is reflected as lower brain activity levels across the left dmPFC (i.e., top-down processing), IFG and pMTG (i.e., bottom-up processing) regions when compared with notetaking-while-listening during the PLP-Audio phase. This hypothesis will not be tested by using gaze behaviors (fixations and visits), as test items will not be presented during the PLP-Audio phase during which the test takers will be looking away from the computer monitor while taking notes.

Hypothesis Two: Based on Sweller’s (1988) cognitive load theory, the cognitive load imposed by the WLP tests will be lighter than the answering of test items in the PLP-Question phase. To investigate this hypothesis, we employ gaze behavioral measures (fixations and visits) and brain activation data. We note that the interpretation of fixations and visits in WLP and the PLP-Question phase in the present study is not as straightforward as that in previous eye-tracking studies. Therefore, we first elaborate on how these metrics are interpreted in our study.

First, in line with the eye-tracking literature on cognitive load (Wang et al., 2018; Zagermann et al., 2016), it is anticipated that test takers (i) fixate within the test items more frequently (i.e., a higher fixation rate), and (ii) therefore spend a lower proportion of their time fixating on the test items on average and in total (i.e., lower normalized average/total fixation durations) during WLP tests than in the PLP-Question phase. In other words, since test taskers fixate on the written test items rapidly during the WLP tests (higher fixation rate), each fixation will be shorter (lower fixation duration). This is because, consistent with Hypothesis One, test takers are expected to adopt the keyword matching strategy and “shallow listening” during the WLP tests (Field, 2009). By contrast, during the PLP-Question test phase, test takers are expected to (i) split their attention across the computer monitor and notes and (ii) integrate their mental representations of the audio text (oral stimuli) with the information from their notes and test items. This temporal and spatial segregation of information in the PLP-Question phase is expected to result in a higher cognitive load.

Second, during the WLP tests, test takers are expected to (i) visit the test items less frequently (i.e., lower visit rates), but (ii) spend a larger proportion of their time visiting the test items on average and in total than in the PLP-Question test phase (i.e., higher normalized average/total visit duration). Lower visit rates during the WLP tests are predicted because test takers do not need to alternate their gaze between the computer screen and notes since they do not have any written notes during this test; by contrast, during the PLP-Question phase, test takers will have to alternate their visits from the computer screen presenting the test items to their notes on paper, resulting in higher visit rates. Since test takers spend time visiting the computer screen in the WLP test, the duration of each visit will be longer. This result is treated as an indication of lower cognitive load in the WLP tests, as this visit pattern indicates a minimal temporal and spatial segregation of information (Sweller, 1988).

Finally, the test takers’ brain activity levels across the left dmPFC, IFG, and pMTG regions are expected to be lower during the WLP tests than when answering test items in the PLP-Question test phase, indicating lower cognitive load in the WLP test compared with the PLP-Question test phase.

Hypothesis Three: We predict that the test takers will perform better with higher test scores in the WLP tests than in the PLP tests, as the former imposes a lighter cognitive load than the latter. In this study, we investigate this hypothesis by controlling for the potential differences in the participants’ listening ability via a psychometrically validated baseline listening test, as the inconsistent results on performance across the studies of Aryadoust et al. (2020) and Field (2009) may be owing to the absence of control variables.

Methods

Participants

A total of 109 neurotypical participants (M = 24.0, standard deviation (SD) = 3.7 years; 60 females and 49 males) volunteered for this research. They were students from a local university and were recruited via posters and an online chat group dedicated to participant recruitment. Of these, 69 participants conversed in English as their first language and 99 were right-handers. There were two parts to this research: (i) a listening test validation (i.e., the IELTS-test specimen (IELTS-TS), see the “Instruments” section for details), and (ii) an eye-tracking and fNIRS experiment. All 109 participants partook in the part of the research that concerned validating aspects of the listening-test score uses. They were then invited to participate in the eye-tracking and fNIRS experiment until the target of 80 participants was reached (M = 24.1, SD = 4.0 years; 45 females and 35 males). Of these 80 participants, 48 conversed in English as their first language and 71 were right-handed. This research was approved by the university’s Institutional Review Board. All participants provided informed consent and were compensated with $30 at the end of the study.

Instruments

Edinburgh Handedness Inventory

Research that investigated the associations between handedness and hemispheric language dominance showed that the incidence of right-hemispheric language dominance was higher in left-handed individuals than right-handed individuals (Knecht et al., 2000; Szaflarski et al., 2012). Thus, we established the participants’ handedness by using an online version of the Edinburgh Handedness Inventory (EHI) to evaluate the differences in brain activation levels between left-handed (M = −61.7, SD = 31.6) and right-handed participants (M = 77.7, SD = 18.8). Using IBM SPSS Version 25 (IBM Corporation, 2017), the Mann-Whitney U tests indicated no significant differences between the two groups across the dmPFC, IFG, and pMTG at α = 0.0056 (i.e., following post-hoc Bonferroni corrections for nine pairs of comparisons). In view of this result and the small number of left-handed participants (n = 9), left-handed and right-handed participants were treated as one group in subsequent analyses.

Computer-mediated listening tests

Three sets of listening tests were used for this research. First, a psychometrically validated IELTS-TS was used to establish participants’ baseline listening comprehension abilities (Aryadoust, 2012). The test was conducted on a computer via Google Sheets and involved listening to (i) a dialogue about an everyday situation, (ii) a monologue radio broadcast, (iii) an academic dialogue, and (iv) an academic lecture (see Table 1 for test item descriptions). Based on the alpha (α) coefficients calculated using JASP Version 0.11.1 (JASP Team, 2019), the IELTS-TS possessed a high internal consistency reliability with Cronbach’s α coefficients of 0.88 and 0.87 with and without partial credit, respectively (McDonald, 1999; Revelle & Zinbarg, 2008).

Table 1.

Description of the listening tests.

	IELTS-Test specimen	IELTS-9		IELTS-11
	IELTS-Test specimen	WLP	PLP	WLP	PLP
Sample size	109	40	40	40	40
Number of items	40	10	10	10	10
Item descriptions	Four sections (in total): 10 questions each, 1 mark per question, 40 marks (in total) Section 1: 6 MCQ, 4 blanks Section 2: 5 MCQ, 5 blanks Section 3: 1 MCQ, 9 blanks Section 4: 10 blanks	Section 4 only: 11 blanks, 1 mark each for 9 blanks, and 0.5 mark for 2 blanks; 10 marks (in total)		Section 4 only: 10 blanks, 1 mark per blank 10 marks (in total)
Font type and sizes	Arial, 18-point (headers), 12-point (body texts)	Times New Roman, 38-point (headers), 20-point (sub-headers and body texts)

Note: Blanks = fill-in-the-blank items, IELTS = International English Language Testing System, MCQ = multiple-choice questions, PLP = post-listening performance, and WLP = while-listening performance.

Second, two lectures (i.e., IELTS-9 (Cambridge ESOL, 2013), and IELTS-11 (Cambridge ESOL, 2016)) with very similar linguistic features, speech rate (M = 141, SD = 3 words per minute), and topics (i.e., Whales and Dolphins, and Ocean Biodiversity) were selected from a corpus of 56 commercially available IELTS practice tests (Cambridge University Press, 2021) for the eye-tracking and fNIRS experiment (see Table 1 for test item descriptions). Following the methods of Green et al. (2010), the linguistic features across all 56 transcribed lectures, such as coherence, narrativity, cohesion, syntactic simplicity, and lexical diversity, were determined using Coh-Metrix (McNamara et al., 2014). Using Latent GOLD Version 5.1 (Statistical Innovations, 2016), the linguistic features were then submitted to a latent class cluster (LCC) analysis to identify similar cases (i.e., the lecture texts) and classify them into homogenous groups called latent classes or clusters (Magidson & Vermunt, 2004). The LCC is a robust classification method that is well suited for identifying homogeneous groups of texts based on their linguistic features. By using LCC, we were able to avoid using arbitrary selection criteria set by experts, which is widely used in content analysis but has recently been critiqued (see Alderson & Kremmel, 2013).

We generated two- and three-cluster models and then selected the more parsimonious two-cluster solution following comparisons with the three-cluster solution. The two-cluster solution divided the text corpus into two clusters (see Appendix 1 for further information). Compared with Cluster one, the lectures from Cluster two had higher mean scores on the five above-mentioned Coh-Metrix variables. Thus, we chose two lectures from Cluster two for this study. Importantly, the choice of lecture texts from the same cluster allowed us to control for the potential effects of text-related variables.

The WLP and PLP versions for each lecture were subsequently created (i.e., IELTS-9_WLP, IELTS-9_PLP, IELTS-11_WLP, and IELTS-11_PLP). These tests possessed medium to high internal consistency reliability with Cronbach’s α coefficients ranging from 0.52 to 0.75. To measure event-related brain activation, the two lectures and their test items were divided into segments (i.e., eight for IELTS-9, and six for IELTS-11; see Appendix 2 for segment details). The average lecture durations per segment was 31.7s (i.e., 13s to 48s) for the IELTS-9 lecture and 57.2s (i.e., 31s to 72s) for the IELTS-11 lecture, and thus each segment in the WLP tests and the PLP-Audio phase was set accordingly to these durations. As for the PLP-Question phase, each segment duration was set at 40s.

Research design and data collection procedures

This study was conducted in a laboratory over 189 sessions, where each session involved one participant. Each session lasted between 45 and 75 minutes. The first part of the study involved (i) collecting demographic information (including details on handedness) and (ii) completing the IELTS-TS (Figure 1A). The aforementioned tasks were computerized and presented on a secondary 23-inch monitor (Hewlett Packard, CA, USA), which was connected to a primary laptop (HP Pavilion 15, Hewlett Packard). The participants sat through the IELTS-TS, with the audio texts presented through two computer speakers (see Appendix 3 for test protocol details).

Figure 1.

Setup for the experimental study and the research design.

The second part of the study involved the administration of the WLP and PLP tests. All participants sat approximately 65cm in front of the same 23-inch monitor, which was connected to the primary laptop (HP Pavilion 15, Hewlett Packard) (Figure 1B). An infrared eye-tracker (Tobii X3-120, Tobii AB, Stockholm, Sweden), which was affixed to the monitor, recorded the participants’ gaze behaviors at 120Hz. The participants wore an fNIRS headcap (NIRx Medical Technologies LLC, NY, USA) that was affixed with eight pairs of light-emitting sources and detector probes covering the left dmPFC, IFG, and pMTG. Each pair of sources and detectors was placed approximately 1.5cm apart and made contact with the participant’s scalp to attain optimal signals. The non-invasive near-infrared light from the sources penetrates the scalp to reach the brain tissue in the cortex. During the penetration, some wavelengths that are specific to brain tissues are absorbed. Wavelengths that are not absorbed follow a “banana-shape” trajectory called “photon banana” within the brain tissue and reach the surface. These unabsorbed wavelengths are then detected by the detectors. The intensity of the detected light is then measured and used to differentiate the brain regions with high concentrations of the oxygenated blood (representing significant brain activation), and the deoxygenated blood (representing less activation) (Rupawala et al., 2018). The sources and detectors on the headcap were connected to a portable fNIRS system (NIRSport, NIRx Medical Technologies LLC) that recorded data at 7.81Hz.

As both the eye-tracker (Tobii AB) and fNIRS systems (NIRx Medical Technologies LLC) operated via near-infrared light, the participants also wore a customized aluminum cap to minimize infrared light interference (i.e., from the eye-tracker) on the fNIRS signals (Figure 1B). To synchronize the stimuli presented with the fNIRS signals, a c-pod (Cedrus Corporation) was used to facilitate event marker delivery from SuperLab Version 5.0.5 (Cedrus Corporation, 2015) to NIRStar Version 15-0 (NiRx Medical Technologies LLC, 2016b) (Figure 2). Event markers were set at the start of all segments for the WLP tests, the PLP-Audio, and the PLP-Question phases to synchronize the fNIRS and eye-tracking data as screen recordings were captured in Tobii Pro Studio Version 3.4.8 (Tobii AB, 2017) during the listening tests.

Figure 2.

The protocols for the while-listening performance and post-listening performance tests.

At the start of each session, participant calibration was first performed for the fNIRS system (NIRx Medical Technologies LLC) using an automated calibration process in NIRStar Version 15-0 (NiRx Medical Technologies LLC, 2016b) to establish the optimal amplification factors for all data channels. Subsequently, a five-point calibration process was performed in Tobii Pro Studio Version 3.4.8 (Tobii AB, 2017) to establish each participant’s gaze in relation to the screen. Following calibration, the listening tests began. All participants were instructed to minimize large head movements during the tests to reduce noise in the fNIRS signals and drifts in gaze data (i.e., the systematic errors between the participants’ actual and recorded gaze positions). Specifically, their eye positions were also monitored in Tobii Pro Studio Version 3.4.8 via the primary laptop to ensure consistency in head-to-screen position.

To carry out the experiments, a single-blinded, pseudo-randomized crossover design was adopted where all participants listened to two lectures (i.e., IELTS-9 and IELTS-11) that were presented in different test methods (i.e., WLP and PLP) (Figure 1A). The order of the “lectures” and “test methods” were counterbalanced as shown in Figure 1A. In between the listening tests, both the fNIRS and eye-tracking system were recalibrated to ensure that the quality of the data was maintained.

Data processing

Listening test scores

For IELTS-TS, together with the answers provided by IELTS, answers that were closely paraphrased or demonstrated an understanding of the concept(s) and/or meaning of the word(s) in the answer key were given full credit (Field, 2009). Partial credit was used for scoring the answers to give a better reflection of the participants’ understanding of the test items. Answers with multiple concepts (e.g., “information desk”) were given partial credit for each concept presented (e.g., two total marks for “information desk,” one mark for “information,” and one mark for “desk”). Answers without any concepts were given zero.

For IELTS-9 and IELTS-11, close paraphrased answers were also accepted, together with the answers provided by IELTS. Only one test item within IELTS-9 was given partial credit, as there were two blanks (i.e., a half mark per blank). All other test items were scored zero or one.

Gaze behavioral data

Upon initial inspection of the eye-tracking data, 12 datasets (out of 160; 8%) were removed from further analyses owing to off-calibrations and/or poor tracking quality (see Appendix 2 for further details). In total, 74 sets of eye-tracking data were included for each of the WLP and PLP tests.

The sequences and durations for all 14 segments were first annotated in Tobii Pro Studio Version 3.4.8 (Tobii AB, 2017) to crop the eye-tracking data for analysis. Each segment comprised one area of interest (AOI), that is, the area in which the test item(s) was(were) presented on screen. Thus, a total of 14 customized polygons were drawn over the AOIs. As Tobii Pro Studio Version 3.4.8 (Tobii AB, 2017) does not offer post-hoc drift corrections, all polygons comprised wider buffers beyond the text-based AOIs to accommodate potential drifts from slight changes in the participants’ head positions during the tests and eye-tracker inaccuracy (see Appendix 2 for further details). Subsequently, the raw eye-tracking data were interpolated with a maximum gap length of 75ms to replace missing data that arose from tracking issues (Komogortsev et al., 2010; Olsen, 2012), and smoothed by using a moving median filter with a window size of three samples to reduce noise (Juhola, 1991; Olsen, 2012). Fixation eye movements were then parsed from saccades using a velocity threshold-identification filter (Stuart et al., 2019), with a velocity threshold at 30°/s (Olsen, 2012) and a minimum fixation duration at 100ms (Rayner, 1991).

Six variables were extracted for every AOI including (i) fixation counts, (ii) average fixation duration, (iii) total fixation duration, (iv) visit counts, (v) average visit duration, and (vi) total visit duration (see Tobii AB (2016) for variable definitions). For each AOI, the durational variables were then expressed as a proportion of the segment duration (i.e., in percentage), whereas the count data were expressed as a function of segment duration (i.e., fixations or visits per second). To compare these data across the listening tests, all six variables of interests (i.e., normalized average fixation and visit durations, normalized total fixation and visit durations, and fixation and visit rates) were averaged according to the number of AOIs in each test.

Neuroimaging data

Pre-processing was performed in NIRSlab Version 2016.06 (NIRx Medical Technologies LLC, 2016a) to reduce noise, improve data quality, and prepare the data for analysis (see Appendix 4 for further details). Overall, there was good data quality with approximately 75% of the channels passing the data quality check (i.e., M = 16, SD = 3 channels out of 20). Participants with fewer than 10 channels (i.e., 50%) passing through the quality check were removed from further analysis, as the data could be biased by the few remaining channels (i.e., nine datasets in total). Subsequently, an oxygenated hemoglobin (HbO) beta (β) value was computed for each channel and condition. The HbO were summed and averaged based on the channel location within the left dmPFC, IFG, and pMTG respectively to produce one HbO for each brain area.

Only HbO values were analyzed, as these data best represent an individual’s brain activity levels relative to the deoxygenated and total hemoglobin levels (Strangman et al., 2002). Three sets of HbO values were extracted for each participant based on activity levels during the WLP, PLP-Audio, and PLP-Question tests. Thus, nine HbO values per participant are included in the analyses (i.e., three conditions by three brain regions).

Statistical analysis

Normality across all test scores, gaze behaviors, and brain activity data was first assessed using the Shapiro-Wilk statistics. All data violated the assumption of normality, and they were unsuitable for data transformation and parametric analysis (i.e., analysis of covariance (ANCOVA) and linear mixed model). Thus, non-parametric analyses were performed using IBM SPSS Statistics (IBM Corporation, 2017).

Listening test scores

The overall test scores across test methods (i.e., WLP and PLP) and lectures (i.e., IELTS-9 and IELTS-11) were analyzed using a two-way rank analysis of covariance (RANCOVA) (Quade, 1967), according to the procedures described on the IBM SPSS support page (IBM Support, 2018). Although both lectures were similar in terms of their linguistic features, they were different in duration. Thus, “lectures” was included as a main factor in addition to “test methods.” In addition, the participants’ overall IELTS-TS scores were included in the analysis as a covariate to account for individual differences in measured listening ability.

Gaze behavioral data

All normalized gaze behavioral data across each IELTS lecture were analyzed independently by using Mann-Whitney U tests to assess the overall differences between WLP and the PLP-Question phase. The effect sizes (r) for non-parametric statistics were calculated using the formula $r = \frac{z}{\sqrt{N}}$ , where Z represents the test statistic output and N reflects the number of observations (A. P. Field, 2013). According to A. P. Field (2013), an absolute r value ranging from 0.1 to 0.3 indicates a small effect size, 0.3 to 0.5 represents a medium effect size, and a value above 0.5 represents a large effect size.

Neuroimaging data

The HbO β values for each brain area were compared using Wilcoxon signed rank tests to assess differences between the (i) WLP and PLP-Audio and (ii) WLP and PLP-Question tests. The effect sizes were calculated based on A. P. Field’s (2013) recommendations.

Results

In this section, tests scores are presented as the mean and 95% confidence intervals (CI), whereas gaze behavioral and fNIRS data are reported as the mean and SD.

Test scores

Controlling for participants’ baseline listening performance, the two-way RANCOVA procedure indicated that the main effect of “test methods” was significant, F(1,156) = 24.63, p < .0001, whereas the main effect of “lectures” was not significant, F(1,156) = 1.95, p = .16. The interaction between “test methods” and “lectures” was also significant, F(1,156) = 12.76, p < .001, Figure 3A. Post-hoc analyses with Bonferroni correction indicated that participants scored significantly lower for IELTS-11_PLP (M = 7.33, 95% CI [6.76, 7.89]) than IELTS-11_WLP (M = 8.93, 95% CI [8.36, 9.49]), p < .0001. No difference in test scores was, however, observed between IELTS-9_PLP (M = 7.66, 95% CI [7.10, 8.23]) and IELTS-9_WLP (M = 8.13, 95% CI [7.56, 8.69]), p = .33. Additionally, the differences in test scores between IELTS-9 and IELTS-11 were significant for WLP (p < .001), but not for PLP (p = .13).

Figure 3.

Test scores and gaze behavioral data.

Gaze behavioral data

In terms of the normalized average fixation duration, the Mann-Whitney U tests showed that the participants fixated on the test items for a longer period on average during WLP (M = 0.87, SD = 0.11%) than the PLP-Question phase (M = 0.57, SD = 0.08%) for IELTS-9, U = 21561, p < .0001, r = −0.42 (Figure 3B). In contrast with IELTS-9, the participants’ normalized average fixation duration was shorter during WLP (M = 0.44, SD = 0.05%) than the PLP-Question phase (M = 0.56, SD = 0.08%) for IELTS-11, U = 10339.5, p < .0001, r = −0.49 (Figure 3B).

For (i) the normalized total fixation duration and (ii) fixation rate, the Mann-Whitney U tests indicated that the participants not only spent more time fixating on the test items in total, they also made more fixations per unit time during WLP than the PLP-Question phase for IELTS-9 (i: WLP M = 53, SD = 11%, the PLP-Question phase M = 34, SD = 10%, U = 13637, p < .0001, r = −0.58, Figure 3C; ii: WLP M = 226, SD = 33 fixations/s, the PLP-Question phase M = 148, SD = 30 fixation/s, U = 10940, p < .0001, r = −0.64, Figure 3D). Similarly to IELTS-9, the normalized total fixation duration for IELTS-11 was also longer during WLP (M = 53, SD = 12%) compared with the PLP-Question phase (M = 33, SD = 10%), U = 7358.5, p < .0001, r = −0.60 (Figure 3C). In terms of the fixation rate, participants also had more fixations on the test items per unit time during WLP (M = 221, SD = 37 fixations/s) than the PLP-Question phase (M = 146, SD = 33 fixations/s), U = 7358.5, p < .0001, r = −0.65 (Figure 3D).

Moreover, the differences between WLP and the PLP-Question phase for (i) normalized average visit duration and (ii) normalized total visit durations were also statistically significant for IELTS-9 (i: WLP M = 26, SD = 9%, the PLP-Question phase M = 11, SD = 5%, U = 15397, p < .0001, r = −0.55, Figure 3E; ii: WLP M = 82, SD = 9%, the PLP-Question phase M = 59, SD = 12%, U = 2901, p < .0001, r = −0.60, Figure 3F). Similarly, the Mann-Whitney U tests also indicated that participants spent more time visiting the test items (i) on average and (ii) in total during WLP than the PLP-Question phase for IELTS-11 (i: WLP M = 15, SD = 6%, the PLP-Question phase M = 12, SD = 7%, U = 16326.5, p < .0001, r = −0.28, Figure 3E; ii: WLP M = 83, SD = 6%, the PLP-Question phase M = 59, SD = 14%, U = 7993.5, p < .0001, r = −0.58, Figure 3F).

In terms of the visit rate, the Mann-Whitney U test showed that participants visited the test items more frequently during PLP (M = 18, SD = 4 visit/s) than WLP (M = 16, SD = 4 visit/s) for IELTS-9, U = 32448, p < .0001, r = −0.20 (Figure 3G). Likewise, the participants also visited the test items more frequently during the PLP-Question phase (M = 17, SD = 5 visit/s) than WLP (M = 13, SD = 4 visit/s) for IELTS-11, U = 15063, p < .0001, r = −0.33 (Figure 3G).

Neuroimaging data

In terms of the brain activity levels, the Wilcoxon signed rank tests revealed significantly more activation during the PLP-Audio phase than in WLP across the left (i) dmPFC (the PLP-Audio phase β M = 4.19 × 10⁻⁵, SD = 9.75 × 10⁻⁵, WLP β M = −9.46 × 10⁻⁵, SD = 11.47 × 10⁻⁵, Z = −6.47, p < .0001, r = −0.77), (ii) IFG (the PLP-Audio phase β M = 5.84 × 10⁻⁵, SD = 10.50 × 10⁻⁵, WLP β M = −1.40 × 10⁻⁵, SD =16.64 × 10⁻⁵, Z = −3.16, p =.002, r = −0.38), and (iii) pMTG (the PLP-Audio phase β M = −1.12 × 10⁻⁵, SD = 12.73 × 10⁻⁵, WLP β M = −6.30 × 10⁻⁵, SD = 14.08 × 10⁻⁵, Z = −2.11, p = .035, r = −0.25). As for the comparisons between the PLP-Question phase and WLP, there was significantly more activation in the former only in the left dmPFC (the PLP-Question phase β M = −3.57 × 10⁻⁵, SD = 18.19 × 10⁻⁵, WLP β M = −9.46 × 10⁻⁵, SD = 11.47 × 10⁻⁵, Z = −2.97, p =.004, r = −0.34). No differences were, however, observed for the left IFG and pMTG (p > .05, Table 2).

Table 2.

Comparisons of beta values by conditions.

Left brain area	Conditions	Beta values (×10⁻⁵)	Comparisons	Z	p-value	r
dmPFC	WLP	–9.46 ± 11.47
	PLP-Audio	4.19 ± 9.75	WLP vs. PLP–Audio	–6.47	< .0001	–0.77+++
	PLP-Question	–3.57 ± 18.19	WLP vs. PLP–Question	–2.87	.004	–0.34++
IFG	WLP	–1.40 ± 14.64
	PLP-Audio	5.84 ± 10.50	WLP vs. PLP–Audio	–3.16	.002	–0.38++
	PLP-Question	1.65 ± 19.94	WLP vs. PLP–Question	–1.04	.30	–0.12+
pMTG	WLP	–6.30 ± 14.08
	PLP-Audio	–1.12 ± 12.73	WLP vs. PLP–Audio	–2.11	.035	–0.25+
	PLP-Question	–6.22 ± 16.72	WLP vs. PLP–Question	–0.47	.64	–0.06

Note: All beta values are presented in mean ± standard deviation. + small effect size, ++ medium effect size, +++ large effect size. Abbreviations: dmPFC = dorsomedial prefrontal cortex, IFG = inferior frontal gyrus, PLP = post-listening performance, pMTG = posterior middle temporal gyrus, r = effect size, WLP = while-listening performance, and Z = Wilcoxon test statistic output.

Summary of findings

The results of the study showed that the cognitive load for the WLP tests was lighter compared with both notetaking-while-listening in the PLP-Audio phase and answering of test items during the PLP-Question phase. During the WLP tests, the test takers displayed significantly lower brain activity levels across (i) the left dmPFC (supporting top-down processing) and left IFG and pMTG (supporting bottom-up processing) than in the PLP-Audio phase, and (ii) the left dmPFC and IFG than in the PLP-Question phase. Across both IELTS lectures, the test takers fixated more frequently within the test items during WLP tests than in the PLP-Question phase. However, they only spent larger proportions of their time fixating on the test items on average and in total during WLP test than in the PLP-Question phase for the IELTS-9 lecture. During both WLP tests, the test takers also visited the test items less frequently and spent larger proportions of their time visiting (i.e., on average and in total) than in the PLP-Question phase. Lastly, the test takers only performed better with higher test scores for IELTS-11_WLP than IELTS-11_PLP. There was, however, no significant difference in test scores between IELTS-9_WLP and IELTS-9_PLP.

Discussion

The aim of this study was to investigate the effects of different test methods on (i) cognitive load by examining test takers’ gaze behaviors and brain activity patterns, and (ii) measured listening performance via their test scores. The three hypotheses of the study are discussed below.

Hypothesis One: The cognitive load imposed by the WLP tests will be lighter than that imposed by notetaking-while-listening in the PLP-Audio phase.

In order to examine this hypothesis, the test takers’ brain activation patterns across regions supporting top-down and bottom-up language processing were measured. We found that test takers had lower brain activity levels across the left dmPFC, IFG, and pMTG during the WLP tests than in the PLP-Audio phase. These results are not only consistent with the findings of Aryadoust et al. (2020), but they also indicated that the test takers may have used “shallow listening” during the WLP tests (Field, 2009). Although the above-mentioned brain regions all facilitate auditory and visual comprehension, previous research showed the left IFG and pMTG (supporting bottom-up processing) are least activated especially in the presence of lower linguistic complexity (i.e., word as opposed to discourse comprehension), and when listening comprehension is facilitated by reading (Buchweitz et al., 2014; Buchweitz et al., 2009; Jobard et al., 2007; Michael et al., 2001). Compared with notetaking-while-listening during the PLP-Audio phase, the reduced activity level of the left dmPFC alongside the greater deactivation of the left pMTG during the WLP tests suggest that top-down processing (i.e., inference-making and global coherence-building) was negligible during the WLP tests and that test takers were most likely predominantly involved in bottom-up comprehension (i.e., comprehension of literal meaning) (Buchweitz et al., 2014; Ferstl et al., 2008; Ferstl & von Cramon, 2002; Jobard et al., 2007).

It is important to note that these results indicate that the WLP tests impose a lighter cognitive load upon the test takers than notetaking-while-listening in the PLP-Audio phase, which is in line with both the cognitive load theory (Sweller, 2011; Sweller et al., 2011) and the study by Aryadoust et al. (2020). However, our results contrasted with the findings of Field (2009), who suggested that the WLP tests induced a “heavier” cognitive load. Aside from differences in research methods, this difference in findings may stem from the researchers’ conceptualization of WM. It is possible that Field (2009) could have regarded WM as a unitary entity (Figure 4A), with a single system being responsible for processing a range of inputs regardless of the modality (i.e., visual or auditory). In contrast, the cognitive load theory is based on the concept that a person’s WM comprises multiple processes or subsystems that process information according to the information’s modality (Figure 4B; see Sweller et al. (2011)). The former view, however, only considered the “number” of tasks performed without accounting for the way in which the information was presented. From this perspective, it appeared that the test takers were managing three tasks concurrently during the WLP tests (i.e., reading of test items, listening to the texts, and answering of test items), whereas there were only two tasks being performed during the PLP-Audio phase (i.e., listening to the texts and notetaking).

Figure 4.

Disparate interpretations of cognitive load.

Overall, the results of Hypothesis One highlight the importance of building a theoretical framework that accounts for listeners’ cognitive architecture when investigating cognitive load and processes during listening test conditions. Ideally, this framework should incorporate the concept of a multi-component WM system that processes information based on its modality, rather than a single, unified construct that does not account for the mode of information received (Baddeley, 2012; Baddeley & Hitch, 1974).

Hypothesis Two: The cognitive load for the WLP tests will be lighter compared with the answering of test items during the PLP-Question phase.

In order to investigate the second hypothesis, we examined the test takers’ gaze behaviors and brain activity patterns while they answered the test items during both WLP tests and the PLP-Question phase. First, we found that the test takers’ fixation rate on the test items was approximately 1.5 times higher during the WLP tests (p < .0001) than in the PLP-Question phase. This finding suggests that test takers were moving their eyes quickly, likely searching for keywords or phrases in the test items to match against those presented in the audio texts. Notably, this is consistent with (i) Field’s (2009) findings that test takers used a keyword matching strategy and “shallow listening” during WLP tests, and (ii) the eye-tracking literature on cognitive load (Wang et al., 2018; Zagermann et al., 2016). However, this finding contrasts with Field’s (2009) conclusion that WLP poses higher cognitive load; as previously discussed, the temporal and spatial segregation of the input during the PLP-Question phase causes higher cognitive load, which is evident from the lower fixation rate.

Second, and in partial conformity with our hypothesis, the average proportion of time that the test takers spent fixating on the test items was lower for IELTS-11_WLP than IELTS-11_PLP (i.e., lower fixation durations during the WLP test than the PLP-Question test for IELTS-11, as expected). On the other hand, the test takers spent larger proportions of time fixating (fixation duration) on the test items (on average and in total) during the WLP tests than the PLP-Question test for one of the lectures, namely IELTS-9. This disparity between the lectures is likely owing to differences in the segment durations when the average fixation duration is expressed as a percentage. The average segment duration of IELTS-11_WLP was approximately 17.2s longer than that of IELTS-11 PLP-Question, whereas the average segment duration of IELTS-9_WLP was approximately 8.3s shorter than that of IELTS-9 PLP-Question. Notably, the average fixation durations on the test items were similar for WLP and the PLP-Question phase across IELTS-9 (i.e., WLP M = 233, SD = 35ms; PLP M = 228, SD = 43ms) and IELTS-11 (i.e., WLP M = 236, SD = 34ms; PLP M = 227, SD = 38ms). Thus, when the average fixation durations on the test items across the WLP and PLP-Question phase for IELTS-11 were divided by their respective segment durations, the average proportion of time spent on the test items was lower for the former than the latter. It is important to note that the above-mentioned average fixation duration data across all four tests were well within the normative ranges reported for silent reading (i.e., 225–250ms; see Rayner (1991)). Together, the gaze behavior data indicate that the test takers had no difficulty sourcing cues in the test items across both the WLP and PLP-Question tests.

Third, in terms of the eye-tracking data, we also found that the test takers’ frequency of visits (i.e., visit rate) to the test items was 1.2 times higher during the PLP-Question phase than in the WLP tests (p < .0001). The higher visit rate during the PLP-Question phase suggests that test takers sourced for answers by alternating their gazes back and forth between their handwritten notes and the test items shown on the monitor. As a result, they spent lower proportions of their time dwelling on the test items in each visit (i.e., lower normalized average visit duration) and across all visits (i.e., lower normalized total visit duration) when answering them during the PLP-Question phase compared with both WLP tests. This result is indicative that the test takers’ visual attention was spatially and temporally split when answering the test items during the PLP-Question phase as they could attend to either their notes on the paper or the test items on the computer screen at any given time. Under such circumstances, the test takers will need to temporarily maintain the spatially and temporally segregated visual information within their WM, and then integrate it with their mental representation of the audio text during the PLP-Question phase.

Taken together, these findings and interpretations conform with our neuroimaging data, which showed that the test takers had lower brain activity across the left IFG and dmPFC during the WLP tests than in the PLP-Question phase. Additionally, this also indicates that listening-while-answering test items during the WLP tests require less bottom-up and top-down processing than answering test items in the PLP-Question phase. Despite the significant differences in activation patterns in the left IFG, there was no significant difference in the activity level of the left pMTG. In line with Jobard et al. (2007), the overall activity patterns of the left IFG (i.e., positive β values) and pMTG (i.e., negative β values) during the PLP-Question phase suggest that the test takers processed the test items at word level or in a bottom-up fashion. Apart from lexico-semantic processing (Buchweitz et al., 2014; Buchweitz et al., 2009; Jobard et al., 2007; Michael et al., 2001), the involvement of the IFG during the PLP-Question phase may reflect its role in recollecting episodic details owing to its functional connectivity with an area responsible for long-term memory, that is, the left hippocampus (Wais et al., 2010). Based on these findings, it is postulated that the test takers may have regenerated a mental representation of the audio texts during the PLP-Question phase and searched for keywords or phrases on their notes to match against the test items while answering. Evidently, this differs from the answering process during the WLP tests as test takers predominantly rely on the latter keyword matching strategy. Lastly, the higher activity in the dmPFC during the PLP-Question phase than in the WLP tests indicates that the test takers may have used top-down processing by applying their world knowledge when answering the test items. Overall, the results from testing Hypothesis Two are consistent with the findings of Hypothesis One, as both sets of results indicate that the WLP tests impose lower cognitive load upon the test takers than the PLP tests.

It is interesting to note that our neuroimaging data for the WLP tests and the PLP-Question phase contrasted Aryadoust et al. (2020), as they reported that their WLP tests induced a heavier cognitive load compared with the PLP-Question phase (but lower cognitive load compared with the PLP-Audio test). This disparity between our study and Aryadoust et al.’s (2020) findings may be a result of the differences in the types of test items used across the studies given that multi-choice questions (MCQs) formed the majority in the listening tests of Aryadoust et al. (2020), whereas fill-in-the-blanks items were used in this study. The MCQs used in Aryadoust et al.’s (2020) study would require test takers to evaluate the statuses of several propositional information in the options, and this might have promoted the reading of test items at the sentential and global levels during answering. In contrast, the fill-in-the-blank items in the WLP tests may draw the test takers’ visual attention at word level (i.e., bottom-up processing), leading to the adoption of keyword matching strategy. This could explain why the left dmPFC, IFG, and pMTG were significantly more activated when answering MCQs during the WLP tests in Aryadoust et al.’s (2020) study, but significantly less activated when answering fill-in-the-blank items in the present study. Furthermore, the left dmPFC was activated while answering MCQs during the PLP-Question test in Aryadoust et al.’s (2020) study, but it was not significantly activated while answering the fill-in-the-blank items during the PLP-Question test in the present study. This finding suggests that the type of test items used, as well as the test method, may influence test takers’ cognitive load and cognitive processes while answering test items. This possibility is in line with the findings of In’nami and Koizumi’s (2009) meta-analysis, as MCQs were reportedly easier to answer than their open-ended counterparts (including fill-in-the-blank items), indicating that MCQs imposed lower cognitive load as they do not require answer production.

Hypothesis Three: Test takers will perform better with higher test scores for the WLP tests than in the PLP tests.

Overall, our study showed that the WLP tests imposed lighter cognitive load than PLP tests. In regard to the test scores, we found that the test takers performed better with higher test scores only for IELTS-11_WLP than in IELTS-11_PLP. Similarly to the results of Aryadoust et al.’s (2020) study, there was no difference in test scores between IELTS-9_WLP and IELTS9-_PLP. This inconsistency in measured listening performance across the WLP tests in the present study may be attributed to the test takers’ limited visual and auditory WM processing capacities. The absence of differences in test scores between IELTS-9_WLP and IELTS-9_PLP suggests that the test takers were not only capable of maintaining the audio text and test item information within their WM while answering during IELTS-9_WLP, but also the audio text information while notetaking in IELTS-9_PLP. As there was also no difference in test scores between IELTS-11_PLP and IELTS-9_PLP, we posit that the test takers were equally capable of handling the audio text information while notetaking across both PLP tests. In contrast, it is probable that the amount of visual and auditory information would have exceeded the test takers’ visual and auditory WM capacities while answering during the IELTS-9_WLP test than in the IELTS-11_WLP test. Notably, the average duration of the audio segments for the IELTS-9_WLP test (i.e., 31.7s) was shorter than that for the IELTS-11_WLP test (i.e., 57.2s). With shorter durations, the density of information presented in each segment is likely higher in the IELTS-9_WLP test than in the IELTS-11_WLP test. When encountering higher density of information, test takers may have had fewer cognitive resources available to maintain both existing and incoming information within their WM during the IELTS-9_WLP. Importantly, our results were similar to Mayer et al. (1999), who reported that undergraduates showed no difference in recalling and transferring knowledge acquired from a multimedia presentation regardless of whether the narrations and animations were concurrently or successively presented in small chunks. With larger chunks, however, the recall and transfer of knowledge performance were better when the narrations and animations were simultaneously presented (Mayer et al., 1999).

The inflated test scores in IELTS-11_WLP, together with the test takers’ use of keyword matching strategy and their lower cortical activity during comprehension, suggest that the test is probably confounded by sources of construct-irrelevant variance (Messick, 1996). Such sources of variance involve the test takers in cognitive processes that are not entailed in the focal construct definition and this in turn allows test takers to use test-specific strategies that differ from those demanded by real-world tasks in the target language use domain (Messick, 1996). In line with Mayer et al. (1999), the construct-irrelevant phenomenon (WLP vs. PLP format) is more apparent when the tests are presented in longer segments (i.e., IELTS-11_WLP). Although this study did not specifically compare the differences in gaze behaviors and brain activity patterns between IELTS-11_WLP and IELTS-9_WLP, the patterns observed from the test takers’ frequency of visits to the test items (Figure 3G), and their average proportion of time spent on fixating and visiting the test items (Figures 3B and 3E) suggest that the durations of the audio texts can affect the cognitive load and processes involved. Extending from Aryadoust et al. (2020), who also presented their audio texts in shorter segments (i.e., approximately 28.5s), these findings reinforce the notion that shorter audio texts are less likely to demand test-specific processes that induce construct-irrelevant variance in test scores. Although the lengths of the segmented audio texts presented in Field’s (2009) study are unknown, the author’s results (i.e., WLP test scores < non-test PLP scores) may be attributed to the fact that the listeners’ performance was only scored by being based on their ability to identify the main points of the lectures. Nonetheless, the present study suggests that lengthier listening segments can disadvantage test takers who take the tests in PLP rather than in WLP formats, owing to influences over their cognitive load.

Limitations and future directions

To measure event-related brain activity responses, this study was limited by the need to segment the listening tests. Nevertheless, our data suggest that the methods used in this study allowed us to gauge the potential impact of test methods on test takers’ cognitive load and to assess listening performance in full-fledged high-stakes tests. To improve our understanding of the effects of different test methods, we recommend that test developers and researchers adopt our methods, namely eye-tracking and fNIRS technology (as opposed to self-reports or verbal elicitations), to examine the neurocognitive mechanisms involved in listening tests while also identifying and minimizing potential sources of construct-irrelevant variance.

It should be noted that one of the reviewers of this paper pointed out that our usage of LCC is “not a standard methodology for selecting similar texts” and may be “a complex way to do a simple job.” The reviewer also suggested that “how to best match texts based on multivariable data is one that needs a good answer and standardized methodology” in language assessment work. Furthermore, this reviewer observed that authors contributing to the natural language processing literature on text similarity go beyond Coh-Metrix, and that this information needs to be surveyed and communicated to language testers. Here, we would like to echo this suggestion and call.

Conclusion

To our knowledge, this is the first study that integrates eye-tracking, neuroimaging, and test scores to examine test method effects and cognitive load in listening assessment. By incorporating Sweller’s (1988) cognitive load theory into our theoretical framework, we were able to explain the observed test method effects on gaze behavioral measures and brain activity patterns. Furthermore, we were able to resolve most of the outstanding inconsistency in the literature on cognitive load and test performance across listening test methods. Importantly, our study suggests that the length of the listening stimuli used may, in part, contribute to inconsistency in participants’ test performance and biometric data (i.e., gaze behavioral measures and brain activity patterns), despite similarity in listening texts’ linguistic features. In addition, different types of test items (e.g., fill-in-the-blank items vs. MCQs) can affect cognitive load and the processes involved in answering the items during listening tests.

Footnotes

Appendix 1

Appendix 2

Appendix 3

Appendix 4 Acknowledgements

We thank the anonymous reviewers for their careful reading of our manuscript and insightful comments.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article. This project was supported by the National Institute of Education of Nanyang Technological University, Singapore, under its Academic Research Fund (Project ID: RI 1/18 VSA). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of the National Institute of Education.

ORCID iDs

Vahid Aryadoust

Stacy Foo

Li Ying Ng

Notes

References

Alderson

J. C.

Kremmel

(2013). Re-examining the content validation of a grammar test: The (im)possibility of distinguishing vocabulary and structural knowledge. Language Testing, 30(4), 535–556. https://doi.org/10.1177/0265532213489568

Aryadoust

(2012). Differential item functioning in while-listening performance tests: The case of the International English Language Testing System (IELTS) listening module. The International Journal of Listening, 26(1), 40–60. https://doi.org/10.1080/10904018.2012.639649

Aryadoust

(2013). Building a validity argument for a listening test of academic proficiency. Cambridge Scholars.

Aryadoust

(2018). The listening test of the Internet-Based Test of English as a Foreign Language (TOEFL iBT). In Worthington

D. L.

Bodle

G. D.

(Eds.), The sourcebook of listening research: Methodology and measures (pp. 592–598). Wiley Blackwell. https://doi.org/10.1002/9781119102991.ch67

Aryadoust

(2020). Dynamics of item reading and answer changing in two hearings in a computerized while-listening performance test: An eye-tracking study. Computer Assisted Language Learning, 33(5–6), 510–537. https://doi.org/10.1080/09588221.2019.1574267

Aryadoust

L. Y.

Foo

Esposito

(2020). A neurocognitive investigation of test methods and gender effects in listening assessment. Computer Assisted Language Learning. Advance online publication. https://doi.org/10.1080/09588221.2020.1744667

Baddeley

(2012). Working memory: Theories, models, and controversies. Annual Review of Psychology, 63, 1–29. https://doi.org/10.1146/annurev-psych-120710-100422

Baddeley

A. D.

Hitch

(1974). Working memory. Psychology of Learning and Motivation, 8, 47–89. https://doi.org/10.1016/S0079-7421(08)60452-1

Buchweitz

Mason

R. A.

Meschyan

Keller

T. A.

Just

M. A.

(2014). Modulation of cortical activity during comprehension of familiar and unfamiliar text topics in speed reading and speed listening. Brain and Language, 139, 49–57. https://doi.org/10.1016/j.bandl.2014.09.010

10.

Buchweitz

Mason

R. A.

Tomitch

L. M. B.

Just

M. A.

(2009). Brain activation for reading and listening comprehension: An fMRI study of modality effects and individual differences in language comprehension. Psychology and Neuroscience, 2(2), 111–123. https://doi.org/10.3922/j.psns.2009.2.003

11.

Cambridge ESOL. (2013). Cambridge IELTS 9: Student’s book with answers authentic examination papers from Cambridge ESOL. Cambridge University Press.

12.

Cambridge ESOL. (2016). Cambridge IELTS 11: Student’s book with answers authentic examination papers from Cambridge ESOL. Cambridge University Press.

13.

Cambridge University Press. (2021). Cambridge English Exams & IELTS. Cambridge University Press.

14.

Cedrus Corporation. (2015). Superlab (Version 5.0.5) [Computer software]. Cedrus.com https://libraryguides.vu.edu.au/apa-referencing/7DatasetsSoftwareTests

15.

Chandler

Sweller

(1991). Cognitive load theory and the format of instruction. Cognition and Instruction, 8(4), 293–332. https://doi.org/10.1207/s1532690xci0804_2

16.

Chandler

Sweller

(1992). The split-attention effect as a factor in design of instruction. British Journal of Educational Psychology, 62(2), 233–246. https://doi.org/10.1111/j.2044-8279.1992.tb01017.x

17.

Ferstl

E. C.

Neumann

Bogler

von Cramon

D. Y.

(2008). The extended language network: A meta-analysis of neuroimaging studies on text comprehension. Human Brain Mapping, 29, 581–593. https://doi.org/10.1002/hbm.20422

18.

Ferstl

E. C.

von Cramon

D. Y.

(2002). What does the frontomedian cortex contribute to language processing: Coherence or theory of mind? NeuroImage, 17, 1599–1612. https://doi.org/10.1006/nimg.2002.1247

19.

Field

A. P.

(2013). Discovering statistics using IBM SPSS statistics (4th ed.). SAGE Publications.

20.

Field

(1999). Key concept: Bottom up versus top down. ELT Journal, 53(4), 338–339. https://doi.org/10.1093/eltj/53.4.338

21.

Field

(2009). The cognitive validity of the lecture-based question in the IELTS Listening paper. In Thompson

(Ed.), International English Language Testing System (IELTS) research reports 2009 (Vol. 9, pp. 17–65). British Council and IELTS Australia. https://www.ielts.org/-/media/research-reports/ielts_rr_volume09_report1.ashx

22.

Field

(2013). Cognitive validity. In Garanpayeh

Taylor

(Eds.), Examining listening: Research and practice in assessing second language listening (pp. 112–117). Cambridge University Press.

23.

Fishburn

Norr

Medvedev

Vaidya

(2014). Sensitivity of fNIRS to cognitive state and load. Frontiers in Human Neuroscience, 8(76). https://doi.org/10.3389/fnhum.2014.00076

24.

Gibson

J. J.

(1979). The ecological approach to visual perception. Houghton Mifflin.

25.

Green

Ünaldi

Weir

(2010). Empiricism versus connoisseurship: Establishing the appropriacy of texts in tests of academic reading. Language Testing, 27(2), 191–211. https://doi.org/10.1177/0265532209349471

26.

Hickok

Poeppel

(2004). Dorsal and ventral streams: A framework for understanding aspects of the functional anatomy of language. Cognition, 92(1), 67–99. https://doi.org/10.1016/j.cognition.2003.10.011

27.

Holmqvist

Nyström

Andersson

Dewhurst

Jarodzka

van de Weijer

(2011). Eye tracking: A comprehensive guide to methods and measures. Oxford University Press.

28.

IBM Corporation. (2017). IBM SPSS Statistics for Windows (Version 25) [Computer software]. IBM Corporation.

29.

IBM Support. (2018). Can SPSS do a nonparametric or rank analysis of covariance (Quade’s test)? https://www.ibm.com/support/pages/can-spss-do-nonparametric-or-rank-analysis-covariance-quades-test

30.

In’nami

Koizumi

(2009). A meta-analysis of test format effects on reading and listening test performance: Focus on multiple-choice and open-ended formats. Language Testing, 26(2), 219–244. https://doi.org/10.1177/0265532208101006

31.

JASP Team. (2019). JASP (Version 0.11.1) [Computer software]. jasp-stats.org. https://jasp-stats.org/

32.

Jobard

Vigneau

Mazoyer

Tzourio-Mazoyer

(2007). Impact of modality and linguistic complexity during reading and listening tasks. NeuroImage, 34, 784–800. https://doi.org/10.1016/j.neuroimage.2006.06.067

33.

Juhola

(1991). Median filtering is appropriate to signals of saccadic eye movements. Computers in Biology and Medicine, 21(1), 43–49. https://doi.org/https://doi.org/10.1016/0010-4825(91)90034-7

34.

Kintsch

(1988). Comprehension: A paradigm for cognition. Cambridge University Press.

35.

Knecht

Dräger

Deppe

Bobe

Lohmann

Flöel

Ringelstein

E. B.

Henningsen

(2000). Handedness and hemispheric language dominance in healthy humans. Brain, 123(12), 2512–2518. https://doi.org/10.1093/brain/123.12.2512

36.

Komogortsev

O. V.

Gobert

D. V.

Jayarathna

Koh

D. H.

Gowda

S. M.

(2010). Standardization of Automated Analyses of Oculomotor Fixation and Saccadic Behaviors. IEEE Transactions on Biomedical Engineering, 57(11), 2635–2645. https://doi.org/10.1109/TBME.2010.2057429

37.

Gong

Zeng

Luo

(2005). Verbal working memory load affects prefrontal cortices activation: Evidence from a functional NIRS study in humans. In Tuchin

V. V.

(Ed.), Proceedings of SPIE 5696, Complex dynamics and fluctuations in biomedical photonics II (pp. 33–40). https://doi.org/10.1117/12.590222

38.

Magidson

Vermunt

J. K.

(2004). Latent class models. In Kaplan

(Ed.), The SAGE handbook of quantitative methodology for the social sciences (pp. 176–199). SAGE Publications. https://dx.doi.org/10.4135/9781412986311

39.

Mayer

R. E.

Moreno

Boire

Vagge

(1999). Minimizing constructivist learning from multimedia communications by minimizing cognitive load. Journal of Education Psychology, 91(4), 638–643. https://doi.org/10.1037/0022-0663.91.4.638

40.

McDonald

R. P.

(1999). Test theory: A unified treatment. Lawrence Erlbaum.

41.

McNamara

D. S.

Graesser

A. C.

McCarthy

P. M.

Cai

(2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press.

42.

Messick

(1996). Validity and washback in language testing. Language Testing, 13(3), 241–256. https://doi.org/10.1177/026553229601300302

43.

Michael

E. B.

Keller

T. A.

Carpenter

P. A.

Just

M. A.

(2001). fMRI investigation of sentence comprehension by eye and by ear: Modality fingerprints on cognitive processes. Human Bran Mapping, 13(4), 239–252. https://doi.org/10.1002/hbm.1036

44.

NIRx Medical Technologies LLC. (2016a). NIRSLab (Version 2016.06) [Computer software]. Nirx.net. https://nirx.net/nirslab-1

45.

NiRx Medical Technologies LLC. (2016b). NIRStar (Version 15-0) [Computer software]. Nirx.net. https://nirx.net/nirstar-1

46.

Ockey

G. J.

Wagner

(2018). Assessment of L2 listening: Moving towards authenticity. John Benjamins.

47.

Olsen

(2012). The Tobii I-VT fixation filter: Algorithm description. Tobii Technology. http://www.vinis.co.kr/ivt_filter.pdf

48.

Quade

(1967). Rank analysis of covariance. Journal of the American Statistical Associations, 63(320), 1187–1200. https://doi.org/10.1080/01621459.1967.10500925

49.

Rayner

(1991). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124(3). https://doi.org/10.1037/0033-2909.124.3.372

50.

Revelle

Zinbarg

R. E.

(2008). Coefficients alpha, beta, omega, and the glb: Comments on Sijtsma. Psychometrika, 74, Article 145. https://doi.org/10.1007/s11336-008-9102-z

51.

Rupawala

Dehghani

Lucas

S. J. E.

Tino

Cruse

(2018). Shining a light on awareness: A review of Functional Near-Infrared Spectroscopy for prolonged disorders of consciousness [Review]. Frontiers in Neurology, 9 Article 350. https://doi.org/10.3389/fneur.2018.00350

52.

Saur

Kreher

B. W.

Schnell

Kümmerer

Kellmeyer

Vry

M.-S.

Umarova

Musso

Glauche

Abel

Huber

Rijntjes

Hennig

Weiller

(2008). Ventral and dorsal pathways for language. Proceedings of the National Academy of Sciences, 105(46), 18035–18040. https://doi.org/10.1073/pnas.0805234105

53.

Statistical Innovations. (2016). Latent GOLD (Version 5.1) [Computer software].

54.

Strangman

Culver

J. P.

Thompson

J. H.

Boas

D. A.

(2002). A quantitative comparison of simultaneous BOLD fMRI and NIRS recordings during functional brain activation. NeuroImage, 17(2), 719–731. https://doi.org/https://doi.org/10.1006/nimg.2002.1227

55.

Stuart

Hickey

Vitorio

Welman

Foo

Keen

Godfrey

(2019). Eye-tracker algorithms to detect saccades during static and dynamic tasks: A structured review. Physiological Measurement, 40(2), 02TR01. https://doi.org/10.1088/1361-6579/ab02ab

56.

Sweller

(1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257–285. https://doi.org/10.1207/s15516709cog1202_4

57.

Sweller

(2011). Cognitive load theory. In Mestre

J. P.

Ross

B. H.

(Eds.), The psychology of learning and motivation, Vol. 55: Cognition in education (pp. 37–76). Academic Press https://doi.org/10.1016/B978-0-12-387691-1.X0001-4

58.

Sweller

Ayres

Kalyuga

(2011). Cognitive load theory. Springer. https://doi.org/10.1007/978-1-4419-8126-4

59.

Szaflarski

J. P.

Rajagopal

Altaye

Byars

A. W.

Jacola

Schmithorst

V. J.

Schapiro

M. B.

Plante

Holland

S. K.

(2012). Left-handedness and language lateralization in children. Brain Research, 1433, 85–97. https://doi.org/10.1016/j.brainres.2011.11.026

60.

Tobii AB. (2016). Tobii Studio user's manual version 3.4.5. Tobiipro.com. https://www.tobiipro.com/siteassets/tobii-pro/release-notes/tobii-pro-studio-release-notes.pdf/?v=3.4.8

61.

Tobii AB. (2017). Tobii Pro Studio (Version 3.4.8) [Computer software]. Tobiipro.com. https://www.tobiipro.com/product-listing/tobii-pro-studio/

62.

Vandergrift

(2007). Recent developments in second and foreign language listening comprehension research. Language Teaching, 40(3), 191–210. https://doi.org/10.1017/S0261444807004338

63.

Vandergrift

Goh

C. C. M.

(2009). Teaching and testing listening comprehension. In Long

Doughty

(Eds.), The handbook of language teaching (pp. 395–411). Wiley-Blackwell.

64.

Wais

P. E.

Rubens

M. T.

Boccanfuso

Gazzaley

(2010). Neural mechanisms underlying the impact of visual distraction on retrieval of long-term memory. The Journal of Neuroscience, 30(25), 8541–8550. https://doi.org/10.1523/JNEUROSCI.1478-10.2010

65.

Wang

T.-K.

Huang

Liao

P.-C.

Piao

(2018). Does augmented reality effectively foster visual learning process in construction? An eye-tracking study in steel installation. Advances in Civil Engineering, 2018, 2472167. https://doi.org/10.1155/2018/2472167

66.

Whitney

Kirk

O’Sullivan

Lambon Ralph

M. A.

Jefferies

(2010). The neural organization of semantic control: TMS evidence for a distributed network in left inferior frontal and posterior middle temporal gyrus. Cerebral Cortex, 21(5), 1066–1075. https://doi.org/10.1093/cercor/bhq180

67.

Wilhelm

Hildebrandt

Oberauer

(2013). What is working memory capacity, and how can we measure it? Frontiers in Psychology, 4(433). https://doi.org/10.3389/fpsyg.2013.00433

68.

Zagermann

Pfeil

Reiterer

(2016). Measuring cognitive load using eye tracking technology in visual computing. In Sadlmair

Isenberg

(Eds.), Beliv'16: Proceedings of the sixth workshop on beyond time and errors on novel evaluation methods for visualization (pp. 78–85). Association for Computing Machinery. https://doi.org/10.1145/2993901.2993908