Abstract

Constructing language assessments is a bit like building sandcastles. Language, like sand, is a dynamic phenomenon in a particular ecosystem (for language: Beckner et al., 2009; Haugen, 2001; for sand: Hyndes et al., 2022). Both phenomena shift when there are changes elsewhere in the system, for example, shifts in patterns of language use due to human mobility or new technology. Both language assessments and sandcastles are built from the substance of interest itself (language; sand) along with various kinds of observational infrastructure, for example, tasks and scoring methods for language assessments; buckets and driftwood for sandcastles. Although these infrastructures are designed to capture the phenomena, the very act of capture reshapes the “natural” patterns and the relationships between elements. Therefore,
As Janna Fox and Natasha Artemeva explain in the opening chapter of their book Reconsidering Context in Language Assessment: Transdisciplinary Perspectives, Social Theories, and Validity, the issue of context in language assessment is thorny, ambiguous and enduring. Their quest is to lay out a toolset of theories and methods that help us to understand and account for context in development and validation research. Firmly rooted in Messick’s (1989) notion of consequential validity and the calls of McNamara (2007) and others to broaden the theoretical scope of language testing, the authors introduce (Chapters 1–3), justify (Chapter 4), demonstrate (Chapters 5–7), and reflect on (Chapter 8) the use of theoretical approaches for understanding “decisions, actions and consequences in context” (p. 4). Drawing on their experiences of working across disciplines, the authors propose a transdisciplinary approach that enables (a) the use of a range of theoretical orientations and (b) contributions from people with diverse disciplinary expertise and/or various relationships to the assessment activities. Transdisciplinarity, as set out in this book, is about partnerships across disciplinary alignments, in which divergent views are forced into discussion as “a catalyst for insight and innovation” (p. 10).
True to its transdisciplinary goal, the book is presented as a conversation in which the authors mediate between two disciplinary clusters that share a mutual interest in language assessment. One cluster is “assessment-centred” communities, comprising test developers, psychometricians, and graduate students in educational measurement or psychology who may be familiar with a more cognitivist approach to research. The other disciplinary cluster is “language-centred” communities, comprising language teachers, writing researchers and graduate students in discourse, culture or writing studies, and others who tend toward more socially oriented approaches to building knowledge. In the first part of the book, the authors set out the key terms of the title—context, language assessment, transdisciplinary perspectives, social theories, and validity—seeking to introduce readers with a language background (i.e., “language-centred” communities) to the concept of validity, and readers with an assessment background (i.e., “assessment-centred” communities) to social theorizations of language ability and language use. While there will be readers who work comfortably within both areas, there is plenty of food for thought in the subsequent chapters for those who are already familiar with validity or social theories.
Chapter 2, titled “Validity as an evolving rhetorical art,” sets out the trajectory of validity theory from the mid-20th century to the present. Readers of Language Testing will be familiar with the narrative involving theorists such as Meehl, Cronbach, Messick, Chapelle, McNamara, Cizek, Kane and Moss, and the evolution of the concept of validity across editions of the Standards for Educational and Psychological Testing. And you will probably recognize the validity narrative: from distinct validities (content, predictive/concurrent, construct) to Messick’s (1989) unitary concept of construct validity which, in combination with Kane’s argument-based scaffolding, has dominated educational and language assessment literature since the 1990s. What is refreshing in this discussion is the forensic attention the authors pay to their primary sources as they interrogate the history and nature of validation methodology. In pursuing their argument for transdisciplinarity, Fox and Artemeva emphasize that foundational theorists were not only alert to the impact of context on experience and behavior (especially Cronbach), but they also advocated strongly for multiple methodologies and viewpoints (especially Messick). The authors go on to contend that, despite the considerable developments in validity theory, we remain beholden to sides in the paradigm wars—the “battle between positivist philosophical principles and interpretivist ones”—a division that tends to be simplified as a methodological distinction between quantitative and qualitative research (Bryman, 2008, pp. 13–14). They propose a broad epistemological and methodological landscape for validation evidence, drawing on Moss’s advocacy for “methodological pluralism” in educational assessment, among other voices (e.g., Addey et al., 2020; Maddox, 2015; Moss & Haertel, 2016).
The problem of recontextualizing the phenomenon of interest in order to observe it, as we do with language use in assessment contexts, is akin to the “observer’s paradox” in sociolinguistic research (Labov, 1972), but for language assessment, it might be even more apt to call it an observer’s dilemma because the issue is not merely that the sample gathered in its new micro-context might be contaminated through the act of observing, it is that the sample gathered is assumed to represent the construct-relevant performance of an individual, and that this assumption forms the basis for inferences and uses. In this new micro-context of the assessment situation, language patterns are in a nested system, subject to its rules, conditions, relations and expectations of “communicative competence” (Harding et al., 2022). Chapter 3 addresses this “conundrum of context” wherein “meaningful interpretability and generalisability” is expected to apply to diverse contexts of assessment use which may shape the construct in unintended ways (pp. 80–81). Fox and Artemeva argue that robust, contextualized argument-based validation agendas call for a more apt theoretical toolkit. Thus, they make the case for social theories. They provide a historical overview that traverses various gear shifts: the cognitive revolution, the communicative turn, the discursive turn, the social turn, the multilingual turn, the visual turn. This is followed by a selection of social theorists and approaches that can be useful in grappling with the conundrum of context. The proposed theoretical approaches range from those more concerned with learning and human sociality (situated learning, communities of practice, cultural-historical activity theory, distributed cognition) to those more focused on language and language ability (rhetorical genre studies, English as a Lingua Franca, intercultural communication and interactional competence).
Turning to research trends, the authors of Chapter 4 (Fox, Montiero, & Artemeva) provide a current state of play with a meta-review of papers in four key language assessment journals over the last couple of decades: Language Testing (LT), Language Assessment Quarterly (LAQ), International Journal of Testing, and Educational Measurement: Issues and Practice. Starting with keywords and word roots and then building a more substantial classification scheme for the theoretical orientations of papers, the authors report that while a cognitive, individualist agenda is dominant in the journals, an awareness of social theoretical perspectives is on the rise. There is increasing attention given to the context in publications and a “modest,” but increasing use of social theories, concepts, and perspectives to inform research, especially in LT and LAQ which share a language specialization (pp. 131–132). The ensuing discussion traces positions on the role of “context” in language assessment, from seeing context as one background variable within a quantitative, individualist paradigm to a relational view of scores, inferences, and contexts manifesting in social consequences. The remainder of the chapter travels vast topics such as fairness, bias, and impact. Although a bit unwieldy in its structure, the chapter revolves around the central theme of assembling socially oriented approaches that grapple with key processes in the field: assessing learners (e.g., dynamic assessment), situating assessment (e.g., ecological theory) and interpreting scores (e.g., critical language testing).
The subsequent three chapters (Part II—Chapters 5–7) present three separate studies, each of which demonstrates how diverse social theories and transdisciplinary research partnerships can contribute to research on language assessment. The first study is the development and validation of a high-stakes oral proficiency test for pilots and air traffic controllers (Montiero & Fox), the second is the development of a diagnostic rating scale used to identify new engineering students in need of academic support (Fox & Artemeva), and the third describes classroom-based assessments in technologically-mediated spaces (Hartwick & Fox). Across these three examples, an array of theoretical tools and approaches was employed: English as a Lingua Franca (Jenkins et al., 2011), intercultural communication (Baker, 2011), interactional competence (Young, 2011), distributed cognition (Hutchins, 1995), rhetorical genre studies (Freedman & Medway, 1994), affordance theory (Gibson, 2015), learning-oriented assessment (Turner & Purpura, 2016), and complex adaptive systems (Larsen-Freeman & Cameron, 2008) among them. While demonstrating the applicability of these various theoretical approaches, the three studies offer very different assessment methods, from high-stakes testing to classroom-based methods.
The final chapter titled “Language Assessment in the wild” (Fox & Artemeva) takes a narrative approach, offering the authors’ personal reflections on their transdisciplinary collaborations. I was struck by how central communication is to their multi-party transdisciplinary efforts. When talking across disciplinary boundaries (with researchers, traditionally defined) and discourse boundaries (with practitioners, stakeholders, and other groups), revelations about differences can emerge. These differences range from understanding each other’s terminologies to divergent worldviews. Drawing again on Messick’s (1989) advocacy for “mutual confrontation of theoretical systems” (p. 61), the project backstories recounted in this chapter are illuminating, not just for the insights they provide on the practicalities of transdisciplinarity, but for the fascinating and frustrating tangle of project schedules, staff turnover, policy constraints, embedded practices, disciplinary expectations and personalities that will be familiar to anyone who has embarked on major assessment projects in institutional and government contexts. The chapter reminds us that transdisciplinarity is not just about disciplines developing insightful new networks, but it includes bringing together different worlds. In this sense, the field of language testing/assessment might be considered well-placed for transdisciplinarity because its central phenomenon—score meanings—are in fact objects that communicate across worlds (Macqueen et al., 2016).
Given the big picture focus of the book, questions of the vulnerability, and even isolation, of the field of language assessment are discussed. We are a field that has arisen from what may seem to colleagues in other disciplines as relatively mundane practices, yet the recent burst in artificial intelligence tools based on Large Language Models (LLMs) is, if anything, a clarion call for the general need for expertise that is both language-informed and assessment-informed, that is, expertise from the two communities addressed in this book.
Validity theory has been concerned with the consequences of test use, and this is largely the lens of this book, but if we view assessment methods themselves as the consequences of technological developments, the scope of validity inquiry expands, and with it, ethical considerations. For example, Paullada and colleagues (2021) draw attention to the datasets underpinning LLM applications, observing that “prevailing data practices tend to abstract away the human labor, subjective judgments and biases, and contingent contexts involved in dataset production” (pp. 1–2). Arguably, this broader societal impact is within the ethical remit of tests that rely on LLMs for their infrastructure, as are the environmental impacts of such tools (Bender et al., 2021). At the level of what is actually operationalized in LLM use, the construct validity of LLMs has been called into question when there is a mismatch between the training of the model (e.g., to decode language) and the use of it (e.g., to provide world knowledge) (Raji et al., 2021), echoing the construct validity concerns that arise when language tests are used for purposes they were not designed for. In relation to the capacities claimed of LLMs, Bender and Koller (2020) have called for “precise language use when talking about the success of current models and for humility in dealing with natural language” (p. 5193), a call which resonates strongly with the transdisciplinary challenges set out in Chapter 8. Furthermore, a critical awareness of the contextual forces driving assessment practices is increasingly necessary across the various activities associated with language assessment; from considering market pressure as a force for cheaper test methods and the impact of this pressure on constructs, to retheorizing the very notion of an “individual performance” in the age of artificial intelligence.
Thus, the book’s joint focuses of context and transdisciplinarity are timely. I agree with Fox and Artemeva’s sense that we are at “a turning point in the evolution of research practices” with a “growing awareness and evidence of transdisciplinarity” (p. 6). Transdisciplinarity is gaining traction in applied linguistics more generally (Douglas Fir Group, 2016; Hiver et al., 2022; Larsen-Freeman, 2012). Larsen-Freeman (2012) proposed that a transdisciplinary approach may allow dynamism and context to be properly considered and not simply a backdrop in which the action occurs (p. 208). While the role of context has occupied relatively long discussion threads in the fields of educational measurement (Ercikan & Roth, 2009; Messick, 1989; Mislevy, 2018; Moss, 1998) and language assessment (Bachman, 2007; Chalhoub-Deville, 2016; McNamara, 1997, 2007; McNamara & Roever, 2006; Shohamy, 2001, 2007), particularly in relation to assessment use and consequences, it remains a challenge to actually define context. As Larsen-Freeman (2012) noted, it is not possible to study everything so boundaries have to be set (p. 208). “Context,” then, is not something that simply surrounds an assessee’s performance, but is instead, inextricably embedded within the performance itself. In this sense, three layers of context are operationalized within assessment constructs (Knoch & Macqueen, 2020; Macqueen, 2022): (1) the societal context that engenders the status afforded to assessed languages and the trust afforded to assessment practices (e.g., Broadfoot, 1979; Shohamy, 2006), (2) the infrastructural context of the assessment practices in which societal patterns are recontextualized for observation through context-based notions such as “target language use tasks” and “criterion” (e.g., Bachman & Palmer, 2010, p. 63; Brindley, 1991, p. 140) and (3) the simulation context which emerges at the moment of engagement between an assessee and the other layers of context (e.g., McNamara, 1997, 2007). What Fox and Artemeva contribute to the field is the articulation of a set of theories and commensurate methodologies that enable us to interrogate these contextual layers in validation research. The authors’ efforts to set out a disciplinary matrix around language assessment reveal our disciplinary strengths, vulnerabilities and potential avenues of collaboration. May it inspire us to look beyond our sandcastles, to the sea, the sky and the beach.
