Abstract

This volume is a valuable addition to the core literature on argument-based test validation, which has gained prominence in the field of language assessment over the last 10–15 years. The co-editors, Carol A. Chapelle and Erik Voss, state in Chapter 1 that the purpose of this volume is “to add clarity to research in language testing and assessment by providing an introduction to argument-based validation along with examples illustrating the use of the framework, terms, and concepts to investigate arguments for language tests and assessments” (p. 2). With their concise yet clear discussion of what argument-based test validation constitutes and how to go about designing and conducting empirical studies taking this approach, the co-editors and the authors of the chapters have certainly achieved this goal.
This volume comprises four main parts. Part 1 (Chapters 2 and 3) discusses the theoretical underpinnings and historical development of argument-based language assessment validation. Chapter 2 presents an accessible overview of Kane’s framework of argument-based validation (Kane, 1992, 2006, 2013). This framework consists of two parts: the interpretation/use argument, developed initially to support the design of an assessment and its use in a specific context, and the validity argument, built on the basis of empirical support obtained for the interpretation/use argument. Of particular value is a theoretical comparison of Kane’s validity argument with Bachman and Palmer’s (2010) assessment use argument (AUA), another major argument-based validation framework in the field. Various similarities between the two often blur their important conceptual differences (e.g., how the frameworks differ in terms of how they operationalize claims and inferences). Thus, this chapter alone contributes greatly to further theoretical development of argument-based validation and inform training of researchers and practitioners in the field. Chapter 3, a systematic review of empirical argument-based L2 assessment validation studies, demonstrates how this approach has proliferated from its origin (the United States) to other parts of the world over the last decade to put the theory into practice in various assessment contexts.
Parts 2 and 3, the main parts of this volume, together present a variety of case studies for argument-based validation. While the running theme across them is technology-mediated language testing, the focal assessments and their contexts of use are diverse, ranging from a commercial language proficiency test to institutional assessments used to make various decisions (e.g., placement, achievement testing, progress check). The case studies are organized into two parts based on the aspects of a seven-inference version of Kane’s (1992, 2006, 2013) framework they highlight. Part 2 (Chapters 4–9) presents six studies focusing on five inferences for test score interpretations: domain definition, evaluation, generalization, explanation, and extrapolation. These chapters signify critical issues for consideration to ensure the reliability and validity of language assessments that are put to use. Key issues addressed in these chapters include conducting thorough domain analyses to design assessment tasks of high relevance to job-related performance (domain definition; Chapter 4); developing a well-designed online platform to enhance reliability and validity of performance assessment scores (evaluation; Chapter 5); securing score consistency across raters, tasks, forms, and occasions (generalizability; Chapters 6 and 7); explicating the nature of the target construct by examining its relationship to other related yet distinct constructs (explanation; Chapter 8); and understanding the degree to which task requirements and assessment criteria correspond to those of important language use tasks in the target language use domain (extrapolation; Chapter 9). Part 3 (Chapters 10–12) presents three studies focusing on the remaining two inferences concerning test score uses and consequences: utilization and consequence implication. The main issues examined are how the use of a final exam (Chapter 10) and placement tests (Chapters 11 and 12) closely aligned with instructional content and language demands impacts different stakeholder groups. Finally, in Part 4 (Chapter 13), Chapelle and Voss provide a synthesis of all case studies presented in this volume, discussing their contributions to the field and the importance of adopting a critical stance in understanding argument-based validation.
This volume is expected to make a number of important contributions to the field, two of which are noted below. One is the rich exemplification of what an argument-based validation study should look like. This is done in the form of the fully developed interpretation/use argument and corresponding validity argument that synthesizes the results of each case study. The systematic review in Chapter 3 points out ambiguities and inaccuracies in the ways in which some previous studies operationalized the inferences for argument-based validation. This may be at least partly attributable to the lack of concrete examples showing how the concepts in this approach can be translated into appropriate research designs. Thus, the detailed accounts of how the research questions were formulated in the case studies presented in this volume from assumptions associated with the specific inferences in the interpretation/use arguments certainly offer valuable guidance to future investigators.
Second, as noted by Chapelle and Voss, this volume communicates a strong message that argument-based validation is not reserved for high-stakes testing. The fact that most chapters in this volume deal with medium- and low-stakes assessments for institutional and instructional settings suggests that argument-based validation certainly has a place in the L2 classroom. Thus, together with Bachman and Damböck’s (2018) nontechnical introduction to AUA, this volume offers hints as to how argument-based validation could contribute to training practitioners and enhancing their language assessment literacy.
Despite these strengths of this volume, there are three remaining issues for further consideration. One of them is methodological. It is certainly the case that this volume provides abundant examples of mixed-methods investigations, addressing a limitation of previous argument-based validation studies that did not combine quantitative and qualitative analyses effectively, as noted in Chapter 3 and by the co-editors. At the same time, however, many of the case studies rely to an extent on certain types of methodologies (interviews, surveys, correlational analyses), not necessarily representing a wider range of typical methods of test validation (Xi & Sawaki, 2016). This is driven by issues such as the research questions addressed, the practicality of research administration, and data availability. However, because the methodology employed limits the quality of the empirical basis for test validation, it might also be worth considering other methods (e.g., think-aloud protocol analyses, observations, experiments), where appropriate.
A second issue is the volume’s primary focus on assessments under development and in the early validation stages. As a result, many of the case studies were conducted to provide support for the proposed assessment design and use from a viewpoint of a “confirmationist” (Kane, 2013, p. 17). Thus, the treatment of critical appraisals of initial support obtained for a validity argument to make it more robust is rather sporadic. While examples of how backing can be obtained for particular warrants are abundant, presenting more study examples dealing with the later stages of validity investigations would have allowed the volume to paint a more balanced picture.
A final and related issue for consideration concerns the need for further elaboration on how rebuttals can be conceptualized. An important aspect of a validity investigation is the degree to which enough evidence can be obtained to reject a potential rebuttal, which corresponds to a weakness in a validity argument. In this regard, Chapter 7, which examined a rebuttal generated from an outside evaluator’s viewpoint for a speaking proficiency assessment for classroom use, was informative. The field would benefit from more explicit guidance on issues such as when and how different types of rebuttals could be specified. For instance, Xi (2010) suggests that a rebuttal that is specific to a fairness argument (e.g., potential presence of bias against a subgroup in test content) plays a distinct role from a rebuttal specified for a broader validity argument (e.g., bias in scoring that undermines intended score interpretation for the entire test taker population). The provision of more concrete study examples that illustrate how such rebuttals can be stated and examined to strengthen a validity argument would help broaden the scope of future L2 assessment validation studies.
Despite these issues that require continued discussion in the field, this volume contributes greatly to deepening our understanding of argument-based validation and improving the quality of future argument-based L2 assessment validation studies.
