Abstract

The need to assess communicative language skills has continuously been on the rise in academic and professional settings. This trend forces language teachers and test developers alike to face the challenges of rating complex language performance in a valid and reliable manner. The recent publication Scoring Second Language Spoken and Written Performance offers a concise and much needed overview of issues in rater-mediated assessment. Drawing on their rich expertise in language testing research and development, the authors managed to untangle the complexities associated with rating productive language performance in a highly useful manner for graduate students, researchers, and practitioners in language testing.
The monograph has seven chapters, framed by an Introduction, exemplifying rating practices in five established examinations, and a Conclusion, summarizing the main chapters and highlighting future directions. Chapter 1 lays the foundation for the remaining chapters by first introducing an expanded model of key factors involved in performance assessment based on McNamara (1996) and McNamara et al. (2019) (e.g., rating criteria, rating procedures, task). Various rater effects (e.g., leniency/harshness, (in-)consistency, halo, central tendency) are then discussed with reference to previous research. The chapter ends by discussing factors that have been shown to affect rating quality, that is, rater-related factors (e.g., rater experience, background), rating procedures, rater training, and rating criteria.
Chapter 2 provides a systematic overview and discussion of frequently used rating quality measures. Considering both the observed and scaled rating traditions, the authors provide a framework which distinguishes between measures of consensus (e.g., percentage agreement) and consistency (e.g., correlations) on one hand, and measures of rating quality (when no expert ratings are available) and rater accuracy (when expert ratings are available) on the other. Each statistical technique is then described in more detail and in doing so the authors also evaluate their usefulness and practicality. The chapter is rounded off effectively by summarizing each measure’s ability to detect rater effects and outlining at which stage of the assessment cycle they may be useful.
Rater cognition forms the focus of Chapter 3. The authors begin by reviewing methodological approaches that have been used to investigate rater cognition, namely, verbal protocols and eye-movement recording using eye-tracking technology. They then provide a review of literature on rater cognition by discussing commonly investigated areas, including focus of rating and rater decision-making behaviors. In doing so, the authors draw on a wide range of rater cognition studies within not only second language assessment but also general education, such as school exam marking. They emphasize the importance of rater cognition research to corroborate, elucidate, or even challenge quantitative findings on raters’ rating quality.
In Chapter 4, the authors summarize current practices and research on how rating quality can be improved before, during, and after rating processes. This chapter describes in some detail the different options available to researchers and practitioners when recruiting, training, standardizing, and monitoring raters, as well as methods to resolve discrepant scores. This includes a brief discussion on how to decide which approach to training—top-down or community of practice—and which mode—online or face-to-face—might be more appropriate to the context.
Chapter 5 focuses primarily on the development and use of rating scales. It discusses different types of rating scales, rating scale development approaches, construct representation in rating scales, and potential issues with rating scales. The authors briefly touch upon comparative judgment, or scoring without a rating scale, which has been used primarily in school contexts, but rarely in second language assessment.
The longest chapter, Chapter 6, provides an overview of practices, benefits, and issues related to the use of technology in scoring language performance. The authors first discuss online marking, where technology is used to distribute performances to raters, record decisions, and support real-time and post hoc quality management of the rating process. The main portion of this chapter describes current research on the automated scoring and evaluation of written and spoken language while highlighting the potential benefits and drawbacks. In acknowledgment of the ever-increasing role of technology in language assessment, the chapter closes with delineating areas in need of empirical investigation and key questions for a research agenda in this area.
The monograph culminates with a discussion on validating scoring processes in Chapter 7. The chapter begins by tracing the evolution of validity conceptualizations and then explicates how validating scoring processes has been situated within different validation approaches, covering the pre-Messick era, Weir’s sociocognitive framework, Kane’s argument-based approach, and Bachman and Palmer’s assessment use argument. In doing so, the authors critically examine the relative clarity of the guidance each validation approach offers with regard to collecting evidence to support the validation of scoring processes. The chapter concludes by discussing validation processes in two specific scoring contexts: automated scoring and classroom-based assessment.
Focusing specifically on the challenges of scoring productive language performance, this monograph addresses a major gap in the current publication landscape and effectively complements key readings such as Weigle (2002) and Luoma (2004), which cover broader issues associated with assessing the productive skills. Information that tends to be scattered across various sources on test development, educational measurement, test validation, and rater training is gathered, evaluated, and structured in this concise book. We therefore believe that it will likely succeed in appealing to the wide readership specified in its introduction: graduate students, researchers, test developers, and teachers. Each chapter provides succinct summaries which meet the authors’ goal to “untangle and clarify” (p. 5) some major issues related to scoring. Each section has the potential to function as a roadmap and help students and researchers when conceptualizing their projects. In addition, a particular strength of this publication is that it constitutes a valuable resource for practitioners involved in test development projects. The clear presentation and evaluation of current research and practical options may help practitioners navigate the complex decision-making processes during various stages of test development. For example, test developers can consult this monograph for a synopsis of different score resolution methods in Chapter 4, or approaches toward rating scale development in Chapter 5. In their coverage of the topic, the authors have struck a balance between depth and breadth, which should engage readers at varying levels of knowledge and expertise in the subject area. This feature could also make the book as a whole or selected chapters a powerful resource for establishing common ground in collaborative projects.
The deliberate structuring of this monograph is another striking aspect and aids comprehension. It first focuses on relatively straightforward concepts like the different factors impacting rating quality and then unpacks them in subsequent chapters. The final chapter draws together all previous sections and moves on to a more advanced discussion of how scoring processes affect validity. This chapter continues the line of reasoning presented in Knoch and Chapelle (2018) by arguing that issues related to scoring extend beyond considerations of reliability and affect the validity of score interpretations and uses. In addition to its coherent overall structure, the use of figures and tables is effective throughout, illustrating complex concepts and processes, or summarizing the most essential points. For example, a figure in Chapter 1 (p. 8) visualizes the intricate interrelationships between the numerous factors contributing to rating quality and, by extension, score quality in a helpful way. Another is a table in Chapter 4 (p. 47) presenting a summary of various rater quality measures and their (in)ability to detect rater effects. By consulting this table, readers can compare measures and narrow down the options which potentially suit their needs.
While informative and useful for various readerships, this monograph should not be understood as a manual. It introduces relevant issues, presents options, and identifies future directions in a succinct and approachable way, but readers interested in learning the statistical techniques presented in Chapter 2, for example, will need to consult other sources (e.g., Eckes, 2015). Incorporating a “further reading” section to each chapter would have been a valuable addition for less experienced researchers. Also, the areas located at the intersections of scoring language performance with other fields such as data science, and behavioral and cognitive sciences are only discussed sparingly. However, we feel that the true strength of this monograph may be that it firmly prioritizes the concepts and issues most pressing to test developers and researchers while providing a point of departure for those who want to venture deeper into the subject at hand.
