Abstract
ChatGPT has shown considerable potential for Automated Item Generation, but the quality of ChatGPT-generated items in language assessment remains insufficiently substantiated. This research recruited 121 participants to systematically compare the psychometric properties of the test items and the linguistic features of the reading passages in ChatGPT-generated and official CET-4 reading comprehension materials, using Item Response Theory and Coh-Metrix. Key findings are as follows: (1) generated items fell short in higher-order reading skills; (2) the generated items were less difficult than official ones, showing weaker discrimination and providing measurement information mainly for lower-performing students; (3) only 22.9% distractors functioned effectively, indicating insufficient distractor performance; and (4) ChatGPT-generated passages were characterized by irregular lexical distribution, higher lexical complexity, weaker cohesion but simpler sentences than CET-4 passages. Although ChatGPT-generated passages were less readable than CET-4 passages, the corresponding items were easier and showed lower discrimination. This discrepancy can be attributed to inadequate distractor functioning that facilitates option elimination without complete passage comprehension, as well as to the underrepresentation of higher-order reading skills. The findings corroborate the conclusion that ChatGPT may function effectively as a supplementary tool in low-stakes assessment; however, substantial refinements in item quality are imperative before its application in high-stakes testing.
Keywords
Get full access to this article
View all access options for this article.
