Abstract
This article investigates the accuracy of examinee classification into performance categories and the estimation of the theta parameter for several item response theory (IRT) scaling techniques when applied to six administrations of a test. Previous research has investigated only two administrations; however, many testing programs equate tests across multiple administrations. As such, this article seeks to examine the long-term sustainability of IRT scaling methods. Three different types of shifts in the ability distribution were examined: no change, a mean shift, and a change in skewness. Haebara, Stocking and Lord, mean—sigma, mean—mean, and fixed common item parameter (FCIP) scaling were compared relative to bias, root mean square error, and classification of examinees into performance categories. Results indicate that FCIP may be the most suitable for complex changes in examinee performance, whereas the methods performed quite similarly for simple changes.
Get full access to this article
View all access options for this article.
