Sage Journals: Discover world-class research

Abstract

We aimed to develop an item bank of computerized adaptive testing for eating disorders (CAT-ED) in Chinese university students to increase measurement precision and improve test efficiency. A total of 1,025 Chinese undergraduate respondents answered a series of questions about eating disorders in a paper-pencil test. A total of 133 items from four well-validated Chinese-version scales of eating disorders were used to construct the item bank of CAT-ED with the following analysis. First, unidimensionality, model fit, local independence, item ﬁt, discrimination and differential item functioning (DIF) were tested. Then, two simulation studies were applied to test the CAT-ED’s effectivity and rationality by calculating concurrent criterion-related validity, sensitivity and speciﬁcity. The ﬁnal item bank comprised 77 items, which met the requirements of local independence, item ﬁt, high discrimination and no differential item functioning in CAT. The mean number of administered items in CAT with the stopping rule fixed at SE ≤ 0.3 was 11 items. The obtained results showed that CAT-ED had acceptable reliability, validity, sensitivity and speciﬁcity.

Keywords

computerized adaptive testing eating disorders item response theory psychological assessment screening

Introduction

Eating disorders (EDs) are a group of mental disorders characterized by various abnormal eating behaviors and attitudes, which are correlated with the interaction of environmental events with an individual’s biological and developmental characteristics (Treasure et al., 2010). In the Diagnostic and Statistical Manual of Mental Disorders (Fifth Edition), EDs primarily include anorexia nervosa (AN), bulimia nervosa (BN), binge eating disorders (BEDs), avoidant/restrictive food intake disorder (ARFID), ruminant/reflux disorders, and pica. In recent years, many studies on EDs have been published, such as epidemiological studies (e.g., Mitchison & Mond, 2015; Udo & Grilo, 2018), clinical medical studies (e.g., Yang et al., 2018), and psychological studies (e.g., Marzilli et al., 2018), which indicate that researchers have long been concerned about EDs. By summarizing and contextualizing the reviews of recent epidemiologic data on EDs in Asia and the Pacific, Thomas et al. (2016) found that regions with EDs are transnational in distribution and that the prevalence rate of eating disorders is very high. As a group of severe and disabling conditions, EDs have a negative effect on individuals and their families and cause sufferers to take both mental and physical blows (Erskine et al., 2016).

For EDs, conducting an effective evaluation is a critical step before treatment. Therefore, based on different disciplinary perspectives, many different methods have been applied to assess eating disorders by researchers or clinical practitioners (Chowdhury & Lask, 2001; Fairburn & Beglin, 1994; Rosen, 1996; Schaefer et al., 2020; Smyth et al., 2001). For researchers in the field of psychology, patients’ self-report is usually regarded as an important reference to assess the severity of EDs. Thus, based on classical test theory (CTT), several self-report scales about EDs have been developed by psychometrics researchers, such as the Eating Attitudes Test (EAT-26; Garner et al., 1982), the SCOFF Questionnaire (Morgan et al., 2000), the Eating Disorders Inventory (EDI; Garner et al., 1983), the Eating Symptoms Checklist-21 (ESC-21; Leung & Mak, 2003), the Eating Disorders Inventory-3 (EDI-3; Garner, 2004) and the Eating Disorders Examination Questionnaire (EDE-Q; Fairburn & Beglin, 1994). For decades, these scales have been effectively validated by researchers and have played an important role in diagnosing or measuring ED. However, these scales have some shortcomings. First, the performance estimation for the entire test heavily depends on the specific samples under CTT. For example, when the same test is administered to different groups of subjects, the test performance may be different; thus, the interpretation of test outcome is inconsistent. Second, using CTT to estimate measurement error is inaccurate, and sometimes reliability and validity may vary with different groups of subjects. At the same time, we could only obtain the average measurement accuracy of a single test group and could not give the measurement accuracy of the subjects under different trait levels. Third, all participants had to answer the exact same items for scores to be comparable. If the scale contains too many items, it will increase the subject’s burden, and sometimes some subjects even lose their patience due to the long test time. Therefore, based on the abovementioned three points, it is necessary to identify a new measurement mode to increase measurement accuracy and improve test efficiency.

Item response theory (IRT) was proposed as an alternative to CTT to overcome abovementioned shortcomings (Hambleton et al., 1991; van der Linden & Hambleton, 1997). With the popularization of personal computers and the internet, computerized adaptive testing (CAT) is probably one of the most valuable new perspectives in the application of IRT (Wainer, 2000). As a novel type of test technology, CAT is a type of test method that uses IRT to build the item bank, employs the computerized algorithm to automatically select the most appropriate items to answer for different participants, and finally accurately estimates their trait levels (Wainer, 2000). In fact, it will create tests of different lengths for subjects at different levels of traits. Furthermore, we could also fix the short length of the test for every subject; thus, the test time can be also reduced. Since its appearance, CAT has shown considerable application prospects and has gradually become a research topic that has received widespread attention. To date, CAT has been successfully applied in earlier studies to diagnose or measure psychological illnesses such as personality disorder (Simms et al., 2011), depression (Gibbons et al., 2012), and anxiety (Gibbons et al., 2014).

To our knowledge, the application of CAT in EDs has not yet been formally discussed in the literature. Based on the limitations of traditional psychometric scales in measuring EDs and the advantages of computerized adaptive testing in assessing mental health and psychiatric illnesses, we consider applying CAT technology to measuring EDs. Therefore, in this study, we hope to address the abovementioned issues by developing an efficient and accurate item bank for EDs; then, we show the psychometric researchers, potential users or readers interested in applying CAT technology to measuring ED the complete theoretical and technical details on the development of CAT-ED as detailed as possible. At the same time, we also hope that this research can promote the application of computerized adaptive testing technology in the fields of mental health and psychiatry. More specifically, this article mainly involves the following three parts. First, we show the detailed development steps of the CAT-ED item bank, which meets the psychometric requirements of CAT, using four well-validated psychological scales for EDs as examples. Second, two simulation studies are conducted to evaluate the psychometric properties, eﬃciency, reliability, and validity of the CAT-ED item bank. In this part, we will verify the performance of the developed item bank by various accuracy indicators. Of note, there is a certain degree of difference between the two simulation studies. The first simulation study is conducted under a wider range of trait levels to show the test performance for simulated subjects at various trait levels. The second simulation study aims to test the performance of CAT-ED in a test environment close to reality. Finally, we list the limitations of this study and provide directions for future research.

Method

Participants

Data collection began in December 2018 and was completed in March 2019. In the process, considering the convenience of sampling and the representativeness of the samples, in this study, a total of 1,168 college students were recruited from seven colleges in 10 Chinese provinces including Shandong, Hebei, Jiangxi, Sichuan, Liaoning, Hunan, Hubei, Henan, Guangdong and Anhui. These samples represent Chinese college students from North, South, East, Central and Southwest China. To protect personal privacy, each respondent was assigned a unique code number for identification. To screen out individuals who could randomly respond, three lie detection items that were designed as opposite meanings were embedded in this survey. For example, “Facing pressure makes me feel powerful.” Its corresponding lie detection item was “The stressful scene makes me feel helpless.” Participants who responded to any one of the three pair items using the same answer were eliminated in this study. In addition, subjects with missing answers on any item about demographic information will also be eliminated. Finally, 1,025 respondents whose average age was 19.7 years (SD = 2.2, ranging from 17 to 24 years) were retained for the follow-up analyses. Detailed demographic information about these samples can be found in Table 1. All participations were voluntary and anonymous. The present study was carried out following the recommendations of psychometrics studies on mental health at the Research Center of Mental Health, Jiangxi Normal University. Informed consent was obtained from all participants in accordance with the Declaration of Helsinki.

Table 1.

Demographic Characteristics of Subjects (N = 1,025).

Variables	Category	Frequency	Percentage (%)
Gender	Male	255	24.9
Gender	Female	770	75.1
Age	17–18	296	28.8
	19–20	377	36.8
	21–22	254	24.8
	23–24	80	7.8
	Missing	18	1.8
Region	Rural	557	54.3
Region	City	468	45.7

Instruments

Taking into account the localization, reliability and validity of the eating disorder scales, we identified the currently commonly used Chinese versions by searching the Chinese literature database. Finally, four well-validated psychological scales (Chinese Versions) measuring EDs were carefully selected and applied to construct the initial item bank. The four scales are the Eating Attitudes Test (EAT-26; Garner et al., 1982), the Eating Symptoms Checklist-21 (ESC-21; Leung & Mak, 2003), the Eating Disorders Inventory-3 (EDI-3; Garner, 2004), and the Eating Disorders Examination Questionnaire (EDE-Q; Fairburn & Beglin, 1994). A psychological scale named The SCOFF Questionnaire will be applied to test the concurrent criterion-related validity of CAT-ED in addition to the abovementioned four scales. Due to space limitations, these scales’ basic information such as reliability, validity and quantity of items of original and Chinese version, is only shown in the online supporting information, and it could be found in Table S1 in Supporting Information.

Originally, there were 149 items in the initial item bank of CAT-ED in addition to The SCOFF Questionnaire (containing five items), which was used to test criterion-related validity. However, in these items, 32 pairs of items (e.g., I think people are happiest when they are children vs. Childhood is the happiest period of life; I think my hip circumference is too large vs. I think my hips are too big.) are considered to have a similar meaning by the following review process. First, we invited four third-year master’s degree students; two of them were from the Chinese language and literature, and the others were from psychology. They were asked to identify the pairs that had the same or similar meaning in the original item bank. When there were different opinions, they needed to negotiate until they finally reached an agreement. Next, we invited another four third-year master’s degree students; two of them were from the Chinese language and literature, and the others were from psychology. They were asked to answer the same question of whether the pairs obtained from the abovementioned step had the same or similar meaning. In this step, we selected the pairs in which all invited evaluators thought there were the same or similar meanings. Finally, we randomly deleted an item from each chosen pair. Ultimately, 133 items made up the initial item bank.

Statistical Analysis

Data analysis is divided into three steps: construction of the item bank, simulation studies and validation of the effectiveness of CAT-ED. All details of data processing will be described next. Data analysis and simulation of administering items were mainly carried out via the R package mirt (Version1.3; Chalmers, 2012) and R package catR (Version 3.16; Magis & Barrada, 2017). The former was used for item bank construction and parameter calibration, and the latter was applied to simulate the process of administering items to participants.

Construction of Item Bank for CAT-ED

Step 1: Test the unidimensionality of the item bank

Unidimensionality is a crucial assumption in IRT, and an item bank will be regarded as unidimensional if the person’s latent trait level of the item measures, rather than other factors, resulted in the person’s response. This implies that responses to each item are affected by a single latent construct of test takers. In this study, the method for testing the assumption was recommended by Andrich (1996) and Reckase (1979) under the framework of IRT. If the ratio of the first eigenvalue explanatory variance to the second eigenvalue explanatory variance is greater than 4 and the variance of the first factor interpretation is greater than 20% (Reckase, 1979), the data can be considered to basically conform to this assumption (Reeve et al., 2007). Some researchers (Wu et al., 2019; Zhang et al., 2019) have tested this assumption according to the abovementioned standards when developing a unidimensional CAT item bank. In this part, we evaluate the dimensionality of our item bank using exploratory factor analyses via the SPSS 20.0 software by the following procedure. Principal component analysis (PCA) is used in this part, and no rotation is required. First, according to the EFA results, the items whose factor load on the first factor was less than 0.3 were removed. Then, a test for unidimensionality will be conducted again until the factor load of all items in the item bank is more than 0.3 (Reeve et al., 2007).

Step 2: Conduct the model selection

Selecting an appropriate IRT model for data analysis is the premise to ensure the accuracy of parameter estimation in IRT. In this research, the graded response model (GRM; Samejima, 1970), partial credit model (PCM; Masters, 1982) and generalized partial credit model (GPCM; Muraki, 1992) will be used as candidate models to ﬁt the items in the CAT-ED item bank. The formula of these models are as follows:

(a) Graded Response Model (GRM). GRM assumes that there are $m$ ordered categories in an item and estimates $m - 1$ boundary response functions (BRFs). The equation for BRF is expressed by

P_{ik}^{*} (θ) = \frac{\exp [D a_{i} (θ - b_{ik})]}{1 + \exp [D a_{i} (θ - b_{ik})]},

(1)

where $P_{ik}^{*} (θ)$ represents the probability that a subject with ability $θ$ scores equal to or higher than $k$ on item $j$ ; $a_{i}$ is the item discrimination parameter for item $i$ ; $b_{ik}$ is the location parameter for the option $k$ ; $θ$ is the latent trait level, and D is equal to 1.7 in this study.

The probability of choosing a response category is not calculated directly from equation (1); it can be calculated using the following formula:

P_{ik} = P_{ik}^{*} (θ) - P_{i, k + 1}^{*} (θ), k = 0, 1, 2, \dots, m

(2)

In this formula, $P_{ik}$ is the probability of choosing a response category $k$ for a subject. Formula (2) is also called the “option” probability function; thus, it is the model used in this study.

(b) Generalized partial credit model (GPCM) and partial credit model (PCM)

Assuming there are a total of $m_{j}$ score categories in item $j$ , the probability of a score in a particular category $k$ can be directly calculated by

P_{jk} (θ) = \frac{\exp (\sum_{v = 1}^{k} a_{j} (θ - b_{jv}^{*}))}{\sum_{c = 1}^{m_{j}} \exp (\sum_{v = 1}^{c} a_{j} (θ - b_{jv}^{*}))}, k = 1, 2, \dots, m_{j}

(3)

where $θ$ is the trait level of the subject; $a_{j}$ is the discrimination parameter for item $j$ , and $b_{jv}^{*}$ is the location parameter. It is worth mentioning that when $a_{j}$ in equation (3) is equal to 1, equation (3) is equivalent to the expressing formula of the partial credit model, that is, PCM is a special case of GPCM.

The selection of the optimal fitting model is based on test-level model-ﬁt indices including −2log-likelihood (−2LL; Spiegelhalter et al., 1998), Akaike’s information criterion (AIC; Akaike, 1974) and Bayesian information criterion (BIC; Schwarz, 1978). The smaller the value of −2LL, AIC and BIC, the better the model fitting. Model selection is conducted via the ﬂexMIRT software (Version 3.51; Cai, 2017).

Step 3: Evaluate local independence

IRT assumes that when subjects with the same ability respond to a question, they will not be affected by any other question on the same test, or it will violate the assumption. Yen (1984) proposed the Q₃ statistic to test the local independence between items. Cohen (1988) holds the opinion that if the Q₃ value between two items in a pair is higher than 0.36, the item will be considered to be locally dependent. He also suggests that among these pairs whose Q₃ value is higher than 0.36, the item with the larger accumulated amount of Q₃ in any pair will be deleted. In this part, we calculate the Q3 statistic using the function called residuals in the R package mirt.

Step 4: Check the item model fit of the remaining items

When conducting IRT-based analysis, testing the goodness of item fit is considered to be an important step. If the item has a low degree of fit to the selected model, it will affect the measurement accuracy to some extent (Köhler & Hartig, 2017). The test for item fit is applied to examine whether the item fits the IRT model well. In our research, we followed the suggestion of Orlando and Thissen (2003) and used the S-χ² statistic to check each item’s degree of fit. When the P value of the S-χ² of an item is less than the critical value of .01, it indicates that this item poorly fits the IRT model and should be excluded from the item bank (Flens et al., 2017). To test the goodness of fit for each item, the function called itemfit in the R package mirt is used.

Step 5: Choose items with high discrimination parameters

Discrimination is an essential indicator and is related to the quality of the single item and even the total test. In IRT, the discrimination parameter is closely related to the information of items and even the whole test. The greater the discrimination parameter, the higher the information. Therefore, item discrimination has shown a significant impact on measurement accuracy and it is usually used to determine whether the items could be preserved in the CAT item bank in IRT (Wainer, 2000). Baker and Kim (2004) pointed out that to estimate the person parameters effectively, the discrimination parameter is generally required to be between 0.5 and 2. Therefore, whether the value of discrimination is less than 0.5 will be taken as the criterion for the retention or deletion of the item. Using the functions coef and mirt in the R package mirt, we obtained each item’s discrimination parameter based on the response data.

Step 6: Test Differential Item Functioning (DIF)

DIF is used to identify systematic differences caused by demographic variables such as gender and age (Gaynes et al., 2002); therefore, it is closely related to the fairness of the test. Adaptive testing may be more susceptible to the impact of DIF on validity than fixed testing because DIF may have a greater impact on short CAT evaluation (Reeve et al., 2007; Wainer, 2000). To build a nonbiased item bank, logistic regression (LR) is used for the DIF test in this study, specifically using McFadden’s pseudo R² method (Choi et al., 2011). The change in McFadden’s pseudo R² is used to evaluate the eﬀect size, and the hypothesis of no DIF is rejected when the R² change reaches a critical value. This indicates that there is a deviation in the item when the variation in R² is greater than .02; then, it should be deleted (Choi et al., 2011). Referring to existing IRT-related and CAT-related studies (Gaynes et al., 2002; Liu et al., 2020; Tan et al., 2018), we selected two demographic variables, which have been the focus of DIF detection in these studies: region (rural, city) and gender (male, female). In this study, we conducted DIF analysis using the R package lordif (Version 0.3-3; Choi et al., 2011).

Psychometric Properties of CAT-ED

Two parts are mainly involved in the psychometric properties of CAT-ED: the algorithm of CAT and its performance evaluation.

Algorithm of CAT-ED and evaluation

The core of ensuring the implementation of CAT is the algorithm, which allows the computer to automatically select an appropriate item for the subject and give them the test results. It mainly involves four steps: initial item selection, parameter estimation, item selection algorithm and stopping rule (Chen & Cook, 2009).

Step 1: Initial item selection . For the originally chosen item of the item bank in CAT-ED, as an effective method, the random method suggested by Magis and Barrada (2017) was adopted because the respondent knows nothing about prior information in the initial stage of a test. The use of this method means that the computer system randomly selects an item from the item bank of CAT-ED and administers it to the subject.

Step 2: Parameter estimation. After the participants responded to the selected items, their current trait level on these responding items was estimated with the appropriate parameter estimation method (Wainer, 2000). In this research, expected a posteriori estimation (EAP), one of the methods of Bayesian estimation introduced by Bock and Mislevy (1982), is applied for parameter estimation because it utilizes prior information of latent variables, needs no iteration, and estimates latent variables with high accuracy and high efficiency. EAP is deﬁned as,

θ_{i} = \frac{\sum_{h = 1}^{q} Z_{h} L_{i} (Z_{h}) W (Z_{h})}{\sum_{h = 1}^{q} L_{i} (Z_{h}) W (Z_{h})}

(4)

where $Z_{h}$ refers to one of the $θ$ quadrature points; $L_{i} (Z_{h})$ represents the likelihood function at $Z_{h}$ based on responses to $i$ items, and $W (Z_{h})$ is a weight associated with that quadrature point. The weight represents the height of the density curve at the corresponding quadrature points.

Step 3: Item selection algorithm. Regarding to selection of the following items according to the current trait levels of participants, the maximum Fisher information (MFI; Baker, 1992) is suggested to be the selection strategy when the respondent’s trait level is updated. MFI is related to the measurement error of the estimated latent trait. The greater is the amount of information provided by an item, the higher is the accuracy of the estimated trait. The mathematical formula of MFI is as follows,

I_{j} (\hat{θ}) = \sum_{k = 1}^{K} \frac{{[P'_{k} (\hat{θ})]}^{2}}{P_{k} (\hat{θ})}

(5)

where $I_{j} (\hat{θ})$ is the item information function of item j given $\hat{θ}$ , which is the estimated latent trait level; $P_{k} (\hat{θ})$ is the probability of obtaining score k; K is the total score of item j, and $P'_{k} (\hat{θ})$ is the ﬁrst derivative of $P_{k} (\hat{θ})$ to $\hat{θ}$ .

Step 4: Stopping rule. The stopping rule is a key aspect of CAT and is related to the reasonable termination of the test (Reckase, 2009; Wainer, 2000). In CAT, the accuracy of the estimated trait levels will be affected by the test length. Consequently, many researchers either fixed the test length or fixed the measurement accuracy as the stopping rule according to Yao (2013). The minimum standard error (SE) stopping rule was used here for CAT-ED, which ensured and fixed the measurement accuracy for each subject. The information function is used to calculate the information provided by the responding items and reflects the measurement accuracy simultaneously. The formula is expressed as,

SE ({\hat{θ}}_{i}) = \frac{1}{\sqrt{\sum_{j = 1}^{n} I_{j} (\hat{θ})}}

(6)

where SE stands for standard error of test, and $n$ is the number of items that the examinee has responded to. It is easy to determine that in this formula, the higher is the information, the smaller is the SE. In this study, the minimum measurement error (SE) is set as 0.2, 0.3, 0.4, 0.5, and 0.6 following a previous study (Tan et al., 2018).

Design on exploring psychometric properties of CAT-ED

To evaluate the effectiveness of the CAT-ED algorithm, the authors designed two simulation studies. The first simulation is conducted based on the simulated latent trait value (regarded as each simulated subject’s true trait value) and simulated response patterns. The second simulation is called a post-hoc simulation, that is, item responses that are not randomly drawn but picked from a given real response pattern. In the post-hoc simulation, the real responses of paper and pencil test based on the real subjects are used to simulate the adaptive administration process. The largest difference between the two simulation studies lies in the source of the subjects and the response patterns. Apart from that, the item parameters used in the simulation study and the post-hoc simulation study are completely the same, and these item parameters are estimated with the functions coef and mirt in the R package mirt by maximum likelihood estimation (MLE) method based on 1,025 real subjects’ response data.

The simulation study aims to show the measurement accuracy for simulated subjects at different trait levels between −3.5 and 3.5. Thus, different trait levels of EDs for subjects are simulated and deemed their true levels of EDs in this study. Specifically, within the values of the ED trait ( $θ$ ) ranging from −3.5 to 3.5 at 0.25 intervals, 100 subjects were simulated at each interval, that is, a total of 2,900 participants were simulated. Meanwhile, the response of administered items in CAT-ED was also simulated with the function genPattern in the R package catR. Finally, the indices of bias, mean absolute deviation (MAD), root mean squared error (RMSE) and Pearson product-moment correlation coefficient r (r is the correlation coefficient between the true theta and the estimated theta of each simulated subject) are calculated to assess the effectiveness of the CAT-ED item bank and adaptive algorithm. The formulas of bias, MAD and RMSE are respectively expressed as,

Bias = \frac{1}{N} \sum_{i = 1}^{N} ({\hat{θ}}_{i} - θ_{i})

(7)

MAD = \frac{1}{N} \sum_{i = 1}^{N} ∣ {\hat{θ}}_{i} - θ_{i} ∣

(8)

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{θ}}_{i} - θ_{i})}^{2}}

(9)

where N is the number of subjects; the estimated trait level of test taker i is represented by ${\hat{θ}}_{i}$ , and the true trait level of test taker i is represented by $θ_{i}$ . The smaller is the value of these indicators, the more accurate is the parameter estimation.

Then, based on the real item parameters of CAT-ED and the real response pattern of each real subject for each item, the adaptive administration process is simulated in the post-hoc simulation study. The administration process in the post-hoc simulation is the same that in the simulation study other than the source of respondents and their response data. According to Magis and Barrada (2017), in post-hoc simulations, when the true trait level of each subject cannot be provided, it is common to treat the estimated latent trait with the full vector of responses as the best guess of the true trait level. Therefore, we estimated the true trait level of each real subject with the function thetaEst in the R package catR using the EAP method. Similarly, to explore the rationality and effectiveness of the algorithm in an environment close to reality, the correlation coefficients between the true theta and the estimated theta of each real subject under each stopping rule will be determined. Then, we calculated the average test length and the standard error of measurement of 1,025 participants on CAT-ED. In the post-hoc simulation study, the standard error is derived from the second derivative of the likelihood function based on the item parameters and estimated trait levels. In addition, the average saving ratio (it is equal to the proportion of items in the item bank that are not administered) of the CAT-ED item bank under each stopping rule is also calculated.

Performance Evaluation of CAT-ED

The main purpose is to explore the accuracy of trait estimation, concurrent criterion-related validity and predictive utility (sensitivity and speciﬁcity) of CAT-ED. Of note, the data used in this analysis were drawn from real subjects.

Accuracy of trait estimation

According to Kolen et al. (1996), conditional standard errors of measurement (SEM) are an effective indicator reflecting measurement error in IRT. Therefore, this indicator is frequently used in many IRT-related applied studies such as by Ferrando (2003) and Irwin et al. (2012). In this study, conditional SEM was calculated to test the accuracy of trait estimation under different stopping rules.

Evaluation of CAT-ED’s validity

Concurrent criterion-related validity

In this study, concurrent criterion-related validity is expressed as Pearson’s correlations between the estimated trait value in CAT-ED and the standard scores of the criterion scale (The SCOFF questionnaire). Then, we calculated it out.

Predictive utility (sensitivity and speciﬁcity)

In this study, sensitivity means the possibility that a respondent with ED is correctly diagnosed with ED, while specificity refers to the possibility that a normal subject is correctly diagnosed with no ED. The higher the quantity of sensitivity and specificity, the better the effect of the diagnosis. The receiver operating characteristic curve (ROC) is a composite indicator reflecting sensitivity and specificity (Hand, 2010), which takes sensitivity as the ordinate and 1—specificity as the abscissa, and is often used for the evaluation of diagnostic effect (Lusted, 1960) and some studies in CAT (Gibbons et al., 2014; Tan et al., 2018). In ROC analysis, the area under the curve (AUC), which refers to the area under the ROC curve, is often considered an important indicator to evaluate the diagnostic effectiveness of the developed clinical tool. In general, the larger is the AUC, the higher is the diagnostic accuracy. In other words, the closer the AUC is to 1, the better the is diagnostic effect.

To conduct ROC analysis, subjects were divided into two groups including subjects with and without EDs, according to the following three criteria. (1) Generally, a score of ≥4 for an individual item was applied to indicate clinical significance (Keane et al., 2017). At the same time, there are 22 items in total of EDE-Q; hence, the authors would regard a total score greater than 88 (22 times 4 is 88) of EDE-Q (Chinese version) as the screening criteria. (2) A total score of ≥2 in the SCOFF Questionnaire is applied to indicate that subjects are likely to have EDs (Morgan et al., 2000). (3) We also included an additional question (Have you been diagnosed with an eating disorder including anorexia, bulimia or atypical eating disorder by a professional clinician?) in the questionnaire during this collection process. Of note, only when the subjects’ responses on the paper questionnaire met all these criteria simultaneously can the subjects be considered as patients with EDs, or they will be assumed to be people without EDs. The calculated results showed that a total of 31 of 1,025 subjects were identified as patients with EDs. Next, we obtained the dichotomous data for each participant with or without EDs. Then, we used both the estimated ED trait score in CAT-ED under different stopping rules and the obtained dichotomous data collectively to conduct ROC analysis via the SPSS 20.0 software. Finally, we determined the sensitivity, specificity and AUC under chosen stopping rules in the post-hoc simulation study.

Results

Construction of Item Bank for CAT-ED

Unidimensionality

A total of 23 items were removed from the initial item bank due to their factor loading on the first factor being less than 0.3. Consequently, 110 items were retained from the initial item bank of CAT-ED at this step. Then, EFA was reconducted out, and the results showed that the first eigenvalue of factor analysis was 38.273; the second characteristic value was 6.167, and the variance explained of the first factor was 34.793%, which satisfied the assumption of unidimensionality.

Model selection

We conducted test ﬁt analysis using the models including GRM, PCM and GPCM. The obtained results suggested that GRM ﬁtted the data better than GPCM and PCM because its indices of −2LL, AIC and BIC were smaller than those of the other two models. Thus, GRM was used for the follow-up analysis. Detailed calculated results of these indices can be found in Table S2 in the Supporting Information.

Local independence

In this step, according to Cohen (1988), 12 items were deleted from the item bank because they were locally dependent on other items after conducting the Q3 statistic. After the deletion, the hypothesis of local independence is also satisfied.

Item fit

The findings indicated that the P values of the S-χ² value of 19 items were less than the critical value of .01 (Flens et al., 2017); therefore, they were removed from the item bank.

Discrimination parameters

The obtained results show that an item’s discrimination parameter is less than .5 (Baker & Kim, 2004), and then it is deleted from the remaining item bank during this step.

Differential item functioning (DIF)

There was signiﬁcant DIF between males and females on an item whose value of R² change was .0262 and was beyond the critical of .02 (Choi et al., 2011). Apart from that, no DIF items are found on the level of region. Ultimately, this item was removed from the item bank during this step.

It is not meaningful for the CAT-ED item bank to measure ED until all abovementioned procedures are completed. After deleting 56 items through the abovementioned steps, the remaining 77 items effectively satisfied the basic assumptions of IRT. Therefore, these items make up the final CAT-ED item bank and will be used for further analysis. In addition, the item parameters and item fit of the item bank are also shown in Table S3 in the Supporting Information due to space limitations.

Psychometric Properties of CAT-ED

Algorithm of CAT-ED and its evaluation

Results of simulation study

The deviation, recovery and correlation coefficient (r) between the true value and the estimated value of the simulated subjects’ EDs trait scores are shown in Table 2. Table 2 indicates that regardless of which stopping rule is chosen, the correlation coefficient (r) is greater than .95, which indicates that there is a high consistency between the true value and the estimated value of subjects’ $θ$ . The bias in each stopping rule is close to 0, which demonstrates that the proposed item bank has ideal recovery in the estimated person parameters. However, the values of MAD and RMSE all exceed 0.5 under the rules of SE ≤ 0.5 and SE ≤ 0.6, which reveals that the two stopping rules may not be suitable for CAT-ED in measurement accuracy.

Table 2.

Recovery of ED Trait Estimation Under Each Stopping Rule in the Simulation Study.

Stopping rule	r	Bias	MAD	RMSE
SE ≤ 0.2	.993^**	0.032	0.207	0.271
SE ≤ 0.3	.988^**	0.008	0.280	0.357
SE ≤ 0.4	.979^**	0.011	0.386	0.494
SE ≤ 0.5	.966^**	0.001	0.506	0.650
SE ≤ 0.6	.956^**	0.012	0.631	0.790

Note. The r stands for Pearson product-moment correlation coefficient. MAD = mean absolute deviation; RMSE = root mean squared error.

Shows the discrepancy on .01 levels notable.

Results of the post-hoc simulation study

After calculation, the results of the number of items used, standard error of measurement and the Pearson product-moment correlation coefficient (r) under each stopping rule are displayed in Table 3. As shown in Table 3, the average number of items used was reduced by more than 60% regardless of which stopping rule was chosen, which shows the great advantage of CAT in improving test efficiency, and the mean SE in each stopping rule was acceptable. Although the subjects were only administered a small part of the items, there were still high correlations between the CAT $θ$ of each stopping rule and the full test $θ$ . In addition, we calculated test information and the number of administered items and at different trait levels under different stopping rules, which is shown in Figure 1. In the figure, it is easy to see that CAT-ED showed more information on the middle or right side of the estimated latent ED scores. For example, in the stopping rule of SE ≤ 0.3, approximately 20 items are administered to most participants with middle or high trait scores, and the value of information nearly exceeds 50. All the results further indicate that not only does CAT-ED have high measurement accuracy, but it can efficiently improve test efficiency.

Table 3.

Psychometric Properties of CAT-ED Item Bank in the Post-hoc Simulation Study.

Stopping rule	Item use			Mean SE (θ)	r
Stopping rule	Mean	Average Saving Ratio	SD	Mean SE (θ)	r
SE ≤ 0.2	29	0.62	18.71	0.13	.98^**
SE ≤ 0.3	11	0.86	10.01	0.19	.94^**
SE ≤ 0.4	6	0.92	4.72	0.24	.91^**
SE ≤ 0.5	4	0.95	2.34	0.29	.90^**
SE ≤ 0.6	3	0.96	1.42	0.35	.86^**

Note. The r stands for the Pearson product-moment correlation coefficient of estimated latent trait levels of subjects between the CAT $θ$ of each stopping rule and the full test $θ$ .

Shows the discrepancy on .01 levels notable.

Figure 1.

Number of administered items and test information at different trait levels under different stopping rules.

Performance Evaluation of CAT- ED

Accuracy of trait estimates

For the CAT-ED item bank, the conditional SEM estimation at different trait levels under each stopping rule are shown in Figure 2. According to Ferrando (2003), if an S.E.M. value of 0.5 (an S.E.M. value of 0.5 corresponds to a reliability of 0.80) is considered to be the maximum error acceptable for practical purposes, from Figure 2, we can conclude that conditional SEM under each stopping rule is acceptable. Additionally, it is easy to see that the full CAT-ED item bank shows the highest accuracy, and SE ≤ 0.6 has the lowest accuracy. The accuracy will decrease as the stopping rule gradually becomes looser because the stricter is the stopping rule, the more items the subject has to answer. The conditional SEM shows the smallest when the subject’s trait score is close to 1.5, which means that the CAT-ED item bank showed the highest measurement accuracy in the trait score of 1.5.

Figure 2.

The conditional SEM at different trait levels under each stopping rule.

Concurrent criterion-related validity

The obtained results show that the Pearson product-moment correlation coefficients of CAT-ED in each chosen stopping rule and the SCOFF questionnaire are all higher than 0.6 (p < .001), which illustrates that the concurrent criterion-related validity of CAT-ED is good and acceptable. If the reader is interested, the detailed results can be found in Table S4 in the Supporting Information.

Predictive utility (sensitivity and speciﬁcity)

The result of ROC analysis is listed in Table 4. As shown, CAT-ED performs well in the diagnostic accuracy of ED. Regardless of each stopping rule, the specificity and sensitivity of CAT-ED are all within a reasonable range. The AUC values under all stopping rules are also higher than the critical value of 0.7, which is universally used as the lower bound for moderate predictive utility (Forkmann et al., 2013). Thus, all stopping rules in this research are acceptable in predictive utility. Moreover, the AUC values can even reach 0.8 in the stopping rule of SE ≤ 0.3. These results also suggest that although the test length of subjects has been greatly shortened, it results only in a smaller drop in prediction precision. Given space constraints, we only show the AUC, sensitivity and speciﬁcity, and the ROC curve is not shown there. If the reader is interested, this information is shown in Figure S1 in the Supporting Information.

Table 4.

The Diagnostic Effects of CAT-ED for the Post-hoc Simulation Study.

Stopping rule	AUC (95% CI)	Specificity	Sensitivity
The whole item bank	0.837 (0.813–0.861)	0.752	0.750
SE ≤ 0.2	0.837 (0.814–0.861)	0.768	0.741
SE ≤ 0.3	0.825 (0.800–0.849)	0.742	0.731
SE ≤ 0.4	0.822 (0.797–0.847)	0.715	0.784
SE ≤ 0.5	0.814 (0.788–0.839)	0.727	0.731
SE ≤ 0.6	0.815 (0.789–0.840)	0.768	0.701

Discussion

In the past, researchers developed various self-reported scales for measuring ED, however, they were constructed by classical test theory. Therefore, there were inevitably some drawbacks in these instruments. A recent approach is to reanalyze the psychometric characteristics of established fixed-length instruments using IRT and convert it to a computer adaptive version (Childs et al., 2000). In this study, the authors adopted IRT to complete the construction and calibration of the item bank of CAT-ED and conducted a series of simulation studies to explore its performance using a Chinese sample. Additionally, the complete theoretical and technical details on the development and validation of CAT-ED are carefully introduced. The final results showed that CAT-ED had acceptable psychometric characteristics that were embodied in the following aspects: all remaining items were strictly consistent with the hypothesis of unidimensionality; there was no strong local dependence for all items; the discrimination parameters for all items were above 0.5; no DIF was found in the item bank; the entire item bank showed a high degree of concurrent criterion-related validity (e.g., r = .707 and .688 when SE ≤ 0.2 and SE ≤ 0.3 respectively); and the diagnostic accuracy (AUC = 0.837 for the full test) of CAT-ED was ideal. In addition, the authors also concluded that SE ≤ 0.3 was suggested to be optimal after comprehensively considering the conditions including the length and measurement accuracy of CAT-ED under different stopping rules.

Previous studies have shown that, compared with the scales developed by CTT, CAT could significantly improve the measurement accuracy (Gibbons et al., 2012; Simms et al., 2011), which was also confirmed in this study. Additionally, the subjects’ trait level of ED can be quickly and accurately estimated with little loss of measurement accuracy even though they only answered a small number of items (such as 6 items), which was nearly impossible under the framework of CTT. To be precise, the main contributions of this work are shown in the following points: (a) convert existing ED measurement tools frequently applied in China to a CAT version for the first time and develop a well-calibrated item bank for CAT-ED; (b) establish the theoretical and technical premises for potential practitioners to use computerized adaptive testing to measure ED in a clinical environment; (c) offer a new perspective for ED measurement and effectively improve the measurement accuracy and test efficiency.

However, there are several limitations while the results of the research are encouraging. The authors believe that these limitations will set up directions for follow-up research. The shortcomings of the article are as follows. First, because the MFI item selection strategy is used in both two simulation studies, it could cause some well-validated items to have a very low usage rate in the administered process. From an economic point of view, this is most likely not conducive to saving the cost of developing an item bank. Therefore, in the future, it is necessary to adopt a more optimized item selection strategies under the premise of ensuring measurement accuracy to achieve an equal utilization rate of each item. Second, based on the preset screening criteria, we found that only 31 subjects were finally identified with EDs, which is only approximately 3% of the total number. According to an epidemiologic study of ED in China, the results showed a prevalence rate of 1.05% for AN, 2.98% for BN, and 3.53% for BED among female university students in Wuhan, China (Tong et al., 2014). Compared with the results of this study, it seems that the percentage of patients with EDs is relatively small. This may have the effect of skewing the results presented for sensitivity and specificity. Therefore, in the future work, in order to ensure the good predictive performance of the proposed item bank, it is necessary to correct the performance of CAT-ED in sensitivity and specificity through a more comprehensive investigation. Finally, we only evaluated the performance of CAT-ED in a simulated CAT environment rather than a real CAT administration. To evaluate the real effects of CAT-ED in the clinical environment in detail, it is necessary to establish a real CAT test system and use it in the administration of ED patients.

Conclusion

Data analysis results suggest that the performance of CAT-ED is good. CAT-ED meets the requirements of CAT development in IRT and performs well in accuracy, reliability, validity, sensitivity and speciﬁcity. CAT-ED can be used as an effective tool for measuring individuals’ ED traits, and it promises to offer a brand-new perspective for measuring ED traits with psychological scales.

Supplemental Material

sj-docx-1-sgo-10.1177_21582440221141273 – Supplemental material for Developing an Item Bank of Computerized Adaptive Testing for Eating Disorders in Chinese University Students

Supplemental material, sj-docx-1-sgo-10.1177_21582440221141273 for Developing an Item Bank of Computerized Adaptive Testing for Eating Disorders in Chinese University Students by Kai Liu, Longfei Zhang, Dongbo Tu and Yan Cai in SAGE Open

Footnotes

Acknowledgements

Thanks for the participation of all the subjects in this research. Sincere gratitude goes to YZ, FH, MY for assistance in data collection.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (Grant No. 62167004 and 32160203).

Ethics Statement

This work was approved by Research Center of mental health, Jiangxi normal university. The project name is psychometrics studies on mental health. The project code is HM20200150030.

ORCID iD

Kai Liu

Supplemental Material

Supplemental material for this article is available online.

References

Akaike

(1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. https://doi.org/10.1109/TAC.1974.1100705

Andrich

(1996). A hyperbolic cosine latent trait model for unfolding polytomous responses: Reconciling Thurstone and Likert methodologies. British Journal of Mathematical and Statistical Psychology, 49(2), 347–365. https://doi.org/10.1111/j.2044-8317.1996.tb01093.x

Baker

F. B.

(1992). Item response theory: Parameter estimation techniques. Marcel Dekker.

Baker

F. B.

Kim

S. H.

(2004). Item response theory: Parameter estimation techniques ((2nd ed.). Marcel Dekker.

Bock

R. D.

Mislevy

R. J.

(1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431–444. https://doi.org/10.1177/014662168200600405

Cai

(2017). ﬂexMIRT^® (Version 3.51): Flexible multilevel multidimensional item analysis and test scoring [Computer software].Vector Psychometric Group.

Chalmers

R. P.

(2012). Mmirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06

Chen

S.-K.

Cook

K. F.

(2009). SIMPOLYCAT: An SAS program for conducting CAT simulation based on polytomous IRT models. Behavior Research Methods, 41(2), 499–506. https://doi.org/10.3758/brm.41.2.499

Childs

R. A.

Dahlstrom

W. G.

Kemp

S. M.

Panter

A. T.

(2000). Item Response Theory in personality assessment: A demonstration using the MMPI-2 Depression Scale. Assessment, 7(1), 37–54. https://doi.org/10.1177/107319110000700103

10.

Choi

S. W.

Gibbons

L. E.

Crane

P. K.

(2011). Lordif: an R package for detecting differential item functioning using iterative hybrid ordinal logistic regression/item response theory and Monte Carlo simulations. Journal of Statistical Software, 39(8), 1–30. https://doi.org/10.18637/jss.v039.i08

11.

Chowdhury

Lask

(2001). Clinical implications of brain imaging in eating disorders. Psychiatric Clinics of North America, 24(2), 227–234. https://doi.org/10.1016/s0193-953x(05)70219-7

12.

Cohen

(1988). Statistical power analysis for the behavioral science. Technometrics, 31(4), 499–500. https://doi.org/10.1080/00401706.1989.10488618

13.

Erskine

H. E.

Whiteford

H. A.

Pike

K. M.

(2016). The global burden of eating disorders. Current Opinion in Psychiatry, 29(6), 346–353. https://doi.org/10.1097/YCO.0000000000000276

14.

Fairburn

C. G.

Beglin

S. J.

(1994). Assessment of eating disorders: Interview or self-report questionnaire? International Journal of Eating Disorders, 16(4), 363–370. https://doi.org/10.1002/1098108x(199412)16:4<363::aideat2260160405>3.0.co;2-#

15.

Ferrando

P. J.

(2003). The accuracy of the E, N and P trait estimates: An empirical study using the EPQ-R. Personality and Individual Differences, 34(4), 665–679. https://doi.org/10.1016/S0191-8869(02)00053-3

16.

Flens

Smits

Terwee

C. B.

Dekker

Huijbrechts

de Beurs

(2017). Development of a computer adaptive test for depression based on the Dutch-Flemish version of the PROMIS Item Bank. Evaluation & the Health Professions, 40(1), 79–105. https://doi.org/10.1177/0163278716684168

17.

Forkmann

Kroehne

Wirtz

Norra

Baumeister

Gauggel

Elhan

A. H.

Tennant

Boecker

(2013). Adaptive screening for depression — Recalibration of an item bank for the assessment of depression in persons with mental and somatic diseases and evaluation in a simulated computer-adaptive test environment. Journal of Psychosomatic Research, 75(5), 437–443.

18.

Garner

D. M.

(2004). Eating disorder inventory-3: Professional manual. Psychological Assessment Resources.

19.

Garner

D. M.

Olmstead

M. P.

Polivy

(1983). Development and validation of a multidimensional eating disorder inventory for anorexia nervosa and bulimia. International Journal of Eating Disorders, 2(2), 15–34. https://doi.org/10.1002/1098-108x(198321)2:2<15::aid-eat2260020203>3.0.co;2-6

20.

Garner

D. M.

Olmsted

M. P.

Bohr

Garfinkel

P. E.

(1982). The Eating Attitudes Test: Psychometric features and clinical correlates. Psychological Medicine, 12(4), 871–878. https://doi.org/10.1017/S0033291700049163

21.

Gaynes

B. N.

Burns

B. J.

Tweed

D. L.

Erickson

(2002). Depression and health-related quality of life. The Journal of Nervous and Mental Disease, 190(12), 799–806. https://doi.org/10.1097/01.NMD.0000041956.05334.07

22.

Gibbons

R. D.

Weiss

D. J.

Pilkonis

P. A.

Frank

Moore

Kim

J. B.

Kupfer

D. J.

(2012). Development of a computerized adaptive test for depression. Archives of General Psychiatry, 69(11), 1104–1112. https://doi.org/10.1001/archgenpsychiatry.2012.14

23.

Gibbons

R. D.

Weiss

D. J.

Pilkonis

P. A.

Frank

Moore

Kim

J. B.

Kupfer

D. J.

(2014). Development of the CAT-ANX: A computerized adaptive test for anxiety. American Journal of Psychiatry, 171(2), 187–194. https://doi.org/10.1176/appi.ajp.2013.13020178

24.

Hambleton

R. K.

Swaminathan

Rogers

H. J.

(1991). Fundamentals of item response theory. SAGE.

25.

Hand

D. J.

(2010). Evaluating diagnostic tests: The area under the ROC curve and the balance of errors. Statistics in Medicine, 29(14), 1502–1510. https://doi.org/10.1002/sim.3859

26.

Irwin

D. E.

Stucky

B. D.

Langer

M. M.

Thissen

DeWitt

E. M.

Lai

J. S.

Yeatts

K. B.

Varni

J. W.

DeWalt

D. A.

(2012). PROMIS Pediatric Anger Scale: An item response theory analysis. Quality of Life Research, 21(4), 697–706. https://doi.org/10.1007/s11136-011-9969-5

27.

Keane

Clarke

McGrath

Farrelly

MacHale

(2017). Eating Disorder Examination Questionnaire (EDE-Q): Norms for female university students attending a university primary health care service in Ireland. Irish Journal of Psychological Medicine, 34(1), 7–11. https://doi.org/10.1017/ipm.2015.35

28.

Köhler

Hartig

(2017). Practical significance of item misfit in educational assessments. Applied Psychological Measurement, 41, 388–400. https://doi.org/10.1177/0146621617692978

29.

Kolen

M. J.

Zeng

Hanson

B. A.

(1996). Conditional standard errors of measurement for scale scores using IRT. Journal of Educational Measurement, 33, 129–140. https://doi.org/10.1111/j.1745-3984.1996.tb00485.x

30.

Leung

Mak

(2003). Disordered eating attitudes and behaviors among Chinese adolescent boys in Hong Kong. Journal of Psychosomatic Research, 55(2), 148. https://doi.org/10.1016/s0022-3999(03)00155-7

31.

Liu

Cai

(2020). Development and validation of an item bank for drug dependence measurement using computer adaptive testing. Substance Use & Misuse, 55(14), 2291–2304. https://doi.org/10.1080/10826084.2020.1801743

32.

Lusted

L. B.

(1960). Logical analysis in Roentgen diagnosis. Radiology, 74(2), 178–193. https://doi.org/10.1148/74.2.178

33.

Magis

Barrada

J. R.

(2017). Computerized adaptive testing with R: Recent updates of the package catR. Journal of Statistical Software, 76(1), 1–19. https://doi.org/10.18637/jss.v076.c01

34.

Marzilli

Cerniglia

Cimino

(2018). A narrative review of binge eating disorder in adolescence: Prevalence, impact, and psychological treatment strategies. Adolescent Health Medicine and Therapeutics, 9(9), 17–30. https://doi.org/10.2147/ahmt.s148050

35.

Masters

G. N.

(1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272

36.

Mitchison

Mond

(2015). Epidemiology of eating disorders, eating disordered behaviour, and body image disturbance in males: A narrative review. Journal of Eating Disorders, 3(1), 20. https://doi.org/10.1186/s40337-015-0058-y

37.

Morgan

J. F.

Reid

Lacey

J. H.

(2000). The SCOFF questionnaire: Assessment of a new screening tool for eating disorders. BMJ Clinical Research, 319(7223), 1467–1468. https://doi.org/10.1136/ewjm.172.3.164

38.

Muraki

(1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159–176. https://doi.org/10.1177/014662169201600206

39.

Orlando

Thissen

(2003). Further investigation of the performance of S - X2: An item fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 27, 289–298. https://doi.org/10.1177/0146621603027004004

40.

Reckase

M. D.

(1979). Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of Educational Statistics, 4(3), 207–230. https://doi.org/10.3102/10769986004003207

41.

Reckase

M. D.

(2009). Multidimensional item response theory. Springer.

42.

Reeve

B. B.

Hays

R. D.

Bjorner

J. B.

Cook

K. F.

Crane

P. K.

Teresi

J. A.

Thissen

Revicki

D. A.

Weiss

D. J.

Hambleton

R. K.

Liu

Gershon

Reise

S. P.

Lai

J. S.

Cella

(2007). Psychometric Evaluation and calibration of health-related quality of life item banks: Plans for the patient-reported Outcomes Measurement Information System (PROMIS). Medical Care, 45(5Suppl 1), S22–S31. https://doi.org/10.1097/01.mlr.0000250483.85507

43.

Rosen

J. C.

(1996). Body image assessment and treatment in controlled studies of eating disorders. International Journal of Eating Disorders, 20(4), 331–343. https://doi.org/10.1002/(sici)1098-108x(199612)20:4<331::aid-eat1>3.0.co;2-o

44.

Samejima

(1970). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 35(S1), 1–97. https://doi.org/10.1007/bf03372160

45.

Schaefer

L. M.

Engel

S. G.

Wonderlich

S. A.

(2020). Ecological momentary assessment in eating disorders research: Recent findings and promising new directions. Current Opinion in Psychiatry, 33(6), 528–533. https://doi.org/10.1097/YCO.0000000000000639

46.

Schwarz

(1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464. https://doi.org/10.1214/aos/1176344136

47.

Simms

L. J.

Goldberg

L. R.

Roberts

J. E.

Watson

Welte

Rotterman

J. H.

(2011). Computerized adaptive assessment of personality disorder: Introducing the CAT-PD project. Journal of Personality Assessment, 93(4), 380–389. https://doi.org/10.1080/00223891.2011.577475

48.

Smyth

Wonderlich

Crosby

Miltenberger

Mitchell

Rorty

(2001). The use of ecological momentary assessment approaches in eating disorder research. International Journal of Eating Disorders, 30(1), 83–95. https://doi.org/10.1002/eat.1057

49.

Spiegelhalter

D. J.

Best

N. G.

Carlin

B. P.

(1998). Bayesian deviance, the eﬀective number of parameters, and the comparison of arbitrarily complex models. Research Report 98-009.

50.

Tan

Cai

Zhang

(2018). Development and validation of an Item Bank for depression screening in the Chinese population using computer adaptive testing: A Simulation Study. Frontiers in Psychology, 9, 1225. https://doi.org/10.3389/fpsyg.2018.01225

51.

Thomas

J. J.

Lee

Becker

A. E.

(2016). Updates in the epidemiology of eating disorders in Asia and the Pacific. Current Opinion in Psychiatry, 29(6), 354–362. https://doi.org/10.1097/YCO.0000000000000288

52.

Tong

Miao

Wang

Yang

Lai

Zhang

Hsu

L. K.

(2014). A two-stage epidemiologic study on prevalence of eating disorders in female university students in Wuhan, China. Social Psychiatry, 49(3), 499–505. https://doi.org/10.1007/s00127-013-0694-y

53.

Treasure

Claudino

A. M.

Zucker

(2010). Eating disorders. Lancet, 375(9714), 583–593. https://doi.org/10.1016/S0140-6736(09)61748-7

54.

Udo

Grilo

C. M.

(2018). Prevalence and correlates of DSM-5–Defined eating disorders in a nationally representative sample of U.S. Adults. Biological Psychiatry, 84(5), 345–354. https://doi.org/10.1016/j.biopsych.2018.03.014

55.

van der Linden

W. J.

Hambleton

R. K

. (1997). Handbook of modern item response theory. Springer.

56.

Wainer

(2000). Computerized adaptive testing: A primer (2nd ed.). Lawrence Erlbaum Associates Inc.

57.

Cai

(2019). A computerized adaptive testing advancing the measurement of subjective well-being. Journal of Pacific Rim Psychology, 13, e6. https://doi.org/10.1017/prp.2019.6

58.

Yang

Turki

Tan

Wei

Chang

(2018). Weighted gene co-expression network analysis reveals dysregulation of mitochondrial oxidative phosphorylation in eating disorders. Genes, 9(7), 325. https://doi.org/10.3390/genes9070325

59.

Yao

(2013). Comparing the performance of five multidimensional CAT selection procedures with different stopping rules. Applied Psychological Measurement, 37(1), 3–23. https://doi.org/10.1177/0146621612455687

60.

Yen

W. M.

(1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8(2), 125–145. https://doi.org/10.1177/014662168400800201

61.

Zhang

Wang

Gao

Cai

(2019). Development of a computerized adaptive testing for Internet addiction. Frontiers in Psychology, 10, 1010. https://doi.org/10.3389/fpsyg.2019.01010

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.09 MB