Abstract
Successful exams require a balance of easy, medium, and difficult questions. Question difficulty is generally either estimated by an expert or determined after an exam is taken. The latter provides no utility for the generation of new questions and the former is expensive both in terms of time and cost. Additionally, it is not known whether expert prediction is indeed a good proxy for estimating question difficulty.
In this paper, we analyse and compare two ontology-based measures for difficulty prediction of multiple choice questions, as well as comparing each measure with expert prediction (by 15 experts) against the exam performance of 12 residents over a corpus of 231 medical case-based questions that are in multiple choice format. We find one ontology-based measure (relation strength indicativeness) to be of comparable performance (accuracy = 47%) to expert prediction (average accuracy = 49%).
Keywords
Introduction
Multiple choice question (MCQ) examinations are widely used to assess the knowledge and skills of students and the quality of teaching instruments. Using good-quality questions is essential for achieving these purposes. Several criteria exist for measuring question quality, as discussed in [9,10,53]. Good quality questions need to be, among other things, 1) valid (i.e., they measure what they are supposed to measure); 2) discriminating (i.e., they discriminate between high- and low-information students); 3) fair (i.e., their results are not biased in favour of a subgroup within the cohort); and 4) of appropriate difficulty. Difficulty of MCQs is usually.1 This is based on a recent systematic review on difficulty prediction of assessment questions [27].
The difficulty criterion is of importance, attributed by its effect on the other quality criteria. Knowledge about difficulty level and sources of difficulty in questions provides insights into whether other quality criteria are satisfied or not. With regards to validity, being able to answer the question ‘what makes a particular question easy or difficult?’ is an important step in understanding ‘what does the question measure?’ For example, questions that are difficult due to their linguistic complexity are usually not valid in tests other than in language tests. This is because it is not clear whether students’ failure in answering these questions is due to the language factor or to their lack of the knowledge or skills of interest. In addition, inappropriately difficult or easy questions tend to have bad discrimination because, either almost none of the students solve them or all of the students solve them correctly. Finally, the difficulty level of questions is a major determinant of the fairness of exams, especially when different exam forms are used (equally difficult forms are needed) or when question selection is allowed (equally difficult questions are needed).
While information about the difficulty of questions is essential for designing exams, percentage correct can only be retrospectively determined. Traditional means of estimating difficulty are by obtaining it from previous administrations of the questions, if previous statistics are available, or by relying on experts’ estimation, which is usually the case in small-scale exams.
With recent advances in automated procedures for generating questions [2,3,5,13,19,30,39],2 For an overview of the field of automatic question generation, the reader is referred to the systematic reviews reported in [4,28].
The majority of existing automatic difficulty prediction models are machine-learning based approaches [see, for example, 11,34,44,47] that have merely been used for finding correlations in existing data as opposed to prediction. Existing cross-validated models that have been developed for prediction [8,16,23] are highly domain-specific which limit their utility. However, in a prior work [5,30], we have developed two ontology-derived measures which are based on a domain-independent model of difficulty.
Since the aforementioned ontology-based measures have neither been evaluated thoroughly nor compared to each other in a systematic way, we continue the work we carried on [5,30] by evaluating the performance of the previously proposed measures. Specifically, we extend our work by collecting data about student performance on a set of auto-generated questions that were validated in [30]. The data about student performance is used as a gold standard for which expert prediction (available from [30]) as well as the prediction of ontology-based measures are compared to. This allows us to validate our measures and determine whether they are suitable for replacing expert estimations when constructing exams.
This paper aims to address the following research questions:
How accurate is expert prediction of difficulty against student performance?
How well do experts perform in comparison to guessing?
How accurate are automatic difficulty prediction (ADP) methods against student performance?
How well does each method perform in comparison to guessing?
How well does each method perform in comparison to the other method? and
How well does each method perform in comparison to domain experts?
We collected difficulty information for 231 questions through a study involving 15 medical experts and a cohort of 12 residents. Similar to studies conducted on other domains [25,49] we found that the difficulty of case-based MCQs was moderately predicted by domain experts (average accuracy = 49%). We also found the automated measure proposed in [30] to be of comparable performance to experts (accuracy = 47%) and to represent an economical alternative.
The main contributions of this paper are:
User studies in the medical domain investigating the predictive performance of domain expert and automated ontology-based measures; A detailed analysis of the performance of ontology-based measures for difficulty prediction that show, by example, the minimum set of criteria that need to be considered in evaluating the performance of similar measures; and A fairly large question set (231 questions, of which 92 were answered by at least 10 participants) annotated with percentage correct and expert prediction that can be used for testing the performance of new approaches to difficulty prediction..3 Available at: https://github.com/grkurdi/A-Comparative-Study-of-Methods-for-a-Priori-Prediction-of-MCQ-Difficulty-dataset.
Multiple choice questions
MCQs consist of two components:
The
The
Writing high-quality MCQs is known to be challenging and expensive. The challenges faced by exam developers are apparent from the low quality of MCQ examinations as indicated by several studies investigating their quality. For example, the authors of [20,33,37,38,42] found more than 50% of investigated MCQs to contain at least one item writing flaw.4 Violations of best practices as suggested in MCQ-writing guidelines. Functional distractors are those selected by at least 5% of examinees [21,36,48].
Given the challenges faced by test developers in constructing high-quality MCQs, automated approaches for question generation have come into play. Ontologies have been increasingly used, in research contexts, as a source for automatic generation of questions [3,5,13,30,39]. We attribute their increased use to the following reasons. The first reason is the availability of ontologies with potential educational value. These ontologies contain exact facts and represent domains of interest precisely and non-ambiguously in a machine-processable way. Besides that, ontologies are supported by standard reasoning services and the development of further supporting tools and services is an active research area. Another reason is that, compared to texts, the process of finding good distractors is easier. As an example, consider the question ‘Which city is located in the UK?’, generated from a Wikipedia6 Wikipedia has been used as a source for question generation by [18,31,46].
One point worth mentioning is that underlying difficulty models are not part of most existing question generation approaches. According to Alsubait [4], apart from the similarity-based approach (outlined in Section 3.1), only two question generation approaches [26,54] take into account generating questions with controlled difficulty but without providing an experimental evaluation of the performance of difficulty prediction. The automatic measures compared in this paper represent existing, domain non-specific measures of MCQ difficulty. Other measures are either variants of the similarity approach [50], designed for questions with other response formats, or categorised by being domain- or question-specific [8,16,51,54].
One of the limitations of current question generation approaches is the simplicity of the generated questions in terms of both cognitive level,7 The mental process involved in question-solving as described in Bloom’s taxonomy [7], a popular classification of cognitive levels.
The target of difficulty prediction is to assign difficulty levels (easy, medium, or difficult), as derived from percentage correct (to be discussed in Section 4.2.4). The two ontology-based measures compared are described in this section.
Similarity-based measure
A plausible prediction model was proposed in [5], in which the similarity between the key and the distractors was suggested as an indicator of MCQ difficulty. Increasing the similarity between the key and distractors results in increasing the difficulty of MCQs. The rationale is that more knowledge is required to differentiate between key and similar distractors. As an example, consider the following question (Q1) taken from [4]. The most similar distractor to the key, and the most difficult to eliminate, is the option ‘the tongue’ since this option shares with the key the feature of being a body part. On the other hand, elimination of the options ‘disease’ and ‘glossitis’ is easier since they do not have shared features with the key.
the tongue
glossitis
the gums
a disease
To control the difficulty of questions, Alsubait et al. [5] developed a similarity measure that is based on Jaccard similarity [24] and intended to be used with ontologies. The similarity measure is defined as follows:
Different variants of the similarity measure, each of which uses a different set of subsumers,8 Subsumers are retrieved using the OWL API [22].
Atomic similarity in which only atomic subsumers of
Sub-similarity in which both atomic and complex (i.e. sub-expressions) subsumers of
Preliminary studies showed that the similarity measure has a good difficulty prediction [4,5]. In the absence of other domain-independent measures that are empirically supported, the similarity measure is considered as the gold standard for automatic difficulty prediction. However, one of the limitations of this measure is that it does not take into account the contribution of the stem to the difficulty of questions. While this did not represent a problem in questions having simple stems (e.g. ‘What is X?’ where X is a concept name), we believe that the role the stem plays is a major influencer on the difficulty of case-based questions that are characterised by stems that contains multiple terms (i.e. multi-term questions). In addition, the similarity measure is developed based on the assumption that all relational axioms have the same strength (i.e. a disease is either associated or not associated with a clinical finding). However, this is not always the case, especially in the medical domain where relations such as
A measure of question difficulty estimates difficulty by combining several calculations that exploit the relational axioms of an ontology, along with their The corresponding Manchester OWL syntax is: A SubClassOf R some B.
The proposed difficulty measure targets more complex types of questions, such as Q2 below, when compared to simple questions, such as Q1. The two main calculations RSI uses involve
Consider the following case-based medical MCQ (Q2), similar to those generated in [30]:10 A simple and modified version of a question generated in [30] is used for the sake of a non complex example.
Dysmenorrhea
HIV infection
Urethritis
RSI’s primary data source is an OWL ontology representation of Elsevier’s Merged Medical Taxonomy (EMMeT), dubbed EMMeT-OWL [30,40]. EMMeT content is maintained by a group of medical experts including physicians and nurses. The maintenance includes adding and removing relationship instances as well as manually adjusting rankings on the strength of the relationship instance. This is based on evidence from Elsevier content, which includes books, journals, and First Consult/Clinical Overviews.
RSI uses the EMMeT relation

A snippet of EMMeT-OWL used to provide data for Q2 where the annotations (
Since the question is asking for the An option entity is a class name that was selected as an option for the question.
Let Note that
The
Let
Using this measure allows
(optDiff ).
Let
The overall question difficulty is simply the average of
We demonstrate the use of RSI using Q2.
The
The questions studied and reviewed in [30] often use more complex stems. These include multiple types of stem entities such as: risk factors (via the
Method
To evaluate the performance of both experts and automated measures, we conducted two experiments: an expert review and a mock exam. Both experiments are described below.
Expert review
In a previous study [30], we carried out an expert review to evaluate the ontology-based approach we developed for generating medical case-based MCQs. As part of the review, experts rated the usefulness of generated questions (i.e. whether or not they are ready to use in an exam context) and predicted their difficulty. In what follows, we explain aspects of the review that are centered around expert prediction of difficulty.
Subjects
Fifteen experts were recruited to review the questions and were paid for their participation. The recruitment was conducted by our industrial partner,
Demographic characteristics of domain experts
Demographic characteristics of domain experts
The EMMeT-OWL ontology, which contains definitions of concepts such as diseases, clinical findings, drugs, symptoms, and risk factors, was utilised as a source for question generation. Four physician specialities (internal medicine, cardiology, orthopedics, and gastroenterology) were selected and a total of 3,407,493 case-based questions were automatically generated from these specialities. The generated questions belong to four templates: ‘What is the most likely diagnosis?’, ‘What is the most likely clinical finding?’, ‘What is the drug of choice?’, and ‘What is the differential diagnosis?’ (see Appendix B for examples). A stratified random sample of 435 questions was selected for expert review. Five stratifiers were used: speciality, question template, the number of distractors (key-distractor combinations in the case of differential diagnoses questions), the number of stem entities, and difficulty as predicted by the RSI measure. We aimed for an equal number of questions from each stratum but this was not possible due to the small number of questions in some strata. We obtained expert ratings for these 435 questions as described next.
Procedure
The expert review was conducted through a bespoke web-based questionnaire tool. Each expert reviewed approximately 30 questions belonging to their speciality.13 Question options were generated such that they belong to a shared speciality (determined using the EMMeT relation
Each question was displayed individually and experts were asked to solve the displayed question without a time limit. To facilitate expert decision regarding question quality, the experts were shown the correct answer after answering the question and were shown an explanation of the incorrectness of the selected option(s) if they answered incorrectly. The following data about the performance of domain experts were collected:
Selected answer(s).
Score: Each single response question answered correctly is given one mark while an incorrect answer is awarded zero marks. With regards to questions with multiple responses (i.e., differential diagnosis questions), a mark for each correct answer is added to the final mark for each question and a mark of zero is given for fully incorrect answers.14 As the exam was experimental and no marks were displayed for participants, it made no sense to use negative marking. One could argue that participants could get the full mark on multiple response questions (only differential diagnosis questions in our sample) by selecting all options. We ensured that this was not the case by looking for such a pattern in the responses to differential diagnosis questions.
Time to solve: The time starts by displaying the question on the screen and ends by the expert clicking the ‘submit’ button.
After answering each question, experts were instructed to rate different aspects of the questions (e.g., usefulness, difficulty, and correctness of explanation)15 An explanation for each of the options was displayed at this stage of the review.
Appropriate: The question is appropriate as a Board exam question; the level of knowledge required to answer the question is that of a resident specialist or practicing specialist.
Inappropriate/no medical knowledge needed: The question can be answered correctly by people having little to no medical knowledge, (far) below the level of targeted exam audience.
Inappropriate/guessable: The correct answer is guessable based on syntactic clues. For example, similar words between the stem and the key can clue examinees to the correct answer.
Inappropriate/confusing: The syntax or terminology is not intelligible and/or the key does not logically follow from the question stem.
Inappropriate/other: The question is inappropriate for other reasons.
They were then asked to classify the question as belonging to one of the following difficulty levels:
Easy: More than 70% of examinees would be expected to answer the question correctly; Medium: 30% to 70% of examinees would be expected to answer the question correctly; or Difficult: Less than 30% of examinees would be expected to answer the question correctly.
They were also provided with an optional comment box for any additional information that they may have wanted to add. The main aim of obtaining expert prediction is to compare it with student performance. Therefore, we did not collect their predictions for questions rated as inappropriate to use in an exam context, since these questions would not be used in the mock exam.
To obtain the empirical difficulty of the selected set of questions (i.e. percentage correct), we administered the questions to a cohort of residents. Details about the cohort, the questions, and the procedure are explained next.
Subjects
To recruit residents, experts who work in universities asked for volunteers. Twelve residents, with a mean age of 32 years (standard deviation = 2.3), participated in this experiment and were paid for their participation. Participants completed a demographics questionnaire, which asked them to indicate their age, sex, and practical experience (i.e. number of years working as a practitioner). Table 2 summarises their demographic information.
Demographic characteristics of residents who took the mock exam
Demographic characteristics of residents who took the mock exam
We used disproportional stratified random sampling, aiming for equal group proportions whenever possible, to select questions from our sample space which consists of auto-generated questions rated as appropriate by at least one domain expert in the expert study (345 questions). We used this sampling technique to obtain a representative sample of each group in the population which was not possible using other sampling techniques (e.g. random sampling or proportional stratified sampling) due to the large difference in size between groups in the population. We decided to include questions that are rated as appropriate by at least one domain expert because one of the main reasons for disagreement on appropriateness was related to the difficulty of questions. The questions causing disagreement were found to be too easy or too difficult, and therefore inappropriate, by one of the reviewers, which was suggested by their comments. Including these questions in the mock exam allows further investigation of their difficulty.
We based stratification on four stratifiers: speciality, template, difficulty as predicted by our measure, and difficulty as predicted by the domain experts. Stratifying by speciality was necessary to ensure that residents from different specialities were tested on questions covering areas they are expected to be knowledgeable about. In addition, using templates as a stratifier allowed us to investigate the applicability of the measures to different question types and to investigate whether differences in difficulty can be attributed to the intrinsic nature of the templates themselves. Finally, stratifying based on our difficulty measure and the experts’ predictions was used to allow investigation of the performance of these measures in predicting empirical difficulty.
The sample size for each speciality was determined considering a reasonable duration of testing (60-minute exam). This resulted in a sample of 231 questions in total to be administered to the residents involved in the experiment. The distribution of these questions is stated in Table 3. Variation in the number of questions across specialities was due to the unequal number of experts in each speciality and, therefore, the unequal number of reviewed questions. The selected questions were reviewed for linguistic issues and minimal edits were applied where necessary. For example, the stem ‘A patient with a history of acetaminophen presents with...’ was edited to read: ‘A patient who has used acetaminophen presents with ...’. This step was carried out to eliminate the effect of linguistic ambiguity on empirical difficulty.
Distribution of question sample per speciality and question type (Template 1 = What is the most likely diagnosis?, Template 2 = What is the drug of choice?, Template 3 = What is the most likely clinical finding?, and Template 4 = What is the differential diagnosis?)
Distribution of question sample per speciality and question type (Template 1 = What is the most likely diagnosis?, Template 2 = What is the drug of choice?, Template 3 = What is the most likely clinical finding?, and Template 4 = What is the differential diagnosis?)
A web-based system was developed to administer the questions and collect performance data. Residents agreed to complete a 60-minute mock exam using their own machines and were assigned questions belonging to their speciality, in addition to internal medicine questions.16 Domain experts indicated that all residents are expected to have knowledge in internal medicine. Selected answer(s); Score: the same as in the expert review; and Time to solve: the same as in the expert review.
A standard test theory analysis [12] was conducted for internal medicine questions that were administered to ten residents or more. The possible values that difficulty (percentage correct) can take and how they are interpreted is as follows:
Easy: percentage correct Medium: 30% ⩽ percentage correct ⩽ 70%; and Difficult: percentage correct
The percentage correct was then compared to difficulty as predicted by the aforementioned measures. However, this type of item analysis was not possible for questions belonging to the other three specialities due to the low number of participants they had been administered to (1 to 5 residents at most).
We designed a new approach for analysing difficulty data for questions answered by less than ten participants. To investigate the relation between expert prediction and empirical difficulty, we grouped the questions based on expert prediction, resulting in three groups: easy, medium, and difficult questions according to the experts. We then computed the percentage correct for each group by dividing the total number of correct responses to all questions in the group by the total number of responses (correct and incorrect) to all questions in the group. One would expect the number of correct responses to difficult questions to be low and therefore the percentage correct for the difficult group to be low. A similar procedure was followed to investigate the relation between automated difficulty measures and percentage correct.
While studies concerned both with investigating expert ability in predicting question difficulty [25,49] and with building difficulty models [5,23] use the accuracy metric (Appendix A) for performance evaluation, we extend the evaluation by using approaches and metrics borrowed from the information retrieval and machine learning communities. The analysis was extended to include other metrics because accuracy does not reflect the performance of prediction when the distribution of classes (easy, medium, and difficult questions in our case) is not balanced. Another reason is that difficulty is an ordinal variable; it is therefore important to find how close or far away the prediction is from the empirical difficulty.
The following metrics, which are standard in classification problems, were used to compare measures for difficulty prediction: accuracy, precision, recall, F-score, and Kappa. We also used the evaluation metric, ‘average relative error’, which was used in the study reported in [23] for evaluating the performance of different machine learning models for predicting the difficulty of reading comprehension questions. We explain how we calculated these metrics in Appendix A.
Residents’ performance on the mock exam. Score is calculated as the percentage of the total possible scores
Residents’ performance on the mock exam. Score is calculated as the percentage of the total possible scores
Since different performance metrics focus on different aspects of the prediction, it is therefore essential to consider all of them, prioritising them based on the problem at hand, to allow comparison between the performances of the different methods. That is, which metrics do we care about in the case that different metrics give contradictory results? For example, it is usually the case that classification methods have a high precision but low recall, or vice versa. Deciding on the superior method depends on the metric that is prioritised, whether it is higher precision or better recall. Our discussion of metrics is guided by the following characteristics of the problem of prediction of question difficulty:
The distribution of difficulty levels is not balanced, with the difficult questions being the minority class. This is apparent from the distribution of difficulty levels in the test set in addition to the literature about MCQ examinations [for example, see: 32,35,36,42]. All of the classes are of importance, with little preference for good performance on difficult questions for two reasons: in addition to them being the minority class, appropriately difficult questions play an important role in discriminating between low- and high-information students.
As we were interested in performance for all difficulty levels, we averaged over the precision for each difficulty level, thereby penalising prediction methods that perform well on some of the difficulty levels. A similar calculation was performed for recall and F-score.
To answer the question of ‘whether experts and automated measures do better than random guessing?’, we compare their performance with the performance of the following three naive baselines:
Random guesser which assigns difficulty levels arbitrarily;
Weighted guesser which assigns difficulty levels according to their distribution in the test set; and
Majority class classifier which assigns the most common difficulty level in the test set (medium) to all questions.
Residents’ performance
Following the description of the difficulty levels in Section 4.2.4, 39.1% (n = 36) of the 92 internal medicine questions were easy, 44.6% (n = 41) were medium, and 16.3% (n = 15) were difficult. We consider this to be a good indicator of question suitability as a test set, since this distribution of difficulty levels is similar to the distribution of difficulty levels reported in analyses of real exams (for example, see [35,42]). Residents’ scores range from 58.49 to 77.65 with an average of 67.69 (± 5.85) (see Table 4 details). Comparing these results to the results achieved by domain experts (range = 63.64 to 80.65 and mean = 72.09 ± 5.30) indicates that participants are adequately knowledgeable.
Performance of the measures
Is expert prediction a good proxy for difficulty?
Resident performance (in percent) on questions belonging to different difficulty levels as predicted by: a) domain experts; b) relation strength indicativeness measure. Raw numbers are presented between parentheses
Resident performance (in percent) on questions belonging to different difficulty levels as predicted by: a) domain experts; b) relation strength indicativeness measure. Raw numbers are presented between parentheses
Overall, the accuracy of expert prediction ranges between 46% and 53%. As Table 6 illustrates, the accuracy of experts is close (less than 10% variation in accuracy between experts). However, looking at other metrics, more variation in performance between- and within-experts can be seen. Of interest are the low values for precision, recall, and thus F-score on difficult questions compared to easy and medium questions,17 We performed a one way repeated measure ANOVA to compare the effect of actual difficulty of questions on F-scores achieved by experts. The F-score differed significantly between the different difficulty levels (F(2,8) = 10.96, p < 0.05).
Performance of different methods on difficulty prediction of internal medicine questions. The rank of each method among others is enclosed in parentheses and boldface indicates the method with the best performance in each metric (Q = questions, Acc. = accuracy, Rel. error = relative error, E = easy, M = medium, D = difficult, and Avg. = average)
A point of interest is whether or not there are consistent patterns characterising expert prediction. An example of a pattern is experts having a tendency to underestimate or overestimate the difficulty of questions. Looking at the data, we found 44 questions for which experts overestimated the difficulty compared to 21 questions for which experts underestimated the difficulty. This suggests that experts tend to overestimate difficulty as opposed to underestimating it. We ran a further analysis of the relation between experts’ performance on questions (getting the question right or wrong) and their prediction. The analysis aimed to answer two questions: 1) Is there a relation between experts’ performance and their prediction accuracy? and 2) Is there a relation between experts’ performance and overestimation or underestimation of difficulty? Regarding the first question, the data suggest that experts were more accurate in their prediction when they answered the questions correctly. The prediction of 51% of questions solved correctly was accurate compared to 36% of questions solved incorrectly. Concerning the second question, experts overestimated the difficulty of 63% of the questions they solved correctly, compared to 81% of the questions they solved incorrectly, which hints at an increase in the percentage of overestimation when questions are solved incorrectly. However, the small number of observations, especially the observation about questions solved incorrectly, precludes making a strong conclusion about expert performance and prediction.
Given that expert prediction is considered as a major component of the evaluation framework for difficulty measures, which is apparent from relying heavily on expert prediction as a source of validation in multiple studies [5,6,29], the performance of domain experts was lower than anticipated. However, all experts outperform the three baseline classifiers in each of the prioritised metrics (i.e. accuracy, Kappa, average precision, average recall, and average F-score) except for the relative error metric, which is outperformed by the majority classifier. However, this is due to the majority of the questions in the test set belonging to the medium level and therefore the distance between any misclassified level and the actual difficulty level is minimal.
With regards to questions belonging to other specialities, a Fisher’s exact test18 The Fisher’s exact test was selected because of the low frequencies observed in some cells (Table 5).
To summarise, the results indicate that experts moderately predicted question difficulty. The results are suggestive of an adverse effect of expert’s performance on their accuracy and of experts’ tendency to overestimate question difficulty.
While preliminary evaluations of the similarity measure [5] showed that it has potential for predicting question difficulty, the current evaluation shows that the accuracy of this measure on its own is lower than two of the baseline classifiers (Table 6). However, it is important to note that the similarity measure was evaluated in questions that have simple stems (i.e. consist of two concepts at most). Most of the questions in our dataset have more complex stems that contain two to five concepts. It is expected that the complexity of the stem contributes to the difficulty of the questions which is not captured by the similarity measure. This seems a plausible justification for its low performance. Taking into account the contribution of both stem and options into difficulty, as combined in the relation strength indicativeness measure (Section 3.2), increases the performance on all metrics except for recall on difficult questions as shown in Table 6. The performance of the relation strength indicativeness measure is also better than random and weighted guessers.
Another observation we made is that the similarity measure tends to overestimate the difficulty of questions. The predicted difficulty of 45 questions (48.91%) was higher than the empirical difficulty. On the other hand, the predicted difficulty of 14 questions (15.22%) was lower than the empirical difficulty. We observed a similar pattern for the relation strength indicativeness measure. We expect that cohort exposure to examined concepts, particularly when reviewing previous or sample exam papers, to moderate the effect of difficulty factors captured by the automated measures. Investigating the relation between cohort characteristics and difficulty remains an area for future research.
Performing Fisher’s exact test on questions belonging to other specialties did not reveal a significant difference between the frequencies of correct and incorrect responses to questions belonging to different difficulty levels (as predicted by relation strength indicativeness measure). Results obtained from internal medicine indicate that the distance between predicted difficulty and empirical difficulty is higher in automatic prediction than in expert prediction. Classifying easy questions as difficult, and vice versa, is expected to have a strong impact on the frequency of correct and incorrect responses in each group (Table 5). Therefore, we attribute the failure in detecting a significant relation to the high value of the average relative error (Table 6).
How well did the automated measures perform in comparison to domain experts?
The performance of our measure is competitive compared with the performance of domain experts. Looking at Table 6, the relation strength indicativeness measure ranks higher than low-performing experts on all prioritised metrics except for the relative error metric. This indicates that difficulty levels assigned by domain experts are closer to the actual difficulty levels than the difficulty levels assigned by the automated measure. This can be explained by the ability of domain experts to recognise other features (e.g. linguistic features) that play a role in the difficulty of questions. For example, while the relation strength indicativeness measure predicts questions with indicative stems and low-similarity distractors to be easy, the language complexity of the questions or the use of rare concepts increases the difficulty of the question. In addition, experts have pedagogical content knowledge (i.e knowledge about challenging concepts that students find difficult to understand or have misconceptions about) which gives them an advantage over automated measures.
Methodological reflection
The studies reported in this paper were pilot in nature. Conducting similar studies with a larger number of experts and student cohort would increase confidence in the results. To allow replication and extension of our work, the question set and the associated data were made available online.
While we have investigated expert performance on question difficulty prediction, our investigation was focused on medical questions and therefore the generalisability of these results to other domains is unknown. It is possible that other domains are more mature in the sense that pedagogical content knowledge is well-known. This, in turn, would improve expert prediction which would provide different results. In addition, we find it worthwhile and interesting to look at domain experts’ characteristics (e.g. teaching experience and exam construction experience) and how these contribute to their predictive performance. However, the amount of data that we have was limited for conducting such an analysis. Another factor that is expected to improve expert prediction, and that requires additional studies, is interaction and familiarity with the cohort to be tested.
Similarly, while we believe that research on medical, case-based questions has a major impact due to the heavy use of these question in medical education and in Board exams [17,41], it would be interesting to investigate the utility of the automatic measures evaluated in this paper in predicting the difficulty of other types of questions.
Automatic measures for difficulty prediction are developed for the purpose of controlling the difficulty of automatically generated questions. This does not preclude the use of these measures for predicting the difficulty of human-authored questions (after parsing these questions). One of the limitations of the current study is that our test set consists of automatically generated questions only. These questions are very similar in terms of their linguistic structure. Difficulty prediction measures might perform worse on human-authored questions that are expected to be inherently more diverse in their linguistic structure. Another difference between auto-generated and human-authored questions is that, as mentioned earlier, the percentage of flawed questions is high among the latter type of questions. This is another expected source of performance variation between different measures on the two sets of questions. However, obtaining human-authored examination questions annotated with student performance was difficult because of exam security issues. Further studies that investigate the consistency of the results for human-authored questions are in high demand.
Another point that needs to be emphasised here is that, although the questions in the test set belong to four templates, these templates have different characteristics (e.g. the number of concepts in the stem and the number of correct answers). In addition, we varied the questions’ characteristics within questions belonging to the same template. If the questions had been similar, we would have had no confidence in the generality of the test set and the generalisability of the results. However, at least the different characteristics of the question set increased our confidence in generalizing the results.
Finally, it is worth mentioning that the performance of both automatic measures investigated in this paper is heavily dependent on the completeness and correctness of the ontology in use. Thus, an interesting next step would be investigating the variation in performance when ontologies with different characteristics (e.g. size and expressivity) are used. Taking a different perspective, the performance of these measures can also be used as an indication of ontology quality.
Conclusion
To the best of our knowledge, this study is the first to compare the performance of domain experts, naive and automated methods for MCQ difficulty prediction. With respect to RQ1, experts moderately predicted the difficulty of questions and were more accurate in predicting easy and medium questions compared to difficult questions. Regarding RQ2, the comparison shows that the relation strength indicativeness measure outperforms the similarity-based measure. Moreover, the former difficulty measure is of comparable performance to that of domain experts, who are heavily relied on in practice. We consider this as a major success since it can be used as an economical alternative. We believe that the ability of our model to explain its decisions (why a particular question is classified as belonging to a particular difficulty level), whether the decision is correct or not, is another point of strength. These justified decisions can make exam designers consider new aspects of questions, which in turn provide new insights into the difficulty and validity of questions.
However, investigating additional factors that can be used to predict the difficulty of both automatically generated and human-authored questions is still a subject of ongoing research. While doing this, the criteria presented in this study need to be considered as the minimum set of evaluation criteria.
Finally, while we made an attempt at creating an annotated question set that can be used for testing the performance of prediction methods, a larger question set is needed to cross-validate the results and gain more confidence in their consistency, as well as to provide statistical significance. In addition, a larger question set will allow the use of standard machine learning algorithms for building prediction models and investigating whether these models outperform the ontology-based measures compared in this study.
Footnotes
Calculation of the evaluation metrics
Let
For
The value ranges from 0 to 1 with higher values indicating that the classifier is less likely to identify questions as being
The value ranges from 0 to 1 with a value of 1 indicating that the classifier has identified all questions in
Let max be a function that returns the maximum possible error where each
Finally, to define
The value is less than or equal to 1 with a value of 1 indicating a perfect agreement.
Example questions
Acknowledgement
We would like to thank all participants of our experiments for their valuable contributions to our work.
Funding
This work was funded by an EPSRC grant (ref: EP/P511250/1) under an Institutional Sponsorship (2016) for The University of Manchester, along with a partial contribution from Elsevier. The funding acts as a secondment to an initial EPSRC grant (ref: EP/K503782/1) awarded as an Impact Acceleration Account (2016) for The University of Manchester.
