Abstract
Due to their test economy and objective evaluability, multiple-choice items are used much more frequently to test knowledge than constructed-response questions. However, studies point out that dependencies may exist between the individual test result and the test format (multiple-choice or constructed-response). Studies testing economic knowledge (one dimension of economic competence) are using mainly multiple-choice items and indicate gender-specific performance in the corresponding tests in favour of male test-takers. As an explanation for these “gender differences” gender-specific affinities and differences in cognitive abilities are mentioned. Moreover, the test format itself is mentioned but has hardly been investigated in detail to date. In order to answer the question to what extent students test performance depends on the item format, we test economic knowledge using two test formats (constructed-response and multiple-choice), but with the same content. Results from 201 business and business education students show that the usage of constructed-response items can compensate for existing gender differences in 53% of all cases. This underlines that no general, gender-specific advantage or disadvantage can be assumed in relation to the item format. However, the mixed use of constructed-response and multiple-choice items seem promising to compensate for potential gender differences.
Introduction
To enable the mobility between universities in Europe within the Bologna Process, the study structure was changed to a two-tiered structure with Bachelor’s and Master’s degrees. The associated necessity of crediting the performance of students in each module is reflected in an increased number of examinations (Winkel, 2010). The interest in selected-response items especially multiple-choice (MC) items, which enable automatic and thus economic and objective evaluation, has therefore increased compared to constructed-response (CR) items. Previously used mainly in the medical field, MC tests are now being extended to other subjects and are meanwhile one of the most widely used test formats in higher education to measure cognitive performance (Förster et al., 2017; Kuhn et al., 2016). However, MC-based tests are not only used to evaluate the results of specific courses. They are common in large (international) comparison studies as well.
This is also the case regarding the measurement of economic knowledge as the core dimension of economic competence. The American Test of Economic Literacy (TEL: Soper, 1979), its German version Wirtschaftskundlicher Bildung Test (WBT: Beck and Krumm, 1998), and further versions like the TEL2 (Soper and Walstad, 1987), the TEL3 (Walstad and Rebeck, 2001), and currently the Test of Economic Literacy fourth edition (TEL4; Walstad et al., 2013) which consist exclusively of MC items, are used internationally and in different stages of education (e.g. pupils, students, or teachers). Taking the results of many studies using those tests into account, they not only refer to insufficient economic knowledge in general but also indicate gender-specific performance in the corresponding tests (Bank and Retzmann, 2013; Förster and Zlatkin-Troitschanskaia, 2010; Goldhaber and Anthony, 2007; Schumann and Eberle, 2014; Siegfried, 2019; Beck and Wuttke, 2004). Regardless of national borders and language, girls and women often achieve a significantly lower test result than their male colleagues (e.g. Förster et al., 2015; Lüdecke-Plümer and Sczesny, 1998; Soper and Walstad, 1987). Explanations for these gender-based differences in performance include gender-specific affinities and socialization (Jackstadt and Grootaert, 1980; Walstad and Robson, 1997), differences in cognitive abilities (Hirschfeld et al., 1995; Jackstadt and Grootaert, 1980) as well as the test format itself (Becker et al., 1990; Ferber et al., 1983; Lüdecke-Plümer and Sczesny, 1998; Lumsden and Scott, 1987; Walstad and Robson, 1997; Beck and Wuttke, 2004).
While, for example, gender-specific affinities mainly indicate that men and women have different attitudes towards economic topics and that measures should be taken to enable women to develop a higher interest in economics more effectively, the impact of gender differences in the item format can be much more far-reaching. Because if there is a gender gap due to the item format, the diagnosis is already wrong. Female students who score lower in MC items due to their gender are, therefore, disadvantaged in the overall result of a test or an examination. Furthermore, the information passed to the students with such results can be misinterpreted by them by attributing themselves lower abilities than they actually have in comparison to their male colleagues. This, in turn, can influence attitudes and interest in economic topics as well as the choice of studies and courses at school and at university (Schumann and Jüttler, 2015). If one thinks further into the future, when students start to enter the labour market, their opportunities might be reduced as a consequence of lower results in (economic) MC tests. Furthermore, diagnostic tests are a central element for the evaluation of performance and corresponding educational interventions refer directly to these results. In the case of gender-based differences due to the test format, there is the potential for misjudgments and thus misguided teaching.
Since the gender differences described do not only occur in an economic context but also in other contexts such as biology, mathematics, history, and chemistry, where the item format, among other things, is also discussed as a cause, particularly for social research, it is interesting whether and to what extent the item format causes gender differences in knowledge tests and which item characteristics may be responsible for these differences. This article aims to address these questions in the context of economic knowledge in higher education.
To do so, we provide definitions of the central constructs, summarize the state of research in the second chapter, and develop the research questions in the third chapter. In the fourth chapter, we describe our research design, the data, and the instruments used in the empirical analysis. The fifth chapter presents the results of the empirical study. In the last chapter, we draw conclusions of the studies and give an outlook on further research.
Theoretical background
Economic competence
Economic competence is more than just economic knowledge. Referring to Weinert’s (2001) definition of competence, namely the abilities and skills to solve problems under cognitive, volitional, and motivational conditions, economic competence is defined as an interplay of (1) economic knowledge, (2) interest in economic problems, and (3) attitudes toward economics (ATE) in order to enable a reflected and responsible solution of an economic problem (Beck, 1993; Schumann and Eberle, 2014; Schumann et al., 2011; Wuttke, 2008). Economic knowledge includes knowledge about fundamental economic concepts, micro- and macroeconomics as well as international relations (Beck, 1993; Beck et al., 2005) and enables individuals to understand a specific economic problem and to apply economic concepts and principles to solve the problem (Beck, 1993; Schumann et al., 2011). Interest in economic problems demonstrates to what extent the person experiences enrichment, positive feelings, and the willingness to use available knowledge or acquire knowledge when dealing with the problem and search for potential solutions (Hidi et al., 2004; Prenzel et al., 1986; Schiefele, 1991). Furthermore, the attitude of a person toward economics decides whether and if so in what way (agreeing or disagreeing) and what valence (strongly or weakly) a situation or problem is seen from an economic point of view (Beck, 1989; Walstad, 1987). Hence, aversive ATEs tend to lead to the avoidance of the confrontation with economic content. In contrast, positive attitudes support a deepened engagement with the topic.
Gender differences in knowledge testing (in economics)
Modeling and supporting economic competence has been a major issue in research in many countries, starting in the United States in 1961 (e.g. Committee for Economic Development (CED), 1961). An early test development to systematically assess economic knowledge as a central component of economic competence was the TEL in 1979 (Soper, 1979). Since then, many studies with different MC tests evaluating different participants in the content of economics have been conducted internationally. Although test results have been available for many years and for different test persons, the results reported have hardly changed. Male pupils, students, and teachers usually perform significantly better in economic knowledge tests than their female classmates, fellow students, or colleagues (pupils: e.g. Beck, 1993; Becker et al., 1990; Grob and Maag Merki, 2001; Hirschfeld et al., 1995; Lüdecke-Plümer and Sczesny, 1998; Schumann and Eberle, 2014; Siegfried, 1979; Soper and Walstad, 1987; Watts, 1985; students: e.g. Walstad and Robson, 1997; Siegfried, 2019; Siegfried and Wuttke, 2016; Beck and Wuttke, 2004). Similar findings have been outlined for the other two dimensions of economic competence, namely ATEs and economic interest (Beck, 1993; Marlin, 1991; Schumann and Eberle, 2014; Siegfried, 2019; Siegfried and Wuttke, 2016; Walstad, 1987).
A search for an explanation for these gender differences reveals a long tradition similar to that of the testing for economic competence. Findings that might explain the gap refer to different factors:
Parents’ profession: Male students achieve better economic test results if their father has a professional, business, or managerial position, while the mother’s profession has no influence. The situation is likewise for female students if the mother has a professional position (Jackstadt and Grootaert, 1980). Since in the past, more fathers than mothers have had a professional position, this could explain the gender differences.
Mathematical understanding: Performance in economics tests is correlated with the test-takers ability in mathematics. Male test-takers usually score better in mathematics than female ones, which might lead to better results in economics (Hirschfeld et al., 1995; Schumann and Eberle, 2014).
Country of origin: male students in the United States, Great Britain, Germany, and Austria perform significantly better than female students, whereas no gender differences occur in Japan or Korea (Förster et al., 2015; Lüdecke-Plümer and Sczesny, 1998)
The number of learning opportunities in economics: Students with commercial vocational training show better results than students without commercial vocational training. Moreover, students in general education with economic courses perform better in economic knowledge tests than students without these courses (Lüdecke-Plümer and Sczesny, 1998; Schumann and Eberle, 2014; Beck and Wuttke, 2004). Interestingly, Watts (1985) shows that without any economic course history, male students show a higher economic understanding than female students do already by the fifth grade. However, some studies found evidence that these results remain stable or even increase during the attendance of economics courses in high school and college (Schmidt et al., 2016; Siegfried, 1979; Walstad and Robson, 1997). Others point out that through economic courses, the gender gap decreases (Siegfried and Ackermann, under review; Siegfried, 2019).
Experience through (part-time) work: Jackstadt and Grootaert (1980) also show that part-time work has a negative impact on the performance in economic knowledge testing among men, but not among women.
Economic content: Studies of Walstad and Robson (1997), wherein they analyze item difficulty of tests on economic knowledge, suggest that gender differences occur only in certain economic contents, namely those that predominantly draw on mathematical skills.
Item format: Gender difference in favour of males tends to be lower in CR tests, whereas it tends to be higher in MC tests (Ferber et al., 1983; Lumsden and Scott, 1987).
Gender achievement and assessment format
Due to the widespread findings regarding gender differences in the three dimensions of economic competence in favour of males, a number of researchers raised the question of whether the item type 1 has an inherent bias for or against male or female students. Research on a variety of item types (e.g. MC or CR items) in different subjects has provided support for differences in gender achievement that favours females for CR items and males for MC items (Bennett, 1993; DeMars, 1998; Federer et al., 2016; Lumsden and Scott, 1987; some studies, however, did not find a gender gap regarding the item format (Lafontaine and Monseur, 2009). Explanations used for the gender differences found in relation to the item format vary significantly.
As a whole, the picture seems to be rather complex, as the following findings show.
Interaction of item format and content or level of knowledge
The use of certain item formats is often associated with the measurement of certain construction-relevant dimensions of skills. Thus, CR items are often used for writing skills or applied exercise as well as higher knowledge levels, and on the other hand, MC items are used for vocabulary skills or factual and theoretical knowledge as well as lower knowledge levels (Arthur and Everaert, 2012; Breland et al., 1994; Lafontaine and Monseur, 2009; Taylor and Lee, 2012). Hence, if men do better in an MC test, the conclusion is often drawn that men have better factual knowledge than women. But in fact, it cannot be distinguished whether the item format favours men or whether their domain-specific abilities give them an advantage. Breland et al. (1994), as well as Taylor and Lee (2012), found indications for this conclusion analyzing the content of the test and whether MC or CR items have been used. They outline that whereas males tend to perform better on items asking for the identification or selection of information in a given test (in math: geometry and algebra), females seem to perform better on items that require their own interpretation and synthetization of a given material (math: statistical interpretation and mathematical reasoning). Moreover, in a study of Wright et al. (2016) in the domain of biology, it was outlined that more difficult items, namely items that correspond to a higher level of knowledge in the sense of Bloom et al. (1956), tend to favour males disproportionately in comparison to items corresponding to a lower level of knowledge (see also for the content of math: Ryan and Chiu, 2001).
Item format addressing gender-specific skills
The main difference between MC items and CR items is that in contrast to MC items – even though MC items question and answer reading also needs clearly verbal skills – most CR items require especially productive verbal skills (Haladyna and Rodriguez, 2013). As was discussed in the previous point, women may have an advantage in writing skills that are necessary to cope with CR items, although writing skills per se are not relevant for the construct itself. Hence, the differences in the test results of males and females could be explained by construct-irrelevant skills required to answer CR items in comparison to MC items. In some studies and domains, evidence was found for this explanation, since writing skills affect student performance in math and economics tests (Abedi et al., 2003; Ferber et al., 1983). However, in other domains such as history, writing, and language skills seem not to affect test scores (Breland et al., 1994).
Difficulty of the item format
CR items seem to be more challenging for students because they usually require the respondents to find an answer in their own words (Arthur and Everaert, 2012; Ben-Shakhar and Sinai, 1991; Ferber et al., 1983). Ferber et al. (1983) found evidence that more prior knowledge is needed to answer CR items compared to MC items with the same content. Thus, only in CR items–based testing did students with an economics major reach a significantly higher test score compared to those students with no ecnonomics major, whereas an economics major did not affect MC-based test scores. Moreover, the number of years in school is not related to the test scores reached based on MC items. Those findings bring the authors to the conclusion “that even the worst students acquire a certain facility in making shrewd guesses” (ebd., 33). Furthermore, CR items generally require a higher amount of time (Rodriguez, 2003) and, therefore, may be perceived as more demanding because more persistence is needed to deal with the item in question.
Item format and guessing
A characteristic of MC is the possibility of developing or guessing the correct solution using the exclusion principle (Lindner et al., 2015). Thus, it is possible to guess the answer in MC items even if the participant is unaware of the correct answer, only by excluding one single answer option (Lesage et al., 2013; Lindner et al., 2015). However, this option is not available to all students, since there might be more risk averse students that rather leave out questions they cannot answer with certainty (Lesage et al., 2013). This personal variance in the guessing tendency can lead to systematic distortions of the test results, based on personality traits (e.g. fearfulness and conscientiousness) or gender. Even if the findings are inconsistent, many studies show that males are more likely to guess, since they are more dauntless, more competitive, and show overconfidence during answering questions (Bucher-Koenen et al., 2016; Riener and Wagner, 2018), which makes them perform better in MC test. Females in contrast often do not like to take risks quite as much as men, often have less confidence, and have more problems in performing under pressure (Riener and Wagner, 2018). Hence, not only students’ knowledge of the content but also their risk aversion, confidence, and performance under pressure might play an important role in MC test (Ben-Shakhar and Sinai, 1991; Croson and Gneezy, 2009; Marin and Rosa-Garcia, 2011; Riener and Wagner, 2018).
Interest and attitudes toward the test topic
With regard to the aforementioned necessity of productive language and writing skills in CR items in contrast to only select an answer in MC items, the interest (e.g. regarding the content) of the participant has to be taken into account. In this context, Niemivirta and Tapola (2007) refer to self-efficacy, and interest in the content of the task as possible influencing factors that might impact individual test performance. Moreover, Schumann and Eberle (2014) found evidence that a higher level of interest in economics and a positive ATE is positively correlated to tests results in an economic knowledge test. Thus, low test results can not only be seen as an indicator of a low knowledge level but rather as an indicator of a lower competence level, since the interest in economics, ATEs, and economic knowledge are related to each other.
Research question
As shown in the previous chapter, there are quite a large number of studies, which refer to possible gender differences in (economic) MC tests and also give possible explanations for these differences. However, most studies (e.g. Lafontaine and Monseur, 2009; Reardon et al., 2018; studies that offer a direct comparison: Ferber et al., 1983; Lumsden and Scott, 1987) do not compare different item formats while using the same content. Against this background, it is not possible to clearly differentiate between a possible influence of the item format on one hand and of the content on the other hand. In order to untangle these effects, we address three questions in our study:
Corresponding to a large number of studies referring to a gender gap in economic knowledge in favour of males when MC items are used, we assume that there will be no gender differences when looking at the total test score of our economic knowledge test, since both, MC and CR items are used in the same amount (Hypothesis 1).
Based on previous studies, we assume that on item level the gender gap will be smaller or non-existent for CR items compared to MC items in the same content (Hypothesis 2).
Since little is known about the effect of different item formats with the same content on gender-specific competence scores, the research question is raised to what extent the performance in CR items in comparison to MC items can be explained by the individual characteristic (gender, grade in math, German and Politics and Economics, type of high school degree, vocational education and training, and attended learning opportunities in economics at university) of the participant.
Method
Procedure
To test the hypothesis and to answer the research questions, we developed, in a first step, CR items that mirror the respective closed items of the TEL4 (Walstad et al., 2013). From all TEL4 items, we chose those that demonstrated a gender differential item function (DIF) in previous studies. This means, we selected items that have a higher difficulty for women than for men and are therefore gender discriminatory and potentially distort the test results due to gender (Walstad and Robson, 1997: 163). A total of eight items with gender DIF from eight different content areas were identified. In addition, seven further items from the TEL4 were used, which can be assigned to the same content areas as the gender DIF items, but do not show a gender DIF. The aim was to compare gender DIF and non-gender DIF items and to exclude influences due to the content. In a pilot study in January 2018, the items were tested with 112 students in Economics and Economics and Business Education who voluntarily participated in the test. No incentive was given to the students. A two-group design and a computer-based test were used. Both groups had the same content, but whereas the questions of one group were based on CR items, the same questions were presented to the other group in the form of MC items. Participants were randomly assigned to one of the two groups (MC items or CR items). This procedure is important to avoid systematic effects due to certain individual characteristic preferences. However, in order to achieve an approximately equal distribution of male/female test-takers in each group, participants were asked for gender before randomization. According to Kröhne and Martens (2011), this design does not have as much power as a one-group design, since it is not possible to evaluate how an individual solves both response formats, but the effect of the response format can be analyzed by the identical question.
When analyzing the results of the tests, it became clear that some of the CR items were not answered from an economic perspective or misunderstood by the participants. In order to identify the cause for this, a think-aloud study (Eccles and Arsal, 2017) was conducted in April and May 2018 with 12 participants. Participants in the think-aloud study received the CR items and were asked to formulate all their thoughts while solving the task (e.g. strategies, difficulties, and questions). All conversations were protocolled. The aim was not only to uncover difficulties in understanding but also to work out hints for reformulations so that the CR items could offer an equivalent solution space for the test persons, as is the case with the MC items. Thus, a total of 10 items were adapted. An example of a reformulation process can be found in Table 1.
Example of a reformulation of a CR item.
MC: multiple-choice; CR: constructed-response.
In a third step, we used again a two-group design and a computer-based test. Questions were identical for both groups, but different item formats in each test were used. Hence, both tests consist of 15 items including half MC and half CR items organized in such a way that when in one test question x is a CR item, then in the other test question x is an MC item. Thus, connecting both tests all questions (in sum 30 items with 15 CR items and 15 MC items) have been answered as CR and MC items. Again, participants were randomly assigned to one of the two tests, after asking for gender in order to achieve an approximately equal distribution of male and female test-takers in the two tests.
In order to ensure uniform evaluation of each CR item, a coding manual was developed deductively using the MC-answer and taking the correct answers from them. However, the procedure was extended inductively, as it is not necessary or expectable to reproduce the exact wording. Therefore, synonyms and equivalent words were added to the coding guide. After coder training, all CR item responses were coded by two independent coders, achieving a Cohen’s kappa of 0.92–0.98.
Data sources and questionnaire materials
Against the background that the aim of the study is to gain insight into the influence of the item format on the test result in the economic context, an attempt was made to eliminate as many other potential influencing factors on the test results as possible, especially interest and attitudes. Students of economics and business chose this study track. Therefore, it can be assumed that, regardless of gender, interest and a positive attitude toward the content of economics are given (see also Schumann and Jüttler, 2015). This should make it possible to rule out interest and attitudes toward economic contents as gender difference–inducing factors and to focus on the item format. Thus, we focused on students in a study track with economic contents. In our case, we selected Bachelor students in Economics and Business Education of different universities in Germany (male 74 and female 127). They participated voluntarily without any incentive in the study (random sample). The students are aged between 18 and 39 (M = 22.8, SD = 2.84). About 30% have a high school degree from a commercial school, whereas 59% finished a general education school. The students were mainly in the fourth semester (M = 3.9, SD = 1.79) and had already attended eight economic courses at the university on average.
Grades for German, Math, and Politics and Economics were asked to test for possible influences of math and writing skills and of prior knowledge in economics (Table 2).
Sample size and characteristics.
N: sample; M: mean; SD: standard deviation.
Since in the following analyses, differences in the performance in economics test items between males and females are examined, it is important to analyze to what extent differences between males and females exist regarding sample characteristics (and thus potential influencing variables) such as the cognitive ability, which is operationalized in this study by the grade in Mathematics, German, and Politics and Economics, the semester, the number of attended courses (opportunities to learn) in economics, and the previous knowledge of economics due to their previous commercial education. T-tests for independent samples show only one significant difference between males and females, namely their actual semester (t(176) = −2.951, p = 0.004). At the time of the study, females are on average one semester higher than males. However, there is no significant difference regarding the amount of attended learning opportunities in economics.
The surveys took place during the summer semester of 2018 and required 20 minutes. In order to ensure that the tests were carried out as uniformly as possible, the questionnaire contains detailed instructions.
The following data was collected:
Economic knowledge
Economic knowledge was measured using 15 out of 45 items of the German version of the TEL4 (Walstad et al., 2013). Eight items were with DIF effects for gender and seven items without DIF effect for gender. The items correspond to the contents: (1) economic institutions (2 items); (2) scarcity, choice, and productive resources (2 items); (3) voluntary exchange and trade (2 items); (4) markets and prices (2 items); (5) economic role of government (2 items); (6) money and inflation (2 items); (7) labour markets and income (1 item); (8) supply and demand (1 item), and (9) output, income, employment, and the price level (1 item). 2
ATE
To analyze possible effects of attitudes, we used the German version of the ATEs developed by Soper and Walstad (1983) and translated by Beck and Krumm (1998). The instrument comprises 14 items (alpha = 0.85) using a 5-point Likert-type scale (1 = totally agree, 5 = totally disagree).
Interest in economic problems
Interest in economic problems as a further dimension of economic competence was measured with a questionnaire consisting of 11 items (Wild and Winteler, 1990) (Likert-type scale of 1 = does not apply to 4 = does apply; item example “For me it is of great importance to learn to better understand economic interrelationships,” alpha = 0.81).
Furthermore, biographical data, such as age, gender, grades in Mathematics and German, Politics and Economics, and completion of commercial training, were collected.
Results
Gender differences in economic competence
In a first step, we measured the overall test scores in the economic knowledge test 3 as well as the mean in ATEs and interest in economics of each individual (Table 3). Results show that male students significantly outperform female students in all three dimensions of economic competence (knowledge, interest, and attitudes) but the size of the gender effect is only small to medium (|d| = 0.34–0.54; economic knowledge: (t(188) = 2.74, p = 0.007); ATEs (t(196) = 3.68, p < 0.001), and interest in economics (t(194) = 2.27, p = 0.027, Table 3).
Mean differences in economic competence.
N: sample; M: mean; SD: standard deviation; Grouping variable: gender (female = 1, male = 0). T-test for independent samples.
Despite a balanced ratio of CR and MC items, the gender gap in favour of men cannot be overcome. Thus, Hypothesis 1, that there will be no gender differences in favour of males in the used test, should be rejected. Interestingly, in spite of the self-selection of the students for the study of economics and business education, differences between gender in favour of male students in interest in economics and in ATEs become apparent.
Gender differences in item format
To investigate the gender gap in specific items (Hypothesis 2), t-tests for independent samples for each test item (CR as well as MC) were calculated (Table 4).
Mean differences in the solution rate of CR and MC items.
1. economic institutions, 2. scarcity, choice, and productive resources, 3. voluntary exchange and trade, 4. markets and prices, 5. economic role of government, 6. money and inflation, 7. labour markets and income, 8. supply and demand, and 9. output, income, employment, and the price level; *Test data with missing values are not included in the analyses; agender DIF; f: female; m: male; N: sample; M: mean; SD: standard deviation; MC: multiple-choice; CR: constructed-response. Grouping variable: gender (female = 1, male = 0). T-test for independent samples. Bold highlights refer to significant values with p < 0.10.
First of all, results show that women have a better solution rate than men for 3 out of 15 CR items (namely item 2, 12, and 14) and 4 out of 15 MC items (namely items 2, 8, 11, and 13). Men thus achieve a higher solution rate than women for a total of 23 out of 30 items (12 CR items and 11 MC items).
Furthermore, it becomes clear that for four MC items (items 2, 6, 9, and 14), the gender differences with regard to the solution rate are significant with an advantage for men (the effects are small to medium). However, these gender differences are no longer significant if these items are formulated as CR items. For four further items (items 3, 4, 7, and 10), the use of the CR item makes it possible to adjust the solution rate of men and women. However, the differences in the solution rate between males and females in favour of males for the corresponding MC items are not significant.
For item 5, the significant gender difference in the CR item in favour of males turns out to be larger (t(73) = 2.26, p = .03) than in the MC item (t(104) = 1.74, p = 0.09). In the case of three further items (items 1, 8, and 15), the gender differences in the solution rate increase to the advantage of males for the CR items compared to the MC items. However, these differences are not significant.
Moreover, it is interesting to note that compared to the MC items, fewer students answered the CR items correctly. This is true for both male and female test-takers.
Influencing factors on item solution
In order to analyze to what extent individual characteristics such as gender or grade in different subjects of the participants as well as their ATEs and their interest in economics influence the probability of answering a question correctly, logistic regression analyses for each item and each item format (CR or MC) were used.
In the first step, the quality of the overall model of each regression was measured using the Omnibus Test of Model Coefficients and its information to what extent the model used significantly fits the data. Hereby, the Omnibus Test tests the null hypothesis that all predictors of the model in the population are zero. Furthermore, the Nagelkerkes R² was used to quantify the proportion of the explained variation in the use of predictors, in relation to the null model (Field, 2013).
For the 30 questions examined, the independent variables can explain the variation of each item only partly, since only 16 of the models established with the respective independent variables fit the data significantly (regarding the omnibus test) and can, therefore, be taken into account (Nagelkerkes R²= 0.08–0.38, which, according to Cohen (1992), corresponds to a medium effect). For the other models, the null hypothesis, that all influencing factors are equal to 0, cannot be rejected. However, in order to concentrate on the comparison of CR and MC items with the same content, only those eight models which allow a direct comparison of CR and MC are displayed in Table 5 (Appendix 1) and subsequently taken into consideration.
The results show, that there do not seem to be any systematic influencing factors, even if the grade in Maths and German, as well as a degree in commercial vocational education, provide a significant explanation for the probability of the correct answer for at least three items (items 8: CR and MC, 9: CR and MC, and 12: CR). For example, this means in detail for item 8 with a CR format that for the increase of the grade in German (b = 0.292, Wald χ2(1) = 4.633, p = 0.031) by one unit, the change in the odds of solving the item (rather than not solving the item) is Exp(B) = 0.18. Thus, participants are more likely to solve the item if they have better grades in German (which means a smaller number in the grade German). 4 A negative influence also arises with the grade in Math. With a higher grade (and therefore a worse grade) in Math, the probability to solve the item correctly decreases (b = −1.103, Wald χ2(1) = 5.355, p = 0.021). The opposite is the case for the grade in Politics and Economics (Wald χ2(1) = 5.765, p = 0.016), where a positive influence with b = 1.82 emerges. Thus, participants with a higher grade in Politics and Economics (and therefore participants with worse grades in this subject) have a higher probability to solve the item. Interestingly, the relationships between the individual characteristics of the test persons and the probability to solve the item compared to not solve the item are different if the same item is formulated as an MC item. Thus, results for item 8 with an MC format, for example, show that the grade in German is now positively (but not significantly, b = 0.672, Wald χ2(1) = 0.38, p = 0.537) associated with the solution probability (the better the grade in German, the lower the probability to solve the item correctly), whereby the relationship between the grade in Politics and Economics (b = 3.119, Wald χ2(1) = 2.867, p = 0.09), and the probability to solve the test item correctly is still positive.
If one summarizes the results of the logistic regression analyses, it becomes clear that there are neither systematic predictors to explain the probability of a solution across all items, nor with regard to the difference of the item format. There are only two exceptions; first, the grade in German is only a predictor for the probability of solving three CR items, but for none of the MC items. However, it should be noted that the influences are both positive and negative. Second, taking multiple predictors into account, the factor of gender seems to be no significant predictor.
Discussion
The aim of this study was to analyze the influence of gender on MC items by comparing answers to MC and CR items with the same content in an economic knowledge test. Even though economic knowledge is only one dimension of economic competence, this focus represents the first systematic step in examining the potential connections between item format and gender differences. Thus, one dimension of a comprehensive construct of economic competence, in which not only economic components are considered but also values and the assumption of social responsibility (Beck, 1989; Schumann and Eberle, 2014), is focused. Even though economic knowledge represents only one dimension of economic competence, this focus represents the first step in examining the potential connections between item formats and gender differences. The findings that can be derived from this can be transferred to other test forms and contents (see also the paper by Ackermann and Siegfried in the same volume). This can lead to test formats that do not discriminate women and thus do better diagnosis of competence. On this basis, it should be possible to equally foster economic competence of men and women and thus educate them equally to act as responsible economic citizens.
The study addressed three questions, namely (1) whether gender differences in favour of males exist for economic knowledge, when different item formats are used, (2) whether the gender gap in favour of males will be smaller or non-existent for CR items compared to MC items with the same content, and (3) whether the performance on CR items in comparison to MC items can be explained by the individual characteristics of the participants.
Regarding the first question, results show that men outperform women in all three dimensions of economic competence. Focusing on the present economic knowledge test, the gender-specific effects in favour of men are still present, even if we use MC and CR items in the same test. Men perform on average 5% better than their female colleagues. This is in line with the findings of Lüdecke-Plümer and Sczesny (1998), as well as Schumann and Eberle (2014), who observed that male test persons performed on average 5%–9% better than female test persons in the German version of the TEL.
A more detailed analysis of the various items (CR or MC) reveals that the formulation of CR items makes it possible to at least approximate the solution rate of both males and females. This is achieved in a total of 8 out of 15 CR items. Nevertheless, there is a CR item that shows a significant difference between men and women in the solution rate in favour of men. Furthermore, it becomes clear that only in 7 (three CR items, four MC items) out of 30 items, women show a higher solution rate than their male colleagues. However, in this context, it is interesting to note that depending on the item format, there are considerable differences in the solution rate on the same question. Regardless of gender, the participants achieve higher solution rates for MC items than for CR items. This suggests that many questions are answered primarily by the exclusion principle of distractors (Brückner, 2017; Lindner et al., 2015). Thus, with regard to the use of the item formats, the question which construct-relevant, but also construct-irrelevant factors are inevitably included in the different item format, should be investigated (see Ackermann and Siegfried in the same issue).
Furthermore, looking at the explanatory potential of the individual characteristics of the test persons in relation to the solution potential of an individual item (question 3), it becomes clear that there do not seem to be any systematic influencing factors. For example, a previous vocational training, but also the grade in Math and German are to be noted. Interestingly, for most items (CR as well as MC), the number of attended opportunities to learn in economics is not related significantly to the chance of solving the items correctly. These findings are only true for MC items which is in line with the findings by Ferber et al. (1983), who refers to the point that the guessing potential in MC items might help less able students to solve the item correctly, even if they do not know the correct answer. For CR items, however, exactly the opposite should be the case, but this cannot be shown in this study.
There are certain limitations that have to be considered: in our comparison of male and female test persons’ economic knowledge, only a selection of items was used. However, the complete TEL4 consists of 45 MC questions. It is therefore not possible to (1) make a statement about the economic competence of our test-takers in general and (2) to draw a complete picture of possible gender differences hidden in the test due to the test format.
Moreover, the presented gender gap in this article only refers to the solution rates of females in comparison to males. However, the solution rate gives no information on how the difficulty of the item varies according to gender (as represented by the DIF). The adjustment of the solution rate can, therefore, be an artifact because open items, for example, become relatively more difficult for men than for women and thus have a DIF (and become more unfair) for men.
Another bias can be assumed due to the voluntary participation in the test. Generally, the participants – male and female – might have had a higher affinity for economics than the basic population.
As Croson and Gneezy (2009) point out, the lack of competition among people with an affinity for competition (as was shown for men, see chapter “Gender Achievement and Assessment Format”) can lead to a lower willingness to make an effort. The extent to which gender differences in favour of males change when the examination circumstances change should also be examined.
It should also be noted that the test participants were students of business and economics or business education so that a basic interest in and understanding of economics was assumed, although the data indicate that the interest and attitudes of students differ systemically between men and women. However, the extent to which these present results are also achieved with test participants who do not have such economic preferences should be tested.
Conclusion
Due to the systematic comparison of CR and MC items, this study makes a valuable contribution to the discussion on possible effects of the item format on gender-specific performance. Our results provide some evidence that the use of exclusively MC-based tests is a biased way to test economic knowledge. In fact, men outperform women mostly in MC items, whereas CR items can support the decrease in gender difference in the solution rate but do not necessarily lead to a better performance of female test-takers. Hence, CR items per se cannot compensate for gender differences, but the use of CR items helps to make the performance of male and female students more comparable.
One motivation for the use of MC items is their efficiency in terms of administration and scoring (e.g. Rodriguez, 2003). Due to their format, they are furthermore objective and reliable. However, if they are affected by systematic gender bias, test-takers may be tested reliably, but not validly. Moreover, a frequently mentioned disadvantage of the MC format is that these formats do not test sufficiently what schools are trying to achieve, i.e. the abilitiy to solve a job relevant problem. This has to be considered especially if the test is used in schools.
In contrast, the most common argument favouring CR items is the assumption that they measure some kind of deeper understanding (Rodriguez, 2003). Despite this advantage, CR items are nonetheless subject to some significant restrictions, which pertain mainly to the scoring of given answers. CR scoring tends to be more complex and subjective, thereby reducing the reliability of assessment (Rodriguez, 2003) and being more time-consuming and therefore more expensive. Moreover, immediate scoring during the assessment process as it is necessary for adaptive testing can only be realized restrictedly.
Taking the findings of this study and the conclusions drawn so far into account, a variation of the item format seems to be a target-oriented approach to exploit the respective gender-specific and test theoretical advantages. This does not necessarily have to happen within a test or a test time but can extend over a school-year/semester.
Footnotes
Appendix 1
Logistic regression analyses.
| Content | Item number: format | Predictor | b | SEβ | Wald’s χ2 | df | p | Exp (B) |
|---|---|---|---|---|---|---|---|---|
| (1) | 2: CR (df = 6, χ2 = 13.06, p = 0.04, Nagelkerkes R²= 0.38, d = 0.26) | Gender (f = 1, m = 0) | 0.59 | 0.903 | 0.428 | 1 | 0.51 | 1.805 |
| Grade German | –0.951 | 0.773 | 1.513 | 1 | 0.22 | 0.386 | ||
| Grade P&E | 1.543 | 0.676 | 5.218 | 1 | 0.02 | 4.68 | ||
| Commercial education | 2.516 | 1.071 | 5.523 | 1 | 0.02 | 12.377 | ||
| VOCAT | 0.746 | 0.956 | 0.609 | 1 | 0.43 | 2.109 | ||
| OTL in economics | 0.027 | 0.078 | 0.117 | 1 | 0.73 | 1.027 | ||
| Constant | –8.161 | 3.28 | 6.19 | 1 | 0.01 | 0 | ||
| 2: MC (df = 5, χ2 = 11.77, p = 0.04, Nagelkerkes R²= 0.27, d = 0.13) | Gender (f = 1, m = 0) | 1.502 | 0.841 | 3.193 | 1 | 0.07 | 4.492 | |
| Grade German | 0.292 | 0.561 | 0.27 | 1 | 0.60 | 1.339 | ||
| Commercial education | –1.087 | 0.844 | 1.657 | 1 | 0.20 | 0.337 | ||
| VOCAT | 2.705 | 1.261 | 4.599 | 1 | 0.03 | 14.954 | ||
| OTL in economics | 0.112 | 0.092 | 1.474 | 1 | 0.23 | 1.119 | ||
| Constant | –3.538 | 2.868 | 1.522 | 1 | 0.22 | 0.029 | ||
| (4) | 8: CR (df = 7, χ2 = 15.19, p = 0.03, Nagelkerkes R²= 0.32, d = 0.18) | Gender (f = 1, m = 0) | –1.689 | 0.914 | 3.41 | 1 | 0.06 | 0.185 |
| Grade German | –1.687 | 0.784 | 4.633 | 1 | 0.03 | 0.185 | ||
| Grade Math | –1.103 | 0.477 | 5.355 | 1 | 0.02 | 0.332 | ||
| Grade P&E | 1.819 | 0.758 | 5.765 | 1 | 0.02 | 6.167 | ||
| Commercial education | –0.49 | 0.798 | 0.377 | 1 | 0.54 | 0.613 | ||
| VOCAT | –0.375 | 0.788 | 0.226 | 1 | 0.63 | 0.687 | ||
| OTL in economics | –0.047 | 0.073 | 0.422 | 1 | 0.52 | 0.954 | ||
| Constant | 9.367 | 3.607 | 6.746 | 1 | 0.01 | 11697.97 | ||
| 8: MC (df = 2, χ2 = 6.03, p = 0.05, Nagelkerkes R²= 0.24, d = 0.09) | Grade German | 0.672 | 1.09 | 0.38 | 1 | 0.54 | 1.959 | |
| Grade Math | –0.829 | 0.556 | 2.222 | 1 | 0.14 | 0.437 | ||
| Grade P&E | 3.119 | 1.842 | 2.867 | 1 | 0.09 | 22.62 | ||
| Constant | –0.385 | 2.331 | 0.027 | 1 | 0.87 | 0.681 | ||
| (5) | 9: CR (df = 2, χ2 = 6.59, p = 0.04, Nagelkerkes R²= 0.15, d = 0.11) | Gender (f = 1, m = 0) | –1.023 | 0.613 | 2.779 | 1 | 0.09 | 0.36 |
| Grade German | –0.863 | 0.415 | 4.331 | 1 | 0.04 | 0.422 | ||
| Constant | 2.738 | 1.44 | 3.614 | 1 | 0.06 | 15.459 | ||
| 9: MC (df = 4, χ2 = 10.56, p = 0.03, Nagelkerkes R²= 0.17, d = 0.12) | Grade Math | 0.639 | 0.308 | 4.305 | 1 | 0.04 | 1.894 | |
| Grade P&E | –0.38 | 0.362 | 1.098 | 1 | 0.30 | 0.684 | ||
| Commercial education | –1.114 | 0.549 | 4.124 | 1 | 0.04 | 0.328 | ||
| OTL in economics | –0.053 | 0.049 | 1.172 | 1 | 0.28 | 0.948 | ||
| Constant | 2.123 | 1.249 | 2.889 | 1 | 0.09 | 8.356 | ||
| (6) | 12: CR (df = 6, χ2 = 12.81, p = 0.04, Nagelkerkes R²= 0.23, d = 0.17) | Gender (f = 1, m = 0) | 0.53 | 0.627 | 0.714 | 1 | 0.40 | 1.699 |
| Grade German | 1.334 | 0.523 | 6.498 | 1 | 0.01 | 3.798 | ||
| Grade Math | –0.644 | 0.334 | 3.726 | 1 | 0.05 | 0.525 | ||
| Commercial education | 0.924 | 0.609 | 2.3 | 1 | 0.13 | 2.52 | ||
| VOCAT | 0.717 | 0.611 | 1.375 | 1 | 0.24 | 2.049 | ||
| OTL in economics | –0.038 | 0.057 | 0.442 | 1 | 0.51 | 0.963 | ||
| Constant | –3.802 | 2.164 | 3.088 | 1 | 0.08 | 0.022 | ||
| 12: MC (df = 1, χ2 = 3.98, p = 0.05, Nagelkerkes R²= 0.08, d = 0.06) | Commercial education | 1.088 | 0.564 | 3.716 | 1 | 0.05 | 2.968 | |
| Constant | –0.877 | 0.799 | 1.205 | 1 | 0.27 | 0.416 |
(1) economic institutions, (4) markets and prices, (5) economic role of government, (6) money and inflation; f: female; m: male; P&E: Politics & Economics; OTL: opportunities to learn; VOCAT: vocational education training.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Gleichstellungsbüro (Equal Opportunities Office) of Goethe-University Frankfurt (Kleinere Projekte zur Genderforschung (gender research projects); number: 3.00.32).
