Sage Journals: Discover world-class research

Abstract

This article provides a meta-analysis of experimental research findings on the existence of bias in subjective grading of student work such as essay writing. Twenty-three analyses, from 20 studies, with a total of 1935 graders, met the inclusion criteria for the meta-analysis. All studies involved graders being exposed to a specific type of information about a student other than the student’s performance on a task. The hypothesized biasing characteristics included different race/ethnic backgrounds, education-related deficiencies, physical unattractiveness and poor quality of prior performance. The statistically significant overall between-groups effect size was g = 0.36. Moderator analyses showed no significant difference in effect size related to whether the work graded was from a primary school student or a university student. No one type of biasing characteristic showed a significantly higher effect size than other types. The results suggest that bias can occur in subjective grading when graders are aware of irrelevant information about the students.

Keywords

Ethnicity bias assessment experiment grading marking (scholastic)meta-analysis

Bias in grading can be either conscious (Malouff, 2008) or unconscious (Malouff, Stein, Bothma, Coulter, & Emmerton, 2014). The focus of the bias could be prior experience with a student (e.g. a halo effect), some physical characteristic such as sex, race, or physical attractiveness or some assigned status, such as being classified as gifted or learning disabled.

Because of the perceived possibility of bias in grading, Brennan (2008) and Warren Piper, McNulty, and O’Grady (1996), as well as a student union in the United Kingdom (Middlesex Students’ Union, n.d.), called for keeping students anonymous when possible during grading. Nobel Prize winner Daniel Kahneman (2011) made the same recommendation, based on his findings of bias in decision making.

In line with this recommendation, some universities require instructors to keep students anonymous when possible during grading, for instance, La Trobe University (2008) and the University of Melbourne (Brennan, 2008). Anonymity may be important during subjective grading, where the grader uses judgment, as in evaluating essays. It usually will not matter much during objective grading, for instance grading responses to multiple-choice and true-false items.

Academics who want to keep students anonymous during subjective grading have used various methods, including asking students to submit work with a student number, code number, or bar code rather than their name (Haeran & Cowling, 2010; Malouff, Emmerton, Schutte, & 2013) or just covering the students’ names during grading.

Some studies a few decades ago looked for evidence of apparent bias in actual grading undertaken in schools and universities. The usual non-experimental research method involved comparing scores assigned before and after blinding graders to some potentially biasing student characteristic, such as sex (e.g. Bradley, 1984; Dennis & Newstead, 1994). These findings are important because they involve grading by real teachers or assessors in the course of their work. Hence, the findings have good external validity. However, the lack of experimental control limits the internal validity of the findings. Factors other than bias could have affected the results. For instance, if women received higher scores in a year where students were kept anonymous, cohort differences could explain the results. Studies of actual grading have produced mixed evidence regarding the existence of bias (Belsey, 1988; Perry-Langdon, 1990).

Many studies have examined grading bias using an experimental method that typically involves giving randomly selected graders possibly biasing information about students, while other graders do not receive the information. For instance, Sprietsma (2013) randomly assigned participants to grade work labeled as completed by someone with a name typical of immigrants or typical of natives. These studies have value in their high internal validity, allowing one to rule out interpretations of the results other than unfair bias. An experimental research design is the gold-standard method of determining causation (Shadish & Ragsdale, 1996); other methods are seriously limited when it comes to determining causation (Pirog, Buffardi, Chrisinger, Singh, & Briney, 2009). The experimental method has a limitation in that the graders typically do not evaluate assigned work of their own students.

One way to quantify the effect size of possibly biasing information is to aggregate the different findings using meta-analysis, while looking for moderators relevant to the question of generalization and to other characteristics of the research sample and method. The present article provides such a meta-analysis of experimental studies of bias in grading. The research hypothesis was that, overall, studies would show evidence that information about student characteristics can unfairly affect the grading of their work. In addition, it was examined statistically whether specific characteristics of the studies, such as whether the grader used a rubric, were associated with effect size as this has been shown to affect results (Gerritson, 2013; Kayapınar, 2014).

Method

To be included in this meta-analysis studies had to (1) use the experimental method, (2) examine whether bias occurred in grading student academic work, and (3) provide adequate statistical information for meta-analysis, including number of participants for each of two conditions and outcome statistics that could be converted into a between-group effect size. Studies that reported using an “experimental” method were considered to have had done so even when the report did not explicitly state that graders were randomly assigned to conditions. Then explicit mention of random assignment as a moderator was evaluated.

The following databases were searched on 25 September 2015: PsychINFO, Google Scholar, ERIC, Informit A+ Education, Taylor and Francis, and Proquest Education using key terms bias with grading, marking, or assessment, as far back in each database as possible. Terms were searched for in the abstract or in the article itself.

When a title was relevant to the meta-analysis, one member of the research team read the abstract. If the abstract was relevant, that researcher read the article. In addition, the introduction of each relevant article was read to look for references to other relevant articles. For articles published in the past five years that fit the three aforementioned inclusion criteria, the lead author was contacted to ask for any unpublished findings relevant to the meta-analysis. Articles that were included were checked in Google Scholar to see if they had been cited. Any articles citing the included articles were then checked to see if they fitted the inclusion criteria. Where possible, searches included published articles, book chapters, dissertations, and unpublished studies. The aim was to include all relevant studies, published or not, to avoid the file-drawer effect (Rosenthal, 1979), in which published studies by themselves point to an unrealistically high effect size.

Figure 1 shows the search flowchart of most of the exclusion criteria used in this meta-analysis. In addition, several relevant studies were excluded because they lacked sufficient effect-size information for meta-analysis. Included in this group are studies by Seraydarian and Busse (1981) showing no biasing effect on essay grading depending on the popularity of the first name of the child. In very similar studies, Erwin and Calev (1984) found a significant effect and Harari and McDavid (1973) found mixed evidence of an effect. In a study with only 15 graders, Batten, Batey, Shafe, Gubby, and Birch (2013) found no significant evidence of information about university students’ prior performance biasing grading of their essays. Likewise, Chase (1986) found no evidence of knowledge of prior performance affecting grading of the essays of primary school students. In a study with real graders of the work of high school students, Baird (1998) found no significant evidence of bias related to student sex.

Figure 1.

Flowchart of search for relevant studies.

These exclusions reduced the original pool of 896 retrieved records to 20 articles and 23 separate analyses that were included in the meta-analysis. Three of the 20 articles had multiple samples, each with an experimental group and control group, and no overlapping participants. All 23 samples were used in the meta-analysis.

Several moderators were examined by coding the data on (1) whether the study article explicitly stated use of random assignment or merely described the between-group research design as experimental, (2) whether the grader was experienced in grading (e.g. an actual teacher), (3) the type of students (primary school or university), (4) whether the grader used a grading rubric or not, and (5) specific type of bias (e.g. sex). These moderators were selected because they might influence outcome or generalizability of results and because they could be coded as they were reported in many of the included studies. Two of these moderators might reasonably be expected to reduce bias: experienced graders and using a grading rubric. The other three moderators pertained to generalizability of the studies.

Bias was coded as positive if the results were in the direction hypothesized by the researchers. One researcher coded effect-size data and moderator data, and the other checked the coding. Inter-coder agreement was assessed by setting aside nine studies with a total of 10 samples for independent coding. Each analysis included eight coding decisions involving N, outcome results and status on six moderators. Three of the studies had multiple outcomes that required the coding of a total of an additional seven outcome results. For the 87 total independent coding decisions, there was agreement on 78 of 87 (90%). In cases of disagreement on any of the studies included in the meta-analysis, final decisions were made by consensus.

The current meta-analysis followed Lipsey and Wilson (2001) and employed the Comprehensive Meta-Analysis Program Version 2 (Borenstein, Hedges, Higgins, & Rothstein, 2005) to calculate the overall weighted effect size (Hedges’ g unbiased). This statistic enables a statistical comparison of different groups. The Q statistic assessed the significance of differences in effect-size between different subsets of studies. This statistic is similar to ANOVA. The homogeneity Q was used to assess to what extent effects sizes within a set of studies varied more than one would expect by chance.

Results

Table 1 provides information about the characteristics of each study included in the meta-analysis. The overall between-groups meta-analytic effect was 0.36 [95% CIs 0.24, 0.48], in a heterogeneous set of studies, Q(22) = 44.03, p = .004, Higgins’ I²= .50, with a total of 1935 participants. The g value indicates that, on average, there was slightly over one-third of a standard deviation difference in grades between students in the bias condition and students in the comparison condition. This finding suggests the presence of bias in grading. Higgins’ I² provides an estimate of the percentage of the variability in effect sizes due to heterogeneity rather than to sampling error. As the studies were heterogeneous, we followed the view of Lipsey and Wilson (2001) and used a random effects model for analyses.

Table 1.

Characteristics of each study included in the meta-analysis.

			95% CI		Bias against^a				Rubric used	All actual graders
Study	n	g	Lower	Upper	Bias against^a	Country	Random^a	Students	Rubric used	All actual graders
Babad (1980)	292	0.71	0.47	0.95	Non-gifted	Israel	Unstated	Primary	No	No
Babad (1985)	41	0.84	0.22	1.47	Weak student	Israel	Unstated	Primary	No	Yes
Babad, Mann, and Mar-Hayim (1975)	18	0.87	−0.05	1.80	PPP and SES	Israel	Stated	Primary	Yes	No
Cornett-Ruiz and Hendricks (1993)^b educational label	19	0.45	−0.42	1.33	Negative ed label	U.S.	Stated	Primary	No	Yes
Cornett-Ruiz and Hendricks (1993)^b past poor behavior	20	0.05	−0.79	0.88	Poor ed behavior	U.S.	Stated	Primary	No	Yes
Fogel and Nelson (1983)	12	−0.75	−1.84	0.34	MR	U.S.	Stated	Primary	Yes	Yes
Gerritson (2013) no rubric	26	0.36	−0.39	1.11	Blacks	U.S.	Stated	Primary	No	Yes
Gerritson (2013) rubric	26	0.05	−0.70	0.79	Blacks	U.S.	Stated	Primary	Yes	Yes
Graham and Dwyer (1987) untrained markers	22	1.15	0.28	2.02	Negative ed label	U.S.	Stated	Primary	No	No
Graham and Leone (1987)	88	0.25	−0.34	0.83	Negative ed label	U.S.	Stated	Primary	No	No
Guttmann and Boudo (1988)	241	0.35	0.10	0.61	Girls	Israel	Unstated	Primary	No	Yes
Hanna and Linden (2009)	86	0.29	0.23	0.36	Girls, low caste	India	Stated	Primary	No	Yes
Kaplan (1978) Study 1	140	0.18	−0.15	0.51	Unattractive	U.S.	Unstated	University	No	No
Kehle (1976)	120	0.25	−0.11	0.60	Low intelligence, girls	U.S.	Stated	Primary	No	No
King (1998)	44	1.02	0.41	1.64	Girls	U.S.	Stated	Primary	No	No
Landy and Sigall (1974)	40	1.12	0.45	1.78	Unattractive	U.S.	Stated	University	No	No
Lebuda and Karwowski (2013)	119	0.70	−0.30	0.44	Unique first name	Poland	Unstated	University	No	No
Malouff et al. (2013)	126	0.53	0.17	0.88	PPP	Australia	Stated	University	No	Yes
Malouff et al. (2014)	122	0.28	−0.07	0.63	PPP	Australia	Stated	University	No	Yes
Sprietsma (2013)	88	0.00	−0.41	0.42	Turkish	Germany	Stated	Primary	No	Yes
Van Ewijk (2011)	113	0.02	−0.35	0.38	Turkish or Moroccan	Netherlands	Stated	Primary	Yes	Yes
Wen (1979) Study 1	60	0.54	0.03	1.06	Whites	U.S.	Stated	Primary	Yes	No
Wen (1979) Study 2	72	0.16	−0.30	0.62	Whites	U.S.	Stated	Primary	Yes	No

Ed: educational; PPP: poor prior performance; SES: low socio-economic status; MR: mentally retarded; primary: supposed work of school children assessed; university: supposed work of university students assessed.

Stated: random assignments explicitly stated in article; unstated: study set up as an experiment, but no explicit statement of random assignment.

A random-numbers table was used to assign an n of 9 to one condition in this study and 10 to the other three conditions.

Classic fail-safe N, Orwin’s fail-safe analysis, and the trim and fill method assessed the impact on the overall effect size of possibly missing studies. The classic fail-safe N was 461, indicating that 461 studies would be needed with 0 effect size to reduce the overall effect size to nonsignificant. Orwin’s fail-safe analysis showed that 26 additional studies with a zero effect size would bring the combined effect size down to 0.15, which one might consider a trivial between-groups difference in that it represents a mean difference of only 15% of the pooled standard deviation. A trim-and-fill analysis, which aims to identify funnel plot asymmetry resulting from publication bias, indicated no evidence of publication bias, leaving the original effect size unchanged.

Table 2 shows the results of the moderator analyses. There were no significant moderator effects for explicit random assignment, whether the grader was experienced at grading, the type of students, whether a rubric was used in grading, or type of bias. The study by Gerritson (2013) compared graders who used a rubric with graders who did not. The results showed only a trivial level of bias for graders using a rubric (g = 0.05), compared to a substantial effect for graders not using a rubric (g = 0.36). Thus, if students’ work is graded without using a rubric (out of 100, with a standard deviation of 15), then grading bias might mean a reduction on average of 5.4 points (out of 100) between groups (bias vs. no bias). With a rubric, the average difference would be trivial, less than one point.

Table 2.

Moderator analysis.

			95% CI			Homogeneity analysis
Moderator	k	g	Lower	Upper	p	Q	df	p
Explicit random assignment, Q(1) = 0.33, p = .56
Yes	18	0.40	0.23	0.57	<.001	51.83	17	<.01
No	5	0.26	−0.19	0.71	.258	23.10	4	<.01
Student type, Q(1) = 0.00, p = .99
Primary or secondary School	18	0.38	0.22	0.53	<.001	40.65	17	.01
University	5	0.37	−0.16	0.90	.168	36.86	4	<.01
Experienced grader, Q(1) = 0.38, p = .54
Yes	11	0.35	0.23	0.47	<.001	12.32	10	.26
No	12	0.46	0.14	0.79	.005	69.26	11	<.01
Rubric used, Q(1) = 0.41, p = .52
Yes	6	0.24	−0.20	0.67	.269	15.11	5	.01
No	17	0.39	0.21	0.57	.001	64.56	16	<.01
Characteristic with bias against, Q(5) = 0.84, p = .84
Educational deficiency	8	0.39	0.06	0.72	.020	18.41	7	.01
Ethnic or racial group	7	0.26	0.02	0.50	.016	15.61	6	.02
Female	2	0.63	−0.02	1.27	.057	3.87	1	.05
Poor prior performance	2	0.40	0.15	0.65	.002	0.93	1	.33
Unattractive	2	0.62	−0.63	2.45	.031	15.3	1	<.01
Other or mixed	2	0.10	−1.27	1.47	.88	7.68	1	.01

Discussion

The overall meta-analytic effect of 0.36 indicated that there was 36% of a standard deviation in score differences between students in the hypothesized-bias condition and students not in that condition. This finding suggests that students who are members of a group against which there is a bias will tend to receive lower grades than students outside this group. However, the results were heterogeneous, with some studies showing trends in a direction opposite to that hypothesized, suggesting that “reverse” bias may take place. The overall results indicated that bias can occur and can have a substantial effect.

The types of bias with significant meta-analytic findings included bias against students who have negative educational labels, students who are members of specific ethnic or racial groups, students who have previously performed poorly, and less attractive students.

The findings do not offer evidence of why the biased grading occurs, but one could hypothesize that all the types of bias examined involve implicit expectancies about the quality of student performance on the present task. When the grading has subjective elements involving opinions as to quality based on characteristics external to the assessment piece, these expectancies may color the work of the student enough to affect assigned scores.

The findings reported in this article are stronger than the mixed findings of experimental studies that provided insufficient data to be included in this meta-analysis (Baird, 1998; Batten et al., 2013; Chase, 1986; Erwin & Calev, 1984; Harari & McDavid, 1973; Graham & Dwyer, 1987 [untrained graders]; Kehle, 1976 [physical attractiveness analyses]; Seraydarian & Busse, 1981). The meta-analytic findings reported here are also clearer than the results of non-experimental field studies (see e.g. Bradley, 1984; Dennis & Newstead, 1994).

The present findings are consistent with findings of unintended, implicit, or unconscious bias in various other realms of evaluating others outside academic work of students, such as with regard to halo effects in rating others (Cooper, 1981), and with regard to implicit prejudicial orientations toward members of various groups that historically have been the subject of discrimination (Greenwald & Banaji, 1995).

The main limitation of the present findings is that the studies did not examine grading of student work by the actual teachers of the students. Hence, it might be safest to view the results as suggestive of bias in actual grading.

The present results suggest that, when feasible, it may be worthwhile for graders of student work to keep themselves unaware of potentially biasing information about students, as recommended by a student union in the United Kingdom (Middlesex Students’ Union, n.d.) and some experts (see, e.g. Brennan, 2008; Kahneman, 2011, p. 83).

The heterogeneity of the study findings seems to indicate an occasional boomerang effect in the studies when the graders see through the experimental instructions, guess that the study is about grading bias, and consciously or unconsciously avoid bias or even exercise bias in the direction opposite of the bias that was hypothesized. However, it is unknown whether any of the graders did in fact see through the experimental design, and various factors such as sample differences could have produced the heterogeneity.

The moderator analyses showed no significant effects and only a few trends. Thus, a lower effect size seemed to be more related to grading with than without rubrics. Similarly, a higher effect size seemed to be more associated with inexperienced graders than with experienced graders while a higher effect size also tended to emerge in studies which explicitly stated that the experiment used a random assignment to conditions. Whether the students were in primary school or in university seemed unrelated to effect size, as did the specific type of bias studied. However, the moderator results are best regarded as suggestive only as the analyses had low power due to the limited number of relevant studies.

A potentially fruitful idea for further research might be the examination of trends in favor of less bias where graders use rubrics and are experienced in grading. Across studies where the graders used a rubric, the weighted effect size (0.24) was statistically significant but was also much lower than the effect size when graders did not use a rubric (0.39). The direct comparison of effect size with and without rubrics by Gerritson showed only a trivial level of bias for graders using a rubric (g = 0.05), compared to a substantial effect for graders not using a rubric (g = 0.36). It makes sense that using rubrics might decrease bias effects because using rubrics tends to remove some of the subjective elements in grading, resulting in increased reliability (Jonsson & Svingby, 2007).

Across all studies that used experienced graders, the weighted effect size (0.35) was notably lower than the effect size for inexperienced graders (0.46). In a study involving the training of graders, Graham and Dwyer (1987), reported significant bias for untrained graders. This study, which found “no significant” bias for trained graders, was reviewed as part of the meta-analysis. However, the nonsignificant result could not be included in the meta-analysis because the article lacked minimal information about effect size and direction.

Future research on bias might most profitably examine to what extent bias occurs in various circumstances with actual graders and to what extent use of rubrics or training reduces bias. These studies could help determine in what circumstances the present findings of bias generalize to actual grading.

The best research methods for future research would probably include using experimental research designs where the researcher refrains from making the research hypothesis too obvious, and, at the end of a study, asking the graders to guess the research hypothesis. Their responses might provide useful information regarding the effects of the experimental manipulation. The best reporting methods probably would include describing in the study report details of the randomization process, and providing an effect size, such as Hedges’ g, regardless of whether the result was statistically significant.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

References marked with an asterisk indicate studies included in the meta-analysis.

*Babad

E. Y.

(1980) Expectancy bias in scoring as a function of ability and ethnic labels. Psychological Reports 46: 625–626. doi: 10.2466/pr0.1980.46.2.625.

*Babad

E. Y.

(1985) Some correlates of teachers’ expectancy bias. American Educational Research Journal 22: 175–183.

*Babad

E. Y.

Mann

Mar-Hayim

(1975) Bias in scoring the WISC subtests. Journal of Consulting and Clinical Psychology 43: 268, doi: 10.1037/h0076368.

Baird

(1998) What's in a name? Experiments with blind marking in A-level examinations. Educational Research 40: 191–202. doi: 10.1080/0013188980400207.

Batten

Batey

Shafe

Gubby

Birch

(2013) The influence of reputation information on the assessment of undergraduate student work. Assessment & Evaluation in Higher Education 38: 417–435. doi: 10.1080/02602938.2011.640928.

Belsey, C. (1988). Marking by numbers. AUT women, No. 15 (pp. 1–2). London, UK: Association of University Teachers.

Borenstein

Hedges

L. V.

Higgins

J. P. T.

Rothstein

H. R.

(2005) Comprehensive Meta-Analysis Version 2, Englewood, NJ: Biostat.

Bradley

(1984) Sex bias in the evaluation of students. British Journal of Social Psychology 23: 147–153. doi: 10.1111/j.2044-8309.1984.tb00623.x.

10.

Brennan

D. J.

(2008) University student anonymity in the summative assessment of written work. Higher Education Research & Development 27: 43–54. doi: 10.1080/07294360701658724.

11.

Chase

C. I.

(1986) Essay test scoring: Interaction of relevant variables. Journal of Educational Measurement 23: 33–41. doi: 10.1111/j.1745-3984.1986.tb00232.x.

12.

Cooper

W. H.

(1981) Ubiquitous halo. Psychological Bulletin 90: 218–244. doi: 10.1037/0033-2909.90.2.218.

13.

*Cornett-Ruiz

Hendricks

(1993) Effects of labeling and ADHD behaviors on peer and teacher judgments. The Journal of Educational Research 86: 349–355.

14.

Darley, J. M., & Gross, P. H. (1983). A hypothesis-confirming bias in labeling effects. Journal of Personality and Social Psychology, 44, 20–33. doi: 10.1037/0022-3514.44.1.20.

15.

Dennis

Newstead

S. E.

(1994) The strange case of the disappearing sex bias. Assessment & Evaluation in Higher Education 19: 49–56. doi: 10.1080/0260293940190105.

16.

Erwin

P. G.

Calev

(1984) The influence of Christian name stereotypes on the marking of children’s essays. British Journal of Educational Psychology 54: 223–227. doi: 10.1111/j.2044-8279.1984.tb02583.x.

17.

*Fogel

L. S.

Nelson

R. O.

(1983) The effects of special education labels on teachers' behavioral observations, checklist scores, and grading of academic work. Journal of School Psychology 21: 241–251. doi: 10.1016/0022-4405(83)90019-5.

18.

*Gerritson, M. (2013). Rubrics as a mitigating instrument for bias in the grading of student writing (Doctoral dissertation, Walden University).

19.

*Graham

Dwyer

(1987) Effects of the learning disability label, quality of writing performance, and examiner's level of expertise on the evaluation of written products. Journal of Learning Disabilities 20: 317–318.

20.

*Graham

Leone

(1987) Effects of behavioral disability labels, writing performance, and examiner’s expertise on the evaluation of written products. Journal of Experimental Education 55: 89–94.

21.

Greenwald

A. G.

Banaji

M. R.

(1995) Implicit social cognition: Attitudes, self-esteem, and stereotypes. Psychological Review 102: 4–27. doi: 10.1037/0033-295X.102.1.4.

22.

*Guttmann

Boudo

(1988) Teachers’ evaluations of pupils’ performance as a function of pupils’ sex, family type and past school performance. Educational Review 40: 105–113. doi: 10.1080/0013191880400108.

23.

Haeran

Cowling

(2010) Objectivity in grading: The promise of bar codes. College Teaching 57: 51–55.

24.

*Hanna, R., & Linden, L. (2009). Measuring discrimination in education (No. w15057). Cambridge, MA: National Bureau of Economic Research.

25.

Harari

McDavid

J. W.

(1973) Name stereotypes and teachers' expectations. Journal of Educational Psychology 65: 222–224. doi: 10.1037/h0034978.

26.

Jonsson

Svingby

(2007) The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review 2: 130–144.

27.

Kahneman

(2011) Thinking, fast and slow, New York, NY: Farrar, Straus and Giroux.

28.

*Kaplan

R. M.

(1978) Is beauty talent? Sex interaction in the attractiveness halo effect. Sex Roles 4: 195–204.

29.

Kayapınar

(2014) Measuring essay assessment: Intra-rater and inter-rater reliability. Eurasian Journal of Educational Research 57: 113–136. http://dx.doi.org/10.14689/ejer.2014.57.2 .

30.

*Kehle, T. J. (1976). Effect of physical attractiveness, sex, and intelligence on expectations for students' academic ability and personality: A replication. Retrieved from ERIC database (ED124288).

31.

*King

J. L.

(1998) The effects of gender bias and errors in essay grading. Educational Research Quarterly 22: 13–25.

32.

Klein, J., & Taub, D. (2005). The effect of variations in handwriting and print on evaluation of student essays. Assessing Writing, 10, 134–148.

33.

*Landy

Sigall

(1974) Beauty is talent: Task evaluation as a function of the performer's physical attractiveness. Journal of Personality and Social Psychology 29: 299–304. doi: 10.1037/h0036018.

34.

*Lebuda

Karwowski

(2013) Tell me your name and I'll tell you how creative your work is: Author's name and gender as factors influencing assessment of products' creativity in four different domains. Creativity Research Journal 25: 137–142. doi: 10.1080/10400419.2013.752297.

35.

Lipsey

M. W.

Wilson

D. B.

(2001) Practical meta-analysis, Thousand Oaks, CA: Sage.

36.

Malouff

J. M.

(2008) Bias in grading. College Teaching 56: 191–192. doi: 10.3200/CTCH.56.3.191-192.

37.

*Malouff

J. M.

Emmerton

A. J.

Schutte

N. S.

(2013) The risk of a halo bias as a reason to keep students anonymous during grading. Teaching of Psychology 40: 233–237. doi: 10.1177/0098628313487425.

38.

*Malouff, J. M., Stein, S. J., Bothma, L. N., Coulter, K., & Emmerton, A. J. (2014). Preventing halo bias in grading the work of university students. Cogent Psychology, 1(1). Published online at http://cogentoa.tandfonline.com/doi/full/10.1080/23311908.2014.988937.

39.

Middlesex Students’ Union (undated). Mark my work, not my name: Introduce anonymous marking wherever possible. Retrieved from http://www.change.org/en-GB/petitions/middlesex-university-mark-my-work-not-my-name-introduce-anonymous-marking-wherever-possible.

40.

Perry-Langdon, N. (1990). Marking by numbers: Evaluation of the marking of final degree examinations in the Faculty of Humanities and Social Studies. Retrieved from British Library Document Supply Centre-DSC: q94/21236.

41.

Pirog

M. A.

Buffardi

A. L.

Chrisinger

C. K.

Singh

Briney

(2009) Are the alternatives to randomized assignment nearly as good? Statistical corrections to nonrandomized evaluations. Journal of Policy Analysis and Management 28(1): 169–172. http://www.jstor.org/stable/29738995 .

42.

Rosenthal

(1979) The “file drawer problem” and tolerance for null results. Psychological Bulletin 86: 638–641. doi: 10.1037/0033-2909.86.3.638.

43.

Seraydarian

Busse

T. V.

(1981) First-name stereotypes and essay grading. The Journal of Psychology 108: 253–257. doi: 10.1080/00223980.1981.9915271.

44.

Shadish

W. R.

Ragsdale

(1996) Random versus nonrandom assignment in controlled experiments: Do you get the same answer? Journal of Consulting and Clinical Psychology 64: 1290.

45.

*Sprietsma

(2013) Discrimination in grading: Experimental evidence from primary school teachers. Empirical Economics 45: 523–538.

46.

*Van Ewijk

(2011) Same work, lower grade? Student ethnicity and teachers’ subjective assessments. Economics of Education Review 30: 1045–1058.

47.

Warren Piper

McNulty

D. D.

O’Grady

(1996) Examination practices and procedures in Australian universities, Canberra, Australia: Australian Government Publishing Service.

48.

*Wen

S. S.

(1979) Racial halo on evaluative rating: General or differential? Contemporary Educational Psychology 4: 15–19. doi: 10.1016/0361-476X(79)90022-5.

Bias in grading: A meta-analysis of experimental research findings

Abstract

Keywords

Method

Results

Discussion

Footnotes

Declaration of conflicting interests

Funding

References