Sage Journals: Discover world-class research

Abstract

In many assessment problems—aptitude testing, hiring decisions, appraisals of the risk of recidivism, evaluation of the credibility of testimonial sources, and so on—the fair treatment of different groups of individuals is an important goal. But individuals can be legitimately grouped in many different ways. Using a framework and fairness constraints explored in research on algorithmic fairness, I show that eliminating certain forms of bias across groups for one way of classifying individuals can make it impossible to eliminate such bias across groups for another way of dividing people up. And this point generalizes if we require merely that assessments be approximately bias-free. Moreover, even if the fairness constraints are satisfied for some given partitions of the population, the constraints can fail for the coarsest common refinement, that is, the partition generated by taking intersections of the elements of these coarser partitions. This shows that these prominent fairness constraints admit the possibility of forms of intersectional bias.

Keywords

algorithmic fairness bias calibration equalized odds intersectionality

1. Introduction

Individual identity is multifaceted. Hannah, for instance, is a woman, an American, from New York City (specifically the Upper West Side), but a resident of the South, a person who spent several formative years in England, a Cambridge graduate, a philosophy DPhil, an academic, from an upper middle class background, an advocate of risk literacy, a runner, a violist, Jewish, a flexitarian, heterosexual, an effective altruism enthusiast, a mother, a sister, a wife, and a fan of Andrei Tarkovsky and Townes van Zandt. Any one of these properties applies to a large number of other people, defining a subgroup of a general population of individuals. For a given individual, different social contexts may make membership in different groups more or less salient. The relative importance attached to membership in such groups is also a matter of individual discretion, at least to some degree. But one and the same individual can be a member of all of these groups without contradiction. Since similar remarks apply to any individual, there are many legitimate ways to group individuals in a population, from marital status or nationality to religion or taste in music.

Fair treatment of different groups is an objective common to many domains of assessment including aptitude testing in psychometrics (Borsboom et al., 2008), hiring decisions in the labor market (Fang and Moro, 2011), risk assessment in the criminal justice system (Kleinberg et al., 2017; Pleiss et al., 2017), and evaluation of the credibility of testimonial sources in epistemology (Stewart and Nielsen, 2020).¹ Consider the case of risk assessment. Using the same actuarial techniques that are used to calculate insurance premiums, statistical software is employed in the U.S. criminal justice system to assess an individual’s risk of re-offending. Given the type of crime committed, age, sex, employment status at the time of arrest, criminal history, etc., an individual is assigned (what can be thought of as) a probability of recidivism. Such scores are used in sentencing and parole decisions among other things. A 2016 ProPublica analysis of the risk scores of the COMPAS statistical tool for Broward County, Florida found a form of bias in the data on the tool’s predictions (Angwin et al., 2016b). The rate of false positives—the percentage of non-recidivists given a high risk score—was roughly twice as great among black defendants as among white. And the rate of false negatives—the percentage of recidivists being given a low risk—among whites was roughly twice as great as among blacks.² The bias is that these types of errors were asymmetrically distributed across black and white sub-populations, affecting the lives of black and white people in very different ways.

Research on algorithmic fairness studies the prospects of unbiased assessment. Bias in error rates is one form of bias, but not the only form and often considered not the most important form. Can bias in error rates and other important forms of bias be simultaneously eliminated? One lesson that emerges from some of these studies is that eliminating one form of bias can mean that it is impossible to eliminate another. Sometimes, then, we face a conflict between eliminating different forms of bias. Here, I argue that, not only do we face a conflict in eliminating different forms of bias, we also face a conflict in eliminating one form of bias across different groupings. Eliminating a certain form of bias across groups for one way of categorizing people in a population can mean that it is impossible to eliminate that form of bias across groups for another way of classifying them. This conflict is significant to the extent that multiple classifications are relevant. And they often are: consider the various classes mentioned in standard non-discrimination clauses, for example.³ Moreover, even if our assessments are unbiased for certain ways of classifying people—say for both a race classification that includes black and white categories and a gender classification that includes categories for women and men—bias can persist for the coarsest common refinement of these classifications—in this case, the single classification that includes the groups of black women, black men, white women, and white men. In other words, forms of intersectional bias are possible for the prominent fairness constraints in the fair algorithms literature. Given the conceptions of fairness encoded in these constraints, and confronted with the sorts of limitations in achieving fairness across various classifications discussed below, we must reconcile ourselves with lingering bias against some groups.

2. Identity and Population Partitions

In any assessment problem, there is a particular population of individuals that is relevant. For instance, in parole decisions in Broward County in 2015, there are the people coming before the parole boards in the county that year. In SAT testing in the U.S. for the last decade, there is the population of people who took the exam in that time frame. In Faceboook’s assessment of the trustworthiness of its users in an effort to combat “fake news,” there is the set consisting of nearly $2.5$ billion active monthly users (Dwoskin, 2018). The first element of our model, then, is a finite population of individuals, $N = {1, 2, \dots, n}$ .⁴

Any population can be divided into groups according to various individual properties or identities. For the question of fair treatment in assessment, certain groups are more customary to consider than others. Often, history and social context make salient particular categories. In the ProPublica story mentioned above, the focus is on the disparate treatment of different races. In particular, black and white defendants were treated differently. My interest here is in the possibility of fair treatment across various partitions of a population. A partition of a (finite) set $N$ is a collection $π = {G_{1}, \dots, G_{m}}$ of non-empty subsets of $N$ such that the elements of $π$ are mutually disjoint and collectively exhaustive.⁵ In other words, each individual in $N$ belongs to exactly one group $G_{k}$ . Whether the black and white racial categories partition a population is a contingent matter that depends on the composition of the population. If the population contains Latino and Asian people, the groups of black and white people might fail to exhaust the population. Similar points apply to a gender partition ${M, F}$ consisting of males and females, or a sexuality partition ${G, S}$ consisting of groups of homosexual and heterosexual individuals, and so on.⁶

Why should we be concerned about the multiple group memberships of any given individual and the multiple possible ways a population can be partitioned? In his book Identity and Violence, Sen argues that there is moral urgency to considering the various aspects of individual identity. Reckoning with “the power of competing identities,” says Sen, “leads to other ways of classifying people, which can restrain the exploitation of a specifically aggressive use of one particular categorization” (Sen, 2007, p. 4). Sen has in mind the way in which considering one’s humanitarian or religious affiliations might weaken the pull of a violent nationalistic or racist movement, for example. Throughout the book, Sen criticizes the idea that there is a uniquely appropriate or privileged partition.

The insistence, if only implicitly, on a choiceless singularity of human identity not only diminishes us all, it also makes the world much more flammable. The alternative to the divisiveness of one preeminent categorization is not any unreal claim that we are all much the same. That we are not. Rather, the main hope of harmony in our troubled world lies in the plurality of our identities, which cut across each other and work against sharp divisions around one single hardened line of vehement division that allegedly cannot be resisted. (Sen, 2007, p. 16)

My concern is a bit different from Sen’s, but there is a relevant lesson in the passage quoted just above. To wit, a single, fixed partition is overly constraining, and may frustrate our goals and lead to sub-optimal outcomes. The goal in assessment that is our focus is the fair treatment of different groups. But since there are various legitimate ways to partition a population into groups, restricting our attention to a single partition potentially commits us to ignoring important forms of group bias.⁷

Consider once again the bias found against black people in the COMPAS data. In that same Broward County data set, there is a similar amount of bias in error rates against women compared to men, as a companion piece in ProPublica makes clear (Angwin et al., 2016a). Bias against either group is ethically relevant. Satisfying certain central fairness constraints (described in the following section) for a race partition does not imply that those constraints are satisfied for a gender partition. Still other partitions could be pertinent. The relevant social identities cannot be decided a priori, without appeal to contingent social context and values. Sen points out that even intuitively unimportant aspects of personal identity can become important. Consider, for example, those who wear a size 8 shoe, or those born between nine and ten in the morning, local time. If size 8 shoes were to become extremely difficult to find—think “high noon” of Soviet civilization or broken supply chains due to a novel coronavirus pandemic—then being someone who wears that shoe size may become an important part of one’s identity and grounds for solidarity with those similarly unshod. Likewise, if an authoritarian ruler were to elect to severely curtail the freedoms of people born between nine and ten in the morning due to some supernatural belief or other, then the hour of one’s birth and the persecution it entails for some is, again, likely to become an important aspect of one’s identity and grounds for solidarity (Sen, 2007, pp. 26–27). In either case, new forms of bias become pressing considerations. The priority of particular partitions in eliminating bias might reasonably depend not just on past history of discrimination, but also on current deprivation. What groups suffer discrimination and deprivation is a matter to which we may frequently need to reattend. They may well depend in part on the particular assessment problem confronting us, or on the particular population, as I explain at the end of Section 4. In general, and more to the point for this essay, there can be bias against many different groups.

3. Fair Assessment

In order to foreground the issue of different population partitions, let’s assume that there is just a single property $y$ of interest (this is a standard assumption in the literature anyhow). Individuals in $N$ either have property $y$ or lack it. We can represent this with a random variable $Y : N \to {0, 1}$ that assigns $1$ to individual $i$ if $i$ has the property, and assigns $0$ to $i$ if $i$ lacks it. In the case of risk assessment, $Y (i) = 1$ would indicate that individual $i$ is a recidivist. In credibility assessment, $Y (i) = 1$ might represent that $i$ is credible or is above some credibility threshold. Call a function $h : N \to [0, 1]$ an assessor. In risk assessment, we can think of $h$ as assigning each individual a probability of re-offending. In credibility assessment, $h (i)$ could be interpreted as the probability that $i$ is credible (or above the credibility threshold). But $h$ need not be thought of as assigning probabilities. We could just as well introduce basic risk, aptitude, credibility, etc. scores. For concreteness, I will interpret $h (i)$ as the assessor’s probability that $i$ has property $y$ .

In order to talk about population proportions or frequencies, let’s introduce a uniform probability distribution $P$ on $N$ . The quantity $P (Y = 1) = μ$ , for example, is the proportion of people in $N$ that have property $y$ , the prevalence of $y$ in the population.⁸ Call $μ$ the base rate for $y$ in $N$ . Given a partition $π = {G_{1}, \dots, G_{m}}$ of $N$ , let $P_{k} = P (\cdot | G_{k})$ be the uniform probability distribution on $G_{k}$ for $k = 1, \dots, m$ . The quantity $P_{1} (Y = 1) = μ_{1}$ , for example, is the base rate for $y$ in group $1$ ; $P_{2} (h = 0.5)$ is the proportion of people to which $h$ assigns $0.5$ in $G_{2}$ , and so on.

The interesting issue concerns what properties $h$ should have in order to qualify as fair or unbiased. The “dominant fairness criterion” in the literature on risk assessment is calibration (Corbett-Davies et al., 2017, p. 799). An assessor is calibrated if $P_{k} (Y = 1 | h = p) = p$ for all $p \in [0, 1]$ and $k = 1, 2, \dots, m$ such that $P_{k} (h = p) > 0$ . This property is familiar from work on the foundations of probability. According to proponents of calibration, good forecasts should track observed frequencies (van Fraassen, 1983; Shimony, 1988). Consider weather forecasting. Suppose that each day, a forecaster announces a probability of rain for that day. The forecaster is calibrated if it rains on $10 %$ of the days she announces that it will rain with probability $0.1$ , and it rains on $85 %$ of the days she predicts rain with probability $0.85$ , etc. Similarly, a risk assessor is calibrated if, among those it assigns a $0.1$ probability of re-offending, $10 %$ re-offend, and, among those it assigns a $0.85$ probability of re-offending, $85 %$ re-offend, etc.

Why does it make sense to think of calibration as a fairness constraint? One reason is that it guards against a form of bias in confidence. If it rains on $100 %$ of the days a forecaster predicted rain with probability $0.8$ , then the forecaster is underconfident in rain on those days. Likewise, a forecaster would be overconfident if it rains on only $50 %$ of the days rain was predicted with probability $0.8$ . It can be helpful to visualize these concepts with a calibration curve like in Figure 1.⁹ Here, $30 %$ of the events the forecaster assigned probability $0$ occurred, while only $55 %$ of the events assigned probability $1$ occurred. The forecaster’s assessments were underconfident in the first case, and overconfident in the latter. Instead of weather forecasting, consider the problem of risk assessment again. If only $50 %$ of black people assigned a score of $0.8$ go on to re-offend, then the assessor is overconfident that blacks assigned that score will be recidivists. If $100 %$ of white people assigned a score of $0.8$ go on to re-offend, then the assessor is underconfident that whites assigned that score will be recidivists. If the assessor were calibrated, not only would it not be overconfident in one group and underconfident in another, it would not be over- or underconfident in any of its assessments. Another way calibration is sometimes motivated is by pointing out that it implies that scores “mean” the same for individuals in different groups. About calibrated assessors Kleinberg et al., for example, write, “we are justified in treating people with the same score comparably with respect to the outcome, rather than treating people with the same score differently based on the group they belong to” (Kleinberg et al., 2017, pp. 4–5). If $50 %$ of black people assigned a score of $0.8$ go on to re-offend while $100 %$ of white people assigned a score of $0.8$ go on to re-offend, there is a sense in which the assessment score of $0.8$ means something different for individuals in the two groups.

Figure 1.

Calibration Curve.

I think there are some reasonable concerns one might have about construing calibration as a fairness property. It seems to me that this latter line of motivation in terms of meaning—and to some extent even the previous one in terms of under- and overconfidence—fails to fully motivate calibration as a fairness constraint. Scores for individuals in different groups can mean the same without the assessor satisfying the full calibration constraint. Calibration implies that $P_{k} (Y = 1 | h = p) = P_{j} (Y = 1 | h = p)$ for all $G_{k}, G_{j} \in π$ when those conditional probabilities are defined. Clearly, if $P_{k} (Y = 1 | h = p)$ and $P_{j} (Y = 1 | h = p)$ are both equal to $p$ (calibration), then those terms are equal to each other, but they could both be equal to some other value. Say that an assessor $h$ satisfies predictive equity for a partition $π$ if $P_{k} (Y = 1 | h = p) = P_{j} (Y = 1 | h = p)$ for all $G_{k}, G_{j} \in π$ and all $p$ for which the conditional probabilities are defined.¹⁰ Put another way, among people assigned the same assessment score, the proportion of people who have property $y$ is the same across all groups in the partition. Predictive equity simply retains the explicit equal treatment aspect of calibration.

There are at least two cautions about relaxing calibration to predictive equity worth considering. The first is that fairness is not the only goal in assessment. We also care about the property being assessed after all. We care about maintaining public safety, admitting a talented class of freshmen, trusting credible testimonial sources, making prudent loan decisions, etc. That is, there is typically a purpose for which an assessment is conducted, with fairness acting as a sort of constraint. So, it may be reasonable to retain the form of accuracy that calibration adds to predictive equity. Not only should it be the case that $P_{j} (Y = 1 | h = p) = P_{k} (Y = 1 | h = p)$ , but it should also be the case that those terms track the actual frequencies in the population. According to this way of understanding the property, calibration captures both a fairness concern and an epistemic concern about accuracy (Stewart and Nielsen, 2020). The second concern about relaxing calibration to predictive equity is that accuracy may represent a type of fair treatment itself. While predictive equity does not permit being simultaneously overconfident in individuals in one group at a given assessment score and underconfident in individuals in another group at that same assessment score, it does permit being uniformly under- or overconfident in individuals of a given assessment score. If only $50 %$ of people assigned a risk score of $0.8$ are recidivist in each race group, then those individuals might still be considered the victims of unfair assessment. Even if those in one group haven’t been treated more harshly relative to those given the same risk score in the other group, they have been treated too harshly in the sense that the assessor is overconfident that they will reoffend. Calibration prevents this.

Rather than considering ways to relax calibration, we might consider alternative fairness constraints, ones that might potentially supplement calibration. Even if calibration is necessary for unbiased assessment, it may not be sufficient. An assessor that simply predicts the group base rate for everyone in the group will be calibrated. Yet, an innocent person in a group with a high recidivism base rate, for example, might have grounds for complaint when he receives a higher risk score than his counterpart in a group with a lower base rate. Similarly, it is consistent with calibration for a recidivist in a low base rate group to receive a lower risk score than a non-recidivist in a high-base-rate group. One reading of these points is that there are other forms of bias to consider besides the one calibration attempts to eliminate. This reading seems supported by ProPublica’s analysis of the COMPAS data. The sort of bias that they charge the statistical tool with is not a failure of calibration, but a disparity in error rates across groups. I turn now to a constraint meant to eliminate for exactly this type of bias.

To introduce the constraint, we need a few auxiliary definitions. The false positive rate of an assessor $h$ for group $G_{k}$ is given by $f_{k}^{+} (h) = E_{k} (h | Y = 0)$ .¹¹ In words, the false positive rate for group $G_{k}$ is the average assessment score of individuals lacking property $y$ in group $G_{k}$ . The false negative rate is $f_{k}^{-} (h) = E_{k} (1 - h | Y = 1)$ . In words, the false negative rate for group $G_{k}$ is equal to $1$ minus the average assessment score of individuals possessing property $y$ in group $G_{k}$ . An assessor satisfies equalized odds if $f_{j}^{+} (h) = f_{k}^{+} (h)$ and $f_{j}^{-} (h) = f_{k}^{-} (h)$ for all $G_{j}, G_{k} \in π$ . Equalized odds guarantees that errors are not asymmetrically distributed across groups of the partition.¹² In general, calibration does not imply equalized odds nor does equalized odds imply calibration. For example, a calibrated assessor can assign much greater average risk scores to non-recidivists in one group than to non-recidivists in another (and higher even than the average risk scores assigned to recidivists in the other group). Moreover, the COMPAS data gives us at least prima facie reasons to be concerned about the form of bias motivating the introduction of equalized odds.

In this essay, I will grant that calibration and equalized odds are prima facie compelling fairness constraints, though I consider it legitimate to subject them to further scrutiny in general and plan do to so in future work. On the one hand, the reader might agree that the properties are important formalizations of unbiased assessment. Many have found them to have considerable intuitive plausibility. So, the consequences of the properties would seem ethically important for such readers. On the other hand, because of the prominence of these sorts of statistical properties in theories of algorithmic fairness, it is crucial to scrutinize them, to explore their consequences and their limitations. So, even if the reader is unconvinced of the normative status of the properties, the consequences of these properties are relevant to a sober evaluation of them.¹³

4. Some Limitations

Unfortunately, there are limits to the extent assessments can be unbiased. Let’s look at one central limitative result. An assessor is perfect if $h (i) = Y (i)$ for all $i \in N$ . In practice, perfect assessment is rarely achievable in interesting or non-trivial cases. So, if some standard of fairness is achievable only by perfect assessment, it is reasonable to think the standard is too high for interesting assessment problems of practical concern. If we think of the assessor as a probability judgment that some individual has property $y$ , then perfection requires only assigning probabilities $0$ and $1$ and not making any mistakes. But much of the power and applicability of probability theory comes with non-extreme judgments. The same is true of assessments. We can now state one of the central limitative results for fair assessment.

Theorem 1.
(Kleinberg et al., 2017) Let $h$ be an assessor for $N$ . The following are equivalent:
$h$ is calibrated and satisfies equalized odds for a partition $π$ .

Either i) the base rates in all groups are exactly the same or ii) $h$ is perfect.

Theorem 1 is widely regarded as an impossibility or triviality result for fair assessment. Corbett-Davies et al. report that Kleinberg et al. “prove that except in degenerate cases, no algorithm can simultaneously satisfy” calibration and equalized odds (Corbett-Davies et al., 2017, p. 799); on the basis of this result, journalists at ProPublica published a followup article entitled “Bias in Criminal Risk Scores Is Mathematically Inevitable, Researchers Say” (Angwin and Larson, 2016). The idea is that perfection, as discussed, is very rarely achievable in real-life, interesting assessment problems. Similarly, that the base rates for the relevant groups are exactly the same is only very rarely the case. As a result, outside of very rare circumstances, it is impossible to achieve both fairness properties. Results like Theorem 1 give us reason to explore ways to relax or modify the fairness constraints.

Each of calibration and equalized odds is meant to eliminate a certain form of bias. What Theorem 1 establishes is that, for a fixed way of carving the population into groups, eliminating one form of bias makes it impossible to eliminate another. Next, I consider requiring the individual fairness constraints on assessment hold for all partitions of the population. Clearly, we cannot expect these constraints to be jointly satisfied in assessment for multiple partitions since, by Theorem 1, they cannot be simultaneously satisfied for a single partition. Instead, I consider each property on its own. Under the assumption that each fairness constraint eliminates a form of bias that is desirable to eliminate, I study the possibility of eliminating one form of bias across multiple ways of dividing the population into groups.

Let’s consider each constraint in turn, starting with calibration. An alternative way to strengthen calibration for a single partition is to require it for multiple partitions rather than imposing a different sort of fairness constraint like equalized odds on the same partition. Again, Theorem 1 gives us reason to seek such alternatives. When confronted with the limitation expressed in Theorem 1, a number of people have suggested to me in person that calibration is clearly the more compelling condition. Perhaps bias of types that calibration fails to exclude—for example, some version of bias in error rates—could be reduced by requiring calibration for multiple partitions. The best case would be for calibration to hold for all partitions, since that would exclude bias in confidence against any group and maybe other forms of bias to boot. Observation 1 states a limitation on this strategy.
Observation 1.
Let $h$ be an assessor for $N$ . The following are equivalent:
$h$ is calibrated for all binary partitions.¹⁴

$h$ is calibrated for all partitions.

$h$ is perfect.

Calibration for all partitions, then, is only achievable in the typically unrealistic case of perfect assessment (cf. Hébert-Johnson et al., 2018, p. 1940).¹⁵ In other words, outside of the unrealistic case of perfect assessment, there will be bias in confidence against some group. Observation 1 complicates any automatic inference from failure of calibration for some group to intentional bias on behalf of the assessor (for further discussion of this point, see Stewart and Nielsen, 2020, Sec. 5).

Next, let’s consider the weaker predictive equity property. Say that an assessor $h$ makes perfect distinctions if, for all $i, j \in N$ , $Y (i) \neq Y (j)$ implies that $h (i) \neq h (j)$ . For any score $p$ , if $h (i) = p$ and $Y (i) = 1$ , then for no individual $j$ such that $Y (j) = 0$ is it the case that $h (j) = p$ . Compare calibration for a fixed partition. Calibration allows for individuals that differ with respect to property $y$ to receive the same score $p$ so long as the proportion of individuals who possess the property among those who receive the score $p$ is $p$ . In fact, calibration for a fixed partition generally requires individuals who differ with respect to property $y$ to receive the same assessment; otherwise, for values in $(0, 1)$ , the assessor would be under- or overconfident in the group.
Observation 2.
Let $h$ be an assessor for $N$ . The following are equivalent:
$h$ satisfies predictive equity for all binary partitions.

$h$ satisfies predictive equity for all partitions.

$h$ makes perfect distinctions.

Aside from assessors that make perfect distinctions, scores will not “mean” the same thing for all groups; there will be bias against some group. In large populations, perfect distinctions is very difficult to achieve—not as difficult as perfect assessment, but difficult nonetheless. Observations 4 and 5 below relate perfect distinctions to perfection.

Say that $h$ satisfies perfect non-discrimination when, for all $i, j \in N$ , $Y (i) = Y (j)$ implies that $h (i) = h (j)$ . Notice that a degenerate case of perfect non-discrimination is a constant assessor: for some $p \in [0, 1]$ , $h (i) = p$ for all $i \in N$ . Perfect non-discrimination—the converse of perfect distinction—captures the ideal that likes are treated alike, whereas perfect distinction captures the ideal that unalike individuals are not treated alike. We can now make the following observation regarding equalized odds.
Observation 3.
Let $h$ be an assessor for $N$ . The following are equivalent:
$h$ satisfies equalized odds for all binary partitions.

$h$ satisfies equalized odds for all partitions.

$h$ is perfectly non-discriminatory.

Here is another way to think about perfect non-discrimination. A higher bar than making perfect distinctions would be making perfect distinctions while limiting assessments to just two scores. Then, the binary assessor perfectly sorts the population into two groups: those having property $y$ are assigned one score $p$ , while all of those lacking the property are assigned another score $p^{'} \neq p$ . A very low bar, on the other hand, would be a constant assessor, that is, an assessor that assigns the same score to every individual in the population. Such assessments may fail to carry any information at all. If $h$ is perfectly non-discriminatory, then $h$ is either constant and uninformative or, if non-constant, binary, sorting the population into two groups perfectly (more on this implication after Observation 5 below). So unless $h$ takes one of these rather restrictive forms, it will violate equalized odds for some way of partitioning the population. Since equalized odds rules out bias in error rates, we know that such bias in error rates is unavoidable outside of the two restrictive cases just mentioned.
4.1. Connections

Certain mathematical relationships between the limitations in Observations 1, 2, and 3 are easy to state. For instance, a simple numerical transformation can convert an assessor that makes perfect distinctions into a perfect assessor.

Observation 4.
Let $h$ be an assessor for $N$ .
$h$ makes perfect distinctions if and only if there exists a function $g : [0, 1] \to {0, 1}$ such that $g \circ h$ is perfect.

$h$ is non-constant and satisfies perfect non-discrimination if and only if the population is not homogeneous with respect to $y$ and there exists an injective function $g : [0, 1] \to {0, 1}$ such that $g \circ h$ is perfect.

Barocas et al. point out that, for any assessor that satisfies predictive equity, there is a transformation of it that is calibrated (Barocas et al., 2019, Proposition 1, p. 52). On the basis of this observation, they conclude that predictive equity and calibration are “essentially equivalent notions” (Barocas et al., 2019, p. 52). In my view, such an interpretation is unwarranted,¹⁶ but these sorts of transformations at least succinctly state a type of connection between concepts. The existence of a transformation to a perfect assessor characterizes those assessors that make perfect distinctions and so characterizes assessors that satisfy predictive equity for all partitions. For non-homogeneous populations, the existence of an injective transformation to a perfect assessor characterizes those assessors whose scores sort the population into a $y$ group and its complement, thereby characterizing the class of non-constant assessors that satisfy equalized odds for all partitions.

Using the limitations, we can also easily mark some logical relationships between the fairness constraints when they hold for all partitions for non-constant assessors.
Observation 5.
Let $h$ be a non-constant assessor for $N$ .
If $h$ is calibrated for all partitions, then $h$ satisfies equalized odds for all partitions.

If $h$ satisfies equalized odds for all partitions, then $h$ satisfies predictive equity for all partitions.

Since the observation is a fairly immediate consequence of preceding ones, I will just sketch a quick supporting argument here. If $h$ is calibrated for all partitions, then, by Observation 1, $h$ is perfect. Perfect assessors clearly give the same score to $i$ and $j$ when $Y (i) = Y (j)$ , namely, $1$ or $0$ depending on whether $Y (i)$ is $1$ or $0$ . So clause 1 follows from Observations 3. For clause 2, suppose that $h$ satisfies equalized odds for all partitions. By Observation 3, $h$ is perfectly non-discriminatory. Since $Y$ is a binary random variable, with the assumption that $h$ is non-constant, it follows that $h$ is binary, assigning each individual one of just two scores. Suppose that, for some $i, j \in N$ , $Y (i) \neq Y (j)$ . For reductio, suppose that $h (i) = h (j)$ . Since $h$ is non-constant, there is some $k \in N$ such that $h (k) \neq h (i) = h (j)$ . Since $Y$ is binary, either $Y (k) = Y (i)$ or $Y (k) = Y (j)$ . In either case, perfect non-discrimination implies that $h (k) = h (i) = h (j)$ , which is a contradiction. Hence, $h (i) \neq h (j)$ . It follows that $h$ makes perfect distinctions. By Observation 2, it follows that $h$ satisfies predictive equity for all partitions.

We could think of what happens when a constraint is satisfied for all partitions as revealing what ideal of fairness the constraint is committed to. As satisfying one of the constraints is supposed to represent a form of fair assessment for the groups in a partition, satisfying a constraint for all partitions represents fair assessment for all groups. This is, plausibly, the ideal case. For the three criteria under consideration here, the ideals are very simple and so are the relationships between them (Figure 2).

Figure 2.
Relations among Fairness Ideals for Non-Constant Assessors.
4.2. Objections

I want to consider two objections to the significance of the foregoing limitative results. Both concern the potentially overly exacting nature of what is being asked for in avoiding bias completely against all groups. First, we might consider satisfying certain fairness constraints approximately rather than exactly. That is, we could confine the amount of bias to which any group is subject to a certain margin of tolerance. Second, we might consider avoiding bias for a certain collection of partitions, even if that collection is not the set of all partitions. I discuss these objections in turn.

One potential source of stringency that could be driving the limitative results is the requirement that a constraint has to be satisfied exactly. Instead, we could consider requiring that an assessor satisfies a fairness constraint approximately. Kleinberg et al. consider such a possibility for satisfying multiple fairness criteria approximately in light of Theorem 1. On this approach, an assessor is approximately fair for some margin of tolerance if, for each group, the assessments are within that margin. The guiding idea is that the fairness standards are relaxed to requiring only that assessors are unbiased “enough” for each group. Only sufficiently small amounts of bias, in other words, are tolerated. Let’s look at each constraint in turn.

Say that $h$ is $ε$ -approximately calibrated for some partition $π$ if, for some $ε \geq 0$ , all $G_{k} \in π$ , and all $p$ such that the conditional probability $P_{k} (Y = 1 | h = p)$ is defined, $| P_{k} (Y = 1 | h = p) - p | \leq ε$ . Rather than requiring that $P_{k} (Y = 1 | h = p) = p$ exactly, in other words, approximate calibration requires only that, for each group $G_{k} \in π$ , $P_{k} (Y = 1 | h = p)$ is within $ε$ of $p$ . Say that $h$ is $δ$ -approximately perfect if, for all $i \in N$ , $| Y (i) - h (i) | \leq δ$ . The next observation establishes that approximate calibration for all partitions is equivalent to approximate perfection with $δ = ε$ .

Observation 1 $^{'}$ .
Let $h$ be an assessor for $N$ . The following are equivalent:
$h$ is $ε$ -approximately calibrated for all binary partitions.

$h$ is $ε$ -approximately calibrated for all partitions.

$h$ is $ε$ -approximately perfect.

Put another way, relaxing calibration in a continuous fashion is equivalent to relaxing perfection in a continuous way. Small deviations from calibration allow only (equally) small deviations from perfect assessment. Observation 1 $^{'}$ expresses the same qualitative limitation as Observatoin 1 for avoiding bias across all partitions.

Say that $h$ satisfies $ε$ -approximate predictive equity for a partition $π$ if, for all $G_{j}, G_{k} \in π$ , $| P_{j} (Y = 1 | h = p) - P_{k} (Y = 1 | h = p) | \leq ε$ for all $p$ for which the conditional probabilities are defined. Again, the idea is that we might require the relevant conditional probabilities to be “close enough” rather than exactly equal. (Notice that, when $ε = 1$ , $ε$ -approximate predictive equity is completely vacuous, placing no constraints on the respective probabilities. Observation 2 $^{'}$ excludes this case, assuming $0 \leq ε < 1$ in generalizing Observation 2.) We might consider saying that assessor $h$ makes $δ$ -approximately perfect distinctions if there is some $δ > 0$ such that, for all $i, j \in N$ , $Y (i) \neq Y (j)$ implies that $| h (i) - h (j) | > δ$ . But this strengthens the property of making perfect distinctions. Since $Y$ is binary, alternative formulations of approximate versions of making perfect distinctions are limited. The following observation, however, establishes a connection between $h$ ’s satisfying approximate predictive equity for any $ε \in [0, 1)$ and making perfect distinctions.
Observation 2 $^{'}$ .
Let $h$ be an assessor for $N$ . The following are equivalent:
$h$ satisfies $ε$ -approximate predictive equity for all binary partitions with $0 \leq ε < 1$ .

$h$ satisfies $ε$ -approximate predictive equity for all partitions with $0 \leq ε < 1$ .

$h$ makes perfect distinctions.

Observations 2 and $2^{'}$ imply that (non-vacuous) approximate predictive equity for all partitions is, somewhat surprisingly, equivalent to predictive equity for all partitions. Moving to the approximate version of the constraint creates no new possibilities.

Finally, say that an assessor satisfies $ε$ -approximately equalized odds for a partition $π$ if there exists some $ε \geq 0$ such that, for all $G_{j}, G_{k} \in π$ , $| f_{j}^{+} (h) - f_{k}^{+} (h) | \leq ε$ and $| f_{j}^{-} (h) - f_{k}^{-} (h) |$ $\leq ε$ . In words, the false positive rates for any two cells of the partition are within $ε$ of each other, as are the false negative rates. Say that an assessor $h$ satisfies $δ$ -approximately perfect non-discrimination when likes are treated approximately the same: for some $δ \geq 0$ and for all $i, j \in N$ , $Y (i) = Y (j)$ implies that $| h (i) - h (j) | \leq δ$ . The next observation establishes that when $h$ satisfies approximately equalized odds with respect to $ε$ for all partitions, $h$ satisfies approximately perfect non-discrimination with respect to $δ = ε$ . However, the equivalence between satisfying a constraint for all partitions and satisfying it for all binary partitions breaks down here. Now, we have that if $h$ satisfies approximately equalized odds with respect to $ε$ for all binary partitions, $h$ satisfies approximately perfect non-discrimination with respect to $δ = 2 ε$ .
Observation 3 $^{'}$ .
Let $h$ be an assessor for $N$ . Then:
$h$ satisfies $ε$ -approximately equalized odds for all partitions if and only if $h$ is $ε$ -approximately perfectly non-discriminatory.

If $h$ satisfies $ε$ -approximately equalized odds for all binary partitions, then $h$ is $2 ε$ -approximately perfectly non-discriminatory.

On the basis of Observations $1^{'}$ , $2^{'}$ , and $3^{'}$ it seems fair to say that, for all three fairness criteria, the spirit of the limitations indicated by Observations 1, 2, and 3 survives the transition to the approximate versions of those criteria. Tolerating a small amount of bias relaxes the restrictiveness with which fairness can be achieved only by a correspondingly small amount—or, in the case of predictive equity, not at all.

When we require, not that the criteria hold exactly for all partitions, but only that they hold approximately for all partitions, the logical connections between the characterizing conditions are a bit different. These connections are represented in Figure 3. We can take any $δ \geq ε$ in the statement of approximate perfect non-discrimination for the implication in the figure to go through. Excluding the trivial case of $ε = 1$ , however, perfect distinctions—which is not stated in an approximate form for reasons previously given—is independent of approximate perfection and approximate perfect non-discrimination.

Figure 3.
Relations among Approximate Fairness Ideals for Non-Constant Assessors.

The second objection councils restraining our ambitions in a different way. One might be inclined to think that, while (a particular type of) unbiased assessment for multiple partitions is often desirable, we have overshot the mark by requiring it for all partitions. Consider calibration. There are simple examples of populations that allow for a imperfect assessor that is simultaneously calibrated for, say, two different non-trivial ways of partitioning the population.
Example 1:
Let $N = {1, 2, 3, 4, 5, 6}$ , and let $Y (i) = 1$ for $i = 1, 5, 6$ . Having property $y$ is represented by an asterisk in Table 1. We consider a race partition ${B, W} = {{1, 2, 4}, {3, 5, 6}}$ and a sex partition ${M, F} = {{1, 2, 3}, {4, 5, 6}}$ . To see that $h$ is calibrated for the ${M, F}$ partition, observe that, in both groups, half of those assigned $1 / 2$ have property $y$ . Similarly, none of those assigned $0$ have property $y$ and all of those assigned $1$ have property $y$ . The same sort of inspection confirms that $h$ is also calibrated for the ${B, W}$ partition. $⋄$
Table 1.
Calibration for Two Binary Partitions.

B W

M $h (1^{}) = 1 / 2, h (2) = 0$ $h (3) = 1 / 2$

F $h (4) = 1 / 2$ $h (5^{}) = 1 / 2, h (6^{}) = 1$

So one idea might be that there is a small set of relevant partitions $Π$ , and we should look for calibrated assessments only for members of $Π$ . However, there are also simple examples of populations—involving a different distribution of property $y$ —such that no imperfect assessor is calibrated for these same two pre-specified categories that partition the population.
Example 2:
Let $N = {1, 2, 3}$ , and let $Y (i) = 1$ for $i = 1, 3$ . Having property $y$ is represented by an asterisk in Table 2. Supposing that $h$ is imperfect and calibrated for the ${M, F}$ partition of $N$ implies that individual $1$ must receive a score in $(0, 1)$ . The only such assessment consistent with calibration is $h (1) = h (2) = 1 / 2$ . But then $h$ cannot calibrated for $B$ since, by calibration for $F$ , $h (3) = 1$ . Similarly, $h$ cannot be calibrated for $W$ since $P_{W} (Y = 1 | h = 1 / 2) \neq 1 / 2$ . $⋄$
Table 2.
No Calibration for Two Binary Partitions.

B W

M $1^{}$ $2$

F $3^{}$

One feature of Example 2 that may incline some to regard it as a corner case is that the base rate for property $y$ is extreme (in ${0, 1}$ ) in the ${B, W}$ partition. But that is only for simplicity of illustration. The Appendix includes a simple example (Example 4) of a population partitioned three ways. The base rate of $y$ in each cell is in $(0, 1)$ . No imperfect, calibrated assessor exists. So impossibilities emerge even in the simplest cases.

Still, it makes sense to investigate conditions that are both necessary and sufficient for the existence of an imperfect but calibrated assessor, for example, for all partitions in a set $Π$ . (Again, in general, $Π$ may not be the set of all* partitions of $N$ or all partitions of $N$ of a given cardinality.) We could investigate analogous characterizing conditions for predictive equity, equalized odds, or other fairness constraints of concern. I will have to leave the task of discovering conditions that are both intuitive and interesting to future work. However, we can already mark a few limitations of this approach. First, Observations 1, 2, and 3 still imply that, for any assessment problem and any non-trivial assessor, bias against some groups will persist—even if that group is not considered highest priority. Second, by Examples 2 and 4, unbiased assessment for even two or three given partitions is sometimes impossible outside of trivial cases. Even if these partitions represent the categories typically considered of most urgent concern in some context—like race and gender in certain settings, say—fairness as gauged by the metrics under discussion may be impossible to achieve in a non-trivial way for these partitions.¹⁷ Moreover, even if they are achievable for those partitions of one population (like for race and gender partitions in Example 1), they might fail to be achievable in another population for those same categories (like for race and gender partitions in Example 2). For example, it could be the case that a non-perfect, calibrated risk assessor exists for races and genders for defendants one year, but not the next. And since we do not typically know the values of either $Y$ or $P_{k}$ at the time assessments must be made, we typically do not know whether non-trivial unbiased assessment is possible for all partitions in the set. Specifying a set of relevant partitions $Π$ for a given assessment problem, then, confers no guarantee that a given fairness constraint can be satisfied for all members of $Π$ in a non-trivial fashion.

To summarize, on the one hand, requiring the satisfaction of a fairness constraint for some single partition is generally unsatisfactory since we may care about the fair treatment of groups from different partitions. On the other hand, requiring any of the fairness constraints considered here be satisfied for all partitions of the population or all partitions of some cardinality places unrealistically high demands on assessment as Observations 1, 2, and 3 establish. Kearns et al. make a similar point: “we cannot insist on any notion of statistical fairness for every subgroup of the population: for example, any imperfect classifier could be accused of being unfair to the subgroup of individuals defined ex-post as the set of individuals it misclassified. This simply corresponds to ‘overfitting’ a fairness constraint” (2018, p. 2565). The foregoing observations refine this point, providing, for each fairness constraint, an explicit characterization of when the constraint holds or holds approximately for all partitions. What about imposing the fairness constraint on only some set of partitions? There are at least three problems with resisting the limitative nature of Observations 1, 2, or 3 by relaxing the assumption that the relevant fairness constraint holds for all partitions (or all partitions of a given cardinality) to the assumption that it holds for just multiple partitions. First, by the observations above, bias against some group is a foregone conclusion. Which and how many partitions are ethically relevant is not invariant across assessment problems and cannot be decided a priori, so the implied bias may be more or less ethically relevant. As Examples 2 and 4 show, for some assessment problems, we can run into impossibilities even for small sets of partitions. These examples can be adapted to show that this sort of limitation emerges even for approximate versions of the fairness constraints. Second, for some given set of partitioning categories, certain populations may admit non-trivial fair assessments while others do not. For example, it could be the case that the population of Broward County defendants in 2015 admits non-trivial fair assessment for races and genders, while the 2016 population does not. This raises yet further concerns about fair treatment: some populations can be treated fairly, while others cannot unless the assessment meets the highest bar of perfection. The third problem that arises for resisting the limitative nature of the foregoing observations by focusing only on a set of pre-determined partitions is the prospect of intersectional bias, which I turn to next.
5. Intersectionality

	B	W
M	$h (1^{*}) = 1 / 2, h (2) = 0$	$h (3) = 1 / 2$
F	$h (4) = 1 / 2$	$h (5^{}) = 1 / 2, h (6^{}) = 1$

	B	W
M	$1^{*}$	$2$
F	$3^{*}$

We have seen that membership in multiple social groups is universal and is, in a sense, a truism. Intersectionality theory is concerned with membership in multiple socially disadvantaged groups, and with how membership in multiple socially disadvantaged groups can compound disadvantage in a nonlinear way, so to speak. Calibration, predictive equity, and equalized odds each presents a particular conception of unbiased assessment. A natural question to ask is whether any of these conceptions of unbiased assessments admits the possibility of intersectional bias. Do the intersectionality theorist’s concerns emerge here?

Kimberlé Crenshaw, who introduced the term “intersectionality,” makes use of a court case to explain how bias against black women, for example, is consistent with the lack of that form of bias against black people or against women (Crenshaw, 1989). In DeGraffenreid v. General Motors, five black women alleging discrimination by General Motor’s seniority-based system sued the company. Prior to 1964, General Motors did not hire black women. All of the black women hired after 1970 lost their jobs through a seniority-based layoff during a later recession. The district court rejected the plaintiffs’ attempt to bring a suit on behalf of black women in particular rather than on behalf of black people or women. According to the court, the suit must present “a cause of action for race discrimination, sex discrimination, or alternatively either, but not a combination of both” (qtd. in Crenshaw, 1989, p. 141). The court noted that, while General Motors did not hire black women prior to 1964, they did hire female employees for a number of years prior to 1964. So there was no sex discrimination. And what if General Motors had hired black people—specifically black men—for a number of years prior to 1964? Crenshaw’s point is that that would not really absolve General Motors of the charge of discrimination against black women. It certainly does not follow that there could be no discrimination against black women.

To address the spectre of intersectional bias in the fair assessment setting considered here, we need to introduce the notion of the coarsest common refinement of a set of partitions. As technical work that deals with partitions has shown, the coarsest common refinement of a set of partitions is a very handy concept (e.g., Aumann, 1976).¹⁸ A partition $π$ refines (is a refinement of) another partition $π^{'}$ if every cell of $π$ is a subset of a cell of $π^{'}$ . In turn, $π^{'}$ coarsens or is a coarsening of $π$ . The partition $π$ is a common refinement of partitions $π^{'}$ and $π^{″}$ if $π$ refines both $π^{'}$ and $π^{″}$ . The coarsest common refinement of partitions $π^{'}$ and $π^{″}$ is the partition $π$ that is a common refinement of $π^{'}$ and $π^{″}$ such that any other common refinement of $π^{'}$ and $π^{″}$ is also a refinement of $π$ . The coarsest common refinement of a set of partitions is given by the partition consisting of the nonempty intersections of the cells of those partitions. In the aforementioned court case, for example, from a race partition containing the categories of black people and white people and a gender partition containing the categories of women and men, we form a new partition with the categories of black women, black men, white women, white men.

How does the fairness framework under consideration in this essay bear on the issue of intersectionality? It is not true that if $h$ is calibrated for each group in some set of partitions, then $h$ is calibrated for the coarsest common refinement of those partitions. For example, even if an assessor exhibits no bias against black job candidates nor against women candidates in the sense that it is calibrated for these groups, it can still fail to be calibrated for black women. That bias not present against certain groups can be present against intersections of those groups is easily established in the assessment framework under consideration.

Example 3:
Let $N = {1, 2, 3, 4, 5, 6, 7}$ , and let $Y (i) = 1$ for $i = 2, 4, 5, 7$ . Again, having property $y$ is represented by an asterisk in Table 3. Consider two binary partitions, one for gender, ${M, F}$ , and one for race, ${B, W}$ . The partitions and assessments scores are also displayed in Table 3. The assessor $h$ is calibrated for both the ${M, F}$ partition and the ${B, W}$ partition. In all of those groups, two thirds of those who receive an assessment of $2 / 3$ have property $y$ . The coarsest common refinment is the four-cell partition ${B \cap M, B \cap F, W \cap M, W \cap F}$ composed of the groups of black men, black women, white men, and white women. Since $P_{B \cap F} (Y = 1 | h = 2 / 3) = 1$ , $h$ is underconfident in (and so not calibrated for) black women. At the same time, $h$ is overconfident in both black men and white women. $⋄$
Table 3.
Intersectional Bias.

B W

M $h (1) = 2 / 3, h (2^{}) = 2 / 3$ $h (3) = 0, h (4^{}) = 2 / 3$

F $h (5^{}) = 2 / 3$ $h (6) = 2 / 3, h (7^{}) = 2 / 3$

Two further points are worth emphasizing. First, the example also shows that it is conceptually possible for membership in multiple socially disadvantaged groups to lead to favorable assessment bias. This could be the case if $y$ is an undesirable feature like recidivism. Or, if $y$ is a desirable feature, simply swap the $M$ and $F$ labels to generate a case of favorable bias for black women. Second, since calibration implies predictive equity, it follows that $h$ satisfies predictive equity for the race and gender partitions in Example 3. But $h$ also violates predictive equity for the coarsest common refinement: $P_{B \cap F} (Y = 1 | h = 2 / 3) = 1 \neq P_{W \cap F} (Y = 1 | h = 2 / 3) = 1 / 2$ , for example. In the Appendix, Example 5 establishes that equalized odds can also be violated for the coarsest common refinement of two partitions even when the constraint is satisfied for these coarser partitions. So intersectional bias is possible for all of the fairness constraints under considerations. I summarize these points in the following observation.
Observation 6.
Let $h$ be an assessor for $N$ .

Even if $h$ is calibrated for each partition in a set $Π$ of partitions of $N$ , $h$ can fail to be calibrated for the coarsest common refinement of $Π$ .

Even if $h$ satisfies predictive equity for each partition in a set $Π$ of partitions of $N$ , $h$ can fail to satisfy predictive equity for the coarsest common refinement of $Π$ .

Even if $h$ satisfies equalized odds for each partition in a set $Π$ of partitions of $N$ , $h$ can fail to satisfy equalized odds for the coarsest common refinement of $Π$ .

I end on at least a slightly more positive note. Suppose that we manage to specify some number of population partitions for which unbiased assessment is most ethically relevant in a particular assessment problem. The next observation provides one consideration in favor of focusing on unbiased assessment for the coarsest common refinement of the relevant partitions.
Observation 7.
Let $h$ be an assessor for $N$ .

If $h$ is calibrated for the coarsest common refinement of a set $Π$ of partitions of $N$ , then $h$ is calibrated for each partition in $Π$ .

If $h$ satisfies predictive equity for the coarsest common refinement of a set $Π$ of partitions of $N$ , then $h$ satisfies predictive equity for each partition in $Π$ .

If $h$ satisfies equalized odds for the coarsest common refinement of a set $Π$ of partitions of $N$ , then $h$ satisfies equalized odds for each partition in $Π$ .

Observation 7 assures us that, if we can specify the population partitions that are ethically relevant in an assessment problem, then satisfying a particular fairness constraint for the coarsest common refinement implies that the constraint is satisfied on all of the relevant partitions. The constraint only needs to be satisfiable in the coarsest common refinement in a non-trivial way. This contrasts with the lack of a corresponding guarantee, indicated in Observation 6, when we focus on unbiased assessments for the coarser partitions as the district court did in DeGraffenreid v. General Motors. Nevertheless, satisfaction of one of these fairness constraints on the coarsest common refinement of a set of partitions is only a sufficient condition for the constraint’s satisfaction for all partitions in the set; it is not necessary, as simple examples illustrate. As a result, focusing only on the coarsest common refinement unduly restricts the set of fair assessors. Furthermore, we generally do not know at the time of assessment whether non-trivial unbiased assessment is possible for the coarsest common refinement.
6. Conclusion

	B	W
M	$h (1) = 2 / 3, h (2^{*}) = 2 / 3$	$h (3) = 0, h (4^{*}) = 2 / 3$
F	$h (5^{*}) = 2 / 3$	$h (6) = 2 / 3, h (7^{*}) = 2 / 3$

There are multiple ways to carve a population, multiple social identities, for which it may be important to avoid biased assessments. Fixing a single partition of identities is overly restrictive, committing us to ignoring both relevant forms of bias against other groups and changing social context. Allowing even a set of partitions to ossify into the relevant partitions may fail to make us sufficiently attentive. In the DeGriffenreid v. General Motors decision, the court resisted the idea that relevant classifications are open to reconsideration or refinement: “The prospect of the creation of new classes of protected minorities, governed only by the mathematical principles of permutation and combination, clearly raises the prospect of opening the hackneyed Pandora’s box” (qtd. in Crenshaw, 1989, p. 142). Pandora’s box or not, the alternative seems to be refusal to confront different possible forms of group bias. If Sen is right, such dogmatism also “makes the world much more flammable.”

But we confront limitations in taking the relevant classes in fair assessment to be “governed only by the mathematical principles of permutation and combination.” Requiring that certain forms of bias be avoided for all possible social groupings is overly constraining, placing unrealistic demands on assessment as Observations 1, 2, and 3 attest. Put another way, bias against some group is inevitable for non-trivial assessment problems. Requiring only that an assessor be “close” to bias-free for all groups does not change this picture much (Observations $1^{'}$ , $2^{'}$ , $3^{'}$ ). For example, an assessor is $ε$ -approximately calibrated for all groups if and only if it is $ε$ -approximately perfect. Specifying a set of relevant partitions in advance, non-trivial, unbiased assessment may still be impossible for all partitions in the set as is the case in Examples 2 and 4. Moreover, even if assessments satisfy one of the fairness properties for some set of partitions, it does not follow that the property is satisfied for the coarsest common refinement (Observation 6), establishing the possibility of forms of intersectional bias for these fairness constraints. However, if non-trivial unbiased assessment is possible for the coarsest common refinement of the partitions in the set, then, by Observation 7, assessments will be unbiased for the partitions in the set. But even this may be too stringent since unbiased assessment is sometimes possible for each partition in a set even if imperfect assessments must be biased for the coarsest common refinement of those partitions.

Where does this leave us? What the foregoing analysis helps us to make clear is that, not only is there a conflict between eliminating different forms of bias, but there are serious limits to the extent to which a given form of bias can be eliminated across different partitions. Often, many partitions are important. In “auditing” an assessor for bias (e.g., Kearns et al., 2018), outside of some rather restrictive cases, we are guaranteed to find bias against some group. While that is pertinent data for the ethical evaluation of an assessor, what are the prescriptive implications for assessment? Some have recently argued for rejecting nearly all of the statistical criteria of fairness proposed in the literature for reasons of a different nature than those I have considered (e.g., Hedden, 2021). Do the observations recorded here add to this case? It seems there are two broad approaches we might pursue. First, we could reconcile ourselves with bias against some groups since it is essentially inevitable on this way of understanding fair assessment, hoping and doing what we can to ensure that implied forms of bias minimally impact what we take to be the most relevant protected classes for the given time and place (cf. Hébert-Johnson et al., 2018). Second, we could seek a different conception of fair assessment.

Footnotes

Acknowledgements

Thanks to Marshall Bierson, Mike Bishop, Yang Liu, Michael Nielsen, Ignacio Ojea Quintana, Shanna Slank, Tom Sterkenburg, Reuben Stern, Borut Trpin, audiences at the Center for Advanced Studies (CAS) at LMU Munich and the Faculty of Philosophy at the University of Groningen, three anonymous referees at Social Choice and Welfare, and two anonymous referees at the Journal of Theoretical Politics for helpful conversations and feedback. I am grateful to CAS and Longview Philanthropy for providing research leave, and to the Cambridge-LMU Strategic Partnership for funding the Decision Theory and the Future of Artificial Intelligence group.

ORCID iD

Rush T. Stewart

Notes

Appendix

References

Angwin

Larson

(2016, December) Bias in criminal risk scores is mathematically inevitable, researchers say. https://www.propublica.org/article/bias-in-criminal-risk-scores-is-math.

Angwin

Larson

Mattu

, et al. (2016a, May) How we analized the compas recidivism algorithm. https://www.propublica.org/article/how-we-analyzed-the-compas-recidivis.

Angwin

Larson

Mattu

, et al. (2016b) Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. https://www.propublica.org/article/machine-bias-risk-assessments-in-cri.

Aumann

(1976) Agreeing to disagree. The Annals of Statistics 4(6): 1236–1239.

Barocas

Hardt

Narayanan

(2019) Fairness and Machine Learning. fairmlbook.org. http://www.fairmlbook.org.

Borsboom

Romeijn

J-W

Wicherts

(2008) Measurement invariance versus selection invariance: Is fair selection possible? Psychological Methods 13(2): 75–98.

Bright

Malinsky

Thompson

(2016) Causally interpreting intersectionality theory. Philosophy of Science 83(1): 60–81.

Corbett-Davies

Pierson

Feller

, et al. (2017) Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 797–806. ACM.

Crenshaw

(1989) Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. University of Chicago Legal Forum 1989(Article 8): 139–167.

10.

Dwoskin

(2018, August) Facebook is rating the trustworthiness of its users on a scale from zero to 1. https://www.washingtonpost.com/technology/2018/08/21/facebook-is-rating.

11.

Facebook (2022, April) Discriminatory practices. https://www.facebook.com/policies/ads/prohibited_content/discriminatory_practices.

12.

Fang

Moro

(2011) Theories of statistical discrimination and affirmative action: A survey. In: Benhabib J, Bisin A and Jackson MO (eds) Handbook of Social Economics, Volume 1. Elsevier, 133–200.

13.

Hébert-Johnson

Kim

Reingold

, et al. (2018) Multicalibration: Calibration for the (computationally-identifiable) masses. In: Proceedings of the 35th International Conference on Machine Learning, Volume 80, pp. 1939–1948. PMLR.

14.

Hedden

(2021) On statistical criteria of algorithmic fairness. Philosophy and Public Affairs, Forthcoming.

15.

Kearns

Neel

Roth

, et al. (2018) Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In: Proceedings of the 35th International Conference on Machine Learning, Volume 80, Stockholm, Sweden, pp. 2564–2572. PMLR.

16.

Kleinberg

Mullainathan

Raghavan

(2017) Inherent trade-offs in the fair determination of risk scores. In: Papadimitriou CH (ed.) 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Volume 67 of Leibniz International Proceedings in Informatics (LIPIcs), Dagstuhl, Germany, pp. 43:1–43:23. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.

17.

Mitchell

Potash

Barocas

, et al. (2018) Prediction-based decisions and fairness: A catalogue of choices, assumptions, and definitions. arXiv preprint arXiv:1811.07867.

18.

Patty

Penn

(MS) Algorithmic fairness and statistical discrimination. Unpublished Manuscript.

19.

Pleiss

Raghavan

, et al. (2017) On fairness and calibration. In: Advances in Neural Information Processing Systems, pp. 5680–5689.

20.

Sen

(2007) Identity and Violence: The Illusion of Destiny. London: Penguin Books.

21.

Shimony

(1988) An adamite derivation of the principles of the calculus of probability. In: Probability and Causality, pp. 79–89. Springer.

22.

Stewart

Nielsen

(2020) On the possibility of testimonial justice. Australasian Journal of Philosophy 98(4): 732–746.

23.

Tetlock

(2017) Expert Political Judgment: How Good Is It? How Can We Know? New Edition. Oxford: Princeton University Press. originally published in 2005.

24.

van Fraassen

(1983) Calibration: A frequency justification for personal probability. In: Cohen R and Laudan L (eds) Physics, Philosophy, and Psychoanalysis. Springer, 295–319.

Identity and the limits of fair assessment

Abstract

Keywords

1. Introduction

2. Identity and Population Partitions

3. Fair Assessment

Footnotes

Acknowledgements

ORCID iD

Notes

Appendix

References