Sage Journals: Discover world-class research

Abstract

When there are no witnesses, no substantive material evidence, no plausible motive, and, moreover, nothing beyond speculation to indicate that a crime has even occurred, then the prosecution of suspected serial killer nurses will be difficult. It will lean heavily on statistical arguments. There are two pillars to these arguments; the first is to show, on the basis of a believed inexplicable spike of deaths and life-threatening events, that the best explanation is that of criminal activity. While identifying a cluster is an awkward statistical question, it is not the one we address here. Our focus is on the second pillar of the argument, one that aims to show the accused as the most likely culprit in the light of roster chart evidence. The basis of this evidence comes from what is judged to be an unusually high correlation of the accused’s presence with the events when they took place. We show that such roster chart evidence is unreliable. A visual inspection of any chart will be misinterpreted, the reader coming to an erroneous conclusion. We consider a recent case and, via careful calculation, show that what was believed to be strong evidence of culpability is, in truth, anything but. The often heard statement “whenever something went wrong, the nurse was always there,” has no basis in fact.

Keywords

criminology probability roster chart serial killer nurses order statistics

Introduction and background

An inexplicable spike in the rate of adverse events

A cluster of events within the health care sector, a clinic or a hospital, for example, can trigger alarm bells. The events we have in mind are serious, life-threatening, and possibly death itself, so that the potential origins of the events capture our whole interest. An example might be that of a neonatal intensive care unit where deaths and collapses are to be expected to some degree. The problem then is not the nature of the events but the rates of these events, a concentration of events, a cluster on the time scale that will often be described as an “inexplicable spike,” that is to say something most unusual, something unanticipated, and that, at an initial glance, appears to defy explanation.

Such a spike will rightly produce concern. Observing more deaths or serious events, such as collapses, than we expect to see will necessarily raise questions. Year in year out, month in month out, we will see variability in any observed rate. This is to be expected. However, we might sense that the rate we are observing appears to push the boundaries defined by statistical variability a little too far; that is, it looks like an outlier. This is not an easy statistical problem. Far from it, although for the purposes of this communication, we will assume that whatever investigative techniques have been used, and to what extent they rely on statistical analyses, a conclusion has been reached. The conclusion is that, within the unit, there is unmistakable evidence of an outlier, an inexplicable spike that—it might be argued—points to deliberately inflicted harm.

Uncovering correlations via a roster chart

Once deliberate harm has been agreed upon, then only one important question remains. Who? Unlike finding someone face down in a back alley with a knife in their back, where the culprit has to be singled out from a very large number of potential suspects, for a case in a health care setting, there can only be a limited number of possible suspects. If we are able to identify several suspicious events, then one obvious way to proceed would be to look for some common denominator. Does one of the doctors or nurses stand out from the others in terms of their presence in the vicinity of suspicious events for many, if not all, of the suspicious events? Should that turn out to be the case, then we might argue that close attention needs to be paid to the activities of the particular person present at the time of the several events. The doctor or nurse in question may become a suspect after giving consideration to other relevant factors.

How do we study the correlation between the presence of a particular nurse and the occurrence of suspicious events? Any such study, in order to claim validity, would require the implementation of a carefully constructed experimental design. A considerable amount of training and experience in epidemiology and statistics would normally be required. An absence of this kind of knowledge can lead us very much astray, and several real examples exist. It is quite easy to simulate a chart, or several charts, based on a condition of independence between the columns of the chart (potential suspects) and the outcome. It is interesting to note that such simulations will often give an impression that certain clusters, hence certain correlations, are present in the data. Natural intuition can let us down, and more formal statistical approaches are needed.

We introduce binary variables, $Y_{i}$ and $X_{i}$ , for event $i$ . $X_{i}$ will assume the value 1 when an event is deemed suspicious and 0 otherwise. $Y_{i}$ will assume the value 1 if present at the event $i$ and 0 otherwise. The chart is then a visible aid and an indirect response to the central question: what is $P r (Y_{i} = 1 | X_{i} = 1)$ , the probability of presence given that an event is deemed suspicious, and, immediately following this question, does this probability differ in any significant way between the suspect and a randomly chosen nurse? And if it does, can any difference be attributed to other factors, such as the average number of shifts on duty? It is the quantity $P r (Y_{i} = 1 | X_{i} = 1)$ that will enable us to address the relevant questions. A scientifically rigorous assessment can only be based on this probability since the information hidden within and by the chart can easily mislead us if used without taking some additional steps. While the visual impression can be a beginning, it cannot be a finality. Consider Figure 1. An assessment based on nothing more than the chart might easily convince us that we have our culprit, nurse 23. But do we?

Figure 1.

A chart believed to show a strong statistical correlation between the presence of nurse 23 and suspicious events. The visual impression that ignores bias is very strong.

Selection, outcome, and confirmation bias

If our visual impressions from Figure 1 are all we have to go on, it could well surprise us to learn that the nurse working the most shifts is more likely to be present when a suspicious event occurs than is nurse 23. We show in this paper why that would be the case. At first glance, it is certainly counterintuitive and serves to illustrate how necessary are careful calculations. Let us keep in mind that, while impressions may stimulate our research, only a relatively unbiased evaluation of the relevant probabilities in different situations will enable us to make sober and reliable decisions. So why then is our intuition so far off when looking at Figure 1? Simply because the figure is not really what it purports to be. At a meeting of the Legal Section of the Royal Statistical Society in September 2024, it was described as a fake. The figure contains major biases that convey a very misleading impression. Ignoring these biases, we could easily conclude that, for nurse 23, the probability of being present at a suspicious event is of the order 100% (“any time something bad happens, the nurse is always there”). As we will see, a correct analysis estimates this probability to be an order of magnitude smaller than that.

Unbounded impact of hidden biases

Bias is a common theme in statistical theory. For a sample of observations, we can add up the sum of squares about the mean and divide the result by the sample size, $n$ . This provides an estimate of the true variance and, in many ways, is the best estimate we can come up with since it corresponds to the maximum likelihood estimate in most cases. It is, however, a biased estimate. For an unbiased estimate, we would need to divide by $n - 1$ . The difference is small, bounded by a term of the order $1 / n^{2}$ , and so very quickly is of no practical importance. This is not so for the biases we discuss here. Their impact is not bounded and can easily be so large that the only visible thing is the bias itself. Failure to address the issue of bias results in an analysis of no value. We show below that the bias implying the perfect correlation in Figure 1 is of the order 900%. Too big to be ignored.

Outcome bias and how to spot it

Most of us, including skilled probabilists, will fall into the trap of outcome bias at one point or another. It has been given many labels—the Texas sharpshooter fallacy, the lottery blunder, among others. The problem arises from our calculating the probability of some event after the event has occurred, and is motivated by the event itself. The probability of winning the lottery is tiny, but what can we say about the probability relating to the person who won? We would all agree, it is not tiny; in fact, it is 100%, which is very different from the probability we would have assigned before the draw took place.

Now, just about all of us would get the above example. But different forms of the very same idea are not so easily detected. Take a DNA test, for instance. A suspect seen running away from the crime scene by several witnesses, and with known links to the victim, is found after detention to have left traces of his DNA on the victim himself. This is the kind of language used to describe such evidence. In truth, we do not know that the suspect’s DNA was found at the crime scene. What we do know is that the DNA, sampled at the crime scene, and analyzed for a very small number of genetic markers, found a very high degree of correspondence with those same genetic markers of the suspect. In order to quantify that degree of correspondence, we are told that there would be less than a one in a million chance of an individual, randomly chosen from the population, exhibiting so high a degree of matching. A smoking gun! Possibly, although (as an aside) a sharp defense lawyer may want to ask where precisely did that one in a million calculation came from. Let us leave that, though, and assume that we have our culprit. How about the following situation? No suspects were seen. There were no witnesses. No known links to the victim. A DNA sample was taken and run through a national database containing around 20 million profiles. A match, that is to say a profile with a fixed small number of genetic markers corresponding to those found at the crime scene, is identified. An arrest is made.

The above two situations may appear to be similar. They are, however, very different. The second one can be subject to a large outcome bias that requires involved calculations in order to quantify. More details are given in O’Quigley.¹ We can see that the one in one million probability is likely to be very far off, and it is not at all implausible that we are observing something far less unusual than we had believed. But let us return to our main concern, clusters of potentially suspect outcomes in the setting of a clinic or a hospital. It is not hard to see that many of the very investigations themselves are little more than the result of outcome bias combined with a total lack of understanding of how, why, and when such biases arise.

Selection bias and how to spot it

Any introductory epidemiology course will devote time to the topic selection bias. The topic might be viewed as the central concern of epidemiologists doing their best to ensure the validity of their findings. Selection bias occurs when the participants included in a study are not representative of the target population, leading to systematic differences between the study sample and the population of interest. This bias can distort the observed relationships between exposures and outcomes, compromising the validity of the study’s conclusions. Selection bias, while particularly problematic in epidemiology, is also a real concern in clinical research, where accurate, unbiased results are crucial for guiding public health decisions and medical practice. Selection bias can show itself in many forms, the most common being sampling bias.

Sampling bias arises when certain groups in the population are overrepresented or underrepresented in the study sample. For example, recruiting participants from a hospital setting may exclude healthier individuals who do not require hospitalization. Self-selection bias can occur when individuals who volunteer to participate differ systematically from those who do not. For instance, volunteers for a fitness study may be more health-conscious than the general population. In longitudinal studies, selection bias can emerge if participants who drop out differ to any extent from those who remain. For example, individuals with severe illnesses might be more likely to discontinue participation. Another form of selection bias comes under the heading, exclusion bias. This results from the inappropriate exclusion of certain participants during study design or data analysis. Excluding older adults from a clinical trial may limit our ability to generalize the results to that age group. The flip side of exclusion bias would be inclusion bias, whereby certain subsets or subgroups are overrepresented. This is very much the case in the focus of our study here, where the suspect nurse is included in the study to a degree well beyond what we might anticipate from a fair representation.

Selection bias arises both during the study design phase, when the inclusion and exclusion criteria can be overly restrictive, and at the recruitment stage,² when recruitment strategies may fail to reach a diverse and representative sample. Even when the sample is in some sense representative, we can run into difficulties at the data analysis stage, when certain subgroups are excluded post hoc, often unintentionally. This can lead to strongly biased results. In particular, selection bias may lead to inaccurate estimates of correlations between exposures and outcomes, either exaggerating or underestimating true effects. A well-known example would be smoking and lung cancer where a strong emphasis is put on older smokers, included in the study because they have already developed lung cancer. Without making careful adjustments, the degree of association can appear to be stronger than it truly is as a result of neglecting information on younger smokers who have yet to develop the disease if ever they do.

The most effective way to reduce or eliminate the risk of misleading inferences, consequent upon selective bias, is through a rigorous experimental design.³ Whenever possible, this would involve randomization so that, at least in principle, all the individuals in the target population have an equal chance of being selected. A close and critical examination of recruitment methods can help identify groups that are under or overoverrepresented. While not perfect, we can often make up for lost ground at the design stage by careful statistical modeling alongside other adjustment techniques such as re-weighting and sensitivity analysis. In some cases, we may be able to reduce most if not all of the bias and, certainly, that is an important first step. No less important is to identify as many sources of selection bias as possible, some of which we can try to address and some of which may be beyond our reach, and to describe them and their potential implications as accurately as possible. This will allow any decision makers to gain realistic insights into the lack of precision of findings, inevitable sampling error, as well as systematic errors, and to make any conclusions with that in mind. Additionally, this kind of critique can be of help to future investigators as a means to building designs less prone to errors in inference linked to selection bias.⁴

Confirmation bias and gut feelings

The chart shown in Figure 1 results from a sequence of iterations.^5,6 At a meeting of the Royal Statistical Society Legal Section on 17 September 2024, the chart was described as essentially “worthless” since it would be near impossible to obtain without such a sequence of iterations. At the outset⁶ Rose finds the chart to have contained several holes, that is to say, a suspicious event, but the accused was not present. This is the point at which concerns and gut feelings drive the construction of the chart. The accusers’ focus was entirely on nurse 23. They agreed with hospital management that they had no material evidence of any kind; there was nothing concrete pointing in the direction of nurse 23. They were adamant, though. They had gut feelings. Now, almost all of the cases being divided into suspicious and non-suspicious were anything but clear-cut. We know this from the accusers themselves who, not only came to conclusions very different from those of highly qualified neonatal pathologists, but, on many occasions, re-classified suspicious to non-suspicious and vice-versa.

The original chart, identified by Rose and Elston⁶ and before the gaps were filled in on the basis of gut feelings—otherwise known as confirmation bias—is shown in Figure 2. This is not the precise chart since there were changes to events, some being added and some removed, as well as the set of nurses not being the same. Our purpose is simply to show the impact of not modifying the original chart. One of the strongest biases—failure to account for time on duty—remains and explains why the presence of nurse 23 continues to stand out in the chart. The nurse’s presence appears strong, although the visual impact of Figure 2 is so much weaker than the visual impact of Figure 1 that we can easily suppose that it would not have made anything like so great an impression on the jury.

Figure 2.

The appearance of a chart that, following iterative updating sparked by selection and confirmation bias, would result in Figure 1. The visual impression is not only much weaker but, unlike Figure 1, a cursory glance reveals nothing so very extraordinary.

One single hole in the unbroken line of presence, and the case for the prosecution team suffers greatly. Why? For the simple reason that the prosecution themselves would be pointing to something suspicious that did not involve the suspect. Or, even worse for the prosecution, that the definition of what constitutes a suspicious event is so vague that it cannot be relied upon; this, in turn leads to the inevitable question—just what do you mean by a suspicious event. Close scrutiny here would also result in the prosecution’s case collapsing.

And here, the largest bias of all kicks in. The accusers are convinced; their concerns, their gut feelings, cannot be wrong. Confirmation bias takes hold, and a reexamination of the holes in the line of presence takes place. A case of death by asphyxiation, the most obvious of all triggers for judging an event to be suspicious, turns out to have taken place when the accused was not present. Closer examination and this case is reclassified as non-suspicious. The same procedure is applied to the other holes. This results in having rather fewer cases than we started out with, a much less impressive looking chart, but that also can be fixed by the simple device of expanding the number of cases under consideration, some of which—on the basis of gut feelings—may look suspicious. Inevitably, the confirmation bias, fueled by the strength of concerns and gut feelings, will manifest itself through a sequence of iterations that result in Figure 1.

Digging more deeply

In our view, the kinds of chart discussed here ought not to be admissible by a court. Most jurors will not fully understand why these charts can be so misleading. But misleading they are, and, as currently used, they fail completely to even begin to address the relevant question, the one that supposedly motivated the chart’s construction in the first place; that is, how to assess the probability of the suspect finding themselves in the close vicinity of the events making up a cluster of supposed crimes. And, no less important, after accounting for the differing lengths of time on duty, does this probability differ significantly from that for any other randomly selected nurse? At best, the chart is a distraction from this. Practical conclusions need to be based on a more careful analysis and not simply on the visual impressions that can be gained from a roster chart.

A more careful analysis begins with a more carefully formulated question. It has to be something more precise than—we have come up with a list of suspicious events, specifically 25 of them, and it turns out that nurse 23 is present for all 25; how likely is that? There is no way of tackling that question, especially since we are told nothing about how the list was cobbled together and the extent to which the repeated inclusion of nurse 23 was mostly due to the influence of selection and confirmation bias—in the investigator’s mind, being nurse 23 can be enough to view an event as suspicious.

The question needs to be more specific, and we suggest the following. Given that an event is deemed suspicious, then: (i) what are the chances that a randomly chosen nurse is present, and (ii) what are the chances that nurse 23 is present? This does not eliminate strong biases in the definition of a suspicious event, but does open up a way, via a simple model and some simple working assumptions, to address our main question without leaning so heavily on those empirical quantities that reflect little more than the bias itself. We can estimate, in close to a fully unbiased way, the probability of presence for a non-accused nurse at events, whether considered to be suspicious or non-suspicious. It is this knowledge that we rely upon, in conjunction with what can be said about relative risk, to infer the probability of the presence of nurse 23 indirectly when events have been deemed to be suspicious. The details are outlined in the following two sections.

Statistical methodology

First, some notation. For the $i$ th event, $i = 1, \dots, 61,$ $X_{i}$ assumes the value 1 when the event is deemed suspicious and is 0 otherwise. $Y_{i}$ takes the value 1 to indicate presence and 0 to indicate absence. The inferences we make will lean wholly on simple probability structures, and some of the unknown parameters of those structures will be obtained from the data. An example of that would be the ratio $25 : 61$ of events deemed suspicious. We take that number to correspond to the probability of a randomly chosen event being designated suspicious. Other aspects we may not wish to fully tie down, and, for example, not only any one of the 61 events could turn out to be a suspicious event, but, when counting the number of suspicious events, we do not stop at 25. We allow, in an abstract probabilistic way, the number of potential suspicious events to go all the way to 61. Of course, for the non-accused nurses, the probability of being present for more than 25 suspicious events is so small that it is close to zero. The distinction is nevertheless of importance and allows us to put together a simple model in order to address the main questions of interest.

We define $S$ to be the set of suspicious events and $S^{C}$ to be the set of non-suspicious events. We write $P r (A | B)$ to denote the probability of seeing an outcome $A$ when $B$ has occurred or is known to hold. For $i = 1, \dots, 61,$ we write $p_{0} (i) = P r (Y_{i} = 1 | X_{i} = 1)$ to indicate the probability for a randomly chosen nurse to be present at event $i$ when the event is deemed to be suspicious. For nurse 23, we write this same probability as $p_{23} (i) .$ The odds are defined to be $ϕ_{0} (i) = p_{0} (i) / (1 - p_{0} (i))$ and $ϕ_{23} (i) = p_{23} (i) / (1 - p_{23} (i))$ , respectively. The odds ratio is then $ψ (i) = ϕ_{0} (i) / ϕ_{23} (i)$ and the relative risk is given by $θ (i)$ , where $θ (i) = p_{0} (i) / p_{23} (i) .$ We also consider the related definition, $p_{0} * (i) = P r (X_{i} = 1 | Y_{i} = 1)$ and so on, leading to a new set of definitions which we identify by a star so that we have $p_{0} * (i)$ , $p_{23} * (i)$ , $ψ * (i)$ , and $θ * (i)$ in place of $p_{0} (i)$ , $p_{23} (i)$ , $ψ (i)$ , and $θ (i) .$ In general, $p_{0} * (i) \neq p_{0} (i)$ , and $p_{23} * (i) \neq p_{23} (i) .$

Inference will be based on summary statistics that are not indexed by $i,$ in particular, cumulative measures of presence. For a randomly chosen nurse, the probability of being present for $j$ events or less is given by $F_{0} (j)$ . For nurse 23, we write this as $F_{23} (j) .$ We also need the probabilities of being present for $j$ events or less, given that the events are deemed to be suspicious. We write these as $F_{0} (j : S)$ and $F_{23} (j : S) .$ While our main focus is on $F_{23} (j : S),$ in which the possible dependence on $S$ is clearly expressed, for a randomly chosen nurse, the first of the following three propositions helps point the way forward.

Proposition 1.

For the set of non-suspected nurses; $F_{0} (j : S) = F_{0} (j : S^{C}) = F_{0} (j) .$

This describes independence between presence on duty and the occurrence of an event during that shift, being labeled as suspicious. This has to be a fair working assumption for this group of nurses. The labeling itself is somewhat tenuous, and coming years after the shift was worked, it is hard to conceive of anything but the most minimal degree of association. Which is not to say that no possible association could be conjectured. It could be argued that the suspected nurse, being relatively senior, would have some small influence on the other nurses showing presence at those shifts. This could be explored, but we do not do so here. We need two further results:

Proposition 2.

Switching the order of conditioning, we write $ψ (i) = ψ * (i) .$

This powerful result is the basis of case-control methodology and is shown in almost all introductory epidemiology texts. Our interest is in the relative risk and, since the odds are approximated by the relative risk, we can say that $θ (i) \approx θ * (i) .$ This approximation can be accurate when the rates themselves are relatively small; otherwise, it can be quite crude. In the following special case, we have a strong result.

Proposition 3.

Assuming proportional hazards, we can write, $θ (i) = ψ (i) = ψ * (i) .$

The proposition follows as an immediate consequence of Theorem 1 in Xu and O’Quigley.⁷ We can exploit the result to obtain an estimate of $θ (i)$ given what we know about the probabilities of observing a suspicious event when present, both for a randomly chosen nurse and for nurse 23. We can write more simply, just $θ$ for the relative risk since it does not depend on $i$ under proportional hazards. This is a crucial condition, and we return to it in the next section. Our model can be written:

\log [1 - F_{23} (j : S)] = θ \times \log [1 - F_{0} (j : S)] = θ \times \log [1 - F_{0} (j)] .

(1)

A fair challenge to this model might be that it treats the discrete counts

j

as though a continuous variable, whereas the proportional hazards model itself is structured around a time continuum. The model could be restructured within a discrete framework, but, as is generally the case when approximating discrete phenomena with a continuous model, unless there are a very small number of categories—not the case here—then the level of accuracy of the approximation will be more than enough for inferential purposes.

A complete Bayesian analysis would consider the rival explanations for the observed events, together with all sources of information, that is, the observations themselves, any other relevant information we may have, and the plausibility of differing explanations. The analysis then produces posterior probabilities that can guide investigators in their search for the truth. The philosophical basis of Bayesian approaches is increasingly convincing to applied researchers, and we can anticipate these methods gaining in popularity in the legal setting. For now, we do not adopt such an approach and, in particular, the plausibility of different explanations plays no part in our analysis. That said, we can make use of Bayesian thinking—we might describe this as a partial Bayesian analysis—by exploiting all the information we have on relative risk. We return to this in the following section.

We now have all the tools we need. The goal of a statistical analysis is to provide us with support for making logical conclusions. Any analysis, including Bayesian analyses, will lean on assumptions, working conditions, and approximations. Bringing these under scrutiny will involve further study of two components. The first of these components is to consider how credible the assumptions are, while the second, the more important consideration, will focus on the impact of plausible departures from these assumptions and approximations. Fully scrutinizing all nooks and crannies of these would result in a 100-page document, and, since there are always several approaches to any statistical analysis, and we do not have 100 pages, the question to keep in mind is whether or not it is likely that any changes to the analytic approach would have more than a minor impact on the conclusions.

The first step of an analysis is to choose a sampling framework within which, in an abstract sense, we can view the observations as having been generated. The next step is to decide which outcomes are to be treated as observations on a random variable and which ones we take as fixed. Sample size is generally the outcome of a random variable, although, almost always, is treated as fixed. We take the sample size of 61 for the events studied here to be fixed. Depending on the different experimental contexts, we will sometimes view the number of suspicious events as fixed at 25 and sometimes not, allowing for variability. Different, equally reasonable, experimental constructs could be employed here. They will generally have no more than a small impact on our conclusions.

Statistical case against nurse 23

The concerns and gut feelings resulting in the chart shown in Figure 1 appear to show a strong correlation between the occurrence of suspicious events and the presence of nurse 23. A very superficial and incorrect analysis might conclude the probability of the nurse being present at all 25 suspicious events to be 100% or somewhere close. That number, though, is grossly inflated as a result of selection bias and confirmation bias, the avoidance of which results in a far more accurate estimate of this probability around 10%. There are simply no grounds for the statement, often repeated, that “whenever something goes wrong, the accused nurse is always there.” But let’s not discard the chart into the trash just yet. Within it, the information we need can be found. Some work is required.

Working constraints and conditions

Every statistical analysis leans to some degree on working assumptions or conditions; the more stringent the conditions, the less generally applicable the results. The conditions may describe restrictions on the nature of the observations, but may also indicate requirements concerning the anticipated properties of the methods we use under varying hypotheses. Since we aim for the greatest generality, we have two things in mind when looking at the working conditions: (1) to find broad physical support for the conditions and (2) to assess to what extent realistic departures from these conditions will impact any conclusions. Statisticians refer to this second concern as one that describes robustness. Here, we clearly identify our working conditions so that the critical reader, the statistical analyst in particular, can reproduce our findings with the help of the previous section, challenge them through different assumptions or carry out a study on how robust they are. The following condition helps to support our methodological approach. Furthermore, should the condition fail to hold, whereby inferences could change by a change in the labeling, then clearly that would be problematic.

Condition 1.

Statistical inferences should be invariant to changes in the labeling of the events.

The 61 events studied are ordered from 1 to 61, but we will not always know precisely how the ordering has been accomplished. The ordering may correspond to a time ordering, the first event in time being given the label 1 and so on, but, in general, we will not know this. When we do, we may be able to further exploit the added information. For now, we assume that the ordering is somewhat artificial, and so we would like our results to remain unchanged following any permutation of the ordering. Simply put, this means that our attention is focused on the number of events and their concentration over the whole time frame under study, but not on the order in which they may have occurred.

Of no less importance, the constraint implies a lack of correlation between the placing of the “suspicious” events on the time scale, 0 to 61, and the “non-suspicious” events on this same scale. In particular, the way the data are generally presented, including our own presentation, implies that the first 25 events are all suspicious and the latter 36 events are non-suspicious. While the data are presented that way, our analysis does not assume such a structure and, indeed, the results would be unchanged following any permutation of event labels.

Distribution of presence for non-suspect nurses

Our purpose is to say as much as we can about $F_{0} (j : S)$ and $F_{23} (j : S),$ that is, the times of presence at suspicious events, both for non-suspect nurses and for the suspect nurse. We begin with the non-suspect nurses. Figure 3 shows the relevant data with all the information concerning nurse 23 removed. The estimates of times on duty for the non-suspect nurses are based only on the data shown in the figure.

Figure 3.

The chart shown to the jury, but with all information on nurse 23 removed.

Next, we need to make inferences about the missing observations; these are those observations that would have been in the roster chart had the non-suspicious events been included.

Condition 2.

Following Little and Rubin,⁸ all possible configurations of the 36 non-suspicious events are assigned probabilities using missing completely at random (MCAR).

For example, from Figure 1, for nurse 33, we see four events, and so we estimate as 4 out of 25 the probability of nurse 33 being present for an event. We do not know this nurse’s presence for those 36 non-suspicious events, but based on MCAR, we take the number to be binomially distributed with $n = 36$ and $p = 4 out of 25.$ No new data are generated, but all possible outcomes and the probabilities we can associate with those outcomes are evaluated. For a nurse other than nurse 23, the assumption is equivalent to assuming independence between their being present on a particular shift and the designation, many years later, of that shift being one during which something suspicious is deemed to have occurred. More complex models could be constructed, but, at the very least, that looks like a reasonable first approximation. Our next assumption is:

Condition 3.

For the non-suspected nurses, the time on duty, ranging from 0 to 61 of the studied events, is taken to follow a beta-binomial model. The beta distribution is the maximum likelihood fit to the observations obtained from Figure 1, excluding the data from nurse 23.

The figure shows Condition 3 to provide a good fit to the observations. Other models could be employed, but without appealing to some very involved structure that would be difficult to justify, it is hard to see how we could improve on Figure 4. Our next step is straightforward and amounts to sampling a probability of presence from this distribution and then deriving the distribution of presence under a binomial assumption and a sample size of 61.

Figure 4.

A histogram showing the raw data for the presence of the unsuspected nurses. A beta-binomial distribution is fit to the data using maximum likelihood estimates.

The result is shown in Figure 5.

Note that we could obtain something similar to this figure more directly based on information from human resources at the hospital in question. This information could also be used to validate our results. Such information is typically difficult to obtain. However, it would be worth checking when available. The figure estimated on this basis provides an excellent summary of the data.

Figure 5.

For the unsuspected nurses, the beta-binomial distribution shows the probability of covering 25 shifts or more out of 61, given that something suspicious occurs is 0.013.

Distribution of presence for suspected nurse

As a result of selection bias, the empirical estimate of presence for the suspected nurse—at 100%—has no value. We need something else, and we describe that here. We will exploit what we know about the distribution of presence for the non-suspected nurses, and we will then combine this information with an estimate of the relative risk.

We specify two further conditions:

Condition 4.

The relative risk is taken to be constant in time, time here being the indices $1, \dots, 61$ identifying the events. This corresponds to the proportional hazards model of Section “Statistical methodology” that defines the relationship between $F_{23} (j : S)$ and $F_{0} (j) .$

Note that this condition will automatically hold under Condition 1, in which case, assuming Condition 4 to hold does not imply any loss in generality.

Condition 5.

The two extremes for the unobserved times of presence for the suspected nurse for the set of non-suspicious events are taken to be either: (i) zero or (ii), distributed according to a binomial law with a rate equal to one-third more than the overall mean rate of all nurses, and with $n = 36.$

These two conditions enable us to obtain a simple estimate of relative risk. The first of these is readily defended against challenge as a result of Condition 1. The second condition appears to be quite reasonable and quite conservative, given the context. By conservative, we mean being more prejudicial than warranted against nurse 23 since it is known that the nurse worked more than one-third over the average hours. In any event, if needed, we could proceed without the condition by putting a more Bayesian flavor on the approach. We know, for instance,⁶ that there were 10 holes in the first version of the chart. These events were re-classified as non-suspicious, so we might infer that, for nurse 23, there was a presence for at least 10 non-suspicious events. This information could be combined with uncertainty over these counts, as well as any uncertainty on the entries to the table of suspicious events, in such a way as to obtain a prior distribution for $θ .$ The task is by no means straightforward, but it is the kind of challenge with which a skilled Bayesian statistician is familiar.

Knowledge of the relative risk $θ,$ combined with the simple model of equation (1), will enable us to make inferences about the probabilities of presence corresponding to nurse 23. We are not without knowledge of the possible values that might be assumed by the unknown relative risk $θ$ . For this, we use estimates of $ψ *$ and appeal to Proposition 3. We are unable to say much concerning $p_{23} (i)$ , but we can say something about $p_{23} * (i),$ that is, we first condition on presence and subsequently count the number of suspicious events. The problem of identifying the unknown $θ$ then comes down to deciding how many entries to the corresponding column in the table are plausible while, in parallel, doing the same for the missing entries for the 36 non-suspicious events. If, for the data we do not have concerning those 36 non-suspicious events, we take the most extreme position—nurse 23 is never present—then, clearly, given presence, the probability of observing a suspicious event is 100%.

This provides us with one extreme for the range of plausible values for $θ .$ To see this, note that under the independence assumption, the probability of a non-suspected nurse being present at a suspicious event given that he or she is present at one of the 61 events would simply be $25 / 61 = 0.41$ . The relative risk then corresponds to 2.44. It would be fair to argue that this corresponds to the highest possible value for the relative risk.

For the lowest value, we might suppose nurse 23 to have been present for a percentage of the non-suspicious events that equates to the average time on duty of this nurse. We know this time to be greater than one-third above the average. These numbers result in a relative risk of the order 1.2. We might then proceed by building a distribution on the interval (1.20, 2.44). A natural candidate prior would be a beta model on the interval (0, 1) re-scaled to the interval (1.20, 2.44). The parameters of the beta model would reflect the prior weighting, favoring values closer to the lower or the upper ends of the interval. Given that the beta model provides for a conjugate prior, it is very simple to run a series of Bayesian analyses based on prior pseudo-data put together by the investigators.

We obtain our relative risk estimate of $θ$ in the following way. Using MCAR for the non-accused nurses, and all 25 shifts of presence together with a binomial simulation based on Condition 5 for nurse 23, we obtain a complete table. From this, we obtain the empirical rate for the non-accused nurses, which is just their presence during suspicious events divided by their total presence. For nurse 23, we have the presence of 25 suspicious events divided by the sum of 25 plus the binomial simulation for presence in the non-suspicious category. This is our second rate. Our relative risk estimate of $θ$ is simply the corresponding odds ratio $ψ *$ for the two rates. We can repeat the process many times to obtain an average value for $θ$ , which is approximately 1.9. Using the value 1.9 for $θ$ , together with a proportional hazards transform of the survival function that corresponds to the distribution for the non-accused nurses, shown in Figure 5. The survival function is simply the summed probabilities taken from Figure 5, the probability of being greater than or equal to $x$ rather than just the probability of being equal to $x .$ This results in the top left-hand figure shown in Figure 6. The probability of being present at 25 or more suspicious events is estimated to be around eight times higher for nurse 23 than that for a randomly chosen non-accused nurse.

Figure 6.

Top two figures show the distribution of times on shift for all nurses and, via a proportional hazards transform, the distribution for nurse 23. The bottom two figures show this distribution for those nurses working 20% and 33% more than the average, respectively.

When something suspicious happens; where is nurse 23?

Finding nurse 23 to be about eight times more likely to be present at 25 suspicious events than any other nurse, chosen randomly from the group of non-suspected nurses, might give us food for thought. While the probability of presence at all 25 events is small, even for nurse 23—an order of magnitude less than would be guessed by a glance at Figure 1—for the non-suspected nurses it is close to zero. That alone might strike us as evidence against nurse 23. Would that be true though?

The top right-hand figure in Figure 6 presents the area under the curve (AUC) and this equates to the mean number of shifts of presence at a suspicious event given that it cannot exceed 25: 9.9 for the non-suspected nurses and 13.6 for nurse 23. Even though we are very far from the wildly incorrect assessment of the nurse being always there when something suspicious takes place, there does appear to be evidence that the probability of presence for nurse 23 is greater than for a randomly chosen nurse. Suppose, however, that, instead of contrasting the experience of nurse 23 with that of all nurses, we consider only those nurses whose time on duty equals or exceeds by more than 20% the overall average. The result can be seen in the bottom left-hand figure of Figure 6. The blue line does not move. The black line moves a lot, and we can see that much of the difference has disappeared, the mean for the non-suspected group having increased to 12.4. Let’s take it one more step and consider only those nurses whose time on duty equals or exceeds by more than 33% the overall average. The result can be seen in the bottom right-hand figure of Figure 6. Again, the blue line does not move, and the black line has moved so much that the entire difference has disappeared, the means now being the same.

It turns out that nurse 23 is a member of this group just described since, on average, the nurse’s time on shift duty was more than 35% above the overall average. In other words, what was taken to be strongly incriminating evidence has simply evaporated. Can anything then be added to the believed 100% correlation of the presence of nurse 23 at suspicious events? Yes. A lot can be said. Using no more than the estimated relative risk and comparing nurse 23 with a randomly chosen nurse, we find a correlation $R^{2}$ to be 9%.⁹ A considerable drop from the commonly quoted 100%. The partial $R^{2}$ that makes an adjustment based on the times on duty drops down even further to $< 1 % .$

How robust are these findings? Could they be impacted in any significant way by small changes to the data or by working with estimated parameters for the beta law that differ from the maximum likelihood estimates, but remain plausible? We carried out a limited study on this and concluded on the reliability of the findings. Specifically, while modest changes in the parameters of the beta law will result in small changes to the areas under the curve at event 25 and beyond, the estimated relative risk is not impacted; the above arguments and the conclusion of the following section remain unaltered. Another study, once again a quite limited one, showed, if we were to avoid framing the relevant distributions within a parametric beta model and appeal instead to a re-sampling approach, that any changes would be small.

Summary of the statistical case against nurse 23

Ignoring selection bias, outcome bias, and confirmation bias and making an ill-considered deduction on the basis of the entries in Figure 1 would have us concluding that $p_{23} (i),$ the probability of nurse 23’s presence given the occurrence of a suspicious event ( $X_{i} = 1$ ), to be 100% or somewhere close. The degree of error here is far too large to ignore, and we need to proceed in a less direct way. We still make use of the information in the chart, but only to make inferences on the distribution of times on duty for the non-accused nurses. We take the estimate of that distribution to be reliable and, in conjunction with the model of equation (1) and what we can say about the relative risk parameter $θ$ , we use all of the combined information to evaluate the probability of nurse 23 being present for any given number of suspicious events.

Nurse 23’s presence at all 25 suspicious events, in terms of probability, while very far from the crude empirical estimate that ignores biases, is still about eight times what is found for the unsuspected nurses. However, if instead of contrasting nurse 23’s presence to the distribution of presence for a randomly chosen nurse selected from all nurses, we were to only consider those nurses working the most shift hours, we find that for this hard-working subgroup, the probability of presence when something suspicious takes place is estimated to be about the same as that for nurse 23. Knowing that the hours worked by nurse 23 put the nurse well within this hard-working subgroup, we can conclude that nothing unusual was observed. When something suspicious occurs, the probability of being in the vicinity of those events is the same for nurse 23 as for about five or six of the hardest-working nurses. If this kind of presence is the litmus test for guilt, then the statistical case against nurse 23 is non-existent. Non-existent in that there is simply no evidence pointing toward nurse 23 as the culprit, any more than to those other five or six potential suspects, all of whom showed a degree of presence at suspicious events no less than the presence of nurse 23.

Deeper and more thorough investigations would surely be needed in order to unmask, from among the five or six potential culprits, the one who is truly the author of those crimes. Unless that is, there is no culprit. Unless that is, there never was any crime.

Conclusion

The immediate conclusion is that the evidence against nurse 23 in the form of the now famous chart (Figure 1) is non-existent. This conclusion will not be impacted in any significant way as a result of measured challenges to the working assumptions and approximations of this paper. The chart—the only one seen by the jurors—despite giving an initially abrupt visual impression, does not demonstrate an unusually high correlation between events, labeled as suspicious, and the presence of nurse 23. The apparent correlation is entirely explicable by the combined influence of selection bias, confirmation bias, and the well-above-average time on duty of the nurse when compared with others. Entering those times on duty into the equation leads to the complete disappearance of the spurious correlation.

A number of take-home messages suggest themselves. We limit ourselves to just one: that the kind of chart shown in Figure 1 ought never be allowed as evidence in any court. Only one purpose could be construed for it: to mislead. And mislead it does! It misleads juries—we have ample evidence of that via several miscarriages of justice—and, well before the chart gets to reach the jury, it will mislead prosecutors and the police alike. Finding the visual impact of the chart thoroughly compelling, the prosecution and the police can easily be convinced that their case is solved; so much so that other, far more plausible, explanations for the apparent increase in deaths and collapses are wholly neglected. Our one take-home message would be simple. It would be unequivocal: keep the charts, those hopelessly biased and misleading charts out of the courtroom. No jury will ever glean anything of value from them.

Footnotes

Acknowledgements

I am grateful to the editor and the reviewers for several detailed and thoughtful comments. I believe that these comments have resulted in a sharper and more accurate presentation. I would like to thank Sean Devlin for writing the code needed to obtain all of the results, numerical and graphical. Sean also spared no effort in checking and re-checking the calculations and using different methods to validate the findings. Together with other colleagues, he provided significant input to put to the test those working statistical concepts upon which the writer is relying in order to claim the reliability of the results.

ORCID iD

John O’Quigley

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

O’Quigley

. Suspected serial killers and unsuspected statistical blunders. Med Sci Law 2025; 65: 28–35.

Rothman

. Modern epidemiology. 2008.

Hernán

Robins

. Causal inference. 2010.

Delgado-Rodriguez

Llorca

. Bias. J Epidemiol Commun Health 2004; 58(8): 635–641.

O’Quigley

. Logical and statistical errors in the investigation and prosecution of suspected serial killer nurses. Open Sci Framework 2024. DOI 10.17605/OSF.IO/2QUP7.

Rose

. Why the Letby case isn’t closed. Unherd 2025; 36: 955–980.

O’Quigley

. Proportional hazards estimate of the conditional survival function. J R Stat Soc Ser B: Stat Methodol 2000; 62: 667–680.

Little

RJA

Rubin

. Statistical analysis with missing data. Vol. 793, New York: John Wiley & Sons, 2019.

O’Quigley

Flandre

. Predictive capability of proportional hazards regression. Proc Natl Acad Sci USA 1994; 91: 2310–2314.

Use of roster charts in the investigation and prosecution of nurses suspected of inflicting deliberate harm on patients

Abstract

Keywords

Introduction and background

An inexplicable spike in the rate of adverse events

Uncovering correlations via a roster chart

Selection, outcome, and confirmation bias

Unbounded impact of hidden biases

Outcome bias and how to spot it

Selection bias and how to spot it

Confirmation bias and gut feelings

Digging more deeply

Statistical methodology

Statistical case against nurse 23

Working constraints and conditions

Distribution of presence for non-suspect nurses

Distribution of presence for suspected nurse

When something suspicious happens; where is nurse 23?

Summary of the statistical case against nurse 23

Conclusion

Footnotes

Acknowledgements

ORCID iD

Funding

Declaration of conflicting interests

References