Abstract
The replication crisis has led to more focus on methodology in psychological research. Meta-researchers urged for more direct replications, larger sample sizes, and preregistration of research plans. In this paper we point to two other aspects of research: the ontology of the research phenomenon and the emotions that are part of scientific work. We present an ethnographic case study of a replication in psychology and argue that these three aspects—method, ontology, and emotions—are inextricably linked. By analysing the emotional concerns of the replication researchers, we show that they are especially concerned about fidelity in the various relations in and around the replication study: the reliability of a connection, the trustworthiness of partners, the validity of an inference from one statement to another, the truth of a representation, etc. Their emotions worked as affective entanglements between these various fidelity relations. Central to this tangle of fidelity relations is the relation between the replicators and their object of study. During the replication process, the replicators’ view of what the phenomenon actually is fundamentally changed, eventually nearing a performative approach in which the ontology of the psychological phenomenon is understood as a process, enacted in the underlying praxis.
Introduction
In the last few decades, concerns have been raised about the reproducibility of research findings, especially in psychology. Central to the crisis that grew in the discipline from 2010 onwards has been a loss of confidence in the knowledge that psychologists have accumulated and in the standards that have been or remain in place for that knowledge production. Studies and events that have contributed to what has been called a “replication crisis” (Pashler & Harris, 2012) in the field of psychology include: the publication of a study claiming evidence of clairvoyance (Bem, 2011); replication studies that failed to reproduce the results of famous research (e.g., Doyen et al., 2012); a survey that showed a high prevalence of questionable research practices in psychology (John et al., 2012); and a large-scale effort to replicate psychological experiments that reproduced less than half of the original results (Open Science Collaboration, 2015). In response to these problems, a community of meta-researchers urged for methodological improvements such as more direct replications (closely following the protocol of the original study), larger sample sizes, and preregistration of research plans (e.g., see https://www.cos.io).
In this paper, we widen the focus and look at the connection between method and two other aspects of scientific research: the ontology that it enacts, and the emotions that are part of scientific work. We present an ethnographic case study of a replication that we followed from the set-up to the interpretation of the results. 1 What the case shows, we argue, is that these three aspects—method, ontology, and emotions—are inextricably linked. In the next two sections, we lay the groundwork for our analysis by discussing ontology and emotions respectively.
Ontology
A major point of contention in the replication crisis in psychology is the variability of human behaviour due to its context sensitivity. Against the meta-researchers’ calls for more direct replication studies, critics from social psychology in particular have argued that experimental effects cannot always be expected to be reproduced in replication studies. According to them, social psychological phenomena depend on factors that vary between social, cultural, and historical contexts, and an experimental manipulation that produced a certain response in one study may have a different effect in a later experiment somewhere else, even if the experimental procedure is the same (a.o. Crandall & Sherman, 2016; Greenfield, 2017; Iso-Ahola, 2017; Stroebe & Strack, 2014).
Whereas these critics couple their rejection of direct replication with an epistemology and methodology geared towards the discovery of an underlying reality of stable, basic psychological mechanisms, others have recently drawn more radical ontological conclusions from the variability of human behaviour. Archer (2024) has argued that psychological phenomena are innately plastic. ‘Human beings are irreducibly complex psycho-biological systems whose emergent psychological properties are ontologically distinct from the physical causal mechanisms studied by the natural sciences’ (Archer, 2024, p. 577). Moreover, for Archer, people are systems that are more open to the context than natural systems typically are. Controlling this context in an experimental setting is impossible, and he therefore rejects both direct replication and experimentalism in general as viable methods in psychology. Newton (2023, p. 5) also contrasts the stability of the biological processes in human psychology and the instability of the socio-cultural aspects of human psychology. Human psychological fluidity is driven by language and technology, but also by the fact that people react to psychology and may change their behaviour in response to psychological theories and practices. Newton does not reject experimentalism but does plead for methodological pluralism to take into account the hybrid biosocial nature of human beings. 2
The theme of the intimate connection between psychological research and psychological phenomena is taken further by Derksen and Morawski (2022, 2024), who suggest a shift to a performative view of research, as co-creating rather than discovering realities. The ontology of psychological phenomena is then inextricably bound up with the scientific practices involved with them. Such a performative understanding is also advocated by Van Geert and De Ruiter (2022). They too see the replication crisis as a result of the “context-specificity” of many of the empirical findings in social science, adding that this context-specificity in turn is a result of the processual nature of psychological phenomena (p. 254). Van Geert and De Ruiter therefore argue for a process ontology for psychology, instead of the substance ontology that dominates the discipline and underlies the requirement of reproducibility by direct replication. Part and parcel of the process ontology is the praxis of psychology itself: it is this praxis that enacts the ontology underlying psychological research. Drawing on Bruno Latour’s actor-network theory, they describe psychology’s objects as constructions emerging out of the interconnected actions of agents in a network that includes both researchers and the participants in their research, as well as a heterogeneous collection of other actants such as students, publishers, funders, and readers outside academia.
Emotions
The replication crisis has been marked as much by the emotions it has engendered as by the debates about method and ontology that it gave rise to. Psychologists describe how having your work replicated can feel like an attack (Bavel & Cunningham, 2017); it can be “hurtful” and “stressful” (Lakens, 2016). In fact, the debates that follow replication attempts are often so heated that a “tone debate” was deemed necessary to discuss ways to make the conversation more fruitful (Derksen & Field, 2022). But the emotions do not just follow replication studies. In the ethnographic fieldwork that was part of our project “Replication in action,” we noticed that conducting a replication study can itself be emotionally fraught, and in some cases we were struck by the intensity of these emotions. 3 Replication researchers used terms like “devastating,” “insanely time-consuming,” “(we) can only lose,” “a complete shock,” or “dramatic” when they talked about their replication work. Moreover, several authors have shown that the connection that psychologists and other scientists have with their object of study is itself emotionally charged. Simon Cohn described how neuroscientists had to develop intimate relationships with their research participants in order to create objective facts in the fMRI scanner (Cohn, 2008). Pickersgill (2012) analysed the emotional relationships between neuroscientists, their colleagues, work, and research participants, and how these co-create the kinds of knowledge that laboratories produce. Fitzgerald (2013) analysed how emotions, thoughts, and feelings (e.g., for children and their families) are entangled in laboratory work on autism. And Peterson (2016) showed how doing experimental research with babies requires a good (“happy and calm,” p. 2) relationship with the baby, and how researchers have an emotional (“squealing with joy,” p. 5) relation with their data. That is, emotions can be understood as affective arrangements, or affective entanglements between researchers and their research objects (e.g., Myers, 2008; Slaby et al., 2019). Not only are researchers emotionally affected by research objects such as participants, psychological phenomena, or data but their emotions can or should also affect these objects (e.g., calm the babies).
Fidelity
For this paper, we followed one replication project from the early stages up to the publication of the report. This gave us the opportunity to study what we call the ‘emotional concerns’ of the replicators throughout the replication process. We show that these concerns focused on the quality of the relations that were involved in their efforts. We initially phrased this quality as ‘trouw’. Trouw is a Dutch word, both a noun and an adjective, that means fidelity, faith(ful) or loyal(ty), and is related to reliability (betrouwbaarheid), as well as to trust and confidence (vertrouwen). We think fidelity is a good translation for “trouw,” because it can refer to the quality of a relationship between people (loyalty for example), as well as to the relations between things (the exactness of a copy, for example). Thus, we choose the term fidelity because it captures the replicators’ concerns about the various relations in and around the replication study: the reliability of a connection, the trustworthiness of partners, the validity of an inference from one statement to another, the truth of a representation, etc. Their emotions arose from threats to that fidelity and the difficulty of maintaining the fidelity of all relations at the same time. The replicators felt responsible for this network of relations, and the disturbances created in and by their replication regularly necessitated repair work. Moreover, the replication affected the relation that the replicators have with their object of study. The replication raises fundamental ontological questions, and these are not merely intellectual issues but deeply felt emotional concerns.
Fieldwork and Methods
In 2021, we initiated an ethnographic research project to study 28 replication studies, with 16 of these in the behavioural sciences (see https://replicationinaction.blog/the-project/). Most of these studies were still ongoing at the start of our project. We conducted interviews, analysed documents (e.g., emails, protocols), and followed several projects in action (e.g., observing research meetings or experimental practices). In one of the studies, the first author (JB) was able to follow a significant number of meetings and activities. We decided to focus on this one case study to analyse the emotions involved. 4
Our case study concerned a direct replication of “Jones et al.,” an experiment from the 1980s that tested a theory about teaching strategies. 5 Four-year-old children were instructed in four different ways (conditions) to solve a puzzle. The experiments were recorded and coded in order to count the correct and unaided operations of the child. The original study was done in one lab with 32 children (4 x 8), one instructor, and one coder/researcher. One of the four conditions (instruction A) led to significantly better puzzle results, and this specific type of instruction is now widely recognised as an effective teaching strategy. The replication study was done in one lab and multiple daycare organizations, with over 250 children (64 per condition), four instructors, five coders, and a team of four researchers. The puzzles were identical to the original material and the instructions followed those in the original publication.
In October 2021, JB conducted a collective interview with the PI of the study, Maria (associate professor), her colleague Anna (assistant professor), and Susan (postdoc). The latter would conduct the study together with eight student assistants (a.o. Nina, Joyce, Kate, Rachel, and Else). After that, JB attended approximately 50 (mostly online) meetings that Susan had with the project leaders (Maria and Anna), a methodologist (Harold), the student coders, or the student instructors. She made extensive notes from all meetings and conversations, transcribed the interview and one more personal conversation with Susan, collected relevant documents (protocol, preregistration, papers, etc.), and coded all these documents with Atlas.ti.
There were some moments of reflection where the replication team responded to the findings of the ethnographic project: Susan and JB had a more personal conversation about JB’s findings and Susan’s struggles halfway through the replication project (Susan: “What do you think? Is there a future for replication research? Or is it too frustrating?”). The ethnographic project organized a workshop to discuss their first findings, which was attended by some replication team members (Derksen et al., 2025). The replication team invited JB to a dinner to celebrate the end of their project in which they also discussed how the replication made them think differently (and were “a bit shocked”) about science. The replication team provided feedback to a draft version of this paper, and approved the first submitted version. Finally, the ethnographic work used for this paper was also part of another paper in which we mainly focussed on the diversity of negotiations, tacit knowledge, and practices surrounding replication more generally (Brenninkmeijer et al., 2025).
For the analysis of this paper, JB coded the quotes with emotions in her material (e.g., “I am very frustrated”) to analyse the underlying concerns of researchers. Since emotions are not always explicitly mentioned, but are often expressed more indirectly, she also analysed those occasions in which concerns were expressed with more general (but emotionally loaded) phrases, utterances (“it is hard,” sighing), or silences. 6 Because the term “concern” can refer both to a problem and to an emotion, we use the term emotional concern to indicate those concerns that are emotionally loaded.
Preparation Phase: Staying Faithful to the Original Study
The fieldnotes of the collective interview with Maria, Anna, and Susan mention “three happy women” who were very attuned to each other. They told us that the original study was important in their field and for teaching practices, and that they were enthusiastic about the original author’s work: “I’m secretly a bit of a fan of his work” (Maria). They explained that they found replication important: “I think it’s important to do replications, and I thought it would be super interesting to actually do it myself” (Maria). And they seemed happy about doing the research: “It brings together all the elements that I like about research” (Susan). They also appeared to be confident about the impact and relevance of this replication: “I hope that the way we are trying to set this up so transparently (. . .) That this will help other researchers to conduct that research in other contexts” (Anna).
7
And Susan said about replication in general: I find it very important that everything that is said and done in education is based on what we really know (and is) proven to work. (. . . So . . .) if you can find great studies, and you can repeat them, then you can strengthen that knowledge base of effective educational research.
That is, at the start of their replication project, the researchers expressed faith in the original researcher’s work, in their position in the field, and in replication as a knowledge-strengthening tool.
In the same interview, however, the researchers also discussed several problems they had encountered while preparing their replication study. For example, while writing the proposal, it became clear that they would have to remake all the materials and draw up a protocol themselves.
I asked (Jones): what do you still have from your studies? Do you still have the protocol? But he is retired and said: everything is . . . (gone). There is nothing left. Nothing recorded, no puzzle. Nothing at all anymore. (. . .) So yes, that is difficult for such a replication study. (Maria)
And although the funding committee had accepted the proposal and protocol, Maria seemed to feel a bit uncertain about this limited information: “Yes, gaps . . . where are there no gaps? The only thing we really know for sure is that puzzle and its dimensions.” The “gaps” in the information about the original study were a problem, because it made it difficult to set up a faithful replication of the original study.
Another concern that was mentioned in the interview was the personal relation between the original author and the original instructor. As Anna explained: In the original study there was one experimenter. And that was his wife. And so we find that a bit complicated, because (. . .) the experimenter was someone very close to the author. (. . .) And we already had quite some conversations with each other, like what the impact of this could have been. And what difference this could make in what we will possibly find. Yes. I think that is a rather typical aspect of this study that we are replicating.
Anna anticipated that the close relationship between the original researchers, which they obviously could not reproduce, might have affected the role and actions of the experimenter (Mrs Jones) and hence could create differences with their replication. This was also a reason that Susan felt a bit tense about the results: “I do think we can replicate the experiment. But I don’t expect to replicate the effect [laughs]. Yes, so there is still . . . there is tension there.” Here the tension between “replicating the experiment” and “not replicating the effect” was expressed with laughter. Susan was about to replicate a study which had been very influential in her field and which was much admired, and she seemed to feel uneasy because she did not expect to reproduce the results.
These uncertainties about what happened in the original experiment and the tensions about possible differences between the original study and their replication were one of the reasons that they organized an expert panel to discuss their questions. One of these experts was Sara, a researcher who had been trained by the original researchers to conduct the puzzle experiment for a different paper and age group. In an email to the expert panel, the replication team wrote: By using a preregistration and by having our protocol and preregistration reviewed by an expert panel before we start gathering data, we aim to increase the quality of the design of the replication and hope to prevent discussion afterwards about whether the replication was conducted well.
With the expert panel, these researchers sought a form of confirmation from experts in the field: not only to increase the quality of their study but also to hedge against criticism afterwards. This shows that the uneasiness was not only about the relation between their replication and the original study but also about their relation with other researchers and not being seen as good enough themselves. To avoid straining their own reputation in the field as trustworthy researchers, they had many discussions about the meaning of the prospective results. As Susan explained: When is it a replication and what does it mean if we don’t find the same results? We are trying to clarify that now before we have the results. To prevent all kinds of painful, difficult conversations afterwards. (. . .) The whole idea is that you (work) super transparently to, yes, to sort of (prevent) arguments.
Pilot and Training: The Problem of the Parent
Before they started data collection, the research team fine-tuned the experimental protocol by doing and recording pilot sessions, and they used the pilot videos to signal problems and train the instructors to respond in the same way. JB attended a meeting with Susan and student instructor Rachel in which they aimed to “calibrate” the instructions. From our fieldnotes: We watch a video in which Rachel tries to persuade a reluctant child to complete a puzzle. The child keeps running to his mother. Susan sighs a lot; she indicates that parents have a great impact, and that children always seek confirmation from their parents. That it is much easier without parents. That children nowadays are also different from in the past. They are protected more, which makes it difficult to “detach” them (from their parent). (. . .) Susan: We should also be able to compare it with the daycare situation (there are no parents there either). (. . .) What is the natural behaviour, (and what is the) effect of the daycare or the parent? We have to decide what to do with the parent.
Watching the video, Susan realized that parents had an effect on the experiment. She worried that children behaved differently when their parents were around, which made it difficult to make them act as “good” research participants. Susan: “There is not only interaction between you and him; he’s also interacting with his mother.” Moreover, the parent’s presence might not only affect the child’s behaviour; the experimenter might also act differently. Susan: “Some kids are also throwing those pieces around. That shouldn’t happen. But you can’t solve that with mother around.” Or as Susan formulates this some weeks later in a meeting with Harold: “That parent is watching. (That) makes (you take) completely different decisions.”
Susan was concerned not only about what the close relationship between the parent and the child meant for the experiment and the data but also about what this said about the child’s natural behaviour, about the comparison between children now and in the past, as well as what this all meant for their comparison of different locations. That is, Susan connected the personal relationship between the parent and the child to a variety of other relations in the experiment: the relation between child and experimenter, lab behaviour and natural behaviour, children then and children now, and children in the lab and children in daycare. Realizing the extent of these complications, all stemming from the close relationship between the children and their parents, she sighed repeatedly.
In order to decide what to do with the parents, Susan searched in her documents to find out what Jones had done with them. In the expert panel, Sara had told them that she was alone with the child. But according to the original paper and personal communication with Jones, the parent had been in the room, but at a distance, and was asked not to interact. Susan: “They say two different things. How nice! (. . .) But, well, the parent has a very big influence. [Sighs].”
Susan had signalled a problem that enacted several other problematic relations within the experiment. To deal with this, she could do two things: follow the original author because it was a replication, or not follow the original author and make it a better study. However, to her frustration (“How nice!”) it was not clear what the original author had done. She decided to put the parent behind an observation window, if the situation allowed this (“If we don’t get the parent detached from the child, then he [or she] is there”). This could improve some problematic relations in the experiment (e.g., the instructor could perhaps prevent children from throwing around puzzle pieces), but not all: it was still unclear if modern children behave the same as those 40 years ago, or if children in the daycare centres behave the same as in a lab, if children behave the same with or without their parents (and what is their “natural behaviour”), if instructors behave the same when the parent is watching, if parents were around in the original experiment, and if parents would be around in the replication study. These uncertainties were emotionally challenging, as we can see from Susan’s sighing and comments.
The Experiment: Being Faithful to the Protocol
The relationship between the child and the parent was seen as problematic because it affected the relationship between the experimenter and the child: That relationship had to have a level of trust to evoke the right behaviour in the child. In the experimental protocol, the researchers wrote about the relationship that the child should be “at ease”; that the tutor should connect to the child in “a friendly way”; that s/he should “nudge the child in a gentle way”; that the child should “never be pushed,” and so forth – all to reproduce the “warm and supportive atmosphere” from the original study. That is, the personal relationship between experimenter and child had to be good in order to reproduce the original atmosphere. However, while conducting the experiment, the instructors observed that it was quite difficult to be warm and supportive while following a strict protocol. They noticed that some instructor’s actions felt good but were not according to the protocol. And when the instructor followed the protocol, it did not always feel natural. The researchers struggled with faithfully following the protocol on the one hand and creating a warm and supportive atmosphere on the other.
While discussing the video (. . .) they discover that some reactions that feel natural (putting some pieces together for the child) cannot be classified under the levels (of the protocol]. “‘Very didactic,” says Susan, “but an error in the experiment’.” (. . .) Susan and Rachel capture decisions in “rules.” (For example), the “three-second rule”: if a child does nothing for three seconds, you have to give further help. They have dropped this rule. Calmness turned out to be more important. Susan says that Rachel is more restless than she has seen in videos before. “That’s because I wanted to keep the child’s attention,” Rachel says. Susan understands this but still thinks she should go back to that earlier feeling. She laughs and says to me that following your feelings isn’t really good for a replication.
The researchers therefore strove to find a balance between feelings and protocol, calmness and rules, tinkering and control. This balancing was useful and necessary to be able to conduct the experiment, but Susan also realized that it was not very good for the reliability of their replication study, because it was hard to repeat. Again, this tension is expressed in laughter.
After the pilot, training, and finalization of the preregistration and the protocol, Susan started data collection with three student instructors. From our fieldnotes of one of their experiments: I observe what is happening in the room through the observation window. (. . .) Susan guides the boy with words [to make the puzzle]. Sometimes she points at something or demonstrates something. (. . .) It all looks nice, pleasant, relaxed. (. . .) [When the puzzle is finished] the boy has to make the puzzle again, now all by himself. He starts with the wrong layer. It takes long (and) he starts to lose his attention. (. . .) Finally, the boy succeeds in putting the puzzle together with a little help from Susan.
JB observed a formal setting (an observation window, cameras, stickers on the floor that indicated the positions of instructor, puzzle, and participant) with an informal, friendly atmosphere during the experiment. She also noticed that the experiment took a long time (approximately 40 minutes) and that the child and the instructor needed a lot of patience to finish the session. The child succeeded, but after the child and mother had left, Susan told us that she had made many mistakes. She explained that the child repeatedly asked her to do the puzzle with him (“do you want to do this piece?”). From our fieldnotes: That is a request for help. (And I chose to go along with him). Otherwise, I would disturb the atmosphere. The trade-off is (always) the atmosphere. I therefore agreed to his request. (Even though this meant a deviation from the protocol).
Susan decided that the relationship between her and the research participant was more important than following the protocol. Susan: “If the child gets stuck, we look at what is needed instead of (following a) mechanical rule. We will see that in the results. It’s chaotic.” In other words, Susan’s decision to stay friendly and supportive instead of following the protocol harmed her trust in the data (made these chaotic).
Susan explained furthermore that Jones et al. had not reported on the interactions between instructor and child, while she thought this interaction was very important. In the original study, all interaction was done by Mrs Jones. Susan: “But you want it [the effect] apart from the person.” So, basically the question was how to create a Mrs Jones effect without the help of Mrs Jones. In a later (recorded) conversation, Susan explained how difficult this was for her. She told us about a consultation she had had with Sara, the research trainee of Jones: [Sara said] “I remember very well how the Joneses stood looking over my shoulder when I had to practise with those children. They both stood there: ‘no, you have to do this, kind of like this.’” So she was incredibly coached on how to do that task. And after that meeting [with Sara] I completely broke down. What she said affected me deeply. It’s very nice to get a bit of an insight into how those people worked at the time, but there is no one looking over my shoulder. (. . .) Something was transmitted by the Joneses that is not described in the articles, but that did play a major role. And that is what you are investigating. Then you can endlessly contrast decision rules, coding schemes, and protocols, but I think, yes, it is that piece of coaching, interaction behaviour between those researchers, which is quite decisive.
Here, again, we see a tension between staying faithful to the replication protocol with its “decision rules, coding schemes, and protocols” and being faithful to the original study with its “coaching and interaction behaviour.” This contradiction meant Susan became “deeply affected” and “completely broke down” because she was on her own. No one helped her bridge the gap between the formal and the informal, no one “incredibly coached” her the way that the Joneses had helped their research assistant. It seemed deeply unfair, because this coaching seemed to have been “quite decisive.” The personal relationship between the original researchers meant that the experimental effect was not “apart from the person,” and hence cannot faithfully be replicated by Susan.
Preparing a Data Set: Lying Awake about Variables, Red Cells and Preregistrations
Susan and the research team worked hard to set up a controlled environment and create a good experimental atmosphere. But still, the experiment enacted a situation that was “chaotic,” and instead of correctly following the protocol, Susan felt she made many mistakes. Moreover, together with Harold, she discovered that there were also measuring errors and other mistakes in the original study. In a conversation with JB, she expressed her discomfort: “What are you replicating when there are mistakes in the original results?”
At a time when Susan still had to conduct about 200 experimental sessions, she seemed to have lost confidence in the study’s usefulness. However, two weeks later, Susan looked combative in a meeting with Maria and Anna. She had prepared some slides to explain something about the sample. The first sheet was entitled “Jones,” and showed that there was one instructor, one location, eight children per condition, and parents present. The next sheet was about the replication and showed that there were four instructors, two locations, 64 children per condition, and only occasionally parents present. Susan explained that besides these variables (instructor, location, parents) they also randomized for sex and age (4–4.5 / 4.5–5 years old). To be able to analyse if some of these variables in their replication study affected their results, Susan suggested dividing their experimental data into 128 different cells: 4 instructors x 4 instructions x 2 locations x 2 ages x 2 sexes. However, not all four instructors had done an equal number of experiments with younger and older, girls and boys, in the lab and daycare, and with the four different instructions. Ideally all 128 cells should have sufficient participants, but then they would need “10,000 children or so.” Alternatively, Susan explained to Maria and Anna that they could try to divide the children more equally over the 128 cells (e.g., by giving the instructors an equal number of children per location), which could result in two or three children per cell. However, getting this equal distribution among the cells was not easy. They were especially concerned about one instructor, Else, who had done many more children than the rest but mostly in the daycare situation. They discussed the fact that Else was a bit unwilling to go to the lab but that she was a very good instructor. Did they have to pause her work to enable the others to catch up? Or did it not really matter that they had some empty (Else-lab) cells? Susan: “Yes, those red cells, they keep me awake at night. It’s really less beautiful, that dataset like that. It hurts.”
Susan’s pain expressed her relation to the replication and the data: after all her work, she was left with empty cells. Moreover, when they discussed this “red cells” problem with Harold, their expectation was again tempered. According to Harold, it was not only about these empty cells: there was not enough statistical power (participants per cell) anyway to study all these interaction effects, and they should concentrate on testing the original hypothesis. For Anna and Susan this seemed rather disappointing. As an instructor, Susan had experienced how hard it was to follow the experimental protocol, and she expected that it could matter who conducted the experiment with what kind of child in what situation.
I think (that) we’ve already failed to replicate [the experiment]. We are not going to find any major effects. (. . .). So, [then you can] conclude: you didn’t do it well.
But you have been working on this for two years. (You are all academics. If you can’t get it done) then I don'’t see it happening in education either. It’s not that you don’t do your best.
But what does it mean?
If you work hard (and) after two years, you don’t succeed. (Then) this effect can be thrown in the trash. It is not replicable. (It has been a coincidental effect).
But (. . .) I am not Mrs Jones. I lose my natural teaching style [in this experimental setting].
(. . .) It is a very relevant result. You have not failed.
But what the hell are we doing? We try to use this in education.
Anna and Susan felt uncomfortable that they could not study what interaction effects might affect the original effect, because this could have consequences for their personal reliability as good researchers. Harold, on the other hand, found their academic expertise reliable enough to claim that the original effect was not trustworthy, and hence that the effect did not exist in teaching either. For Susan this was a concern because she felt that her teaching in the experiment did not relate so well to teaching in class. She seemed to worry about the consequences of the replication (with her not so “natural teaching style”) for education.
In a subsequent meeting, however, it turned out that Susan, Anna, and Maria had preregistered studying some of these interaction effects. From our notes:
Your main hypothesis is: those effects don’t matter. I wouldn’t change the main analysis.
(. . .) but actually we put in the preregistration (. . .) location and instructors.
[They show that this is in the text. Harold seems to sigh. Looks at the ceiling.]
Okay, then we will include that. Then that is what you’ll have to do.
Since a preregistration is a strategy that should prevent differences between the researchers’ intentions and actions, Harold also had to follow it—otherwise it could harm other people’s confidence in the study. However, following the preregistration could strain the trustworthiness of their analysis as a good representation of the original study, as well as the reliability of the statistical inferences (not enough statistical power). This seemed a bit frustrating to Harold.
Reliability of the Coding
Data collection proceeded with some emotional ups and downs, but in June 2021 the researchers were very excited because the end of data collection was finally in sight. From our notes:
The good news is the sample. (. . .)
We’ll have more than 300 soon.
Fantastic. Wow!
The next step, which started during data collection, was the coding of all videos. Jones had not explained how he did the coding, so the replication researchers developed a coding scheme themselves. Susan involved five student assistants to code the behaviour of all 300 children every 10 or 15 seconds (and in a later round also the instructors’ behaviour). The coders had to code if the child made an “operation” and if so, if this operation was with or without the help of the instructor and if it was correct or incorrect. However, during coding it turned out that it was not always clear what an operation was, if an operation was correct or not, and what “help” actually meant. When a child picked up a piece and put it down somewhere else, was this then a correct operation (make it ready for action), an incorrect operation (does not know what to do with the piece), or no operation (a meaningless action)? And was it “helping” if the instructor arranged the pieces, gave a compliment, or just smiled? The problem of the coders was not only that they had to assign the right code (e.g., “correct operation, unaided, stage 2”) but also that they all had to give the same codes in the same situation. The IRR—inter-rater reliability score (the extent of agreement between the coders)—had to be good to continue their work (Maria: “The next step depends on the IRR”). To make the coding reliable, they had regular meetings to practice and discussed how to code the often very unpredictable behaviour of the children. From one of their coding meetings:
I had a child who spread all the pieces (around him). (So) he searched for pieces and placed them widely (around him). What would you do?
Good one . . . [laughs]
Operation or not?
As we have it now, it is selecting. But if the child puts (the pieces) away (it is) no new event. Or?
It is so sensitive to interpretation.
I don’t know actually.
The IRR was important to show the fidelity of the replication to the original study. Jones’ single coder had to be convincingly substituted by five different coders, all coding in a coordinated way. In fact, if different coders coded differently, it did not really make sense to calculate the results. The IRR directly affected the trustworthiness of the data, which was related to the interpretations and competence of the researchers who developed the coding scheme and trained the coders. Susan told her supervisor Maria that it would be their fault if the coding was not reliable, not the coders’ fault. Harold on the other hand, had a diametrically opposite take on the issue. He said that if the coding was not (inter-)reliable, it was not bad research but a finding. From our fieldnotes:
If that cannot be processed, cannot be coded, then that is a finding. Then Jones did that out of the blue. But it is not reliable. (. . .) It is the problem of the original study. It is a relevant outcome.
That’s a super big relief. (. . .) This makes me happy. But how do we rigorously demonstrate this?
You’ve tried to do it. The videos are still there: others can do it again. (. . .)
It makes me happy and nervous. In education it is pushed so much. (If it is that subjective/difficult to transfer), that is quite a result.
Harold’s reversal of the situation—low IRR as a finding which affected the trustworthiness of Jones’ study not theirs—made Susan very relieved, but it also made her nervous. She was concerned about how they could prove that the reliable or consistent coding of the original study was not doable, and what it would mean for the teaching practice that was based on that study. That is, the reliability of the coding affected their own trustworthiness as researchers, who had perhaps developed a bad coding scheme, and/or that of the original author who had perhaps developed an un-codable experiment. Moreover, it could also strain faith in current teaching strategies, which were perhaps based on untrustworthy experimental results. Susan felt happy because Harold said that they were not the unfaithful aspect in this chain, but also nervous because this might undermine fidelity elsewhere (for example, in teaching strategies).
All these emotions and concerns occurred before the IRR test was actually done. A few weeks later, when the test had indeed shown that the coding was not reliable, Anna and Susan had a meeting to discuss what to do, and how to convey this bad news to the coders with whom they were meeting right after. They did not seem to be so happy anymore with Harold’s solution that unreliable coding is a finding instead of a failure. Instead, they were very concerned about the impact of this on the coders, and the time new coding would take.
My greatest wish is to discuss (how we tell the) coders (about the IRR)? I find that very difficult. My concern is that it will be demotivating and confusing. (. . .)
(I think we should) involve them in this process. (We should) let them think along. They are loyal to us. (. . .) Pushing won’’t help.
I worked very hard for 20 months. I gave it all, but I’ll have no coded data in September. I’m so done with this. I have to keep on going. It has a major impact on my life and career. Continuing is not good for me. (. . .) I don’t want to sacrifice any more. (I feel a) deep frustration. (. . .) This has major consequences for my career. (. . .) I can’t bear it anymore. I never want a project like this again.
I get that.
We should go to Harold (with the problem): we can’t get that coding scheme reliable. Now what?
In this conversation, we see intense emotions and concerns. At this stage of the project, so many issues have accumulated in so many relations that the situation seems to become unmanageable. The coders, who have been so loyal to the project, must be told that their IRR is low. Anna proposed engaging the coders in the process of looking for a solution, but that would take time. This was not acceptable to Susan; she needed the data by September. She was deeply worried about the impact on her life and career. Again, they looked to Harold for a solution: perhaps he could help them to make the coding simpler. They decided that Susan would continue working until the planned end of her postdoc (September) and would then take a step back: Susan: “(then I’ll take) a sabbatical from the project. (I don’t want to have the project under my care anymore, like a baby].)”
The project as a baby that must be cared for, coders whose loyalty must not be betrayed, the close relationship between Susan and Anna and their confidence in Harold, and the high impact of this project on Susan’s life and career: these showed the different relations that required fidelity from one or more parties to be good. And although these relations between the researchers, between the researchers and the research, and between the research and the researcher’s life and career have to be good in all research, the situation in which the researchers had to replicate the unclear and perhaps unfeasible decisions of an earlier researcher put these relations, we argue, under even more tension. Nevertheless, Susan and Anna did not give up on this emotional journey (perhaps out of fear of throwing the baby out with the bathwater), and decided to take more time, adjust the coding scheme, and redo and re-check the coding to increase the IRR. In the conversation with the coders which directly followed this emotional meeting, Anna said: (I wonder) what are we doing. I kept looking at myself. (I thought, it’s my fault that I can’t follow it.) But it’s so complex that it’s impossible to follow. (. . .) Now (we have to) think: what simplification or rougher measurement can we use? We are very confident that we all understand it well. But now (we take) the Jones route. Making it easier. (We have to) make concessions.
Anna said that she first thought that she was the unreliable aspect in the coding, but now she realised that the coding was too complex to be reliably done. From now on, they would take the Jones route, who had not reported about the coding at all. So, although they initially had higher methodological standards than Jones and made a detailed coding scheme, they now concluded that this scheme was too detailed. Susan added that it was not that the coders coded unreliably but that the coding was affected by “noise”: “[It gives me] a stomach ache. But maybe (we should redo the coding). (It is) noise. (That’s different from mistakes). (We have to find out) where the noise comes from.”
The decision that it was their coding scheme that was impossible to follow reliably exonerated the coders and Jones. And although this also gave them some hard feelings (frustration and stomach ache), it also made it possible to continue their project and maintain the quality of their relations. They adjusted (simplified) the coding scheme, and the coders redid the coding. Apparently, this improved the IRR of the coders because it was no longer a concern in later meetings.
Calculating the Adherence Rate
Some weeks later, the reliability of the data was again a concern. This time, the worry was not whether the coding was reliable but if the instructors had correctly followed the (not always very clear) instruction steps. Jones et al. had reported a relatively high “adherence” rate of 70% correct instruction steps, but Susan had serious concerns if their instructors would even achieve a 50% adherence rate. She wondered what their contribution would be if it turned out that the instructors had not been able to adhere to the original instructions (i.e., if the instructors did not execute the experiment in the correct way). Susan: “I think it is a sensitive point. What those results mean. We are attackable. We did not replicate [the experiment].” However, Harold comforted them, saying that they are the experts and that the problem was with the original authors (who were not very clear about these exact steps) and the complexity of teaching “in practice” (which made it hard to follow a strict protocol with instruction steps). Harold: “If experts like you cannot do this after three years, then in practice it is certainly not possible. Many meeting hours preceded this. Then Jones should have explained this better.” In other words, the fear that the adherence rate of the research team was too low potentially undermined the replication’s fidelity to the original study, and potentially made them open to criticism by Jones or potential reviewers. At the same time, fears about adherence rates problematized the relation between the experiment and actual teaching practices.
Despite Harold’s calming words, the adherence rate became an increasingly stressful and complicated topic in the research meetings. At a certain moment, even Harold seemed to become a bit off balance from the complexity of the data and the concepts behind them.
We don’t know what failure was [of the child], but we know what success is.
Sara [Jones’ assistant] said: “You just have to assess that and I’m sure you will do that right if I read your protocol.”
[Silence]
[Sighs] Can you get the (mock) data again?
[They discuss the data sheet of one experiment]
(all these actions of the child) are one attempt? (. . .). You respond to this as a human being. You can’t say, “Go, make your puzzle!”
According to the protocol, (the instructor should have made a different action), while everything in you says: “I will give the child another chance.” (. . .) But that subtlety doesn’t exist with Jones. He only said: 70% of mistakes are repetition.
Okay (it’s a) mistake. (. . .). Wow, this is difficult. (. . .) Okay, so one mistake so far, maybe two mistakes. (. . .) [He reads the data as he talks]. ‘“Now, I want to draw again!” [laughs]. This is so difficult!
Harold seemed to realize how complex the interaction between the instructor and the child is. Indeed, you cannot say: “Make your puzzle!” to a four-year-old child. That one of the original instructors had told the replicators that they should assess themselves what a failure is for the child (and hence how the instructor should respond) probably did not fit his idea of a faithful replication either. Harold was silent and sighed. But he pulled himself together and decided to just follow the protocol and to define this as a mistake of the instructor. That is, Harold found it important that the protocol was followed accurately, although he realized that strictly following all instruction steps was actually impossible with children, and hence also not a very accurate model of the natural (teaching) situation. Moreover, accurately following the protocol and defining all “human” responses as a mistake strained their trust in Jones, who apparently had made almost none of these mistakes. The emotions (expressed in sighing, silence, feeling vulnerable) that we see here were about the concern that a faithful replication (following their reconstruction of Jones’ protocol to the letter) was not possible with a child, and hence not faithful to the natural teaching situation of the child. However, not following Jones and their own protocol also felt unfaithful. As Anna formulated it: “Does it still matter what the results are? If adherence is so low?”
The fact that Mrs Jones’ instructions had been 70% correct was quite a stressful outcome for Susan and Anna, who were afraid that their instructors were making many more incorrect actions. However, at a certain point Harold noticed that Jones never specified that an incorrect step (e.g., giving help at level 2 instead of level 3) was a mistake—Jones only wrote that a step in the wrong direction (e.g., giving help when no help was needed) was a mistake. So, by literally following Jones they would not have made as many mistakes as they thought they had done. Moreover, when Harold calculated all steps in the right direction as correct, he also ended up with an 70% adherence rate for the replication. This made everyone very happy.
(Look],) how well the instructors did. 70%, the same as Jones. This is an important manipulation check. [Happy responses.]
Compliments to you, Susan!
Pfff . . . So we can still train people. That is a relief for teacher training.
(. . .) This is very strong evidence that the instruction just worked.
So happy with that!
So, although nothing had actually changed in the experiment or data, this slightly different definition and calculation of mistakes had huge consequences. It reestablished their faith in the fidelity of various relationships in the project: the reliability of the coordination in the experiment; their trustworthiness in their personal relations with Jones, their colleagues, and the children; and the faithfulness of their representation of Jones’ protocol. Not surprisingly, this shift evoked a wide spectrum of emotions.
A Faithful Replication and Its Consequences
So far, we have seen that the replication researchers expressed a variety of emotions concerning fidelity in the various and sometimes conflicting relations in their replication. However, as soon as the results were in, these relations changed again. From our fieldnotes: The analyses are ready. All (results) are more or less the same. The bottom line is not only that [condition A] does not stand out but that everything is about equally distributed. “It just doesn’t matter at all what the teacher does,” laughs Susan. Anna says several times that she still needs to process this. They discuss how this (result) is possible. The Mrs Jones effect is mentioned. Harold calls Jones’ results “quite remarkable.”
Again, we see interesting changes in some of the relations. From the first interview, the researchers seemed to doubt that they would reproduce Jones’ “big” effects, but now that they had indeed not reproduced these results, they seemed to be surprised, especially Anna. The surprise was not that they found different results but that they found no effect at all. (Anna: “There is simply no difference between the conditions. Just none!”). This immediately strained several other important relations in the project: between the (“remarkable”) original study and their replication; current teaching strategies and their scientific support; the effect of the experimenter on the participant; and the relation between the researchers and their field. From our fieldnotes:
I think it’s the Mrs Jones effect. She could do this. She was very well trained. (. . .) But that doesn’t work for the teacher. If you [follow instruction a, b, c, or d]: apparently that is not the magic. This raises questions. What is (good) teaching? (. . .)
[about Jones et al.] if you just look at the ns [number of participants], it’s just (impossible) (. . .) that you will find this (effect) with eight test subjects with so much noise.
It gives me confidence. This is a bomb. They can no longer say that (our instruction in the experimental condition) was not okay [because results in all conditions are more or less the same].
The magic of Mrs Jones, teachers in class who are not trained like her, impossible original results, a study with so much noise, colleagues who might think that your instruction is not okay, results that might have the impact of a bomb; these all indicated that confidence in several relations was about to collapse. As we can tell from an earlier meeting, this was not the conclusion that Anna and Susan had hoped for. Anna: “Our background shows that we would have preferred to find it.”
Moreover, the question that followed was what relations this “bomb” would actually destroy. (Susan: “And now? What is our message? I think that remains a struggle. The results are quite clear, but what does it mean?”) For Harold, this seemed relatively easy. During several meetings they had to discuss their paper in progress, Harold summed up what the results were, what went well in their replication, and what went wrong in the original study. In fact, he advised not to go too deeply into the theory. (Harold: “we can’t replicate it. Period. And we don’t immediately know why.”) For Susan and Anna, however, this was not satisfactory. At conferences, they were presenting the results of their replication to a field of colleagues who were in favour of Jones and who wanted to know what they thought this meant for the theory and all the later studies of the phenomenon. At a certain point, Anna turned a switch in their discussions with Harold: The line we are taking so far is very much: “How is it possible that they found it and we did not?” And I think that Susan set out the line (in the paper) more like: ‘“Why don’t we find the effect?” And to look for this in the operationalization of the instruction. So: “Why is it not surprising, in retrospect, that this condition, operationalized in this way, does not give the expected effect.”
Susan and Anna were no longer unsure about the reliability of their performance and results, but their relations and discussions with peers also obstructed the argument that Jones’ study and theory were unreliable. They decided to focus on the complexity of the phenomenon: instead of presenting Jones et al. in their scientific paper as an undoable or unreliable study, they presented it as an important theory that needed some rethinking on its operationalisations. The expected explosion in their field did not occur and their work was received with positive attention. In the peer review process their work was called “a very careful replication.”
Discussion
In this paper, we followed the emotional concerns of Susan, Anna, and Harold throughout the replication process. We showed that emotional concerns are not only a result of failed replications but also form an integral part of the replication process. These replicators were especially concerned about the quality of the many complicated and sometimes conflicting relations between the heterogeneous elements in and around their replication study: the relations between the original study and the replication, participants’ behaviour in different locations and situations, the coding of different coders, the instructions of different instructors, between the data and their interpretation, and the replication results and the theory. Their concerns focussed on a set of qualities that are related to what we called “fidelity.”
We can distinguish different kinds of fidelity relations. Replication researchers expressed emotional concerns regarding fidelity in several personal relationships. They worried about Jones’ trustworthiness and about their own trustworthiness in the eyes of their colleagues or reviewers. They were also concerned about the close relationship between the Joneses, and between the children and their parents, because these personal relationships made it hard to replicate the original situation. On a more positive note, Susan and Anna had a close relationship, they trusted and protected their student coders and instructors, and had a lot of confidence in the experience of Harold—whose support strengthened their confidence in their own work. Another fidelity relation was one of representation: the replication should represent the essential features of the original study, by following its protocol. The fidelity of this representational relation was threatened by the missing information, the Mrs Jones effect, and the preregistered analyses of the interaction effects (which differed from the original analysis). Other fidelity-related representational relations were those between contemporary children and children 40 years ago (at the time of the original study), and between children in the lab and in daycare. The researchers were also concerned about coordinational and inferential fidelity relations: They were eager to demonstrate that the coders and the instructors coded and instructed in the same coordinated way—and that their conclusions reliably derived from their data. A final fidelity relation is that between the research practice and the phenomenon it aims to grasp: the ontological fidelity between the (replication) experiment, in particular the experimental protocol, and this particular teaching strategy. At the start of the project the replicators were still loyal to the teaching strategy as described by Jones, and committed to the replication experiment as a good way to prove its efficacy. During the replication project, their concerns about fidelity relations grew and their loyalty to Jones and his protocol declined, until they finally concluded that it is not surprising that they could not find the same effect, because this teaching strategy is very hard to operationalize.
In other words, the emotions of the replicators (e.g., their uncertainty) were not only generated by concerns about fidelity in specific relations (e.g., between the original study and the replication); these emotions and relations could also affect fidelity issues in other relations (e.g., between themselves and their colleagues). The emotional concerns about fidelity issues could be seen as affective entanglements (e.g., Myers, 2008; Slaby et al., 2019) that connected apparently unrelated situations and aspects of the study. As a result, small events in the study could have important consequences for fidelity somewhere else. For example, the fact that one instructor (Else) conducted more experiments in the daycare than the other instructors affected fidelity in some inferential relations (e.g., between data and interaction effects); re-calculating the instructors’ adherence rate in the replication radically changed the original author’s trustworthiness; the replication’s null results strained the researchers’ confidence in the evidence base of teaching practices; the problems of following the protocol in a well-coordinated way made the original experimenter a “magical” (unreliable) instructor; the problem that two coders coded differently was very bad for Susan’s confidence in her career; and the strong connection between parents and children decreased the representativeness of participant behaviour across locations, and of children nowadays with respect to those in the 1980s.
Central to this tangle of fidelity relations is, as mentioned, the relation between the replicators and their object of study. As Anna said about the start of their project: “Our background shows that we would have preferred to find it.” As the project goes on and the web of relations that supports it comes under tension, their loyalty to Jones’ effect starts to waver. Perhaps the strategy as defined by the protocol does not work. The promise of “an effect apart from the person” is not fulfilled. Susan calls the protocol “a mechanical rule” that is at odds with what the teaching situation demands. Even Harold, who has emphasized following the original protocol to the letter, is impressed by how complicated it is to help a toddler make a puzzle. At the end of the project, the replicators conclude that the “magic” is not in the protocol with its rigid rules for instruction and helping. As Susan said: “Something was transmitted by the Joneses that is not described in the articles, but that did play a major role.” The strategy cannot be isolated from the teacher implementing it and formulated in a set of steps and decision rules. That means that teaching teachers to use the strategy requires something else than informing them of these rules and steps. It requires what Susan calls “coaching.”
Thus, during the replication process, the team’s view of what the phenomenon (the teaching strategy and its effect) actually is has fundamentally changed. Although its precise nature remains unknown (further research is needed), they now see it as intimately tied to the relation between an individual child and an individual teacher. Moreover, this change is the result of the close contact between experimenters and children. In one of their discussions with Harold, Anna explained: “by really getting up close to those children, that has changed it. (. . .) Gradually, we started to revise our assumptions.”
Thus, the replicators end up close to embracing a process-orientated, relational view of the teaching strategy, realising that it exists in and through the developing relation between a teacher and a learner. This means that the phenomenon is fundamentally at odds with an epistemology that relies on direct replication as a method for confirming its reality. Direct replication studies, at least the way they have been implemented by psychologists in response to the replication crisis, depend on precisely defined experimental protocols that prescribe among other things the actions of the experimenter. The replicators in our study, however, discovered that following an experimental protocol based on Jones’ work does not produce the phenomenon. They hypothesize that this is due to its “mechanical” nature, i.e., the fact that it is a protocol. A better protocol would not help to produce the phenomenon and make it reproducible.
It remains to be seen how the replicators will further enact this new ontology of the phenomenon, that is to say: how this teaching strategy will be realised in the practice of their future research projects, their teacher education, advice to schools, communication to a wider audience, and so on. Now that the link between the phenomenon and the experimental protocol (or any set of rigid decision rules) is broken, the network of relations around the phenomenon must be rebuilt. In the process, the various forms of fidelity in these relations will also change.
This replication study shows the kind of ontological shift that is advocated by among others Van Geert and De Ruiter (2022) and Derksen and Morawski (2022, 2024). Here, this shift is not just theoretical in nature but happens in real time in a team of researchers. For them, it is a lived experience which involves an entire network of emotionally charged relations. Taking care of the fidelity of these relations while the replication work simultaneously created tensions in and between them was an emotional, sometimes painful process. Thus, method and ontology are not just abstract topics of discussion; in research practices they are entangled with issues of loyalty, reliability, trust, and other fidelity relations that are emotionally charged. Methodological improvements and new ontologies require more than a change of mind. They are constituted by changes in a heterogeneous network of relations that researchers have feelings about.
Footnotes
Acknowledgements
We would like to thank Susan, Anna, Maria, and Harold for being so open and trusting with us, Anne Beaulieu for her thoughtful comments on a draft version of this paper, and the reviewers for urging us to discuss ontology.
Ethical Considerations
Informed Consent
Informed consent to participate in our study was given via email and verbally (during interview).
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was funded by NWO Open Competition-SSH [406.20.FR.007]
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Long-term storage facilities for our interviews and fieldnotes are available via the Amsterdam UMC. In accordance with the General Data Protection Regulation, we will not make any of the generated data openly available, because this would not protect the privacy of the observed and interviewed researchers in a sufficient and responsible manner. Further investigation using this data will only be possible via one of the authors of this paper.
