Abstract
Philosophers of science have long questioned how collective scientific knowledge grows. Although disparate answers have been posited, empirical validation has been challenging due to limitations in collecting and systematizing large historical records. Here, we introduce new methods to analyze scientific knowledge formulated as a growing network of articles on Wikipedia and their hyperlinks. We demonstrate that in Wikipedia, concept networks in subdisciplines of science do not grow by expanding from their central core to reach an ancillary periphery. Instead, science concept networks in Wikipedia grow by creating and filling knowledge gaps. Notably, the process of gap formation and closure may be valued by the scientific community, as evidenced by the fact that it produces discoveries that are more frequently awarded Nobel prizes than other processes. To determine whether and how the gap process is interrupted by paradigm shifts, we operationalize a paradigm as a particular subdivision of scientific concepts into network modules. Hence, paradigm shifts are reconfigurations of those modules. The approach allows us to identify a temporal signature in structural stability across scientific subjects in Wikipedia. In a network formulation of scientific discovery, our findings suggest that data-driven conditions underlying scientific breakthroughs depend as much on exploring uncharted gaps as on exploiting existing disciplines and support policies that encourage new interdisciplinary research.
Philosophers of science question the way science works and why. For millennia, they have posited mechanisms and offered explanations. Several recent and particularly compelling theories have been difficult to validate, in large part due to challenges in collecting and systemizing large historical records. Fortunately, Wikipedia—the largest online encyclopedia—contains millions of articles that not only form hyperlinked networks of concepts but also include a history of when a concept was discovered. We use this resource to formulate the process of science in terms of knowledge growth, or—more precisely—the growth of concept networks. Across scientific subjects, we demonstrate that networks in Wikipedia, as an important and interesting case, grow both by expanding frontiers and by filling knowledge gaps, striking a balance between bubbling out and bubbling in.Significance Statement
Introduction
In the philosophy of science, thinkers from disparate eras have attempted to reason about the various processes underlying scientific progress. Some of the most compelling theories have come from the last 50 years. For example, in 1959 Karl Popper described the development of scientific ideas as a sequence in which previous theories are falsified (Popper, 2008). In 1962, Kuhn suggested instead that progress was best described as periods of
Despite the variety of theories regarding scientific progress, such as cultural accounts of scientific practice (Chemla and Keller, 2017), many suggest a dependence of new discoveries on the existing body of knowledge. Newton writes in 1675, “If I have seen further, it is by standing on the shoulders of giants” (Newton, 1675). Indeed, discoveries, including calculus, are often
Here, we expand upon the concept of a body of knowledge, not as amorphous, but as comprised of distinct relationships between concepts. We formalize this structured body of knowledge as a Building a growing concept network from Wikipedia. (a) Each network is made up of Wikipedia articles that are indexed by subject in the natural sciences, mathematics, and the social sciences. Each node corresponds to an article; the node’s name is the title of the article, and the node’s birth year is the first year listed in the introduction or history sections as the year when the concept was conceived (highlighted in yellow). Each directed edge corresponds to a hyperlink, from the article that is hyperlinked to the article that hyperlinks. (b) The final concept networks display greater clustering, modularity, and coreness than null networks constructed to reflect Fereyabend’s hypothesis that science lacks a characteristic pattern of discovery. (c) By comparing the birth year of core nodes to those of neighboring peripheral nodes, we observed that there is no clear 
In this study, we develop methods to operationalize the theories of the philosophers Kuhn, Feyerabend, and Lakatos as falsifiable hypotheses using network theory. We then test each hypothesis—from which we infer the theories that are most supported by the data—in Wikipedia as an important and interesting case. To operationalize and test these hypotheses, we use cutting-edge methods in network science, algebraic topology, and control theory. Here, we find that science concept networks grow by creating and filling knowledge gaps, a process which is valued by the scientific community. Moreover, we operationalize a paradigm as a particular subdivision of scientific concepts into network modules and find that paradigm shifts vary in their magnitude in a similar temporal signature across scientific subjects. Our findings not only shed light upon the large-scale, empirical evidence of classical philosophies of science but also demonstrate operationalizations of qualitative social phenomena that may be readily applied in other contexts.
Results
Given the notions of characteristic patterns in discovery by Kuhn and Lakatos, or the lack thereof by Feyerabend, we first quantified the structure of concept networks using computational tools and concepts from network science. As an initial test of Kuhn’s hypothesis of
How might the structure of concept networks arise through the course of history? Might core concepts emerge first, followed by peripheral concepts, in a cumulative growth pattern in which we build deeply upon foundational early discoveries in the field? Or might the “foundational” concepts change as the field grows and changes, such that new discoveries can sometimes replace earlier discoveries as more central to the field? To address this question, we compared the birth year of core nodes to the birth year of neighboring peripheral nodes. We observed that there is no clear lead-lag relationship between a core node and its neighboring peripheral nodes (Figure 1(c)). That is, core nodes do not necessarily precede neighboring peripheral nodes in their discovery, suggesting that the discoveries viewed as “central” to a field can change throughout the field’s lifetime. Notably, there is similarly no clear lead-lag relationship between core and periphery nodes within specific modules (Figure S1). Interestingly, a few core nodes are born consistently earlier than most peripheral nodes. These nodes, such as the node “Hydrology” in the subject “Earth Science”, may serve as concepts that are central to the subject as a whole and comprise a “hard inner” part of the core (Figure S2). Taken together, these results point to both an outward expansion and an inward exploration of concepts, in which the core of a research programme is often updated by new discoveries that are influenced by discoveries that occur in the periphery.
How then does a body of knowledge grow? Thus far, we understand that real concept networks are highly clustered and display processes of both inward and outward growth. Might concept networks fill gaps in knowledge (Fontana et al., 2020) in a manner that is conceptually akin to Kuhn’s puzzle-solving Real concept networks maintain shorter and fewer gaps. (a) Examples of cavities in 0, 1, and 2 dimensions. Cavities are born when a new, added node creates a topological gap; cavities die when the gap is closed by a new node and its edges tessellate the gap. (b) Barcode for the 
After computing the persistent homology of real networks (Figure 2(b)), we next wished to determine whether the observed gap formation and closure was similar to (or different from) the same processes expected in network null models. We consider two network null models. The first is a randomly rewired null model, in which the edges are relocated uniformly at random. The second is a genetic null model that explicitly examines the process of scientific discovery. The genetic null model simulates the process by which scientists learn about existing concepts and then slowly mutates those concepts to create new ones. Importantly—and unlike true science—, the genetic null model contains no preference for how new concepts are mutated, for example as could be operationalized by an objective or fitness function (Figure 2(c)). The model is initialized as a subset of a real network, containing “core nodes” that were born before a predetermined year (500 BC in our simulations). For each node, we iteratively mutate a
At this point, the topology of the real, rewired, and genetic null model networks can be compared. We find that for persistent cavities that have already died by
Whereas the first part of Kuhn’s theory regards puzzle-solving, the second part regards paradigm shifts that radically change the way scientists view concepts. In our network formulation, we operationalized paradigms as temporally varying modular structure, where an “incommensurable” paradigm shift would drastically change the membership of nodes to modules. Accordingly, we built a multilayer network with each layer containing the concept network at each year (Bianconi, 2018) and used multilayer community detection to identify modules at each year and understand how those modules change in constitution over years (Mucha et al., 2010). To summarize these data, we followed prior work to calculate the number of times that a node changes its membership to a module for each year (Figure 3(a)) (Killick et al., 2012). Then, we performed change point detection on the number of changes observed to identify epochs of stability in module membership (Figure 3(b); Methods) (Killick and Eckley, 2014). We found that after an initial, short epoch of little change, the network enters an epoch of moderate and lasting change (Figure 3(c)). Then, the network enters a short epoch of much greater change, after which the network stabilizes in the last epoch. This signature is not observed in randomly rewired networks (Figure S4). The data demonstrate that shifts in the structure of concepts occur not as abrupt Kuhnian reconfigurations but as gradual Lakatosian modifications (Daston, 2016). Moreover, the findings demonstrate that subjects (or fields) display a shared signature of structural stability and instability across time. Concept networks undergo a signature pattern in structural stability. (a) Module membership (colored) of nodes across years for the 
Finally, we ask whether we can predict the perceived merit of a discovery by measuring the node’s theoretical ability to influence a body of knowledge due to its location within the gappy topology. We operationalize perceived merit in two ways: (i) a network-based measure and (ii) the receipt of a Nobel prize (Jin et al., 2021; Li et al., 2019; Szell et al., 2018), although we acknowledge the imperfect nature of the latter assessment (Nature, 2018, The Lancet, 2018). After constructing a single network containing all nodes from all subjects, we calculated the network’s impulse response as a measure of a node’s potential influence. Arising from dynamical systems theory, the impulse response quantifies how much a network “responds” to an “impulse” that perturbs one node (Figure 4(a); Methods). We observed that nodes that more frequently participate in the birth or the death of cavities have higher impulse responses (birth: Pearson’s correlation coefficient Network topology reveals the impact of concepts. (a) Illustration of the response of a network to an impulse applied to a node. (b) Nodes that more frequently participate in the birth or death of persistent cavities have higher impulse response, which is a dynamical-systems measure of a node’s influence on a network (birth: 
In summary, our findings reveal that human knowledge, in the case of articles in Wikipedia, grows by filling gaps in knowledge, perhaps driven by the collective curiosity of individual scientists (Golman and Loewenstein, 2018; Merton, 1974), through inward and outward exploration and gradual modifications to network structure. Moreover, knowledge discovered while creating and filling knowledge gaps is likely to be more influential and more frequently awarded in the scientific community.
Model assessment
Feyerabend
While Feyerabend’s theory arrived latest in chronological time (1975 vs. Kuhn’s in 1962 and Lakatos’s in 1970), we tested his hypothesis first because his ideas about the growth of scientific knowledge are the most different from the rest. Feyerabend posits that there is no single set of scientific methodologies employed in practice by all scientists (Feyerabend, 2010). Note that this claim stands in stark contrast to the Popperian and Kuhnian positions that science does have a characteristic pattern of discovery. Feyerabend was accepting of—and even promoted—a competition of theories and was careful to not demarcate science and its practice from other human endeavors, including storytelling and myth. These commitments served to counter the effects of a history of power structures and the influence of Western philosophy upon—and in justification of—science. To model Feyerabend’s theory, we use an edge-rewiring process that produces a random network topology (Maslov and Sneppen, 2002). It is critical to note here that we are not modeling the scientific process as a process of random rewiring; instead, we are modeling the network structure resulting from science without a characteristic pattern of discovery (Feyerabend’s thesis) as the culmination of a random rewiring of connections. In comparing edge-rewired networks to their real counterparts, we observed that real networks often display significantly non-random clustering on two topological scales: at the nodal scale, clustering manifests in a high clustering coefficient, whereas at the mesoscale, clustering manifests in the existence of modules and cores (Borgatti and Everett, 2000; Clauset et al., 2004; Fagiolo, 2007). Thus, a network formulation of knowledge suggests that the processes underlying network growth constrain the network topology, in opposition to Feyerabend’s theory.
Further, for Feyerabend new significant discoveries in science come as a result of reframing or seeing nearly all things completely anew, rather than filling otherwise well-recognized gaps in knowledge. In contrast, in our study we have shown that novel discoveries are the result of new cavities being formed or filled in the network and that these cavities are of higher-dimension, shorter period, and decreased frequency in real networks than in random or genetic networks. Hence, our results do not support the Feyerabend position of incommensurability and of novelty in knowledge being about a complete reconfiguration of the concept network. These observed distinctions between the data and Feyerabend’s predictions motivate the examination of other philosophical positions.
Lakatos
Lakatos hypothesized that science progresses as a
In this study, we operationalized Lakatos’s hypothesis as growth within the core-periphery structure of a concept network (Borgatti and Everett, 2000). From the perspective of network topology, nodes in the “core” are densely connected to each other and form the topological center of the network; in contrast, nodes in the “periphery” tend to connect to the core but not to other peripheral nodes. In general, peripheral nodes have fewer connections than core nodes. The core-periphery structure of a network has important implications for the modification of scientific theories. For example, it is more difficult to modify nodes in the core than those in the periphery because core nodes are densely connected to one another. Similarly, it is easier to modify nodes in the periphery than those in the core because periphery nodes are only loosely connected to the network. The core-periphery structure hence reflects Lakatos’s research programme with respect to both topology and scientific modification. In our study, we observed that concept networks do indeed display a core-periphery structure; however, we also observed that concept networks grow both “outward”, with core nodes preceding neighboring peripheral nodes, and “inward”, with core nodes preceded by neighboring peripheral nodes. This result supports an interesting aspect of Lakatos’s theory of science: it is seen as a layered core with an outer layer that is often updated by new discoveries that are influenced, in turn, by discoveries that occur in the periphery.
Kuhn
Kuhn’s ideas of scientific progress posit that there are two periods of science (Kuhn and Hacking, 2012). One period is called
To complement our casting of normal science as creating and filling cavities, we operationalized paradigms in terms of a concept network’s modular structure. Our choice was motivated by an appreciation of the following fact: the view that scientists hold about a body of knowledge can depend upon how a subject is organized into parts or
Interestingly, the process of cavity filling in real networks reveals the importance of curiosity as a catalyst for scientific progress. Kuhn recognized quite early that “normal science” would push scientists (and knowledge formation) into a constrained and conservative path—one that does not encourage true creativity and innovation (Kuhn, 2000). For Kuhn, the innovation occurred either when iconoclastic individuals were lucky enough to uncover something novel, or when the accumulated failures of a scientific research programme made the search for more creative solutions necessary and better rewarded. In our study, we provide no analysis of individuals, scientific commitments, or practices. Yet, our results on cavity filling and its propensity for being rewarded in the scientific community suggest not two disparate periods of constrained normal science and innovative paradigm shifts but rather a single process in which scientists are continually and collectively driven, perhaps by curiosity, to uncover novel information that “connects the dots” among existing pieces of knowledge.
Discussion
The question of precisely how individuals and groups “do science” has troubled scholars for millennia. In the past century, philosophers of science have proposed several distinct and well-defined processes whereby scientific knowledge might grow. Prominent theories include those by Kuhn, Lakatos, and Feyerabend, which posit different patterns of growth that are supported by historical evidence. However, progress in devising rigorous empirical assessments of these theories has been hampered by difficulties in gathering large amounts of historical data and in operationalizing the theories in a falsifiable way. In this study, we formulated the body of knowledge as a concept network (Hesse, 1974) and operationalized philosophical theories in terms of the network’s growing structure. We demonstrated that, in the case of Wikipedia, concept networks are non-randomly organized, being characterized by high modularity and notable clustering. The networks do not grow strictly outward, as one might have naively expected. Rather, they expand both outward and inward. The inward expansion manifests as a filling of network cavities, which produces concepts that are topologically influential and more often awarded Nobel prizes. Across almost all subjects or fields, the modular structure of concept networks morphs along a signature trajectory of stability and instability consistent with Kuhn’s notion of a paradigm shift. Broadly, our mathematical formulations of historical data pave the way to describe, understand, and even potentially guide scientific progress for individuals and funding agencies (Fortunato et al., 2018). Furthermore, our findings provide a data-driven approach to identifying novel contributions, especially those by underrepresented groups whose works are typically devalued yet are vital for vibrant scientific innovation (Hofstra et al., 2020; Reardon, 2013). In the remainder of this section, we expand upon the insights revealed by our findings.
Network science
Our effort to quantitatively evaluate philosophical theories of science was made possible by formalizing large-scale data into well-defined expressions. In our formalization, we primarily employed network science and algebraic topology. Using the former, we extracted graphs from a large database of text in Wikipedia pages, and using the latter, we extracted simplicial complexes from the same. Network representations are intuitive models for concepts and conceptual relations. Such representations have previously proven useful in the study of concept networks from Wikipedia, and in the characterization of their topology using measures of centrality, shortest paths, and clustering (Bellomi and Bonato, 2005; Lydon-Staley et al., 2021; Matas et al., 2017). Further, network representations can reveal patterns in data that are not quantifiable by observing each individual pairwise interaction—in this case, individual pages or hyperlinks—but that are only quantifiable by considering the entire network or subsection of a network. By complementing network science with algebraic topology, one can study higher order structures in concept networks, and thereby quantify the birth and death of topological cavities. This approach has previously proven useful in, for example, understanding the exposition of concepts in college textbooks (Christianson et al., 2020) and the growth of knowledge gaps in the semantic networks of toddlers (Sizemore et al., 2018). By modeling concept networks as units and pairwise or even higher-order relationships among units, one can operationalize hypotheses about the structure of knowledge and its change over time.
Methodological assumptions and limitations
In operationalizing the complex social process of knowledge discovery, we made a several assumptions. Each was chosen to better enable us to form empirically testable hypotheses. These methodological assumptions and limitations are explained more extensively in the Supplemental Discussion and are summarized here. First, to model knowledge discovery, we assume that the growing Wikipedia networks model how
Furthermore, we acknowledge the limitations and assumptions of modeling scientific discovery as growing concept networks. First, the process of scientific discovery involves a network of commitments and social practices in addition to the concepts themselves (Longino, 2002). Thus, the findings presented here would best be considered complementary to qualitative studies of theory and process. Our approach is useful because it allows us to operationalize and quantitatively test certain dynamics in the structure of concept networks on a large scale. Second, while the network structure is different from the process used to obtain that structure, we aim to infer and constrain the possible processes that produced the concept network that we see today in the structure of Wikipedia. Indeed, concepts can evolve over time in their name or contents or can even be disconnected from the scientific canon (Foster et al., 2015; Shwed and Bearman, 2010).
We note the limitations in our use of Nobel prizes as a measure of impact. As a highly subjective and often biased measure of impact (Lunnemann et al., 2019), we wish to test in future studies the robustness of our results not only in other datasets for Nobel prizes (Li et al., 2019) but also in other prizes (Jin et al., 2021). Moreover, while the goal of prizes is often to award for impact among other things, the effects of Nobel prizes on scientific practice is complex. Nobel prizes do not affect the citation impact of a Nobel laureate (Farys and Wolbring, 2017), but they do produce more papers and entrants in the topic associated with the scientific prize (Jin et al., 2021). Thus, scientific awards influence and are influenced by the body of scientific knowledge.
Finally, we acknowledge limitations in the way we represent Wikipedia pages as a concept network. First, without proper context, the title of a Wikipedia article can be ambiguous, and it may be difficult to infer the concept to which it best relates. This limitation is mitigated by the hyperlinks that connect a Wikipedia page to other pages; yet, even here it is important to acknowledge that the existence of a hyperlink involves some aspect of chance. Second, Wikipedia pages only reflect the current understanding of a concept and its relation to other concepts. Third, Wikipedia itself may be laden with the predispositions of the editors of Wikipedia regarding their philosophies of science. Moreover, their editing of Wikipedia may be affected by their political biases (Harvard Business School et al., 2018) or their styles of information seeking, which may influence which articles they choose to read and later edit (Lydon-Staley et al., 2021). Fourth, Wikipedia articles may actually affect scientific practice itself, an observation that highlights the impact of secondary source information, especially a widely available source like Wikipedia, on primary research (Thompson and Hanley, 2017). Thus, understanding patterns of growth on Wikipedia as they relate to primary sources seems ever more important. In summary, we applied the network methods presented here to Wikipedia as an initial and interesting test case, but we look forward to future efforts applying such methods to other datasets, primary or otherwise.
Future Directions
In summary, we operationalize and test the theories of prominent philosophers of science using concept networks of Wikipedia articles. We significantly extend the state of the field by building upon previous studies that have examined the topology of Wikipedia (Bellomi and Bonato, 2005; Matas et al., 2017; Yamada et al., 2020) and network principles in scientific practices (Clauset et al., 2017; Fontana et al., 2020; Fortunato et al., 2018; Sinatra et al., 2016). Moreover, we depart from prior work by using algebraic topology to test hypotheses regarding the sociology, history, and philosophy of science in a large-scale, empirical study. These methods can readily be applied to related questions about the structure and dynamics of concept networks. For example, differences in language or culture may influence the structure of concepts as encoded in Wikipedia pages. Other investigators may use network structure and document embedding to form more accurate and nuanced models of scientific discovery that explain the qualitative observations and interpretations of philosophers of science. Further, the methods demonstrated here may reveal insights into observed gender disparities in the processes of knowledge generation and scientific engagement (Ford and Wajcman, 2017). Our development of data-driven approaches hence lays the groundwork for others to further describe and understand the process of scientific discovery and its relation to diversity, equity, and inclusion.
Methods
Building growing concept networks from wikipedia
Network measures.
Null networks
To model Feyerabend’s hypothesis of anarchical scientific progress, we used an
Network methods
The network measures we used are summarized in Table 1. To calculate clustering coefficients, we compute the clustering coefficients of each node in the network with the Python library
Persistent homology
In our study, we hypothesized that processes of scientific discovery create and fill gaps in concept networks. A tool from applied algebraic topology, called persistent homology, provides a well-defined formulation of such gaps as persistent topological cavities that evolve as a network grows (Hatcher, 2002). To calculate persistent homology for a growing network, we first create a correspondence between
Temporally varying modularity
To detect modules across time, we first built a multilayer network from our growing networks. In the multilayer network, each layer is the network reflecting concepts discovered up to that year. Each node in a layer (e.g. at time
In performing this analysis, we observed that the module membership of nodes changed at different rates across time, with almost no changes in module membership in the second half of the period evaluated. Hence, we hypothesized that there may be epochs of module stability, with greater stability observed later in history. To identify such epochs, we first computed the number of changes in module membership across time as any time when the module membership is different than in the previous time point. Then, we summed the number of changes for each time point to obtain a single variable across time. Next, we used the R implementation of binary segmentation in the software package
Impulse response
To quantify the impact of an article-node on the subject network, we use the impulse response measure from linear systems theory (Kailath, 1980). Here, the network is represented as an adjacency matrix
To quantify the influence of a node on all topics using the impulse response function, we built a large network that consisted of all nodes in all subjects under study. In addition, we took the impulse response to a time horizon
Nobel prizes
We used Nobel prizes in Physics, Chemistry, and Physiology or Medicine as an external measure of influence. To identify which nodes in the concept networks received Nobel prizes, we parsed the Wikipedia articles “List of Nobel laureates in Physics”, “List of Nobel laureates in Chemistry”, and “List of Nobel laureates in Physiology or Medicine”. In the section “Laureates” for each article, there is a table of Nobel laureates that includes a rationale column, which describes the work of a laureate that motivated the Nobel prize with hyperlinks to articles that describe the laureate’s discoveries. For example, for the scientist Maria Skłodowska-Curie, the rationale column states, “for their joint researches on the radiation phenomena discovered by Professor Henri Becquerel” with a hyperlink to the article “Radiation”. By obtaining all hyperlinks to articles in the rationale columns, we identified nodes in the concept networks that were Nobel prize-winning nodes. All other nodes were identified as non-Nobel prize-winning.
Supplemental Material
Supplemental Material - Historical growth of concept networks in Wikipedia
Supplemental Material for Historical growth of concept networks in Wikipedia by Harang Ju, Dale Zho, Ann S Blevins, David M Lydon-Staley, Judith Kaplan, Julio R Tuma and Dani S Bassett in Collective Intelligence
Footnotes
Author Contributions
Conceptualization, H.J., D.Z.; Methodology, H.J., D.Z., A.S.B.; Software, H.J.; Validation, H.J.; Formal analysis, H.J.; Investigation, H.J.; Resources, H.J., D.S.B.; Data curation, H.J.; Writing – Original Draft, H.J.; Writing – Review & Editing, H.J., D.Z., A.S.B., D.M.L.-S., J.K., J.R.T., D.S.B; Visualization, H.J., D.Z., A.S.B., D.S.B.; Supervision, H.J., D.S.B.; Project administration, H.J., D.S.B.; Funding acquisition, D.S.B.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors acknowledge support from the John D. and Catherine T. MacArthur Foundation, the Alfred P. Sloan Foundation, the ISI Foundation, the Paul Allen Foundation, the Army Research Laboratory (W911NF-10-2-0022), the Army Research Office (Bassett-W911NF-14-1-0679, Grafton-W911NF-16-1-0474, DCIST-W911NF-17-2-0181), the Office of Naval Research, the National Institute of Mental Health (2-R01-DC-009209-11, R01-MH112847, R01-MH107235, R21-MH-106799), the National Institute of Child Health and Human Development (1R01-HD086888-01), National Institute of Neurological Disorders and Stroke (R01-NS099348), the National Science Foundation (BCS-1441502, BCS-1430087, NSF PHY-1554488 and BCS-1631550), and the National Institute on Drug Abuse (K01DA047417).
Data availability statement
All code is available on https://github.com/harangju/wikinet. All data used in the study are available on https://www.dropbox.com/sh/43qg5dop650kdul/AACTz1bRi5t1pWXxSBaurK9La. The latest multistream dumps of Wikipedia are available on
.
Diversity statement
Recent work in several fields of science has identified a bias in citation practices such that papers from women and other minority scholars are under-cited relative to the number of such papers in the field (Caplar et al., 2017; Dion et al., 2018; Dworkin et al., 2020; Mitchell et al., 2013). Here, we sought to proactively consider choosing references that reflect the diversity of the field in thought, form of contribution, gender, race, ethnicity, and other factors. First, we obtained the predicted gender of the first and last author of each reference by using databases that store the probability of a first name being carried by a woman (Dworkin et al., 2020). By this measure (and excluding self-citations to the first and last authors of our current paper), our references contain 25.53% woman(first)/woman(last), 9.09% man/woman, 10.64% woman/man, and 54.74% man/man. This method is limited in that a) names, pronouns, and social media profiles used to construct the databases may not, in every case, be indicative of gender identity and b) it cannot account for intersex, non-binary, or transgender people. Second, we obtained predicted racial/ethnic category of the first and last author of each reference by databases that store the probability of a first and last name being carried by an author of color (Ambekar et al., 2009; Sood and Laohaprapanon, 2018). By this measure (and excluding self-citations), our references contain 5.06% author of color (first)/author of color(last), 15.64% white author/author of color, 16.35% author of color/white author, and 62.95% white author/white author. This method is limited in that a) names, Census entries, and Wikipedia profiles used to make the predictions may not be indicative of racial/ethnic identity, b) it cannot account for Indigenous and mixed-race authors, or those who may face differential biases due to the ambiguous racialization or ethnicization of their names, and c) it lacks baselines for this field of study to which to compare measures of probabilities. We look forward to future work that could help us to better understand how to support equitable practices in science.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
