Abstract
Medical information on English Wikipedia was accessed over 2 billion times in 2018. Our goal was to develop an automated system to assist Wikipedia volunteers to improve articles with high-quality sources from journals such as The Cochrane Library. We created an automated indexing system by linking available reviews from the Cochrane library with disease-related Wikipedia articles and evaluating the relationship between the quality and importance of these articles with the number of relevant and cited Cochrane reviews. We first conducted a bibliometric analysis, identifying disease-related Wikipedia articles and relevant/cited Cochrane reviews. Citations were thematically coded, and descriptive statistics were calculated. Finally, separate multinomial logistic regression analyses were conducted for article quality and importance. The indexing system identified 4381 disease-related Wikipedia articles, 1193 (27%) of which cited a Cochrane review. Higher quality Wikipedia articles were more likely to cite a Cochrane review (p = 0.002), while lower quality articles were less likely to cite a Cochrane review (p < 0.0005). A greater number of Cochrane reviews are available for more ‘important’ Wikipedia articles (p < 0.005), and these articles were more likely to cite a Cochrane review (p < 0.005). This approach to an indexing system can be leveraged by Wikipedia contributors and editors seeking to update disease-related Wikipedia articles with relevant Cochrane reviews (thus improving their quality), and online information seekers in need of additional information to supplement their Wikipedia search.
Keywords
Introduction
An increasing proportion of healthcare professionals have spent their entire lives in an environment where smartphones, streaming and web-surfing have been ubiquitous. 1 These ‘digital natives’ expect web-access to be universal and information to be perpetually accessible. 2 The way information is consumed and implemented in the healthcare setting is changing rapidly as increasing numbers of individuals occupy and migrate to the digital space.3,4
Healthcare professionals experience a theoretical unknown in 50 per cent of their clinical encounters, 5 while patients can experience thousands of questions over the timespan of a disease.6,7 A typical avenue to fulfilling these ‘information needs’ is the Internet; 280 million health-related queries are made on search engines like Google every day.8,9
Wikipedia is usually among the top search results on Google.10,11 Created in 2001, Wikipedia is ‘a multilingual, web-based, free content encyclopaedia project supported by the Wikimedia Foundation and based on a model of openly editable content’. 12 Wikipedia is the fifth most popular site on the web with 2,125,792 visits per day 13 and is the most popular online healthcare resource globally.14–16 Indeed, the use of Wikipedia among students,17–19 physicians,17,20,21 pharmacists 22 and nurses23,24 to fulfil their information needs is widespread. Medical-related Wikipedia articles are accessed over 10 million times per day across all languages. 25 This, despite several groups advocating caution when relying on Wikipedia as a primary information resource.26,27 These cautionary notes notwithstanding, the reputation of Wikipedia in academia continues to improve, 28 as does the perception of its quality. 29
Medical-related article development on Wikipedia is supported by volunteers, many of whom are affiliated with the WikiProject Medicine Foundation.18,30 Efforts to improve the quality of health-related articles on Wikipedia are numerous and include initiatives to encourage medical professionals and students to contribute to Wikipedia, translation efforts, partnerships with health organisations and institutions, and developing offline content for parts of the world that do not have access to the Internet.14,27,31 Cochrane began working with WikiProject medicine in 2014 to improve the quality and content of the health evidence available on Wikipedia. This partnership is intended to support the inclusion of evidence produced and shared by Cochrane, where appropriate, and the continued update of Wikipedia articles when new reviews are published. 32
Cochrane Reviews are systematic reviews of primary research in human health care and health policy, and are internationally recognised as representing a high standard in evidence-informed health care. 33 Wikipedia has a precedent of sharing information from Cochrane Reviews to improve the quality of its health-related articles.34–36 Currently, over 3000 Cochrane systematic reviews are cited in Wikipedia articles. 37 One of the difficulties of maintaining congruence between medical-related Wikipedia articles and available Cochrane reviews is the continually evolving evidence landscape. English Wikipedia contains over 32,000 medical-related articles and there are more than 7000 active Cochrane reviews published in The Cochrane Library. In addition, approximately 30 new Cochrane reviews and 30 updated reviews are published each month. 38 Ensuring that the evidence being shared is presented accurately and appropriately and maintaining congruence between these two data sets represents a significant burden for a global community of Wikipedians. The inevitable discord between available Cochrane reviews and their citation on Wikipedia represents a significant opportunity for computer-automation technologies to harmonise content from a high-quality evidence resource (The Cochrane Library) with one of the most popular online knowledge bases (Wikipedia).
The primary aim of this investigation was to develop a free, automated indexing system that links reviews from the Cochrane Library with health-related Wikipedia articles that can be used by online information seekers with minimal computational burden. Using this system, our objectives were as follows:
To index health-related Wikipedia articles for which Cochrane reviews were available (‘relevant’ Cochrane reviews), to determine whether those reviews were cited and, to verify if the cited reviews are the latest version available or need updating.
To identify important add in health-related Wikipedia articles for which no Cochrane reviews were available.
To evaluate the relationship between Wikipedia article quality, importance and its citation of relevant Cochrane review articles.
This hypothesis-confirming research seeks to corroborate Wikipedia’s own guidelines for article quality, which specifies that higher quality articles must have inline citations from reliable sources. 39
Methods
Data collection
Between January and April 2019, a system of search implementation to identify disease-related Wikipedia articles and Cochrane reviews that were relevant to those articles was developed. This tool, named WP2Cochrane, is now publicly available, and is presented as an interactive Jupyter Notebook (https://mybinder.org/v2/gh/ajoorabchi/WP2Cochrane/master?urlpath=lab/tree/index.ipynb) that serves as a tutorial for the process of linking disease-related Wikipedia articles with relevant Cochrane reviews. The Jupyter Notebook pipeline provides scripts that process the raw data into tabularised results that can enable others to replicate and validate our analyses. The linkage results in HTML and CSV format and the Python source code are available on the project’s GitHub repository (https://github.com/ajoorabchi/WP2Cochrane). For non-technical users, a browser extension was developed for Google Chrome containing the Cochrane2WP tool (https://chrome.google.com/webstore/detail/wikipedia-%20-cochrane-libr/cehpfefpnicpmejgidpkgeenapnfcakm; Figure 1). The extension activates when the user is browsing Wikipedia. While browsing a disease-related Wikipedia article, users are provided with a list of related Cochrane reviews by clicking on the extension’s icon. The extension retrieves mapping data from the Cochrane2WP’s repository on GitHub and, therefore, its results are updated on a monthly basis to accommodate newly published Cochrane reviews and new disease articles published on Wikipedia. The Cochrane2WP script is currently hosted on a Google could server and set to run once a month. The result of each new run will be automatically uploaded to the tool’s GitHub repository, and the historical results will be kept in a persistent storage folder for future trend analyses (https://github.com/ajoorabchi/WP2Cochrane/tree/master/persistent_storage).

The Google Chrome extension for the WP2Cochrane tool.
The process of Wikipedia to Cochrane library linkage commences by first retrieving the list of disease-related Wikipedia articles as indexed on Wikidata, using an SPARQL query (https://w.wiki/3kg). Wikidata is a machine-readable knowledge base and a central storage of structured data of Wikipedia. 40 Using the Wikidata Query Service (https://query.wikidata.org/), all entities on Wikidata that have a valid corresponding Wikipedia article and are identified as an instance of a ‘disease’, were retrieved (e.g. dyslexia; https://www.wikidata.org/wiki/Q132971). As of 30 April 2019, this list included 11,621 diseases on Wikidata, with 4381 corresponding disease-related Wikipedia articles.
In the second stage of the process, a search on PubMed is automatically conducted to identify an entire library of published Cochrane reviews for each of the 4381 disease-related articles on Wikipedia. PubMed currently indexes all reviews published in The Cochrane library. 41 By combining Boolean operators, we conducted a search for each disease-related article in the aggregated list, using the article title and its reroute terms. Due to the potential use of terminological derivations for the same medical concept on Wikipedia and Cochrane (e.g. ‘sprained ankle’ vs ‘ankle sprain’), a list of abbreviations was parsed from the Medical Search Headings (MeSH) list along with text simplification (lower case, no punctuation) to increase the reliability of the method.
Data extraction and coding
The quality and importance of all disease-related Wikipedia articles (as judged by the articles’ editors) were first extracted. Wikipedia article quality is measured on a grade scale with nine levels (in order of descending quality: featured article, A-class article, good article, B-class article, C-class article, start article, stub article and list article), 42 however, no ‘A-class’ articles were identified from the list of disease-related Wikidata entries, so this grade was removed from further coding and analysis. Wikipedia article importance is measured on a scale with four levels (top, high, mid and low); WikiProject Medicine’s importance scale typically answers the question, ‘How important is it to Wikipedia’s coverage of this project’s subject area that there should be an article for this topic’. 39
In the second step, every available, relevant and cited Cochrane review was identified via its associated PubMed ID (PMID). ‘Available’ Cochrane reviews were classified as such on the basis that they were:
Published by the Cochrane library and indexed on PubMed.
‘Relevant’ Cochrane reviews were classified as such on the basis that they were:
Published by the Cochrane library and indexed on PubMed.
Determined to be relevant to a disease-related Wikipedia article with a corresponding entry on Wikidata.
‘Cited’ Cochrane reviews were classified as such on the basis that they were:
Published by the Cochrane library and indexed on PubMed.
Determined to be relevant to a disease-related Wikipedia article with a corresponding entry on Wikidata.
Cited in that disease-related Wikipedia article.
A Cochrane review was considered to be relevant to a Wikipedia article when the title of the Wikipedia article or its variations in Wikidata appeared in the title or abstract of the review. De-duplication of Cochrane review updates was achieved by applying a title-matching filter to the review list, as updated Cochrane reviews on the same topic retain their titles but are assigned a unique PMID per each update. PMIDs are numerical and sequential; therefore, the title with the largest PMID number will be the most up-to-date version of the review. Instances where an outdated Cochrane review was cited on Wikipedia were recorded (i.e. Cochrane published an ‘updated’ review, but the updated version was not yet cited or reflected in the Wikipedia article). The advantage of this method is that it is computationally efficient, and does not require access to the full texts of the Cochrane review articles such that semantic mapping of full texts can be conducted; the authors envisage that many users may not have access to full-text reviews to enable semantic mapping.
Citation compilation
Once the linking process is complete, the WP2Cochrane system compiles the results into a number of HTML and CSV formatted files which could be easily used by contributors and editors of WikiProject Medicine, without the need of any technical knowledge. The tool creates two different sets of result files. In the first set, all the disease-related Wikipedia articles and their corresponding Cochrane reviews are listed at once. In the second set, the Wikipedia articles and their linkage results are divided between multiple files based on the task force(s) that they belong to. The WikiProject Medicine currently has 17 different task forces (https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Medicine/Task_forces). Members of each task force are assigned a subset of Wikipedia articles most relevant to their speciality to maintain (e.g. cardiology, dermatology, neurology). Our system identifies the task force(s) to which disease-related articles belong and cluster the results accordingly. This feature is designed to further assist Wikipedians to focus their efforts on the articles which are most relevant to their speciality. Figure 2 shows a screenshot of the HTML results page for the neurology task force.

Sample HTML results for the neurology task force.
As shown in the figure, results are presented in a seven-column table listing Wikipedia articles and their properties (i.e. title, class, importance, taskforces and corresponding Wikidata disease), and a list of related Cochrane reviews per article. We have used colour coding to indicate the status of each Cochrane review in the list (green: up-to-date and cited, red: up-to-date and not cited, orange: out-of-date and cited, grey: out-of-date and not cited). In cases where a Cochrane review has multiple versions, they are grouped together (showing the same background colour) and listed chronologically (latest version first). The interactive HTML tables are made using the DataTables (https://datatables.net/) plug-in. This enables users to interactively sort, search and filter the results. All the results files are available online (https://github.com/ajoorabchi/WP2Cochrane).
Statistical analysis
Descriptive statistics for: (1) available Cochrane reviews; (2) relevant Cochrane reviews and (3) cited Cochrane reviews were first computed. By aggregating duplicate Cochrane reviews (based on title) and their associated PMIDs, it was possible to identify (4) instances where a disease-related Wikipedia article cited an out of date Cochrane review where a newer one was available (‘out-of-date’ review citations).
We then sought to evaluate the association between outcomes 1–4 and Wikipedia article quality and importance. First, a Spearman’s rank-order correlation was run to assess the relationship between: (a) Wikipedia article quality; (b) Wikipedia article importance and (1) the number of relevant Cochrane reviews; (2) the number of cited Cochrane reviews. Next, our intention was to run a cumulative odds ordinal logistic regression with proportional odds to determine the effect of (1) the number of available Cochrane reviews and (2) the number of cited Cochrane reviews on Wikipedia article quality (including featured, good, B-class, C-class, start, stub and list) and importance (top, high, mid, low). However, preliminary testing revealed a violation of the assumption of proportional odds, as assessed by a full likelihood ratio test comparing the fitted model to a model with varying location parameters. To accommodate this, separate multinomial logistic regression analyses were conducted for each potential expression of the dependent variable (quality and importance), coded in binary format. The independent variables and covariates in the multinomial logistic regression for the different grades of quality were importance, and relevant and cited Cochrane reviews, respectively. The independent variables and covariates in the multinomial logistic regression for the different grades of importance were quality, and relevant and cited Cochrane reviews, respectively.
The p-value for this analysis was set using a Bonferroni-adjusted correction for multiple comparisons at p < 0.025 (0.05/2 (dependent variables)). All data were analysed using Statistical Analytics Software (Version 18, SPSS Inc., Chicago, IL, USA).
Results
The system of search implementation identified 11,621 disease entries on Wikidata with 4381 corresponding Wikipedia articles. 43 Of the 4381 disease-related Wikipedia articles, 1193 cited a Cochrane review, with <1 per cent of these citing an out-of-date review. Of the remaining 3188 (26%) Wikipedia articles, 836 did not cite a Cochrane review, despite at least one ‘relevant’ one being identified.
The characteristics of these 4381 articles with regards to quality and importance are presented in Table 1.
Characteristics of the included Wikipedia articles, stratified by quality and importance.
Characteristics of included articles
In total, 1193 (27%) of the disease-related Wikipedia articles cited one or more Cochrane Reviews. There were 48 instances (1.1% of all citations) where an out of date Cochrane review was cited when a newer version was available (i.e. cited review was ‘out of date’). Thirty-nine of these instances were from separate disease-related Wikipedia articles. There were seven instances where a single disease-related Wikipedia article cited two out of date Cochrane reviews, and one instance where a single disease-related Wikipedia article cited three out of date Cochrane reviews (specifically, the article on ‘Deep Vein Thrombosis’). Due to the low number of citations to out of date Cochrane reviews, this was not included in inferential statistical analyses.
Spearman analysis
Preliminary analysis showed the relationship to be monotonic between: (a) Wikipedia article quality; (b) Wikipedia article quality and (1) the number of available Cochrane reviews; (2) the number of cited Cochrane reviews, as assessed by visual inspection of a scatterplot. There was a statistically significant, moderate negative correlation between (a) and (b), and (1) and (2), indicating that there were more relevant Cochrane reviews for more important and higher quality Wikipedia articles, and that these articles were also more likely to cite a Cochrane review (Table 2).
Results of the Spearman analysis evaluating the correlation between Wikipedia article quality and importance, and the number of relevant and cited Cochrane reviews in those articles.
Multinomial logistic regression analyses
Separate multinomial logistic regression analyses were conducted for each potential expression of the dependent variable. For Wikipedia article quality, this included eight models (featured article vs non-featured article; good article vs non-good article; B-class article vs non-B-class article; C-class article vs non-C-class article; start article vs non-start article; stub article vs non-stub article; list article vs non-list article). For Wikipedia article importance, this included four models (top importance vs non-top importance; high importance vs non-high importance; mid importance vs non-mid importance; low importance vs non-low importance).
Results of the multinomial logistic regression analyses, delineated by article quality and importance, are presented in Tables 3 and 4, respectively.
Results of the multinomial logistic regression analyses for article quality, wherein article importance was included as an ordinal-dependent variable and the number of relevant and cited Cochrane reviews were included as continuous covariates.
Results of the multinomial logistic regression analyses for article importance, wherein article quality was included as an ordinal-dependent variable and the number of relevant and cited Cochrane reviews were included as continuous covariates.
To summarise the statistically significant findings at the a priori p-value, there was a negative correlation between the number of times that a Wikipedia article classed as being a ‘Good’ article cited a Cochrane review (p = 0.002), and a positive correlation between the number of times that a Wikipedia article classed as being a ‘Starting’ article or ‘Stub’ article cited a Cochrane review (p < 0.0005 in both cases). This indicates that higher quality (specifically, ‘Good’) Wikipedia articles were more likely to cite a Cochrane review than lower quality (specifically, ‘Starting’ or ‘Stub’ articles) Wikipedia articles. In contrast, an article’s ‘featured status’ (the highest level on the Wikipedia quality scale) 39 was not explained by its citation of Cochrane reviews, however, this was likely due to the low number of ‘featured’ articles in the dataset (N = 22).
With regard to Wikipedia article importance, there was a statistically significant, negative correlation between the number of relevant Cochrane reviews and Wikipedia articles rated as being of ‘top’ or ‘high’ importance (p = 0.0005 in both cases). There was a statistically significant, positive correlation between the number of relevant Cochrane reviews and Wikipedia articles rated as being of ‘mid’ (p = 0.004) or ‘low’ (p = 0.002) importance. This indicates that more Cochrane reviews are available for more ‘important’ Wikipedia articles. There was also a statistically significant, positive correlation between the number of times that Wikipedia articles of ‘mid’ (p = 0.001) or ‘low’ (p < 0.0005) importance cited a Cochrane review, indicating that less important Wikipedia articles are less likely to cite Cochrane reviews.
Discussion
We developed an automated indexing system that links relevant reviews from the Cochrane Library with disease-related Wikipedia articles. This indexing system can be freely and readily accessed and operated online (https://mybinder.org/v2/gh/ajoorabchi/WP2Cochrane/master?urlpath=lab/tree/index.ipynb). The advantage of providing the pipeline in the Jupyter Notebook format is that others can quickly reproduce the results in this article expediently, update them if necessary, and potentially conduct alternative analyses of the same dataset. The source code is also hosted on GitHub (https://github.com/ajoorabchi/WP2Cochrane). We encourage users with the necessary technical expertise to add, improve and customise the pipeline as they see fit. The current system can also be leveraged by online information seekers (including healthcare professionals) as they pursue a more comprehensive understanding of a Wikipedia topic using an indexed Cochrane review, and by Wikipedia editors, students or administrators seeking to update or improve disease-related Wikipedia articles.
Using this system, we indexed disease-related Wikipedia articles for which Cochrane reviews were available (‘relevant’ Cochrane Reviews), and evaluated whether those reviews were cited or not cited. We conducted inferential statistical analysis to evaluate the relationship between the quality and importance of a dataset of 4381 disease-related Wikipedia articles, and the number of relevant and cited Cochrane reviews in those articles. Preliminary assumption testing revealed that it was not appropriate to undertake a cumulative odds ordinal logistic regression analysis (due to a violation of the assumption of proportional odds, which likely occurred because the estimated parameters for citation count were not the same for predicting article quality/importance on their respective ordinal scales). As such, separate multinomial logistic regression analyses were conducted. This analysis revealed that there was a negative correlation between Wikipedia articles’ quality and their propensity to cite a Cochrane review. Specifically, articles classed as being ‘Good’ by the Wikipedia community (N = 69) were more likely to cite a Cochrane review (p = 0.002), whereas articles classed as ‘Starting’ (N = 1711) or ‘Stub’ (N = 906) articles by the Wikipedia community were less likely to cite a Cochrane review (p < 0.0005 in both cases). This indicates that higher quality Wikipedia articles (specifically, ‘Good’ articles) were more likely to cite a Cochrane review than lower quality (specifically, ‘Starting’ or ‘Stub’ articles) Wikipedia articles. The relationship between the highest quality Wikipedia article (‘featured’ article status; N = 22) was non-significant, however, this was likely due to the low sample size of this article group. These findings corroborate Wikipedia’s own guidelines for article quality, which specify that higher quality articles must have inline citations from reliable sources. 39 Our findings provide the first objective evidence of the successful implementation of these guidelines in practice.
Assuming that there is a relationship between a disease-related Wikipedia article’s citation of a Cochrane Review and article quality, these findings confirm the validity of Wikipedia’s methods of ensuring article quality in that updating Wikipedia articles that do not currently cite relevant Cochrane reviews is likely to improve their quality. This finding would have implications for the online community of Wikipedia editors as well as for students and professionals in healthcare currently using Wikipedia as an information resource. For example, editors could use the indexing system to expediently identify disease-related Wikipedia articles for which relevant Cochrane reviews are available but not cited, and out of date Cochrane reviews that are currently cited on Wikipedia that need to be updated. Students and professionals who are using Wikipedia to fulfil their healthcare information needs could review the list of non-cited but relevant Cochrane reviews, should they seek to gain further information on a disease-related Wikipedia topic. However, more research is needed to elucidate the relationship between a Wikipedia article’s quality and its citation of a relevant Cochrane review. For the purposes of the present investigation, we relied upon Wikipedia’s own quality grading scale, which is described as a measure of ‘how close we are to a distribution-quality article on a particular topic’. 42 ‘Distribution quality’ as determined by the Wikipedia community may not necessarily reflect scientific quality, as determined by educational or scientific societies; 42 our results should be interpreted with this in mind.
Assuming that there is a relationship between the importance of a disease-related Wikipedia article and the number of available Cochrane reviews, it can be inferred from our findings that any instances of a dissociation between an article deemed by the Wikipedia community to be of ‘top’ or ‘high’ importance and the number of available, relevant Cochrane reviews for that article could potentially direct the Cochrane community to conduct reviews to fills these gaps. For example, the article on ‘Frostbite’(https://en.wikipedia.org/wiki/Frostbite) is rated as being of ‘high’ importance by the WikiProject Medicine’s dermatology taskforce; the PubMed ‘Clinical Queries’ tool identifies 687 articles relevant to the treatment of ‘Frostbite’ (access date: 1 May 2019), yet no Cochrane reviews exist for this condition. The Cochrane community could systematically review the available scientific literature corpus of this condition to generate a Cochrane review, which could then be leveraged by the Wikipedia community as a high-quality source of information on this condition. Our analysis revealed that more important Wikipedia articles were more likely to cite a relevant Cochrane review, corroborating this finding.
The dataset of Wikipedia articles included in this analysis were identified through Wikidata. As a machine-readable knowledge base and central storage platform of structured Wikipedia data, 40 we were able to leverage Wikidata to parse metadata about Wikipedia articles in an automated manner. Our dataset of 4381 Wikipedia articles and their associated importance and quality were extracted on the basis that they were classified as a ‘disease’ on Wikidata (The precise Wikidata query is available here: https://w.wiki/3kg). This dataset is relatively smaller than those included in previously conducted studies evaluating Wikipedia articles in the fields of ‘health & fitness’(18,805 articles, access date: February 2017) 42 and ‘medicine’ (11,314 articles, access date: October 2017), 28 due to the fact that we have undertaken an extra step of filtering to only include those Wikipedia articles which directly describe a disease, and hence was considered relevant in the context of linkage to the Cochrane library. This was deemed appropriate to investigate Cochrane review citations, the research questions for which are population-centred, 44 on the Wikipedia platform.
Of the 4381 disease-related Wikipedia articles, 1193 cited at least one Cochrane review and <1.1 per cent of these cited an out-of-date review. Of practical importance is the finding that a Cochrane review was available for 836 of the 3188 (26%) Wikipedia articles that did not presently cite a Cochrane review. For example, our system identified five relevant Cochrane reviews on ‘mumps’, none of which were cited in the Wikipedia article on this topic. Similarly, our system identified 49 Cochrane reviews on ‘lymphoma’, none of which were cited in the Wikipedia article on this topic. Our analysis would suggest that updating these Wikipedia articles with these Cochrane Reviews would improve their quality on the basis of Wikipedia’s own rating system. In light of the multi-societal disposition to consult the Internet for health-related information19,45–49 and the widely prevalent use of Wikipedia specifically to fulfil these information needs,11,19,23,50,51 its value as a free source of knowledge, and the importance of ensuring that this knowledge reflects high-quality scientific evidence, are indisputable. As the most referenced scientific journal among medical Wikipedia articles, 37 the Cochrane Database of Systematic Reviews is an invaluable source of the secondary research cited on Wikipedia. 52 Wikipedia relies heavily on crowd-sourced peer review to ensure the quality of its knowledge corpus. 53 Consequently, facilitating the peer-review process on Wikipedia via the indexing of relevant Cochrane reviews is a public health priority. There have been repeated calls for experienced medical professionals to get more actively involved in improving the accuracy of health-related Wikipedia articles, 14 and Wikipedia has sought to encourage its community to improve the quality of its articles. WikiProject was launched by the Wikimedia foundation for this reason,18,30 and WikiProject Medicine, along with its community of editors, might benefit from the indexing system outlined in this article in expediting the editorial process towards improving disease-related Wikipedia articles.
Limitations
Despite the novelty of the indexing system, and the insights gained from the analyses outlined in this article, a series of limitations must be acknowledged. First, the indexing system and the associated analyses were limited to the English-language version of Wikipedia. While this is the largest version of Wikipedia, the possibility that other language Wikipedias might have divergent patterns cannot be discounted. Next, and as has already been alluded to, the validity of Wikipedia’s own grading systems of quality and importance may not reflect the sentiments of the scientific community. Further research is required to substantiate whether these scales consistently and reliably reflect scientific robustness (e.g. using the DISCERN criteria) 54 or whether there is a dissociation of perceived importance between the Wikipedia and scientific or academic communities for different disease-related Wikipedia topics. Finally, because the indexing system described in this report relies on the title of the disease-related Wikipedia article, its terminological derivations and its reroute terms to identify relevant Cochrane reviews, there is an assumption that disease-related Wikipedia articles are both correctly titled (i.e. there are no misspellings) and that these titles reflect the prevailing terminology in scientific research. Due to the potential for incorrect use of search and retrieval ontologies among the scientific community, and the occasional use of non-scientific descriptors as article titles on Wikipedia (e.g. see the Wikipedia article on ‘Charley horse’ to describe a lower limb haematoma; access date: 1 May 2019), it is difficult to establish the precision and recall of the indexing system.
Conclusion
The quality of Wikipedia articles relies on the willingness of experienced and knowledgeable volunteers to devote their time and effort to improve existing Wikipedia articles. Volunteers with WikiProject Medicine help to oversee the medical-related content shared on Wikipedia. 18 , 30 Presently, efforts to improve the quality of these articles centre upon appropriate citation of high-quality secondary evidence (including systematic reviews).34–36 In this article, we describe an automation system that indexes reviews from the Cochrane Database of Systematic Reviews to relevant disease-related Wikipedia articles. Using this system, we identified an association between the quality of disease-related Wikipedia articles and their citation of Cochrane reviews, whereby higher quality articles were more likely to cite relevant Cochrane reviews. Similarly, a greater number of Cochrane reviews are available for Wikipedia articles deemed to be of greater importance by the community; instances of a dissociation between these variables, whereby no, or fewer reviews are available for more important disease topics, can be used to inform the completion of future reviews. Our indexing system can be accessed via a web browser and is freely available, such that online information seekers can supplement their perusal of disease-related Wikipedia articles with relevant, but uncited, Cochrane reviews where necessary. Similarly, researchers can use this system to replicate, update or further explore our findings, or conduct their own analyses. Technically proficient readers can improve and customise the codebase by forking it on GitHub for their own purposes. Finally, the potential impact of article improvements on Wikipedia should also be considered through the lens of how articles are improved and evidence is shared on the platform. Given that many ‘important’ Wikipedia articles are ‘watched’ by numerous active Wikipedia volunteers, adding one new sentence of evidence and a citation to a Wikipedia article (e.g. evidence from a Cochrane Review) may generate increased Wikipedia editor traffic and result in numerous article improvements by other volunteers, which may result in significant article improvements.
In future, we plan to enhance the usability and impact of our system by:
(a) Development of an NLP-based component capable of measuring the semantic similarity between the textual content of a Cochrane review and a Wikipedia article. This extension will enable us to rank and filter candidate Cochrane reviews according to their semantic relatedness to the given Wikipedia article, and hence reduce the amount of screening work required from the Wikipedians.
(b) Development of a WP2Cochrane robot. There are currently over 2000 software robots running in Wikipedia. These robots carry out a wide range of repetitive tasks helping Wikipedians with the maintenance of millions of articles. Our indexing system currently stores its results in multiple HTML and CSV files. The WP2Cochrane robot, once approved by the WikiProject Medicine community, will be able to edit the Wikipedia articles’ talk pages directly, and periodically update them with a list of relevant Cochrane reviews. Talk pages (also known as discussion pages) are administration pages where editors can discuss improvements to articles (https://en.wikipedia.org/wiki/Help:Talk_pages).
Footnotes
Acknowledgements
The authors acknowledge Mr Peter Megyesi as the lead developer of a user-facing application that houses the system (the ‘SciScanner’ app).
Author’s note
Jennifer Dawson is also affiliated with CHEO Research Institute, Ottawa, Ontario, Canada
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This research was co-funded by the European Regional Development Fund (ERDF) under Ireland’s European Structural and Investment Funds Programmes 2014-2020 under the ‘SciScanner’ project title, and the Health Research Board (ARPP-A-2018-002).
