Abstract
Objectives
Healthcare research performance is increasingly assessed through research indicators. We performed a systematic review to identify the indicators that have been used to measure healthcare research performance. We evaluated their feasibility, validity, reliability and acceptability; and finally assessed the utility of these indicators in terms of measuring performance in individuals, specialties, institutions and countries.
Design
A systematic review was performed by searching EMBASE, PsycINFO, Ovid MEDLINE and Cochrane Library databases between 1950 and September 2010.
Setting
Studies of healthcare research were appraised. Healthcare was defined as the prevention, treatment and management of illness and the preservation of mental and physical wellbeing through the services offered by the medical and allied health professions.
Participants
All original studies that evaluated research performance indicators in healthcare were included.
Main outcome measures
Healthcare research indicators, data sources, study characteristics, results and limitations for each study were studied.
Results
The most common research performance indicators identified in 50 studies were: number of publications (n = 38), number of citations (n = 27), Impact Factor (n = 15), research funding (n = 10), degree of co-authorship (n = 9), and h index (n = 5). There was limited investigation of feasibility, validity, reliability and acceptability, although the utility of these indicators was adequately described.
Conclusion
Currently, there is only limited evidence to assess the value of healthcare research performance indicators. Further studies are required to define the application of these indicators through a balanced approach for quality and innovation. The ultimate aim of utilizing healthcare research indicators is to create a culture of measuring research performance to support the translation of research into greater societal and economic impact.
Introduction
Academic healthcare is the synergy between studying disease mechanisms, identifying new treatments, improving patient care and training healthcare professionals. 1–3 Although the contribution of research to healthcare over the last century has been remarkable, academic healthcare often endures the inequality and lack of transition between basic and clinical research, fails to drive technology and innovation in clinical practice, underrates the role of education, and disregards social and global accountability. 1,3,4
To tackle these deficits, a system is required for academic healthcare researchers to measure research performance according to an accepted global benchmark, so that innovation and quality of research can be improved and new discoveries can be translated into medical advances. 5 Currently, systems such as The Research Assessment Exercise (UK) and Institutional Assessment Framework (Australia) have attempted to appraise academic research in general. 6,7 The Research Excellence Framework is a new development to assess the quality of research in UK higher education institutions, 8 although currently there are no validated systems to accurately measure performance in healthcare research. This has been difficult to implement due to the operational complexity of the discipline (Figure 1). 5 To design a system that can successfully measure healthcare research performance it is imperative to determine which indicators can measure this more accurately.

Elements of academic healthcare performance 5 (in colour online)
The objectives of this article are to: (1) identify existing indicators which specifically assess healthcare research performance; (2) assess each indicator to determine its feasibility, validity, reliability and acceptability; and (3) evaluate the utility of each indicator in terms of individual, specialty, institutional and global perspective.
Methods
This study was performed following guidelines from the preferred reporting items for systematic reviews and meta-analyses (PRISMA). 9
Data sources and searches
Studies to be included in the review were identified by searching the following databases: (1) EMBASE (1980–September 2010), (2) PsycINFO (1967–September 2010), (3) Ovid MEDLINE (1950–September 2010), and (4) Cochrane Library.
All databases were searched using the following free text search: ‘academic OR university OR education OR scientific OR institution’ AND ‘performance OR competence OR quality OR productivity’ AND ‘assessment OR evaluation OR indicator OR peer review’ AND ‘index OR bibliometric OR impact factor OR citation OR benchmark’ AND ‘health care OR medicine OR surgery OR physician OR biomedical OR hospital OR scientist’. The search was expanded by using all possible suffix variations of the keywords. Additional studies were identified by searching the bibliographies of the studies that had been identified through the electronic search. A keyword search was chosen rather than Medical Subject Headings (MeSH), because there was a lack of established MeSH terms in this area of research.
Study selection
We included all original studies that evaluated research performance indicators which measured performance across individuals, specialties, institutions and countries in healthcare. For this study, healthcare was defined as the prevention, treatment, and management of illness and the preservation of mental and physical wellbeing through the services offered by the medical and allied health professions. 10 There were no language restrictions. We excluded all studies that did not have data relevant to healthcare.
Two authors (VP and HA) independently reviewed the titles and abstracts of the retrieved articles, and selected publications to be included in this review. The full texts of these publications were reviewed by the two authors, who selected the relevant articles for inclusion in the review. When there was disagreement, a third author (SA) was consulted and a decision was made by agreement of all authors.
Data extraction and quality assessment
Two authors (VP and HA) independently extracted data from the full text, which included source of article, study design, study period, type of performance indicator, data source, study population and their sample size, type of statistical analysis, outcomes and methodological limitations. Disagreements in data extraction were resolved by discussion and consensus between all authors. Study quality was assessed using the Oxford Centre for Evidence-based Medicine Levels of Evidence classification. 11
Data synthesis and analysis
The methodology of the included studies was heterogeneous, therefore it was not possible to pool data and statistically analyse the results. The indicators that were identified were analysed in terms of their: (1) utility (the usefulness of indictors at individual, specialty, institutional and global levels); (2) feasibility (measure of whether the indicator is capable of being used); (3) validity (measure of the relevance of the indicator: content, convergent and discriminant validity); (4) reliability (measure of the reproducibility or consistency of an indicator); and (5) acceptability (the extent to which the indicator is accepted by researchers). 12,13
Results
Study selection
We retrieved 6705 potentially relevant articles, of which 1185 duplicate articles were identified and excluded. Of the remaining articles, 5385 were excluded after title and abstract review. Review of the full text and bibliography of the remaining 135 articles identified 50 studies for inclusion in the review (Figure 2) (Table 1 – see

Selection of articles for the systematic review
Study characteristics
All evidence was level 4 according to the Oxford Centre for Evidence-based Medicine. 11
The plurality of studies were performed in North America (n = 20) 14–33 and Western Europe (n = 19). 34–52 Fewer studies were performed in Eastern Europe (n = 5), 53–57 South America (n = 3), 58–60 Asia (n = 2) 61,62 and Australia (n = 1). 63 The studies were published from 1973 until 2009, but the majority of the studies were published after the millennium (n = 34). 17,19,24,27,28,31–36,38–40,42–47,49–51,53–63 The design of each study was retrospective and observational.
Forty-two studies used Thomson Scientific's Institute for Scientific Information database (ISI). 14–18,20–30,32–34,36,37,39–46,49,51,53–56,58–63 Out of these, 10 studies used one additional database: Scopus (n = 2), 40,63 MEDLINE (n = 5), 24,28,44,52,62 PsycINFO (n = 1), 26 National Institutes of Health (NIH) (n = 1) 27 and institutional (n = 1); 58 three studies used two additional databases: MEDLINE and PsycINFO (n = 1), 54 EMBASE AND MEDLINE (n = 1) 51 and PsycINFO and NIH (n = 1). 17 Out of the studies that did not use ISI, four studies used one database: MEDLINE (n = 2) 35,47 and institutional (n = 2); 49,57 four studies used two databases: institutional and MEDLINE (n = 1), 38 Scopus and Spanish Office of Patents and Trademarks (n = 1), 50 and NIH and MEDLINE (n = 1), 19 Scopus and Google (n = 1). 31
Only seven studies assessed research performance over a lifetime 31,40,52,54,58–60 in comparison to 24 studies assessing research performance over a 1–5-year period. 15,16,19–23,27–30,35,36,43,45, 47–51,55,57,61,62
The main methodological limitation was the use of a single bibliometric database as the only information source in 32 studies. 14–16,18,20–23,25,27,29,30,32–37,39,41–43,45–47,49,51–53,55,56,59–61
Type of indicators
The types of indicator that were used to measure research performance in each study included number of publications (n = 38), 14,16–30,32–40,42–44,47,48,51–57,60 number of citations (n = 27), 14–17,20–23,26–28,30,32–34,36,37,39,41–43,45,51,52,55,57,63 Impact Factor (n = 15), 20,24,28,35,38,42,44,46,48,49,52,54,56,61,62 research funding (n = 10), 17–19,27,29,30,32,33,35,56 degree of co-authorship (n = 9), 20,31,37,38,41,49,52,56,57 population size (n = 6), 24,33,40,44,49,63 gross domestic product (n = 5), 24,33,40,44,49 h index (n = 5), 27,31,58–60 peer review (n = 6), 32,34,35,43,51,52 g index (n = 1), 31 age-weighted citation ratio (AWCR) (n = 1), 31 number of conference presentations (n = 1), 28 number of patents (n = 1), 50 number of doctoral students (n = 1), 17 number of editorial responsibilities (n = 1) 17 and gender (n = 2). 28,52 Twelve studies evaluated one indicator only, 15,25,41,45–47,50,53,58,59,61,62 whereas 16 studies evaluated two indicators, 14,18,19,23,29,36,39,48,54, 55,60,63 seven studies evaluated three indicators, 30,34,37,38,40,43,57 nine studies evaluated four indicators 20,27,31,32,34,35,42,44,56 and four studies evaluated five indicators. 17,28,33,49
Number of publications
The simplest measure of research productivity in healthcare is the number of published articles a researcher or group of researchers produce within a time span. 14,16–30,32–44,47–49,51–57,60,63 This indicator can be presented by document type so that letters, editorials, reviews and conference papers can be excluded. 47 It is relatively easy to calculate using bibliometric databases such as ISI, MEDLINE and Scopus, but these databases will ignore non-journal publications. It can be difficult to retrieve all the publications for certain researchers because of the commonality of names. 18 The number of publications does not take into account the size of the research group, the type of research or the quality of the publication. To address this problem, publications per author, population size or publications in top ranked journals can be considered. 47,49 Although the number of publications is commonly used to measure research performance in individuals, specialties, institutions and countries, often as a benchmark, there are no studies formally validating this indicator in healthcare. However, a few studies have shown significant correlation between the number of publications and other measures of research performance, such as citations, peer review and research funding. 19,30,32,34,35
Number of citations
The impact of healthcare research can be measured by counting the number of citations received by a researcher or group of researchers from published articles within a time span. 14–17,20–23,26–28,30,32–34,36,37,39,41–43,45,51,52,55,57,63 Bibliometric databases such as ISI, Scopus and Google Scholar are required to extract citation counts, which are subject to error because the databases are affected by commonality of names, typographical errors, variation of literature sources and geographical bias. 45,63 Citation analysis assumes that there is a positive association between the citing and referenced article, which does not account for articles that can be cited for negative impact. Citation counts are typically higher in older articles, falsely elevated by self-citations, and can vary between document type and speciality. 36,45 In order to make comparisons across specialties relative citation factor can be used to normalize citation counts. 36,45 As well as specialties, the number of citations has been used to measure research performance in individuals, institutions and countries but there are no studies formally validating this indicator in healthcare. One study, with a small sample size, has demonstrated a low correlation between number of publications and citations. 22 However, the majority of studies have shown significant correlation between the number of citations and other measures of research performance, such as publications, co-authorship, peer review and research funding.
Impact Factor (IF)
The Journal Impact Factor is calculated by dividing the number of current year citations to the source items published in that journal during the previous two years. 55 It is an evaluation tool provided by ISI Thomson Reuters Journal Citation Reports® which is used to measure the scientific impact of journals. 56 Evaluating research performance using IF can have a marked affect on performance rankings. 20,24,35,38,42,46,48,54,56,61,62 However, the IF is influenced by publication language, document type, citation patterns, open access journals, fast track publications and co-authorship, as well as disregarding publications from zero impact journals. 61 More importantly, there is large IF variation between healthcare specialties. For this reason, IF may not reflect quality of research performance, but instead the different publication and citation patterns within specialties. 61 Normalizing the IF can provide a more realistic assessment of research quality, which has been demonstrated at an institutional level. 62
H index
The h index of a researcher is the number of ‘h’ publications with at least ‘h’ citations each during a time span. 64 Initially the h index was introduced in physics to address the limitations of publication number, which does not account for research quality, and citation number, which can be disproportionately influenced by a small number of highly cited papers. 64 The h index simultaneously evaluates the quality and sustainability of research productivity, 64 and can be calculated without difficulty by bibliometric databases such as ISI, Scopus and Google Scholar. In healthcare, the h index has been shown to be a useful statistic to evaluate a researcher's contribution within a given specialty and may even be helpful as a promotional tool. 27,31,58–60 General drawbacks of bibliometrics, such as commonality of names and publication language are shared by the h index, which is also positively biased to senior researchers with older publications. 31,58,59
Indicators such as the g index and Age Weighted Citation Ratio (AWCR) have been proposed to address these limitations, but there is strong correlation between both of these measures and the h index. 31 In addition, the h index has been shown to overcome the disadvantages of multiple authorship and self-citation. 31 There is consensus that the h index cannot be used to measure research performance between different specialties because of diverse publication and citation practices. 27,31,58–60
Research funding
Research funding is a term covering any financial support for scientific research. This indicator poses an analytical problem, because it is an example of circular cause and effect. Based on bibliometrics, it is difficult to differentiate whether more research funding improves a researcher's performance or if superior performing researchers receive more research funding. Regardless, most of studies show significant correlation between research funding and research performance at an individual and institutional level. 17,19,27,30,32,35,56 Developed countries with higher research spend also have higher research productivity. 18,33
Degree of co-authorship
Co-authorship determines the extent a researcher or research group collaborates with others to publish articles. Authors can collaborate at an international, institutional, departmental or individual level. In healthcare several studies have demonstrated that research performance is improved with international collaboration. 42,49,56 The role of co-authorship at an organizational level has been shown to have a positive impact on performance and has been considered as a novel evaluation tool. 37,38 However, the role of co-authorship at an individual level is uncertain, but indicators such as the h index overcome this potential limitation. 20,31,57
GDP and population size
GDP is a measure of a country's overall economic output and population size is the number of individuals in a region. Adjusting research performance indicators for GDP and population size allows fairer comparison of global performance. 24,33,44 However, GDP and population size may also be markers of performance in their own right. 40,49,63
Uncommon indicators
It is difficult to quantify the value of indicators such as peer review, number of conference presentations, number of patents, number of doctoral students, number of editorial responsibilities, and gender because of limited research in these areas. 17,28,32,34,35,43,50–52
Feasibility, validity, reliability and acceptability
Feasibility of using publications, grants, doctoral students and editorial responsibilities to measure research performance was assessed by a survey in one study. 17 The respondents generally agreed with the use of these four indicators. Seven studies measured convergent validity by correlating number of publications with number of citations. 16,21,23,26,30,37,52 One study demonstrated significant reliability of textbook citations to measure research performance (P < 0.001). 26 No other studies assessed research performance indicators in terms of feasibility, validity, reliability and acceptability.
Utility
Twenty-one studies compared research performance between individuals 14,19–22,25,28,30–32,34,35,38,46,51–55,57,60 and 14 between specialties. 15,16,20,21,29,31,36,40–42,45,46,59,61 All individuals were researchers in a range of healthcare specialties, and the most common specialties were medicine in general (14 studies) 15,18–20,27,35,36,38–40,46,47,56,58 and psychology (11 studies). 16,21–23,25,26,37,41,43,54 Eleven studies compared research performance between institutions, which included universities, national academies and hospitals in the USA, UK, Canada, Australia, New Zealand, France, Germany, Italy, Switzerland, Finland, Serbia, Croatia, Romania, Brazil and Iran. 23,25,27,37,42,48,53,54,56,59,62 Thirteen studies compared research performance between countries, of which nine studies assessed performance globally and three studies assessed performance of the USA with the UK, Europe and Brazil. 15,18,24,33,39,40,43,44,47,48,50,59,63
Discussion
This is the first systematic review which identifies indicators for assessment of research performance in healthcare. The most widely used indicators include bibliometrics such as number of publications, number of citations and IF, h index, g index and AWCR. Less commonly used indicators include degree of co-authorship, number of conference presentations, number of patents, research funding, number of doctoral students, number of editorial responsibilities, peer review, gender, gross domestic product and population size. The utility of these indicators in assessing research performance in individuals, specialties, institutions and countries has been well described, but their feasibility, validity, reliability and acceptability has not been formally evaluated.
Measuring the number of publications and their citations are simple ways to signify influence. Although they are the most commonly used methods, it is hard to compare them among specialties or career stages. However, this shortcoming can be overcome by normalizing these indicators to scientific disciplines and experience at both individual and institutional levels.
The h index considers both the research productivity and its impact, although its use is limited by variations in individuals' age and their discipline. Several other variants of the h index have been developed to address these drawbacks, for instance the g-index which provides higher scores for increased numbers of citations.
The journal IF should be cautiously used, preferably as an adjunct to other methods, this is because it only considers the impact of journals and does not assess the performance of individual researchers or the impact of their publications. 65
Research funding and degree of co-authorship can be used in addition to the above mentioned indicators to measure individual, specialty and institutional performance. When measuring performance at global level, GDP and population size should be added to the performance assessment metrics.
Bibliometric research outputs are readily accessible from databases such as ISI, Scopus and Google Scholar. The methods of extracting these outputs should be transparent in all databases so that researchers are able to make an informed decision on the sources of their performance statistics (Table 2). A universally accepted framework needs agreement by the decision-makers in academia to standardize research outputs, so that the economic and societal impact of research can be measured. A recent example includes the STAR METRICS working group in the United States (Science and Technology in America's Reinvestment – Measuring the EffecT of Research on Innovation, Competitiveness and Science) who are developing a common empirical infrastructure. 66
There are several limitations at a study and review level. Studies will be biased when authors evaluate their own performance or the performance of their affiliated specialties, institutions or countries. There is different coverage of peer-reviewed publications between bibliometric databases, so a source level bias will exist in studies which use a single data source. This systematic review was limited by the poor quality of the studies. In addition, meta-analysis could not be performed because of the diversity of the studies which did not have homogeneous methods or results.
This study has several implications: (1) further studies are needed to determine the feasibility, validity, reliability and acceptability of current and future research performance indicators; (2) specifically, it is important to assess the value of the h index because it measures the importance, broad impact consistency and sustainability of a scientist's research; (3) co-authorship networks and changes in collaboration patterns over time should be analysed to establish whether they are important tools to assess and develop research performance; (4) the use of the IF to evaluate a researcher's performance needs to be investigated, since the IF has only been designed to measure journal performance; (5) researchers and policymakers can then debate what role the indicators should play, both in terms of the weighting and the level they should be incorporated into the decision-making process; (6) the balanced scorecard is a performance measurement framework that adds strategic non-financial performance measures to traditional financial metrics (Figure 3). 67 Although designed for business and industry, the balanced scorecard can be modified for non-profit and non-manufacturing research institutions. 68 This approach needs to be adapted by institutions to present a more unbiased view of research performance. This multifaceted method of research performance evaluation will require a multidimensional model of analysis utilizing a broad range of robust analytical techniques; 69 (7) enhanced healthcare research indices should be translated into improved healthcare outcomes because the principal aim of healthcare research is to improve patient wellbeing. It is now imperative to consider healthcare outcomes as opposed to research outputs. The use of healthcare outcomes can then determine important factors such as the societal and economic impact of healthcare research, in addition to awards of clinical prestige and quality. 70–73,79

Balanced scorecard showing performance areas of an organization 67 (in colour online)
Conclusions
Recently, there has been greater awareness of the importance of research performance indicators in healthcare. As a result the prevalence and usage of metrics such as number of publications, number of citations, IF and h index has increased. However, the assessment of feasibility, validity, reliability and acceptability of these indicators has been poorly investigated. Future studies are required to improve the current standards and accuracy of performance evaluation. It is imperative to have a balanced approach when measuring research performance in healthcare, which should consider quality and innovation. There is an increased need to consider the role of healthcare research outcomes in achieving societal and economic impact. The ultimate aim is to accurately quantify the research performance of healthcare individuals and institutions to cultivate an environment which can support translational medicine to improve the quality of patient care.
DECLARATIONS
Competing interests
None declared
Funding
None
Ethical approval
Not applicable
Guarantor
VMP
Contributorship
All authors contributed equally
Acknowledgements
None
