Abstract

As a clinician, I have been keen to apply the principles of evidence-based medicine, informed by evidence from systematic reviews and the specific needs of each of my patients. However, in my attempts to practise evidence-based medicine, I have often had to confront two frustrations. First, many clinical decisions involve more than two possible interventions, yet systematic reviews are usually based only on comparisons in pairs of interventions. Second, the comparisons researched do not cover all the relevant interventions.
Many others have experienced these frustrations. For example, after assessing 18 trials testing 24 interventions for children with acute pyelonephritis, John Ioannidis 1 asked: ‘How do we make sense of this complex network and guess the best choice(s)?’
It was in 2009 that I first became aware of an approach to addressing quandaries resulting from such complex networks. The solution involves combining direct and indirect treatment comparisons using evidence from randomised trials to produce a synthesis linking data from all the interventions in a ‘network’. In doing this, each trial shares at least one direct comparison with another trial. Analysis of the network – network meta-analysis (NMA) – could then produce a ranking of the likelihood of each of the interventions being the most effective.
NMA as part of the development of research synthesis
NMA is a method of research synthesis, so any history of this specific method must acknowledge that it is just one part of the overall history of research synthesis. That wider history has been described by others.2,3 This article focuses on the origins and development of this method of synthesis, which has also been described as a ‘mixed treatment comparison’ or ‘multiple treatments comparison’. 4
The concept of NMA
NMA is an extension of traditional, pairwise meta-analysis, but three main advantages are claimed for NMA. First, it allows more than two interventions to be compared simultaneously; second, interventions can be compared even if they have not been directly compared in trials; and third, it increases precision of the estimate of effect size by ‘borrowing strength’ 5 if the key assumptions described below are valid.
When seeking to compare the effects of interventions A, B and C, if there are trials of A versus B and B versus C, but not of A versus C, an estimate of the relative effects of A compared to C can be made by indirect comparison using the results of the trials of A versus B and B versus C. This indirect comparison uses the results of meta-analysis of all trials in each direct comparison, as explained by Bucher et al., 6 to combine, in a single synthesis, all the available direct comparisons and all the indirect comparisons that can be estimated between the selected interventions.
A network diagram can be drawn illustrating the direct comparisons between A, B and C. In more complex networks, multiple combinations of direct comparisons are used to estimate the same indirect comparison, which are then incorporated in the synthesis. These networks might have more than three interventions and have varying complexity of geometry. Among the simplest is a star, in which the intervention at the centre is the common comparator to all the other interventions. More complex forms of network geometry, when there are common comparators for only a few of a more diverse range of interventions, involve multiple conjoined loops and sidearms.
Figure 1 shows an example of a network diagram from an NMA comparing six interventions, A to F. The solid lines show where trial results of direct comparisons exist. In this example, there is no single common comparator for all the interventions, but each intervention shares at least one comparator with another intervention in the network. The thickness of each connecting line represents the number of trials between the pair of interventions at either end of the line, and the size of each node (letter within a circle) represents the number of participants receiving the intervention. Some network diagrams also include the actual numbers of trials, participants or both.

Network diagram example.
Transitivity and consistency are two key assumptions for these analyses. Assessing the validity of these requires input of both clinical and methodological expertise. Transitivity is the assumption that effect modifiers (the clinical and methodological characteristics that can affect the outcome) are similar in each direct comparison involving the same intervention. 7 It cannot be assessed statistically but requires critical interpretation of the effect modifiers in the trials whose results are considered for synthesis. Before an NMA is conducted, those trials must be assessed for significant differences in their populations, interventions, outcomes, methodological features and reporting. 8 It is also important to note that even placebo response has been found to vary over time, which might affect the transitivity assumption when placebo is a common comparator. 9
Consistency is an extension of transitivity. It is the assumption of agreement between the results of direct and indirect comparisons for each pair of interventions. This can be assessed statistically but only when there are both direct and indirect comparisons of one or more pairs of interventions within a network, known as ‘closed loops’. The assumption of transitivity needs to be reconsidered if inconsistency is detected. If inconsistency is not detected statistically, however, that does not automatically validate the transitivity assumption. 7
Origins and evolution of methods for NMA
In 1989, Eddy described the ‘confidence profile method’ (CPM), 10 and the publication by Eddy et al. 11 in the following year, ‘A Bayesian method for synthesizing evidence’, described: ‘a collection of meta-analysis techniques based on Bayesian methods for interpreting, adjusting, and combining evidence to estimate parameters and outcomes important to the assessment of health technologies.’
These techniques were collectively called the ‘confidence profile method’ and the article explained indirect comparison with the following example: The approach is to use the available evidence to derive probability distributions for the various pairs that have been directly compared. A distribution for the relative effects of other pairs can then be calculated by a series of convolutions. The concept is illustrated by calculating the difference between the test scores of Tom and Bill from knowledge of the differences in scores between Tom and George, and George and Bill.
A series of methodology publications through the 1990s built on the CPM approach, including notably those by Smith et al.
15
and by Higgins and Whitehead.
5
Higgins and Whitehead wrote about borrowing strength from external trials in a meta-analysis. They argued that Many meta-analysis papers include data from three or more treatments, but only consider pairwise comparisons of, say treatment A with control and treatment B with control. There would seem to be little reason not to combine all treatments into one analysis. the first to articulate that relative effects of different treatments can be jointly estimated in a single meta-analysis model to improve power. This landmark paper introduced the basis for the methodology which, now extended and refined, is increasingly known as network meta-analysis.
Lumley acknowledged the limitation of his methods, which were restricted to each trial only having two intervention groups ‘Meta-analyses with large numbers of multi-armed trials present difficulties for network meta-analysis, and extensions to handle multi-armed trials correctly should be investigated.’
Ades subsequently described methods to encompass multi-arm trials and multiple outcomes. In his 2003 article,
18
he stated The aim of nearly all meta-analysis has been to summarize evidence comparing one or sometimes more treatments. Usually only a single outcome is examined, and if there is more than one outcome these are explored in separate meta-analyses, rather than simultaneously. This paper concerns the possibility of combining information from different studies on different, but structurally related, outcomes, and using the data to construct a single model which expresses the relationships between the different kinds of data.
A review of the methods for NMA, with particular emphasis on the issue of inconsistency between direct and indirect evidence, was published in 2008 by Salanti et al. 20 They explained that inconsistency in estimates of intervention effects obtained from direct and indirect comparisons may indicate diversity, bias or a combination of both, and they described modelling to test for consistency. Their review considered potential sources of inconsistency, including genuine diversity in the characteristics of included trials, selection bias, study quality and sponsorship bias. It stresses the importance of planning in advance for investigation of inconsistency, because clinical and epidemiological assessment of inconsistency may be difficult because of factors such as reporting deficiencies or lack of sufficient studies for some comparisons. Salanti et al. 20 also highlighted that attention to the geometry (the overall pattern of comparisons among interventions) and the asymmetry of networks (the extent to which specific comparisons of interventions are represented more heavily than others in the number of included trials or participants) can be used to inform the design of the new trials that would most usefully add to the overall network.
A review of NMA methods, published in 2016 by Efthimiou et al., 8 summarised newer publications on the use of NMA methods. This included various models for performing NMA, statistical methods for assessing inconsistency, software options, investigating sources of potential bias and reporting results.
The use of individual participant data (IPD) in meta-analyses has many advantages over the use of aggregate data, including improving the quantity and quality of data, which has resulted in it being considered ‘the gold standard in evidence synthesis’. 21 The use of this approach, initially using traditional, pairwise meta-analysis methods, increased between the early 1990s and 2008 to around 50 publications per year. 22 The number of systematic reviews using IPD was found to be 10 to 22 per year with no discernible growth trend in the years leading up to 2015. 23 Gao et al. found that the first IPD using NMA methods was published in 2007 and that 21 IPDs using NMA methods had been published by June 2019. 24 There are limitations as well as advantages to use of IPD. Guidance has been published on the best use of IPD meta-analysis generally 25 and specifically on the use of NMA methods with IPD. 21
Multiple outcomes multivariate meta-analysis (MOMA) is another approach to meta-analysis that has been increasing in recent years. 26 Relevant studies that might be considered for synthesis may not report the same outcomes, which could result in their exclusion from traditional meta-analyses, but MOMA allows for inclusion where outcomes can be regarded as highly correlated. Guidance on conducting this type of synthesis using NMA methods has been published in recent years,26–28 including the use of IPD. 29
Interest has developed recently in creating and maintaining continuously updated meta-analyses using NMA methods 30 and a major project of this kind for COVID-19-related interventions began in 2020. 31
Approaches to conducting NMA
A simple meta-regression approach can be used for NMA if there is no multi-arm trial in the network. 32 However, if the network includes multi-arm trials, other methods are more appropriate. Bayesian methods have been used most frequently, 33 partly because this approach can most naturally produce estimates of ranking probabilities for the interventions being compared (to give the probability that each intervention is most effective through to least effective) 34 ; but frequentist methods to approximate ranking have also been described. 35 The hierarchical model approach is detailed by Lu and Ades 19 and by Salanti et al. 20
An alternative approach is multivariate meta-analysis, which can be conducted using Bayesian 36 or frequentist methods. 37 A further approach, based on graph-theoretical methods, has been described by Rucker. 38
The frequentist approach assumes that the intervention effect has a true value with a confidence interval, which defines the range within which the true value would fall with a minimum probability, usually 95%. The Bayesian approach assumes that the intervention effect has a fixed value but within a probability distribution based on a ‘prior’, which might be a value chosen from existing evidence or might be a ‘best guess’. The credible interval results of a Bayesian meta-analysis provide the probability of the range of values within which the fixed value lies, given the data, and this range is the ‘posterior’ that includes 95% of the probability.
Use of network meta-analyses in published systematic reviews
In 1999, Dominici et al.
39
used Bayesian methods and data from 46 trials of treatments to prevent migraine headache, to produce a ranking of treatments. They stated their aims as follows: In this article we present a meta-analysis of these 46 trials with the goal of synthesizing existing evidence about which treatments are most effective and of quantifying the remaining uncertainty about treatment effectiveness. We hope that the results and methods will be useful in supporting clinical treatment decisions and will help guide the planning of new trials. The critical statistical aspects of this goal are the estimation of treatment effects on a common scale and the relative ranking of treatments, both within classes and overall. This requires indirect comparisons among treatments that may never have been tested together in the same trial. The clinical trials in hypertension have provided a patchwork of evidence about the health benefits of antihypertensive agents. Some trials used placebo or untreated controls, and others used active-treatment comparison groups. Among the latter, the choice of treatment and comparison therapies has varied from one trial to the next. Several approaches to the synthesis of these complex data are possible. The Blood Pressure Trialists, for instance, conducted a prospective series of mini-meta-analyses, but this method left many ‘unresolved issues’ due to multiple comparisons and low power. In this study, we used a new technique, called network meta-analysis, to synthesize the available evidence from placebo-controlled and comparative trials in a single meta-analysis.

Estimate for the number of published network meta-analyses 2004 to 2020.
The most significant impact of evidence synthesis on healthcare is likely to come from the use of the evidence generated by these research projects in national clinical guidelines. A review of NICE clinical guidelines published or updated in 2015 and 2016 found that they made extensive use of meta-analysis to identify evidence to support their recommendations. 42 NMA methods were used far less often than traditional, pairwise meta-analysis but were used or considered for nearly one-quarter of the guidelines reviewed, showing that evidence produced using network meta-analysis methods is influencing recommendations for the UK National Health Service.
Guidance relating to conduct and reporting of NMA
Efforts to improve the quality of reporting of systematic reviews have been based on defining and promoting internationally recognised standards. There is evidence that publication of such reporting standards results in improved quality of reporting, based on comparisons before and after the publication of both the QUOROM statement 44 and the PRISMA Statement, 45 and that reporting to PRISMA statement standards is strongly associated with higher study quality, as assessed by a widely used critical appraisal tool (CAT). 46 Panic et al. 47 found that endorsement of the PRISMA statement by journals in their instructions for authors was associated with improved quality of reporting, regardless of whether the authors declared that they had followed the statement and was associated with higher study quality.
When the first standards for reporting systematic reviews and meta-analyses were published in 1999 in the QUORUM Statement, the concept of NMA was still in its infancy. However, by the time of the PRISMA Statement in 2009, NMA warranted a mention as a form of meta-analysis that combined direct and indirect comparisons. However, the PRISMA 2009 statement did not make recommendations relating to reporting that addressed the specifics of NMA methodology. For standards for conducting systematic reviews, it directed readers to the guidance published by the Cochrane Collaboration 48 and the Centre for Reviews and Dissemination, 49 both of which contained very limited guidance on the use of NMA methodology in systematic reviews. Reporting standards for use of NMA methods were subsequently published as a PRISMA extension statement in 2015. 50
The Methods Guide 2008 Update of the National Institute for Health and Care Excellence (NICE) 51 included a section on NMA for the first time and the NICE Decision Support Unit’s Evidence Synthesis TSD series, 52 published initially in 2011, expanded on that guidance. Also in 2011, the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) published reports on the interpretation and conduct of NMA.13,14 Ades 53 observed that this ‘seems to represent the first position statement from an academic body on these methods’.
The Cochrane Comparing Multiple Interventions Methods Group was established in 2010 and has since produced guidance on the use of NMA methods in Cochrane Reviews and promoted training in using these methods. Version 6 of the Cochrane Handbook for Systematic Reviews of Interventions, published in 2019, contained ‘a major new core chapter’ addressing for the first time NMA within the handbook. This guidance emphasises that NMA is ‘more statistically complex than a standard meta-analysis’, consequently, ‘close collaboration’ between a statistician with expertise in NMA methods and those with expertise in the clinical content area is essential in the design and conduct of a review to ensure that studies selected for inclusion in NMA fulfil the assumptions of transitivity and consistency. 54
Critical appraisal of NMA
Critical appraisal involves assessing the report of a study for methodological quality and any likelihood of bias and considering whether these might affect the validity of the reported results. A 2018 review of the published CATs for systematic reviews 42 found that none of the most widely used CATs for systematic reviews contained content specifically relevant to appraising research synthesis using the NMA methodology. Three tools had been published, which include content that is relevant to appraising the use of NMA methods.13,55,56 These three tools, however, might not be suitable for end-users without specialist statistical knowledge of NMA methodology, so there is still potentially a role for a new CAT to support generalist end-users. A tool has been constructed using the CASP format, 57 which might form the basis for further development. 42
Other approaches have been developed to assess confidence in the evidence produced using NMA methods. For example, in 2014, the GRADE Working Group reported guidance for establishing the quality of treatment effect estimates obtained from NMA. 58 Their approach involved rating the quality of each direct and indirect effect estimate for each pairwise comparison within the NMA and then rating the NMA effect estimate for each pairwise comparison. In 2018, the GRADE Working Group recommended modifications to their 2014 guidance with a view to making the process more efficient, acknowledging that the original approach, ‘may appear onerous in networks with many interventions’. 59
In 2014, Salanti et al. published a modification of the GRADE guidance, in particular, drawing a distinction between rating the effect estimates for each pairwise comparison and rating the ranking of all the interventions within a network. 60 More recently, in 2020, Salanti and others published a new approach: Confidence in Network Meta-Analysis (CINeMA). 61 They stated that uptake of the earlier 2014 GRADE system and that reported by Salanti and others in the same year had been limited by ‘the complexity of the methods and the lack of suitable software’.
CINeMA is also based on the GRADE framework but instead of considering direct and indirect evidence for each pairwise comparison separately, it considers the impact of every study in the network. A web-based application makes it easy to apply to even large networks. 62
A further approach, ‘threshold analysis’, has been developed specifically to assess confidence in NMA results used in guideline development.
63
The authors argue that their approach is needed because GRADE approaches do not assess the influence of the NMA evidence on a resulting recommendation: Threshold analysis quantifies precisely how much the evidence could change (for any reason, such as potential biases, or simply sampling variation) before the recommendation changes, and what the revised recommendation would be. If it is judged that the evidence could not plausibly change by more than this amount, then the recommendation is considered robust; otherwise, it is sensitive to plausible changes in the evidence.
Conclusions
NMA is a relatively new form of evidence synthesis and evolution of the methodology is continuing. As outlined by Salanti, 64 the introduction of NMA faced similar scepticism to that raised originally about traditional, pairwise meta-analysis. However, there has been gradually wider uptake of the method by researchers and use of the resulting evidence by decision makers in health and social care. The Cochrane review with the largest number of included studies is a NMA of 585 randomised trials of drugs to prevent postoperative nausea and vomiting. 65 Consensus guidance relating to the conduct and reporting of NMA is now widely available.
The use of NMA methods can overcome some limitations of traditional, pairwise meta-analysis. The design and conduct of NMAs require multidisciplinary input from expert methodologists and clinical topic experts. Further research is needed to clarify whether end-users who do not have specialist statistical knowledge can assess the quality and validity of evidence produced in systematic reviews using NMA methods, even with a critical appraisal tool optimised for such studies.
