Abstract
Near 60% of new HIV infections in the United Kingdom are estimated to occur in men who have sex with men (MSM). Age-disassortative partnerships in MSM have been suggested to spread the HIV epidemics in many Western developed countries and to contribute to ethnic disparities in infection rates. Understanding these mixing patterns in transmission can help to determine which groups are at a greater risk and guide public health interventions. We analyzed combined epidemiological data and viral sequences from MSM diagnosed with HIV at the national level. We applied a phylodynamic source attribution model to infer patterns of transmission between groups of patients. From pair probabilities of transmission between 14,603 MSM patients, we found that potential transmitters of HIV subtype B were on average 8 months older than recipients. We also found a moderate overall assortativity of transmission by ethnic group and a stronger assortativity by region. Our findings suggest that there is only a modest net flow of transmissions from older to young MSM in subtype B epidemics and that young MSM, both for Black or White groups, are more likely to be infected by one another than expected in a sexual network with random mixing.
Introduction
Men who have sex with men (MSM) account for 40% of new HIV diagnoses in Europe. 1 In the United Kingdom (UK), nearly 60% of new infections are estimated to occur in MSM, although there is a recent sign of decline in diagnoses particularly recorded in London. 2 It has been estimated that the largest contribution to transmission in the UK is attributable to young HIV-positive MSM. 3 More generally, since the early work from Morris et al., 4 young MSM having sex with older partners have been suggested to increase the risk of infection 5,6 and to represent a significant driver of the epidemic in North America. 7 This disassortative age mixing pattern is also considered in interaction with mixing by ethnicity. 8,9 Among MSM, black men appear to be more affected by HIV in both the UK and US contexts and age mixing patterns have been evaluated to illuminate this ethnic disparity in prevalence. 10 –12 In addition to the question of transmission patterns by age and ethnicity, it is unclear whether the geographic variation in diagnosis rate for MSM is solely reflecting the demographic distribution of groups at greater risk in the country, or can also be explained by a varying extent of transmission between persons of different regions. 13 Assessing the primary sources of infection in these different demographic groups could prove helpful to design more effective intervention strategies.
Several studies have used phylogenetics to infer transmission patterns based on coclustering of persons from different demographic or risk groups. For instance, occurrences of clustering observed between older and younger MSM is suggestive of a flow of transmission from old to young, as prevalence tends to increase with age. 14,15
However, there are several limitations to the interpretation of genetic clustering in terms of transmission. Clustering of genetically similar viruses is influenced by time since infection when patients are sampled, which is confounded by patients' age as well as CD4 and clinical stage of infection. Also the extent of clustering is dependent on the fraction of infected persons sampled, which makes direct inference of transmission patterns difficult using genetic clustering. 16 –18 Particularly, the direction of putative transmission events cannot be resolved by pairwise genetic distance alone, and it is not possible to estimate flows of transmission between age groups based on clustering observations.
In this study, we applied a phylogenetic source attribution (SA) method that infers the probability of potential transmission (infector probability) between pairs of patients among ∼15,000 MSM diagnosed in the UK with available genetic sequences. 19 SA methods based on consensus pol-sequence data cannot be used to infer transmission pairs with high confidence, but can provide useful insights when studied in aggregate over thousands of putative transmission pairs. In general, direction of transmission cannot be inferred from consensus HIV sequence data, but in combination with clinical stage of infection at the time of sequencing, directionality can be inferred probabilistically in some cases, as when for example a patient with chronic infection is linked to a patient with early infection.
By combining phylogenetic analysis with stage of infection data and independent estimates of incidence and prevalence in the population, we are able to quantify potentially imbalanced transmission patterns between different risk groups. To this end, we used sequencing data routinely collected for drug resistance testing, patient-level data informative of the time since infection to account for biased sampling, and population estimates of background prevalence and incidence to account for potentially unsampled individuals that could be the sources of infection. In estimating transmission pair probabilities, our objective was to reveal patterns of transmission in MSM according to age, ethnicity, and geography. In particular, we searched for evidence of source-sink relationships in transmission patterns between age groups and examined the hypothesis that there is a net flow of transmissions from old to young MSM overall or by ethnicity.
Materials and Methods
Data
We used partial HIV-1 pol sequences collected in the UK HIV Drug Resistance Database 20 linked with characteristics of patients newly diagnosed with HIV from the UK Collaborative HIV Cohort study database and the national HIV/AIDS Reporting System database, 21 as of end of August 2016. Among MSM diagnosed with HIV after 1997 in the UK, 58% had at least one sequence. The data were fully anonymized.
We analyzed adult patients reported as MSM; infected by HIV-1 subtype A1, B, C, or CRF-02AG (the four most represented subtypes); and having a nucleotide sequence while treatment naive. The first sequence per patient with length >950 nucleotides was included. CD4 count values closest to and within a maximum of 1 year of the date of sequence sampling were used to define five stages of infection, comprising early HIV infection (stage 1) and four stages of declining CD4 with thresholds at 500,350 and 200 cells/mm 3 . 22 In our sample, 81% of patients had a CD4 count. A positive result from the avidity-based recent infection testing algorithm (RITA) led to classifying a patient as at stage 1. Results of RITA at diagnosis were available as of 2009, and from this year were informed for 46% of patients.
Age of patients was categorized in quartiles of age at the date of resistance testing. Difference in age between patients was calculated relative to year of birth. Ethnicity categories were grouped in seven classes: White; Black Caribbean; Black African; Other or unspecified black; Indian, Pakistani, or Bangladeshi (South Asian); Other Asian or Oriental, Other, and mixed. Regions of diagnosis were categorized in five classes: London; South of England; Midlands and East of England; North of England; Northern Ireland, Scotland, and Wales. In analyses of assortativity, unknown category was treated as missing data.
Sequence processing
Partial HIV-1 pol sequences from the UK were sampled from 1997 to July 2015 with a majority obtained after 2009. Subtypes were determined with REGA version 3. 23 To infer importation of viral lineages, a BLAST search 24 was performed for each UK sequence to identify the global sequence from the Los Alamos HIV sequence database (LANL) 25 with highest similarity. We retained 1,780 unique matching global sequences, as more than one UK sequence may have the same BLAST match. Four reference alignments 26 per each subtype were also added to UK sequences to serve as outgroup for rooting the phylogenetic trees. All alignments were obtained with MAFFT version 7. 27 Drug resistance mutation sites were stripped from the alignments. 28
Phylogenetic analysis
Phylogenetic trees were constructed with ExaML by maximum likelihood-based inference with a gamma distribution model for rate heterogeneity among sites. 29 One hundred bootstrap replicates of each tree were computed to account for phylogenetic uncertainty.
We calculated root-to-tip distance and regressed distance by time from MRCA to sample. By iterations of Grubb's algorithm, 30 we identified on overall 0.3% sequences as outliers in terms of divergence time and evolutionary rate. We applied least-square dating algorithm 31 on rooted trees and sampling times to estimate the substitution rate and dates of ancestral nodes.
We analyzed separately the four main subtypes to account for different evolutionary rates. Fitch algorithm was used to reconstruct ancestral host status (UK vs. global) and determine distinct clades of virus transmitted in the UK. 32 The dated subtype B phylogeny comprised 18,484 taxa and for computational reasons was split into subtrees (clades) for further analyses. The tree splitting step consisted in iteratively testing thresholds of forward times (above the root) to slice 33 the large tree into clades with maximum size of 1,000 taxa (viruses from UK patients). Thus for each of 100 bootstrap trees for subtype B, resulting clades were different.
Probabilistic source attribution
We applied a phylogenetic SA method that uses a population genetic model to derive probabilities that a given individual (donor) is the source of infection for another individual (recipient) in the sample. These probabilities, termed infector probabilities, account for the epidemiological and sampling processes by incorporating into their calculation the time-scaled phylogeny, patient data on stage of infection, and population-level data on occurrence of infection. 19 The method was evaluated in a previous simulation study. 18
For population-level epidemic statistics, we used updated incidence estimates of CD4-based back-calculation method for MSM population and prevalence estimates of Bayesian multiparameter synthesis of surveillance data, as reported by Public Health England in 2017. 13 To account for uncertainty in those input parameters, we randomly drew five pair values of incidence and prevalence per bootstrap replicates (2,000 in total) from normal distributions inferred from the credible intervals of those estimates. Incidence and prevalence were assumed to be proportional across subtypes.
The SA method uses a continuous-time Markov chain model to reconstruct the likely state of a lineage at the time of transmission given the CD4 stage of infection at time of sampling. The definition of stages of infection and progression rates were based on Cori et al., 22 as described in our previous analysis. 18 In case of missing CD4 count and missing RITA results at sampling, individuals were assigned a stage with probability relative to the average duration of respective stages. The method assumes that each infected patient corresponds to a single lineage of virus, ignoring multiple infections, and that internal nodes in the phylogeny correspond to a transmission event between hosts. To limit calculations to non-negligible pairing, only coalescent events within a limit of 20 years before sequence sampling were incorporated to compute infector probabilities.
Statistical procedures
Infector probabilities
To characterize transmission patterns by patients' covariates, we first computed a symmetric mixing matrix M as the normalized sum of infector probabilities representing aggregated number of transmissions between category k
Code availability
The code used in this article is available as a R package:
Results
Characteristics of the study population
The demographic and geographic composition of the 19,847 HIV-1 partial pol sequences from treatment-naive patients diagnosed in the UK is described in Table 1. Most gay and bisexual men diagnosed in the UK were infected with subtype B (93%). Therefore, the patterns of transmission inferred from reconstructed phylogeny of subtype B sequences are largely dominating that of all MSM patients. Patients infected with non-B subtype were on average sampled later (median year of 2008 for subtype B, 2009 for subtypes A1 and C, and 2011 for CRF02AG) and were on average younger (median age of 35 for subtype B, 34 for subtypes A and C, and 32 for CRF02AG).
Characteristics of the Study Population
In terms of ethnicity, the majority (84%) of patients were white persons. Patients infected with C or CRF02AG were more commonly of non-white ethnicity: Black African for 11% and 16% and from other non-white ethnicity for 19% and 26%, respectively.
In terms of geography, half of subtype B and 71% of subtype CRF02AG sequences were sampled in Greater London. Apart from London, subtype A was especially prevalent in North of England (27%).
Infector probabilities
Across 100 bootstrap tree replicates for each subtype, we computed infector probabilities for on average 554,514 potential transmission pairs involving 14,603 patients (Table 2). The remaining 5,244 individuals from the initial sample, besides 250 outliers in tree reconstruction, could not be connected by a probability of transmission due to their isolation in distinct clades or the time limit imposed to coalescent event. Although the distribution of infector probabilities is varying across bootstrap replicates, almost all estimates are very small (Supplementary Fig. S1). This confers a very low confidence in any particular pair and interpretations in terms of transmission are only applicable at a group level. Given the n by n matrix of probabilities that a patient i transmitted to a patient j, the sum
Phylogenetic Reconstruction and Source Attribution Results by Subtype
Results are averaged over 100 bootstrap replicates. Global sequences are unique sequences from Los Alamos HIV sequence database matching UK sequences from a BLAST search. Outliers are UK sequences identified as outliers in root-to-tip regression. Mean in-degree represents the probability that the donor of a given recipient is included in the sample.
Age difference between donors and recipients
Table 3 shows the mean difference in age between donors and recipients, weighted by infector probabilities. A significant difference is only detectable for subtype B, donors being on average less than 8 months older than recipients. For subtype B, most transmission pairs in our sample involved individuals less than 30 years of age (Fig. 1M). The largest proportion (46%) of infection acquired by young individuals was attributable to individuals in the same age category (Fig. 1R). And a strong assortativity in transmission mixing is seen in this youngest age category, indicating that young MSM are preferentially infected by young MSM. This preferential mixing is also seen among individuals over 44 years. The overall assortativity coefficient was moderate with

Patterns of transmission of HIV subtype B by age in quartiles. The four graphics depict transmission from donor categories in column to recipient categories in row (from x-axis to y-axis). Axes labels represent ranges of quartiles of age. (
Difference in Year Between Age of Donor and Age of Recipient
Results are averaged across 100 bootstrap replicates and intervals are 2.5 and 97.5 percentiles.
Age difference is calculated relative to year of birth.
Number of p-values <.05 for two-tailed weighted t-test of the age difference, either positive (donor older than recipient) or negative (donor younger than recipient).
Transmission by ethnicity
The vast majority (85%) of MSM infected with subtype B viruses were of white ethnicity. We estimated that 82% of all transmissions in our sample occurred between white individuals, and that recipients of all ethnicities had a majority of white donors. The probability of having been infected by a white individual was 92% for whites, 77% for Indian/Pakistani or Bengladeshi, 75% for other Asians, 55% for Black Africans and 54% for Black Caribbean. Conversely, a majority of transmission originating from donors of any ethnic group was estimated to affect white recipients. Figure 2a shows the level of assortativity in transmission of subtype B viruses between ethnic groups. Interethnic transmission (cumulated pair probabilities outside the diagonal) represented 17% on overall and 58% when excluding the white category. Overall assortativity was moderate (

Assortativity in transmission of HIV-1 subtype B by ethnicity and region of diagnosis. Lighter shades represent higher assortativity.
We estimated the probability of transmission of subtype B viruses between young (<30) and older MSM (30+) either from white or black ethnicity (Fig. 3). The relative excess of transmission within age categories observed previously is observed for both white and black ethnicities, and overall assortativity by age was similar (

Patterns of transmission of HIV-1 subtype B between young MSM (less than 30) and older MSM by ethnicity:
Transmission by geographical region
Analyses of transmission by region show the largest level of assortativity, indicating an overall strong spatial structure of the epidemics (Fig. 2b). Assortativity coefficients were 0.56 for subtype B and 0.49 for subtype CRF02AG. For those two subtypes, Figure 4 shows the probability for a donor in a given region to transmit to a recipient of each respective region. For subtype B (left), the majority of transmissions (at least 60%) occur within the same region but donors from every region contributed to infections diagnosed in London (10% for North of England, Northern Ireland, Scotland, and Wales, 20% for the Midlands and East England, and 30% for the South of England). For subtype CRF02AG, there was a higher probability for donors from North of England (60%) or Northern Ireland, Scotland, and Wales (70%) to infect recipients in London than individuals within the same region.

Patterns of transmission of HIV-1 subtype B (left) and CRF02AG (right), by geography. Each flow diagram, obtained from D matrix described in Methods section, has connections proportional to the probability of transmission from a donor given his region (left side) to recipients from respective regions (right side). The map is colored by groups of region of diagnosis: London, South of England (S_England); Midlands and East of England (ML_E_England); North of England (N_England); Northern Ireland, Scotland, and Wales (NI_S_W). Color images are available online.
Discussion
The objective of this study was to describe patterns of HIV transmission between age, ethnicity, and geographical categories in the United Kingdom. We used a phylodynamic inference based on sequences collected among diagnosed MSM, which accounts for incomplete sampling and stage of infection at sampling time. By modeling an epidemic process that is compatible with the evolution of transmitted viruses and epidemiological surveillance data, we characterized past transmission events among nearly 15,000 MSM patients at the national level.
Pair probabilities averaged over phylogenies and aggregated by age groups indicated a modest overall net flow of transmission from older to young MSM. This result is compatible with other studies reporting coclustering of young and older patients 14,15 as we do not observe pure assortative mixing, with probable transmission occurring in both directions across age groups. But our results indicate that on average, flow from old to young is mostly compensated by the transmission from young to old (Fig. 1). And when the flow is imbalanced, as for transmission of subtype B viruses, the difference is small. We observed an overall preferential mixing in transmission by age with greater assortativity both in the youngest and oldest age groups and more random mixing in intermediate age groups. Understanding age mixing patterns in transmission can help to determine which groups are at a greater risk and potentially guide public health interventions. 35 Our findings confirm that young MSM infect one another more than expected by random mixing, which supports the idea that prevention benefit could be enhanced by focusing on this small group. 36 This result also corroborates the observation of recent clusters of young MSM sustaining the epidemic in the Netherlands. 37
We showed an overall preferential pairing by ethnicity in conjunction with an important mixing between white men and men from other ethnicity. It can be explained by the overwhelming proportion of white men in the population. But in non-white groups, more than a half of transmission was interethnic, revealing that a substantial amount of transmission has occurred between ethnic groups among MSM. A similar pattern for sexual partnership between ethnic groups was reported in Britain. 10 Although we found a relatively higher assortativity among black MSM in general and a non-negligible mixing between black ethnic groups from different origins (African, Caribbean, and other), HIV transmission appears less assortative among black MSM in the UK than it is in the USA. 38 We assessed whether intergenerational transmission was different in white and black MSM and found a similar level of age assortativity in both groups. Therefore as others in the US context 9 we did not find support in our findings to explain a disparity in HIV prevalence by age mixing. 7,8
Finally, we found a strong geographical structure for the epidemics among MSM, with region of diagnosis as the variable associated with the highest level of assortativity. This implies that interventions in a particular location would take time to diffuse to a wider population. It should be noted that region of diagnosis can be different than the region of residency or of actual transmission, which may lead to an underestimation of the true level of geographical structure.
Several potential limitations of our study relate to the assumptions of the phylogenetic inference and SA method. First, as stated in Methods section, the SA method neglects some effects of within-host evolution, which can cause discordance between phylogenies and transmission trees. 39 This approximation is reasonable if within-host evolution generates coalescence time considerably shorter than between hosts at the population level. Second, we incorporated crude estimates of incidence and prevalence in the inference of infector probabilities. These were assumed constant over the period and proportional across subtypes. However, variation of these inputs within credible limits had limited impact on average infector probabilities (Supplementary Fig. S2). Third, the direction in transmission was derived from CD4 count and RITA result data that were partially complete.
Nevertheless, our analysis aimed to improve the use of phylogenetic information relative to genetic clustering in two ways. First, by providing a rough measure of transmission probability, which unlike linkage into clusters can indicate a directionality and gives more weight to pairs with higher credibility. Notably, output matrices and patterns between groups would be symmetrical if based on clustering. Second, by correcting for biases stemming from incomplete sampling of the infected host population. Lastly, the SA method was fast to compute and scaled easily to phylogenies based on many thousands of sequences. The approach we take is generalizable to many different settings and has wider applicability to other large pathogen sequence databases.
Future directions for this work include applying the analysis to the heterosexual population, where phylogenetic information could contribute to assess age disparity in mixing across gender. 40,41 Another direction would be to use methods exploiting next-generation sequencing that account for within-host evolution and enhance resolution in identifying transmission. 39,42
In conclusion, this study has leveraged available patients data and viral sequences to provide evidence of assortativity in HIV transmission by age, ethnicity, and geography. Understanding these patterns of transmission is important to modeling the impact of intervention strategies.
Footnotes
Acknowledgments
This work was supported by the National Institute for Health Research (NIHR) Health Protection Research Units in Modeling Methodology and Sexually Transmitted Infections (HPRU-2012-10080). E.M.V. is supported by the National Institutes of Health (R01AI087520). O.R. and C.F. are supported by Bill & Melinda Gates Foundation: Phylogenetics Networks to Address Transmission of HIV (OPP1084362). A.T. is supported by UK HIV Drug Resistance Database grant from the Medical Research Council (164587). The authors thank the Imperial College High Performance Computing Service (doi: 10.14469/hpc/2232).
Author Contributions
S.L.V. designed the study, performed the analysis, and wrote the article; O.R. contributed to the phylogenetic analysis and writing the article; V.D., A.E.B., O.N.G., A.T., and D.D. contributed to data collection, molecular sequencing, data monitoring, and article evaluation. C.F. contributed to article editing and project leading. E.M.V. designed the study, contributed to article review and editing and project leading.
Author Disclosure Statement
No competing financial interests exist.
Supplementary Material
Supplementary Figure S1
Supplementary Figure S2
Supplementary Figure S3
Supplementary Table S1
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
