Abstract
Decades of research has yet to provide a vaccine for HIV, the virus which causes AIDS. Recent theoretical research has turned attention to mucosa pH levels over systemic pH levels. Previous research in this field developed a computational approach for determining pH sensitivity that indicated higher potential for transmission at mucosa pH levels present during intercourse. The process was extended to incorporate a principal component analysis (PCA)-based machine learning technique for classification of gp120 proteins against a known transmitted variant called Biomolecular Electro-Static Indexing (BESI). The original process has since been extended to the residue level by a process we termed Electrostatic Variance Masking (EVM) and used in conjunction with BESI to determine structural differences present among various subspecies across Clades A1 and C. Results indicate that structures outside of the core selected by EVM may be responsible for binding affinity observed in many other studies and that pH modulation of select substructures indicated by EVM may influence specific regions of the viral envelope protein (Env) involved in protein-protein interactions.
Introduction
AIDS was discovered nearly 40 years ago, but a vaccine for the virus that causes the disease remains imponderable. The challenge that researchers face is the overwhelming mutation rate of the virus due to host immune system pressure after introduction into the body.
HIV is typically transmitted during sexual intercourse where an acidic mucosa pool exists. Because protein structures and their ability to interact with other proteins are affected by pH, 1 we focus our attention on this key component. HIV transmission occurs when the gp120 portion of the viral envelope protein (Env), attached to the periphery of the virus, makes contact with a CD4 protein receptor on host T-cell membranes. Interaction between the 2 structures initiates a binding process and subsequent introduction of the viral RNA.
A study completed by Boeras et al 2 in 2011 concluded that the highest populations of HIV variants are not the subspecies that transmit from one host to the next. Their determinations were backed by statistical analysis of population subspecies and transmission data through direct investigation of human volunteer donors. With the large pool of quasi-species extracted, and the capture of variants at the time of transmission, this data set presents a potential to determine differences in protein structure and the role of pH that may explain the transmission bottleneck.
We focus our efforts around the sequences provided by Boeras et al as the foundation of our latest theoretical methods in an effort to narrow the field of research to those Env quasi-species with a higher potential of producing an infection from host to host.
Background
The high rate of mutation obtained by HIV allows antigenic regions targeted by host immune responses to vary greatly across HIV variants. Most research has focused on inducing the so-called broadly neutralizing antibodies (bnAbs) that target protein antigenic regions conserved due to functional requirements of the binding process. 3 The gp120 extracellular subunit of Env is responsible for binding CD4 on the surface of host T cells to begin infection; this subunit is a common target for bnAbs. 4 Env fragments selected via computational optimization to potentially invoke the production of bnAbs are often employed in current work for vaccine production. 5 Studies using these methods have varied from successful 6 to unsuccessful. 7 One potential explanation is that environmental impacts on gp120-CD4 interactions are not considered during Env selection. In particular, isolating bnAbs from a blood/plasma environment (slightly basic pH) might obfuscate the impact of mucosal environments (often acidic pH) on transmission. Therefore, it is reasonable to assume that both Env structure and binding affinity with CD4 and/or bnAbs will be altered under physiological conditions which are more consistent with sexual transmission.
Recent experimental and computational studies have shown that pH does in fact impact both Env conformation and CD4 binding. In 2013, Stieh et al hypothesized that electrophoresis, which is commonly used to characterize and separate cells and micro-organisms,8,9 could be applied at a protein level and performed direct experimentation to reveal a pattern of change in surface electrostatics across the pH range of the human body. Their findings produced a fingerprint of trimeric gp120 indicating a change in electrophoretic mobility from negative toward positive as pH increased. 1 The study was performed in a multidisciplinary, collaborative effort with computer scientists to develop a corresponding analytical protocol using off-the-shelf general public license (GPL) based software. The pipeline produced similar results to those of laboratory experiments developed by Stieh et al in that a determinable difference was seen from negative to positive with advancing pH. Stieh et al concluded that the experimental process and the computational data were in agreement.
In 2016/2017, Morton et al enhanced and refined the process introduced by Stieh et al to incorporate protein modeling via Modeler, 10 parallel processing, structure energy minimization by Gromacs,11,12 and advanced floating point data compression through ZFP 13 that allowed for larger studies to be performed and a greater depth of analysis to take place. 14 A classification method called Biomolecular Electro-Static Indexing (BESI) was developed based on principal component analysis (PCA), cosine similarity analysis (CSA), and loosely based on latent semantic indexing (LSI). Nearly 1 million adaptive Poisson-Boltzmann solver (APBS) 15 calculations were executed by Morton et al with the entire computational process taking approximately 60 days to complete on a small compute cluster with 256 cores.
During 2016 to 2017, Howton and Phillips 16 introduced a prototype method that extended Stieh et al to the protein residue level. The approach used by these authors exercised the hypothesis that strains in chronic infection, the so-called chronic control (CC) strains, will likely have adapted to systemic pH and will be less efficient at binding CD4 under acidic conditions when compared with transmitted founder (TF) strains. Using computational modeling, some differences between subclasses (TF and CC) and clades (B and C) were discovered using a more extensive set of 28 Env proteins. 16 However, the specific molecular mechanism (eg, surface residues and mutations) responsible for the pH sensitivity of the gp120-CD4 interaction could not be determined using the resulting data. The main difficulty was assumed to stem from a small sample size and a broad range of sexual-transmission-type studies.
In 2018, Morton et al 17 developed a method of protein residue analysis that examines the surface charge fluctuations of amino acids called Electrostatic Variance Masking (EVM). This method aligns all sequence structures together and determines the charge variance of exposed surfaces across the set. This information is then used to image those amino acids via transparency against a representation of the structure in an alternate mode such as New Cartoon in VMD. 18 The imagery produces a unique view of charge active residues that are similar across all structures examined to date. The process reveals what were hypothesized to be the residues responsible for modulating the binding process by exposing the high variation of electrostatic charge across the pH range of the human body.
Target Data
From a pool of more than 900 HIV Env sequences, Boeras et al provided 252 gp120 protein assemblies drawn against 20 individuals from Rwanda and Zambia. The donors consisted of couples where one was known to be infected and the other was expected to acquire infection at some point. Samples were taken prior to communication of the disease and after infection of the recipient occurred. The naming conventions used for the sequences indicate the country of origin, the sex, a subject pair identifier, and a donor (D)/recipient (R) indicator. Our selection of sequences is based on the BESI scores of the donor sequence data for each couple and is represented in Table 1.
List of donors taken from Boeras et al. 2
BESI, Biomolecular Electro-Static Indexing.
Subject indicates the country of origin, couple identifier, and sex, respectively. D indicates the subject’s status as the donor. Scores are the highest and lowest BESI scores for the sequence set.
The subject pair is not mentioned in Boeras et al’s study.
We use previously processed data from Morton et al 14 to reduce the overall processing time considerably.
Methods
Residue surface charges
We calculate the charges of individual amino acids that have solvent-accessible surfaces as described by Howton and Phillips, 16 enhanced and performed by Morton et al, 17 that include energy minimization steps performed by Gromacs11,12 and compression levels approaching 2 orders of magnitude provided by ZFP. 13 With the latter enhancement, we are able to process larger studies across more solvation states that allow a more granular investigation of the substructures involving gp120.
Phylogenetic tree
The phylogenetic tree inferred for the selected high and low BESI scores for each donor is constructed as follows. Sequences were aligned with MAFFT v7.273 using the L-INS-i strategy. 19 A maximum likelihood (ML) phylogenetic tree was inferred using the RAxML software, version 8.2.11, 20 with the HIVW amino acid model of substitution 21 and 100 bootstrap replicates. Trees were midpoint-rooted and rendered using APE version 5.0. 22 Expression of the phylogenetic tree involves minor differences from Morton et al 14 where recipient sequences are unused for this study.
BESI
With the focus of investigation being the transmission of the virus, our attention is directed to the donor group from Boeras et al. Using BESI as prescribed in Morton et al, 14 we select the maximum and minimum scores available from each donor into a correlation of BESI and phylogeny to produce Figure 1 which provides a graphical representation of BESI and evolution. One can see that, for each subclade of the tree, a higher and a lower score have been selected based on the gradient scale left of the inference. Note that the inferred tree also distinctly differentiates between donor categories where the sequence name represents the country Zambia (Z) or Rwanda (R) with a 3-digit code for a subject number. The fifth character is gender specific which is self-explanatory. All additional characters are attributes of the sequence that are explained in Boeras et al 2 if the reader chooses.

BESI vs phylogeny for the selected highest and lowest BESI scores applied as a gradient to the phylogenetic tree. As the BESI scores increase, the shading moves more toward the red. Each subclade of the tree is a specific donor in the study.
The reader should note that at this point no additional calculations have been made with the data; we have simply selected a subset of what was processed in Morton et al 14 and presented the results in a different manner.
Electrostatic variance masking
Selection of residues that show surface charge response to pH shifts involves calculating the electrostatic potential variance of each residue across all aligned sequences vertically. Where gaps are encountered in the alignment, a value of 0 is assigned. For each residue, the median value of individual residues for each model at a specific pH is taken to create a 1 × 61 vector for the pH range of 3.0 to 9.0 in 0.1 increments. The vectors are stacked row by row to create an array of dimensions M × 61, where M is the number of sequences involved in the study. The mean value of each column is then calculated to produce a vector for which the variance is determined and stored. This is repeated for each alignment position. This method allows us to effectively filter out residues with small variations in mean surface charge across the pH shift.
For each sequence alignment, a reverse mapping is created to align selections with correct residue numbers on the individual proteins. Where a gap exists in the alignment, a hyphen (-) is assigned. This allows the determination of a cutoff value for variance where a selection of a gap in some determined sequence can easily be detected. To determine a starting value for selection, the ceiling of one-half the standard deviation is calculated for the variance data. Assuming a gap is selected, the value is incremented by 1 until a uniform selection across all sequences can be determined.
The selected residues of the gp120 protein are then applied to a VMD representation 18 to display the substructures involved. This method of imaging residue structures participating in the mechanistic functions of the binding process is EVM.
HXB2CG alignment
We align the assemblies to HXB2CG as described by Korber-Irrgang et al 23 in Numbering positions in HIV relative to HXB2CG. This provides a common numbering scheme for amino acids and allows us to describe those residues that EVM selects in a concise manner.
Comparing BESI and variable loop lengths
For each sequence used in this study, we extract residue information directly from amino-acid-based text files. The 5 variable loops associated with HIV are extracted by aligning to HXB2CG and clipping the loops inclusively at defined residue numbers provided by Los Alamos National Laboratory (LANL) 24 in the HXB2 annotated spreadsheet. The information is correlated with BESI scores and presented via scatter plots grouped by variable loop number.
Tropism of loop V3
A method of prediction for a V3 tropism test of the major HIV-1 subtypes was developed by Cashin et al25-29 to determine specificity for CCR5 and CXCR4 usage during the binding process. We extract this information using the provided web-based tool and present the data in table format as a comparison of co-receptor predicted binding mode, BESI score, and clade.
Results
Our results are focused around EVM, HXB2, variable loop lengths, and V3 tropism. We skip through BESI as the information used here is an extension of the analysis performed by Morton et al 14 and presented in High-Throughput Structural Modeling of the HIV Transmission Bottleneck. Sequence data are available in Appendix 1.
EVM
Performing the process prescribed by EVM produced a uniform selection of amino acids for each of the chosen sequences. Statistical information returned from this data set is as follows:
The variance data for the entire set of gp120 structures analyzed in this study are displayed in Figure 2. We separated Clades A1 and C into 2 graphs to display differences in Figures 3 and 4. We note similarities between the 2 representations and differences in amplitude for later analysis. A scree plot is generated to provide a sorted view of the data in Figure 5. The red horizontal line indicates the cutoff value chosen.

Raw EVM plot of the variance values for the aligned protein sequences. The small subset of amino acids (11.0%) experiencing surface charge modulation due to varying pH levels at or above the selection value (variance = 65) contain the largest amount of variation (73.6%).



Scree plot of the variance values for the aligned protein sequences. The red line indicates the selected cutoff value displaying the large amount of variance (73.6%) across a small subset of amino acids (11.0%).
For the purposes of this discussion, we have selected a single gp120 structure as the subject of explanation for all the remaining graphics. The calculated sequence-based residue map for this Env is as follows:
R56MCF21aug0511_plasmid_1v
14 16 18 31 58 63 65 66 69 73 81 90 91 92 93 160 162 175 177 206 210 211 212 214 215 221 222 224 234 244 248 255 257 329 330 335 337 377 380 382 396 398 401 406 422 423 424 425 426 428 430 431 433 434 436 438
We apply the amino acid maps in VMD by first creating an additional representation in the interface. We use “New Cartoon” colored by secondary structure to represent the entire assembly. The second representation is limited to the selected residues provided by EVM as a single color (red) in transparency. Figure 6 is marked to present the α2 helix oriented left of the binding site and labeled accordingly. All the remaining images of Env structure and substructure are oriented identically for this article. All sequence pair imagery can be examined as shown in Appendix 1. The region of selection is highly conserved and localized at the Env center. Assuming that the process continues to provide similar results for the other analyzed structures, the power of the tool to exhibit differences in assembly makeup will become apparent.

Electrostatic Variance Masking of R56MCF21aug0511_plasmid_1v. This figure indicates the orientation of the assembly with the α2 helix situated to the left and distinguished by the label and arrow to confirm the binding site position. All images of Env structure and substructure are oriented identically for this article.
To further expound on the selection process, a WebLogo30,31 representation is generated for the aligned sequences. Sequence logos present a unique method of graphical representation that displays the presence of like amino acids across the set of sequences by lettering height. Figure 7 displays the logos for all selected substructures in this study, and Figures 8 and 9 produce the logos for sequences in Clades A1 and C, respectively. We again note the minor discrepancies in content between the 2 clades for future analysis and disregard the differences in height due to the number of sequences present in each clade of this study.

Sequence logos representation of the EVM selection process for all structures. The figure displays the conservation of residues in the EVM process across all sequences.

Sequence logos representation of the EVM selection process for structures in Clade A1. The figure displays the conservation of residues in Clade A1.

Sequence logos representation of the EVM selection process for structures in Clade C. The figure displays the conservation of residues in Clade C.
HXB2CG characteristics
For this study, we aligned all assemblies to HXB2 using the procedure described by Korber-Irrgang et al 23 in Numbering positions in HIV relative to HXB2CG. Residue selections provided by EVM and mapped back to HXB2 position identification via the annotated spreadsheet 24 were identical containing the following list:
47 49 51 64 91 96 98 99 102 106 114 123 124 125 126 199 201 214 216 245 249 250 251 253 254 260 261 263 273 283 287 294 296 370 371 376 378 426 429 431 445 447 450 455 470 471 472 473 474 476 478 479 481 482 484 486
Per the annotated spreadsheet, we note the following pertinent EVM selections: Residues 64 and 91 are adjacent to 65 and 92, respectively, which are interface contacts with gp41; 123 is a co-receptor binding site outside of V3 and adjacent to 122 of the same function; 124 to 126 are CD4 contact residues; 199 is a co-receptor-specific (R5/X4) site; 201 is adjacent to 202 which is a co-receptor binding site outside of V3; 249 to 251 where 251 is a co-receptor-specific (R5/X4) site; 253 is adjacent to 252 which is an interface contact with gp41; 261 and 263 are adjacent to glycosite 262; 283 is a CD4 contact residue; 294 is adjacent to glycosite 295; 296 is the beginning of V3 loop; 370 is a CD4 contact residue and 371 is adjacent; 376 is adjacent to 377, a co-receptor binding site outside of V3; 378 is cystine linked to a counterpart at 445; 426, 429, and 431 are CD4 contact residues; 445 is cystine linked to a counterpart at 378; 447 is adjacent to glycosite 448; 455 is a CD4 contact residue; 470 is V5 loop end and adjacent to CD4 contact residue 469; 471 to 476 are CD4 contact residues.
Variable loop lengths
Derdeyn et al 32 observed that transmitted quasi-species appeared to have shorter variable loop lengths than the larger populations of the donor. Whereas our study focuses on the donor-specific envelope structures based on BESI score, we observe the research of Boeras et al 2 in that a small subset of quasi-species actually cross the transmission barrier. Our observation conjoins the 2 aforementioned examinations to reason that Derdeyn et al observed an attribute of the transmission bottleneck.
We applied BESI scores to variable loop lengths in scatter plots for the sequences used for this study. Figures 10 and 11 present a discernible correlation with BESI score and variable loop length. We note that our small sample size may influence other observations in this regard and distinguish loops V2 and V5 as exposing the potential need for further investigation. All variable loop graphs can be examined as shown in Appendix 1.

Scatter plot of variable loop V2 in comparison with BESI scores. Red indicates the control quasi-species. In total, 55% of the residues fall at or inside the quadrant containing our control variant (red dot) which is a BESI score of 0.5 or greater.

Scatter plot of variable loop V5 in comparison with BESI scores. Red indicates the control quasi-species. In total, 50% of the residues fall at or inside the quadrant containing our control variant (red dot) which is a BESI score of 0.5 or greater.
V3 co-receptor tropism
Our investigation into the usefulness of BESI, what the process is keying on and how to best describe using the method, requires the examination of peripheral attributes. Here we have applied a method of predicting the tropic mode of V3 co-receptor binding using a process developed by Cashin et al25-29 (Table 2). Although the resulting data are inconclusive, we provide the same explanation of limited sample size as a potential detriment to the observations.
Tropic mode of V3 co-receptor binding in comparison with BESI score and Clade revealing no pattern of distinction available with the current sample size.
BESI, Biomolecular Electro-Static Indexing.
Discussion
HIV has been the focus of research around the world for nearly 4 decades. The virus has eluded scientists over this period due to the fast mutation rates and the ability to overwhelm the human immune response system. With some progress being made in the quality and length of life, a vaccine has still to be determined for the infectious disease. 33 Typical studies of the binding site overlook the significance of pH and the effects acidic fluids, common in genital mucous, have on antibody binding functions. 34
The binding site of gp120 is correctly identified through residue-specific pH sensitivity without the need to determine the area based on the presence of CD4 or other counterpart protein structures. In addition, the selection process identifies highly reactive residues adjacent to common glycosite and specific co-receptor positions implying that pH modulation of these amino acids could influence activities common to those locations.
We note that our sample size is too small to evaluate against variable loop lengths although loops V2 and V5 do indicate the potential to produce interesting data. Furthermore, we observe no discernible characteristics in a comparison of V3 co-receptor tropism, BESI, and clade based on the small sample size used here.
These results suggest that the highly conserved and localized amino acid cluster is not responsible for variation in the ability of a particular mutation to infect another cell, but the variation of the remaining structure due to folding and loop lengths may. Figure 12 shows the core representation depicted in Figure 6. The darker shaded areas are exposed surfaces of the CD4 binding site.

The core structure selected by EVM. The dark shading is the exposed surface at the CD4 binding site. The orientation of this image is identical to Figure 6 and is the same gp120.
We noted differences in variance data between Clades suggesting that some discernible variations may exist that provide additional insight into the binding process. Although these fluctuations are noted, the number of sequences selected for each clade precludes the useful comparison of the data and will need to be analyzed at a later date with a balanced set of sequences.
We conclude that BESI, in conjunction with EVM, provides a unique view of the gp120 Env and may provide additional focus on a subset of mutations for vaccine research. The process reveals differences in the outer structures of the protein and displays the power to distinguish features both visually and analytically.
Footnotes
Appendix 1
Acknowledgements
We would like to thank Cynthia Derdeyn (Emory University) for providing the sequence data and Peter Hraber (Los Alamos National Laboratory) for helpful discussions about the sequence data and analyses used in the study by Boeras et al.
Funding:
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
SPM developed the methods of analysis, wrote the manuscript, performed all molecular simulations, simulation analysis and created all simulation data figures. JBP developed the alignment methods utilized for phylogeny. JLP developed the simulation protocols, reviewed manuscript, and simulation analysis.
