Abstract
Phage display is a powerful technique to select high-affinity antibodies for different purposes, including biopharmaceuticals. Next-generation sequencing (NGS) presented itself as a robust solution, making it possible to assess billions of sequences of the variable domains from selected sublibraries. Handling this process, a central difficulty is to find the selected clones. Here, we present the AutomaTed Tool For Immunoglobulin Analysis (ATTILA), a new tool to analyze and find the enriched variable domains throughout a biopanning experiment. The ATTILA is a workflow that combines publicly available tools and in-house programs and scripts to find the fold-change frequency of deeply sequenced amplicons generated from selected VH and VL domains. We analyzed the same human Fab library NGS data using ATTILA in 5 different experiments, as well as on 2 biopanning experiments regarding performance, accuracy, and output. These analyses proved to be suitable to assess library variability and to list the more enriched variable domains, as ATTILA provides a report with the amino acid sequence of each identified domain, along with its complementarity-determining regions (CDRs), germline classification, and fold change. Finally, the methods employed here demonstrated a suitable manner to combine amplicon generation and NGS data analysis to discover new monoclonal antibodies (mAbs).
Keywords
Introduction
The use of phage-display antibody combinatorial libraries has impulsed the isolation of innovative antibodies. Approximately 30% of the approved antibodies for clinical treatments in the last 5 years were isolated from phage-displayed libraries using different selection strategies. 1 Originally, the validation of selected clones was based on either the enrichment of specific clones throughout the selection rounds 2 or biological activities of some (few) randomly selected clones. 3 However, the low throughput of Sanger sequencing limited these strategies, as enriched clones were assessed from few sequences, or activity-screened clones (antigen binding, for example). Thirty years after the pioneering work describing the phase-display technique, 4 new methodologies such as next-generation sequencing (NGS) have pushed the technology toward modern standards for selecting biologically active selected clones.
The development of NGS increased in several orders of magnitude the quantity of individual clone sequences obtained, allowing the determination of complete variable domain repertoires. As a consequence, NGS sequencing became the preferred strategy to determine antibody phage clones that were successfully selected in phage-display experiments. 5 Most of these approaches rely on repertoire studies and clonal skewing to find selected antibody winner sequences.6-8 Some methods focus on the VH complementarity-determining region-3 (CDR3) to address the diversity of phage libraries.9,10 The role of CDR3 in antigen (Ag) recognition and binding is noteworthy, but other regions (CDR1 or 2, and even some framework residues) have also been described as crucial in some antibody-antigen interactions. Thus, novel approaches addressing the whole sequence may be more indicated to find reliable high-affinity antibody domains.
In this work, we describe a new workflow—AutomaTed Tool For Immunoglobulin Analysis (ATTILA)—that makes it possible to identify variable domains enriched after selection. The methodology combines the generation of VH and VL amplicons from antibody-displaying phages, before and after selection, and the use of NGS to acquire their coding sequences. After a filtering process, ATTILA can establish the frequency of each variable domain sequence in a given selection round. It generates a report with the fold change of each enriched sequence, giving the amino acid sequence, complementarity-determining region (CDR) identification, and germline classification of the variable domains. We also report a human Fab VH-VL combinatorial library displayed on phage, 11 which was examined to validate the ATTILA workflow. We present ATTILA analyses based on heavy variable domain (VH) sequences, submitted to NGS sequencing in either Illumina or 454 sequencers. The ATTILA workflow also assesses the variability of the library and the error rate of the whole process and analyzes the changes in selected VHs throughout the selection cycles from 5 different biopanning experiments. Finally, we show that ATTILA is a powerful tool to point out specifically selected variable domains using the combination of polymerase chain reaction (PCR) and NGS to identify antibodies from phage-display libraries enriched by panning. Based on the ATTILA results, antibodies harboring the most enriched VH and VL can be produced and further characterized.
Methods
Describing ATTILA
AutomaTed Tool for Immunoglobulin Analysis (ATTILA) is a workflow that combines both third-party and in-house programs for analyzing phage-display selection of antibodies by NGS sequencing data derived from Illumina or 454 sequencers. It compares the content of VH and VL clones in both unselected library
The ATTILA workflow takes, as input, each sequencing set of 4 libraries, 2 from the heavy chain (VH) and 2 from the light chain (VL). Both contain a sample of the original phage-display library
Next, a program translates antibody variable domain sequences, using a singular approach to choose the most probable open reading frame (ORF). The chosen ORF does not have stop codons, except TAG, that codes for glutamine (SupE44+ Escherichia coli strains) and presents conserved framework residues flanking CDR1 and CDR3. A Perl script calculates the relative frequency of each unique translated subsequence delimited by the first Cysteine (C) residue before CDR1 and W/FGXG, after CDR3. Another Perl script compares the relative frequencies of each sequence in

The ATTILA workflow. It retrieves selected VH and VL domain sequences from phage-display experiments. The ATTILA workflow reads antibody phage-display NGS sequencing, either single-end or paired-end sequence data in FASTQ format, and delivers a report of the most enriched VH and VL sequences after panning. Those marked A and B represent workflow steps focused on the following figures. ATTILA indicates AutomaTed Tool For Immunoglobulin Analysis; NGS, next-generation sequencing.
Data checking and enrichment analysis
For checking the ATTILA results, we examined 2 output files. The first was a nucleotide FASTA file obtained after NGS quality check and filtering. The second was a list of VH clones predicted after translation and pattern detection, named the VDJ data set. A VDJ file can be in the nucleotide or amino acid format. The VDJ sequence files were compared using bash commands and Blast 17 to retrieve ATTILA data. The VH gene usage was computed using Blastn 17 against a Kabat germline database (obtained from GenBank, National Center for Biotechnology Information [NCBI]), filtering the best hits at e value ⩽ 1010 for family assignment. The UpSet plot (Figure 3) was generated with the VDJ amino acid dataset.
VH and VL amplicons for NGS sequencing
The VH and VL coding genes from each round of a given experiment were amplified from pooled phagemid preparations. For PCR, the following primers with Illumina adapters were used: 5′leadVH—TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGCTGCCCAACCAGCCATGGCC; 3′VH_rev—GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGCGATGGGCCCTTGGTGGAGGC; 5′Vkappa—TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGGGCCCAGGCGGCCGAGCTC; and 3′Vkappa_rev—GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGAAGACAGATGGTGCAGCCACAGT.
The reactions were performed using Platinum Taq DNA Polymerase (Invitrogen) according to the manufacturer’s instructions and the cycling was as follows: 95°C for 2 minutes; 30 cycles of 95°C for 1 minute, 65°C for 1 minute, and 72°C for 1 minute, followed by an extra 5-minute incubation at 72°C. The amplicons were analyzed in 0.8% agarose gel from where it was extracted and purified using UltrafreeDA columns (Millipore), according to the manufacturer’s instructions prior to NGS sequencing.
Immunoglobulin Fab library and NGS sequencing
All experiments were performed with a previously described Fab phage-display library 11 based on the pComb3X vector. 3 The library was deeply sequenced 5 times, in the Illumina MiSeq platform, and in a single experiment with 454 pyrosequencing (Roche). For each sequencing experiment, VH and VL amplicons were obtained as described above. The NGS raw data are shown in Supplementary Table S1.
Selection procedure
Two phage-display panning experiments were also assessed here. The first experiment was performed selecting Fabs against a synthetic glycopeptide. The selection procedure was performed, increasing the number of washes throughout the experiments. Typically, 5, 10, 15, and 15 washes in rounds 1 to 4, respectively. The elution was performed using an acid solution. The PCR amplicons for VH and VL were obtained as above, from the original library, as well as for the second, third, and fourth selection rounds. In the second experiment, the library was panned against a biotin-labeled peptide, and 2 different protocols performed the elution: either by disfavoring binding using traditional acid elution or by competition with an unlabeled peptide. Four rounds of selection were performed, increasing the number of washes as described above, and PCR also obtained the sets of VH and VL amplicons from the original library and round 4.
Results
Developing ATTILA workflow
The ATTILA workflow (Figure 1) can be used to analyze NGS sequences from PCR amplicons obtained from phage-displayed libraries. It compares the content of VH and VL clones in both the unselected library
The ATTILA workflow finds candidate sequences displaying known conserved Cysteine residues flanking both CDR1 and 3 and the conserved β-bulge residues GXG at the end of CDR3.
12
Based on the CDR’s definition
13
and on observations of the germline sequence profile, ATTILA establishes a valid rearranged domain (VDJ) of heavy or (VJ) light chains by computing the minimum and maximum distances between the first C residue before CDR1 and the conserved C residue just before CDR3, and between residue C and conserved
The
subset reflects the variability of the Fab library
The original library clone content was independently assessed in 5 distinct experiments performed over a 4-year time interval. The NGS data were processed, filtered, and used for VDJ pattern identification. In Figure 2, the input files and their respective

The VH gene usage of raw input sequences and rearranged VDJ after translation and pattern detection. Raw VH domain sequences were analyzed after the sequence filtering step (marked as A in Figure 1) and labeled as Input, and after rearranged VDJ pattern detection (marked as B in Figure 1). The VH gene usage was assigned by Blast against a Kabat VH germline database and is shown in proportional stacked bars for 5 independent libraries’
The
The original phage libraries were sampled 5 times and the size of the intersection among these

Accessing the diversity of the antibody phage library. The antibody phage-display library was independently sampled 5 times. (A) The UpSet plot of intersection between
Error rate
Sequencing error may yield artifactual sequences not previously found in the phage-display library. Due to the difficulty of estimating the experimental error rate in hitherto very variable immunoglobulin sets, we estimated the artifactual appearance of extra Cysteine residues in the V gene coding sequence. As observed in Table 1, as many as 3% of sequences of correct in-frame VDJ subsets contained at least an additional Cysteine residue, a potential error due to PCR or sequencing procedures. Despite the elevated error rate, there was no clear deviation of the
Cysteine error rate.
Total VDJ indicates the total number of valid rearranged sequences; 1C, 1 additional non-conventional Cysteine; >1C, more than 1 additional non-conventional Cysteine.
Evolution of the selected library subsets
The changes in VH family content and the emergence of selected phage clones were observed in 2 independent panning experiments. In the first, 4 rounds of phage selection were sequenced: unselected library

The VH sequences are selected during panning experiments. The evolution of representative sequence contents is shown during selection steps (
Phage clones that did not interact with the selection target were diluted along with the panning procedure, and the selection round sequence data sets accompany this. The 4 most frequent
In the second panning experiment, phages were selected for binding to a biotinylated peptide and eluted either with an acid elution or with a competing peptide. Only
Enrichment analysis reveals the winning clones
The enrichment analysis performed by ATTILA compares the frequency of rearranged sequences in the last selected round

Enrichment of VH domain sequences observed in the panning experiments. The fold change of the 50 most enriched clones as predicted with ATTILA are plotted in descending order: (A) panning experiment 1; (B) panning experiment 2 with peptide elution; and (C) panning experiment number 2 with acid elution.
The enrichment of VL clones was also calculated and reported by ATTILA using rules similar to those for VH. Supplementary Figure S3 shows the enrichment of VL for the experiments reported above. Enrichment is less pronounced than observed for VH, and the most selected VL varied from 700 to 1400 times. In 2 of these experiments, the presence of Fabs bearing high selected VH and VL domains was also confirmed by PCR, using CDR3 (H and L) targeting primers (data not shown). In a biopanning experiment where only VLs were selected from a human VL library, the 2 most enriched VLs reported by ATTILA were tested for their ability to bind to the same antigen used in the selection procedure. Both new selected VL harboring clones showed a better antigen binding when compared with the original clone (Supplementary Figure S4).
Discussion
The use of high-throughput sequencing technology to analyze phage-display results had been proposed as an alternative to Sanger sequencing or biological activity selection protocols. 10 The prevalence of clones is inferred from sequence abundance changes along the selection process. Here we propose a workflow for sequence enrichment analysis after phage-display library panning, based on simple frequency changes. The ATTILA workflow is freely available at GitHub.
The use of NGS for the identification of selected phage clones led to the proposal of some, but few automatic workflows. REceptor LIgand Contacts (RELIC) 18 was one of the first software for phage-display analysis, and it enables users to align sequences and find motifs from phage-display experimental results. At present, it is limited by biases associated with the original phage-display technique. MIMOP 19 integrates 2-dimensional and 3-dimensional analyses to predict potential epitopic regions, respectively, performed by MimAlign and MimCons programs. SLiMFinder 20 is a probabilistic method to identify short linear motifs (SLiMs) with a very high level of specificity and a low false discovery rate. It can be applied to solve many problems in this domain, including phage-display library peptides.
VDJFasta 6 uses Hide Markov Models to analyze antibody variable domain repertoires. N2GSAb 21 is a tool focused on HCDR3 to find entire clones using a smarty PCR strategy that uses an elegant approach to derive phage clones. However, high-affinity antibodies contain uniquely mutated variable domains that account not only for affinity but also for selectivity and bona fide structural features, both equally important if the aim is to obtain clinical Abs. These subtle amino acid residues’ substitution may be lost as a consequence of clonal skewing or CDR3 centered analysis. ImmuneDB 8 both stores and analyzes NGS immune receptor sequencing data by aggregating tools to process raw reads for gene usage, infer clones, aggregate data, and run downstream analyses. DEAL (Diversity Estimator of Antibody Library) 7 is an algorithm to estimate the library complexity.
Finally, PHASTpep 22 makes it possible to discover peptides from phage display and NGS that target a selected cell type, which enhances clinical translatability by circumventing complications with systemic use.
PHASTpep is the software that is most similar to ATTILA, presenting several standard features such as translation of sequences, translation validation, enrichment frequency calculation, and normalization.
On the other hand, ATTILA program translateab9 is able to detect the entire VH and VL domain sequences, using an elegant and efficient translation method, where the choice of the ORF is based not only on the absence of stop codons, but also on the presence of canonical immunoglobulin motifs. Also, ATTILA identifies the antibody germline genes that gave rise to these domains. The ATTILA workflow already proved to be suitable to discovery antibodies to a given target: using a combination of the most enriched VH and VL sequences, a single-chain fragment variable (scFv) anti-α-dystroglycan mucin glycopeptide was constructed and showed selectively binding to the tumor cell surface. 23 It can also be used to select individual domains with improved ability to bind to its antigen (Supplementary Figure S4).
The successful acquisition of a high-affinity antibody phage clone depends on the size and diversity of the phage-display library. Hence, we estimate the size of the library comparing 5 different sequencing experiments on a unique library. As observed in Figure 3, every library sampling revealed a large number of unique sequences, mostly specific to a given experiment. Sequences appearing in all library samplings were scarce, suggesting that the sequenced library was large enough to maintain novelty findings along with multiple sequencing events. This library was derived from human peripheral blood mononuclear cells (PBMCs), and its estimated size was
The diversity of individual VDJ reflects the actual size of the library rearranged immunoglobulin gene fragment, but the size estimated by sequencing is distorted by experimental error. Experimental sequence error arises from either PCR steps or the sequencing procedure, artifactually enlarging the library sequence universe. 24 Experimental sequence error overestimates the actual variability of the VDJ gene fragment set. Here, the experimental error rate was measured indirectly, at the protein level, assuming that no Cysteine residues were naturally found in between the conserved Cysteine residues in both FW1 and FW3. However, any additional Cysteine residue must reflect artifacts introduced during either PCR or sequencing. We observed a significant number of sequences containing additional Cysteines, suggesting that, at the protein level, at least 1% to 3% of observed sequences are artifactual.
Sequence abundance along selection cycles revealed the increase in selected phage clones. Counting VDJ sequence data sets prior
Conclusions
A workflow for deriving a rearranged antibody variable domain sequence was created and incorporated into a software package named ATTILA. The ATTILA workflow can retrieve VH and VL sequences by comparing a phage-display library,
Footnotes
Acknowledgements
The authors are thankful to Prof Concepta McManus for English language revision of the manuscript.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was financially supported by FAP-DF (Grant Number 0193.001784/2017) and Capes and CNPq support scholarships.
Declaration of Conflicting Interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
A. Q. Maranhão, M. M. Brigido, and H. M. Silva: concept and design of the workflow. R. K. A. França, T. C. De-Leo, and R. T. Burtet performed the wet bench experiments. H. M. Silva and W. M. C. Silva: programming. A. Q. Maranhão, M. Dias-Baruffi, and M. M. Brigido directed all experiments. All authors contributed to writing and reviewing the manuscript.
