Abstract
Multiple sequence alignments are usually phylogenetically driven. They are studied in the framework of evolution. But sometimes, it is interesting to study residue conservation at positions unconstrained by evolutionary rules. We present a supervised method to access a layer of information difficult to appreciate visually when many protein sequences are aligned. This new tool (MAGA; http://cbdm-01.zdv.uni-mainz.de/~munoz/maga/) locates positions in multiple sequence alignments differentially conserved in manually defined groups of sequences.
Introduction
Protein sequence alignments are based on the comparison of residues to detect similarities, which usually leads to conclusions about whether residues are conserved, mutated, deleted, or inserted in evolution. Tools like Clustal Omega, 1 T-Coffee, 2 MAFFT, 3 and MUSCLE 4 are some of the algorithms used daily by researchers to align sequences into multiple sequence alignments (MSAs). Here, we present a tool that postprocesses such alignments to facilitate the inference of residue properties specific to sequence groups defined by the user. This is a crucial step in the use of protein sequence alignments for the prediction of protein function.
The idea of locating specificity-determining sites (SDSs) by subfamily analysis of the sequences dates back to 1993. 5 Many other approaches have been developed since then following this line of thought.6-8 All of them use unsupervised approaches that automatically place the sequences in groups by analyzing the SDSs, some also taking into consideration local physicochemical properties and phylogenetic information.
For exploratory purposes, we believe that a user-supervised approach, where one should be able to adapt the groups considered, can be of great help. Consider, for example, a situation in which the user has functional information of the proteins of a family that is independent of the phylogeny. Examining the conservation in groups of proteins defined by the user may reveal residues associated with that specific functional information.
In this article, we describe the MAGA (Motifs from Annotated Groups in Alignments) web tool, a simple way to infer conservation information from manually defined groups of sequences in alignments. We also provide 2 case studies to exemplify its use and showcase how it can provide meaningful results to researchers in a transversal way, not necessarily phylogenetically based.
Main Text
Workflow
MAGA takes as input an MSA of protein sequences, in FASTA format. In a first iteration, MAGA preprocesses the alignment and generates a profile with the shared residues from all input sequences, as a normal aligner would do. Then, it lets the user allocate the sequences in up to 6 groups and label them. Group assignments can be done manually or by uploading a file with assignments. The file is required to have 1 row per assignment with 2 columns indicating an ID and its assigned group identified by a number, respectively. Next, MAGA produces a profile per group with both the shared residues and group-conserved residues colored depending on the assigned group. To consider a residue at a position as conserved in a group, it must meet the following conditions:
There must be an amino acid that is more prevalent than all the other amino acids together for that group in that position.
The most prevalent amino acid must be more prevalent than the gaps in that group at that position.
The color code of the profiles displaying the group-conserved amino acids follows the notation:
Conserved in all sequences → Colored in green.
Per group, conserved in >50% but <75% of the sequences → Colored in red, blue, indigo, orange, violet or black, depending on the group.
Per group, conserved in ⩾75% but <100% of the sequences → In italics and colored in red, blue, indigo, orange, violet, or black, depending on the group.
Per group, conserved in all the sequences → In bold and colored in red, blue, indigo, orange, violet, or black, depending on the group.
The group assignment of the sequences can be iteratively modified. As an example, in a first iteration, one may cluster the sequences [A, B, C, D, E, F] based on taxonomy as [A, B, C] + [D, E, F], then differently based on a phenotype as [A, C] + [B, E] + [D, F], and finally, based on whether they present a sequence feature or not as [A, E, F] + [B, C, D]. Previous results are shown directly below the current results so that the comparison between the results originated from different arrangements can be easily done.
MAGA offers the possibility to convert the amino acids in the alignment to categories, to detect regions with similar physicochemical properties. The considered equivalences are as follows: [0] = D/E (negatively charged); [1] = RHK (positively charged); [2] = FWY (aromatic); [3] = IVL (aliphatic); [4] = STNQ (with polar uncharged side chains); [5] = A; [6] = M; [7] = G; [8] = P; and [9] = C. One can even use the same grouping with and without this feature, to compare their outputs, as shown in the section “Case study 1: Argonaute protein family.”
Case study 1: Argonaute protein family
We prepared an MSA of 34 sequences from the Argonaute protein family, from 3 subfamilies: AGO, PIWI, and CE (Supplemental File 1). The CE is an Argonaute

Case study 1: Argonaute family. Results obtained in MAGA when using as input an MSA with 34 sequences from 3 related protein subfamilies (CE, PIWI, and AGO) and executed with amino acids and with categories. (A) Sequences are grouped in MAGA based on their subfamily, and conserved residues in the alignment positions 1060-1081 are highlighted depending on the groups in which they are conserved: [CE] + [AGO] (pink), [PIWI] + [AGO] (blue), [CE] + [PIWI] (yellow), and all (green). (B) Structure of human protein AGO3 (PDB:5VM9); in red, RNA chain; in colors based on their conservation, alignment positions 1060-1081. (C) Detailed interaction RNA-helix, with [CE] + [AGO] conserved residues (pink) pointing toward the RNA chain. MAGA indicates Motifs from Annotated Groups in Alignments; MSA, multiple sequence alignment; RNA, ribonucleic acid.
Case study 2: evolution of fish anti-freeze proteins
Fish have several families of anti-freeze proteins (AFPs). The type II family constitutes a suitable case study for MAGA because its members originated from multiple independent events of duplication and evolution from C-type 4 lectins.
11
Lectins are vertebrate proteins that bind carbohydrates, and in particular, C-type lectins have a carbohydrate recognition domain with 2 coordinated Ca2+ ions and 4 conserved cysteine bridges.
12
The carbohydrate binding is calcium-dependent because the binding pocket is constructed with one of the coordinated Ca2+ ions.
13
However, we would expect that the AFPs evolved from lectins will no longer need to bind carbohydrates and will lose the selective pressure to keep the pocket that recognizes the carbohydrate. To test this hypothesis with MAGA, we searched for fish homologs of the type II anti-freeze protein of the sea raven (

Case study 2: Evolution of fish anti-freeze proteins. Results obtained in MAGA when using as input an MSA with 64 proteins, which include 24 fish anti-freeze proteins (AFPs) and 40C-type 4 lectins (CLs). These AFPs evolved from duplicated CLs. (A) Phylogenetic tree derived from the MSA. AFPs are marked with red Xs. Their position in different branches indicates that they emerged in multiple independent events. The outlier at the bottom is the human C-type lectin domain family 4 member E. (B) Part of the MSA with CLs at the top and AFP at the bottom. Sequence identifiers have been added the label “CL” or “AFP” as a prefix for clarity. Residues are colored according to type and conservation and the GEPNN motif has been marked (using Jalview).
16
MAGA conservation computed for the region is shown above. Red boxes indicate the differentially conserved GEPNN motif, present in human C-type lectin domain family 4 member E (PDB:3WH2), which is less conserved in the AFPs (eg, it is “TKPDD” in the AFP from
Conclusions
Amino acid conservation in an MSA reflects the phylogenetic proximity of the sequences, functionality, or both. The idea behind MAGA (http://cbdm-01.zdv.uni-mainz.de/~munoz/maga/) is to find positions conserved due to functionality that are not necessarily phylogenetically related. Taxonomic information in this respect is still important to rule out the evolutive component when detecting a conserved residue in a manually-defined group of sequences.
Although it is trivial to detect group-conserved residues or regions when working with small MSA, the task becomes much harder for larger MSA and when analyzing conserved physicochemical properties in a position, not just amino acids, which are more difficult to detect visually. The MAGA web tool addresses this problem allowing the detection of sequence conservation in a transversal nonphylogenetically driven way. We strongly believe it will help users to detect meaningful biological functions and motifs in an exploratory analysis.
Supplemental Material
SuppFile1.fasta.txt – Supplemental material for MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments
Supplemental material, SuppFile1.fasta.txt for MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments by Pablo Mier and Miguel A Andrade-Navarro in Evolutionary Bioinformatics
Supplemental Material
SuppFile2.fasta.txt – Supplemental material for MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments
Supplemental material, SuppFile2.fasta.txt for MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments by Pablo Mier and Miguel A Andrade-Navarro in Evolutionary Bioinformatics
Footnotes
Acknowledgements
The authors thank Miguel Almeida and René Ketting (IMB/JGU, Mainz) for insightful discussions regarding the Argonaute protein family.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Deutsche Forschungsgemeinschaft (AN735/4-1 to M.A.A.-N.).
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
P.M. and M.A.A.-N. conceived the project. P.M. developed and implemented MAGA. M.A.A.-N. generated and analyzed the case studies. M.A.A.-N. supervised the project. P.M. and M.A.A.-N. drafted the manuscript, and read and approved the final manuscript.
Availability of data and materials
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
