Abstract
Pathogen-host protein-protein interaction systems examine the interactions between the protein repertoires of 2 distinct organisms. Some of these pathogen proteins interact with the host protein system and may manipulate it for their own advantages. In this work, we designed an R script by concatenating 2 functions called rowDM and rowCVmed to infer pathogen-host interaction using previously reported microarray data, including host gene enrichment analysis and the crossing of interspecific domain-domain interactions. We applied this script to the
Introduction
Because infectious diseases are a major health problem worldwide, understanding the molecular mechanisms among pathogen-host interactions (PHIs) is key in addressing this concern. The design of an appropriate strategy to combat a specific pathogen depends on our understanding of the specific PHI. Most studies have focused on identifying protein-protein interactions (PPIs) within a single organism (intraspecies PPI prediction). It is difficult to infer new PPIs between 2 different species (interspecies PPI prediction) because the development of an interaction database depends on experimentally verified PHI data that are costly in time, equipment, and budget to produce.
1
Therefore, the design of computational strategies is worthwhile to elucidate infection mechanisms when experimental PHI data are scarce. The interactions between pathogen proteins and their hosts allow the pathogens to manipulate host cellular mechanisms for their own advantage, such as escaping from host immune responses. For instance, the
Materials and Methods
Data sets
All microarray data used in this work for
To predict DDI, we downloaded the collection of known and predicted DDIs from the database DOMINE v2.0 release 2010. 17 We converted the lists “INTERACTION” into comma-separated value files. This list is included in Additional file 1.
Because each
The algorithm
We implemented 2 new functions in R named: “rowDM,” which estimates the mean deviance for each row in a matrix and “rowCVmed,” which estimates the median variation coefficient for each row in a matrix. The output of both functions is a numerical vector ordered from the highest variability to 0; then, a percentile threshold is defined for each vector to choose a set of genes with extreme variability, ie, a desired amount of differentially expressed genes, such that its variability is greater than the percentile chosen.
The procedure used to identify PHI based on microarray and DDI data is explained in the following steps:
The R script is provided in the Supplementary Information S11.
Results and Discussion
Application of rowDM and rowCVmed functions
Toxoplasma GEO series analysis
On implementing the program, we sought to evaluate its performance in a descriptive manner, comparing it against the GEO2R tool (www.ncbi.nlm.nih.gov/geo/geo2r), which was used by other authors. The GSE44191 and GSE16115 series include RNA expression values for 3 type 1 toxoplasma strains (RH-ERP, RH-JSR, and GT1) from human foreskin fibroblasts (HFFs) infected for 24 hours with each of these strains. To evaluate the performance of the R program, we applied this code script to identify the
The most relevant differentially expressed gene candidates obtained with both functions from the
Abbreviations: FEA, functional enrichment analysis; GEO, Gene Expression Omnibus; NA, not applicable; NF-κB, nuclear factor κB.
Thereafter, we examine the GSE22315 series, which contains RNA values for 6 more representative toxoplasma strains (type I: GT1 and RH, type II: ME49 and Prugniaud, and type III: CTG and VEG), taken 12 hours after infection in HFF cells. For this series, we used the critical values in both functions to obtain a list with the 100 most variable genes in their RNA values (Table 1). Among them, we found that the ROP18 protein considered along with ROP5 as the main virulence factors in the toxoplasma type I genetic background. The ROP18 gene has low expression in the type III strains, considered less virulent, at least in the mice hosts.
30
Similarly, as in the other toxoplasma series, ROP38 along with ROP8, ROP46, ROP20, and ROP19A were also found with high variability in their expression, which was also observed with GEO2R (Table 1 and Supplementary Information S4A and S4B). Finally, we analyzed the GSE20145 series, which compares the RNA values for the 3 canonical
Host GEO series (human and mouse) and FEA
Continuing with the descriptive performance analysis of our script, we examined the host gene response during
After exploring
After that we examine the GSE27972 series that compares the RNA levels from mouse bone marrow–derived macrophages (BMdMs) infected with
Mapping toxoplasma Gene Ontology terms to Pfam entries
It was observed that ROP kinases in the
DDIs: “Pkinase PF00069 versus gene set domains” and mapping to the KEGG signaling pathways 04630 (JAK-STAT), 04064 (NF-κB), and 04010 (MAPK)
The previous human and mouse gene sets obtained with our functions from the GSE44189, GSE25468, GSE27972, and GSE81016 series were also mapped to Pfam domain entries to identify functional domains that could interact with the
In addition, because the differential expression of ROP38 influences HFF gene expression associated with the MAPK, JAK-STAT, and NF-κB signaling pathways,10,23 we remapped the gene sets obtained with both functions from the GSE44189, GSE25468, GSE27972, and GSE81016 series to the KEGG pathway IDs 04630, 04064, and 04010 to identify possible targets involved in some of these signaling pathways.
We observed interesting transcription factors such as the NFKB1 p105 subunit in the GSE25468 series and the NFKB inhibitor zeta (NFKBIZ) in the retinal human GSE81016 series. These 2 proteins contain inhibitory ankyrin repeat domains, which have been shown to interact with kinase activity proteins 32 (Table 2). Likewise, we also found suppression of SOCS2 and SOCS5 cytokine signaling in the GSE81016 series; these proteins are regulators of the JAK/STAT signaling pathway. Both SOCS2 and SOCS5 contain SH2 domains that can be phosphorylated by JAK proteins33,34 (Table 2). We saw important kinase proteins, such as PIK3R1, PRKCA, PRKCG, PRKCB, and the GTPase HRAS, that have been reported as key molecules in the MAPK signaling, which activate antiapoptotic genes. 35 In our list, we evidenced the presence of inflammasome activator MAP2K3 and the proapoptotic proteins NFKBIZ, MAP3K5, as well the c-JUN transcription factor (Table 2).
Domain-domain interaction.
Abbreviations: BMdM, bone marrow–derived macrophage; GEO, Gene Expression Omnibus; HFF, human foreskin fibroblast; NF-κB, nuclear factor κB; TF, transcription factor.
The most differentially expressed gene sets obtained from the functions rowDM and rowCVmed which have domains that can interact with the Pkinase domain PF00069. Those genes were also mapped for 3 signal pathways JAK/STAT, NF-κB, and MAPK. Red: survival factors HRAS, PIK3R1, PRKCA, PRKCG, and PRKCB that activate antiapoptotic genes. Green: MAP2K3 is related to inflammasome activation. Blue: MAP3K5, NFKBIZ, and JUN (TF) are apoptosis activators.
After all of these results, our hypothesis is that the lower expression of ROP38 in type I

Suggested proteins belonging to the MAPK signaling pathway (KEGG 04010) that could potentially interact with the Pkinase domain (PF00069), as observed in ROP38. On the left is a heat map showing ROP38 toxoplasma; this was the most differentially expressed gene among type I (RH), type II (Prugniaud), and type III (VEG). On the right, up/downregulated genes in human WERI retinal cells (GSE81016 series); these proteins contain domains to interact with PF00069 and are mapped for the MAPK signaling pathway.
Because there is no gold standard consensus to identify differentially expressed genes in array experiments, 37 most of the methodologies to identify differentially expressed genes in array experiments are based on a variety of statistical tests, such as analysis of variance38–40 or false discovery rate.41–43 We tried to evaluate the performance of our R script differently; first, we proposed critical values to obtain a free chosen output of differentially expressed genes; second, we compare the outputs obtained through our functions against those reported by other authors. It was shown that with our functions we could reproduce similar outcomes to those previously reported, even with a smaller gene set than that proposed by the original authors (Table 1). Similarly, all the differentially expressed gene sets obtained with our functions were also evidenced by the GEO2R tool. The aforementioned demonstrated that our R script can be used by other researchers, even those with no knowledge of R programming, and can be applied to other pathogen-host coexpression experiments to predict more PHI for specific domains. Our R script does not seek to compete with other approaches, such as GSEA, 44 or other R packages, such as limma or GSEABase45,46; rather, we wrote an R script code to identify plausible pathogen genes involved in pathogenesis, particularly those with high expression variability. Furthermore, using their functional domains, we aim to identify the host domains that are differentially expressed and potential interactors, as well as map the different cellular pathways.
Conclusions
Predicting interactions between host and pathogen proteins is a never-ending problem with important implications for public health. We have presented an R script that integrates differential gene expression calculations, enrichment analyses, and the crossing of interspecific DDI to predict interactions between pathogen and host proteins (PHI). When applied to the
Footnotes
Funding:
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by COLCIENCIAS through grant 111356933664 and Universidad del Quindío.
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
AFA and GES designed the approach and implemented the algorithm. JEG-M contributed to the discussion and revision of the manuscript. All authors read and approved the final manuscript.
Availability of Data and Material
Additional file 1 contains all the additional information.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
