Abstract
Introduction
The recent NGS technology can produce a huge amount of genomic data for a wide array of bacteria. However, the lack of complete proteome data because of coding sequences without a proper prediction of functions has made it difficult to understand pathogenesis and virulence determination. These molecules are labeled as hypothetical proteins (HP).2,20,21 Nearly 30% to 40% genes of most bacterial genomes are classified as unknown or hypothetical.
22
These HPs are the translated nucleic acid sequences based on sequence similarity, but their biochemical and functional characterization evaluation is necessary for the experimental existence.
23
Therefore, the functional annotation of many hypothetical proteins has become an important focus in bioinformatics.
24
Homology-dependent gene annotation can assign functions to HPs based on their correlation with known proteins, providing the knowledge of new structures, functions, interactions, and pathways.23
-27 A well and precise annotation of
As
Many proteins from
Materials and Methods
The methodology overview flowchart is presented in Figure 1.

Methodology overview flowchart for the functional annotation and analysis of
Extraction of genomic data
The entire sequence of
Gene ontology prediction
To determine the HP’s functions, Blast2GO with an
Family and domain prediction
The conserved domains and protein functions were searched based on the structure of domains. So, Simple Modular Architecture Research Tool (SMART) was used in general mode to identify and annotate genetically mobile domains of signaling, extracellular and chromatin-associated proteins. 34 Furthermore, NCBI Batch CD-Search was applied that allowed to search multiple protein sequences using RPS-BLAST to compare query HP sequences against databases of conserved domain models. 35 NCBI Batch CD-Search tool searched against CDD—58235PSSMs database, and threshold was set at 0.01.
The HMMER website, allows protein homology search algorithms within the HMMER 3.3.2 software suite and uses profile hidden Markov model libraries to annotate the HP sequences with protein families and domains.
36
The cut off was set at 0.01 for significant
Subcellular localization determination
In the study, PSORTb v3.0 and CELLO v.2.5: Subcellular Localization Predictor were used to determine the cell locations of the HPs using default parameters for gram-negative bacteria.42 -44 PSORTb database contains information obtained from both laboratory experiments and computational prediction. 45 A 2-level support vector machine (SVM) is used in CELLO, which involves 4 SVM classifiers and the final assignment is determined by using the jury votes from these classifiers.20,43
Determination of transmembrane proteins
TMHMM 2.0 and HMMTOP 2.0 at default parameters were performed for the prediction of transmembrane helices and topology of the HPs in the study.46,47 SignalP 5.0 helped to predict the presence of signal peptides and cleavage site location, which performs through a neural network architecture involving a conditional random field. 48
Physicochemical prediction parameters
To compute several physical and chemical parameters of the HPs, such as molecular weight, theoretical pI, amino acid composition, instability index, aliphatic index, extinction coefficient, and grand average of hydropathicity (GRAVY), ProtParam tool in Expasy was used. 49
Virulent HP detection
MP3 is a tool which can accurately predict virulent proteins in genomic and metagenomic data using SVM and HMM approach. 50 DeepVF uses a deep learning-based hybrid framework to identify virulence factors more accurately by relying on machine learning. 51 Blast search tool in the Virulence Factor Database (VFDB) identified various virulent factors from the submitted HPs. VFDB contains information about virulent factors from several bacterial pathogens. 52 Finally, PHI-base was used for virulent factors detection as it contains curated information on pathogen-host interaction affecting genes based on research articles. Only lethal and hypervirulence proteins were selected after completing a blast search (PHIB-BLAST) in Phi-base against PHI-base 4.12 protein sequences. 53 Virulent HPs predicted by 2 or more tools were then identified and further analyzed.
Predictions of antigenicity, allergenicity, and toxicity index
The antigenicity of the virulent proteins was predicted using the VaxiJen v2.0 server 54 and the ANTIGENpro server. 55 Toxicity and allergenicity of those proteins were predicted using the ToxIBTL server 56 and the AllerCatPro v. 2.0 server, 57 respectively.
Protein-protein interaction
String 11.5 database was utilized to predict the protein-protein interactions (PPIs) for the proteins from
String 11.5 search for
The network analyzer plugin in Cytoscape 3.9.0 program was utilized for the validation of PPI networks.65,66 Cytoscape 3.9.0 was used to obtain a better visualization of the potential virulent HPs with other proteins and among themselves. In Cytoscape, protein molecules are assigned to nodes and molecular interactions to edges. Furthermore, network analyzer tool can compute multiple network topological parameters with details of node degrees, edges, neighbor interactions, and network characteristics.
Results and Discussion
Functional annotation of E. cloacae HPs reveals their association with several biological, molecular, and cellular processes
A total of 604 proteins out of 4707 (12.83%) were labeled as HP in the
BLAST2GO was utilized to perform a primary prediction of the HPs, which returned 214 HPs with known protein domains or families along with their GO IDs (Supplementary Table 1). Further analysis of the pool of 214 HPs with NCBI Batch CD-Search, SMART, SUPERFAMILY, Pfam, and INTERPRO tools was performed to assign the functions (Supplementary Table 2). Among 214 HPs, functional characterization with strong confidence was possible for 78 HPs as they demonstrated similar functions predicted by 3 or more tools. NCBI BLASTp tool was used to manually annotate the functions of these 78 HPs according to their homologous proteins (Table 1). Multiple tools increase the reliability of the functional prediction. Moreover, as domains are protein’s fundamental unit of structure, folding and function, domain identification is crucial for annotating biological functions of a protein. 67
Functionally annotated hypothetical proteins and their homologous accession from
GO function analysis
Analysis of predicted GO terms for 78 HPs revealed their association in different GO categories: biological process, cellular components, and molecular functions (Figure 2). For biological process, 34 proteins were identified with distinct GO terms. About 12 of them were involved in protein transport and 18 proteins had functions in metabolic process. The cellular component category had 47 different GO terms, among which 38 were an intrinsic part of the membrane. Finally, among 53 GO terms in molecular functions, 33 were enzymes and 22 proteins were binding proteins.

GO categories distribution of the HPs from
Enzymes
Enzymes produced by gram-negative bacteria play a significant role in their host as they provide support and nutrients for growth, ensure favorable growth by modifying local environment, conduct the pathogenesis of several infections and help in metabolism. 68 A total of 33 proteins were characterized as enzymes, among which 15 proteins are hydrolase and 11 proteins are transferase.
Analysis of several infections by gram-negative anaerobes, involving tissue invasion and inflammation, necrosis, or suppuration, has revealed that hydrolytic enzymes have roles in pathogenesis of infection.
68
Furthermore, study of different hydrolases has supported their potential role in pathogenesis.69
-72 Four proteins TOZ47235.1, TOZ47607.1, TOZ41437.1 and TOZ48018.1 were identified as the α/β hydrolase that are likely to be involved in the immune system evasion and modulation, detoxification, and metabolic adaptation.
73
The α/β hydrolases have also been found to play a major role as virulence factors in
Similarly, 11 proteins were identified as transferase enzymes. They are necessary for lipoprotein biosynthesis, spore germination, and aid the full virulence of bacteria. 75 TOZ44254.1 was predicted as glycosyl transferase protein. Glycosyl transferase family proteins can alter extracellular polysaccharide and lipopolysaccharide synthesis upon mutation, resulting in the reduction in disease symptoms.78,79 TOZ46360.1 and TOZ50295.1 were predicted to be CDP-alcohol phosphatidyl-transferase family protein and UDP-GlcNAc. Both of the families are associated with lipid biosynthesis.80 -82 Alteration of the synthesized phospholipid has a crucial role in virulence and several human diseases.83,84
TOZ38888.1 and TOZ41165.1 were predicted as lyase enzymes. Lyase enzymes have essential functions for the virulence of pathogenic gram-negative bacteria in host. 68 TOZ38888.1 is pyridoxal-phosphate (PLP)-dependent enzyme, which are a ubiquitous class of biocatalysts. In several free-living prokaryotes, PLP-dependent enzymes are encoded by almost 1.5% of all genes. 85 PLP-dependent enzymes with desulphydrase activity help in amino-acid metabolism, adaption to nutrient sources in a new environment, and sometimes can function as virulence factors.86,87
TOZ48897.1 was annotated as RpiB/LacA/LacB family sugar-phosphate isomerase. This family of proteins takes part in the lactose catabolism pathway. 88
Binding proteins
There are 22 proteins characterized as binding proteins, among which 5 proteins were DNA binding, 3 were RNA binding and 5 were ATP binding ones. HPs with DNA-binding function can contribute to the virulence by altering the expression of virulence factors, which have been observed during
TOZ46383.1 was found to be a CTP synthase, which converts UTP to CTP, a necessary step in pyrimidine metabolic pathway in community-acquired respiratory tract infection (RTI) causing bacteria.
95
In addition, TOZ48266.1 was identified as an ABC transporter 6-transmembrane domain-containing protein, which are considered to have roles in nutrient uptake and drug resistance. Moreover, evidence of ABC transporters being directly or indirectly involved in the bacterial virulence has been found.
96
Furthermore, TOZ50233.1 was characterized as a biotin-dependent carboxyltransferase protein. They have roles in fatty acid, amino acid and carbohydrates metabolism.97
-100 Furthermore, their activity plays important role in the virulence of organisms like
Transporter proteins
Eight proteins were characterized to have transmembrane transporter activity. TOZ50430.1 was characterized as formate/nitrite transporter (FNT) protein. Bacterial FNTs monitor the transport of small monoacids.
104
In addition, FNTs can perform as a virulence factor in
Regulatory proteins
Regulatory process is a complex network system in bacteria that helps in various gene expression and maintain bacterial pathogenesis, growth, and survival. 2 TOZ43300.1 was identified as a diguanylate cyclase which has functions in cellular process regulation and signal transduction. Interestingly, diguanylate cyclase is necessary for biofilm development. It also performs as a messenger for bacterial virulence, motility, adhesion, secretion, and community behavior. 109
TOZ47572.1 was predicted as an alpha-2-macroglobulin (A2M) protein, which can structurally mimic proteins of eukaryotic innate immunity in invasive bacteria. Bacterial A2M are located in periplasm where they trap external proteases and provide cellular protection. 110 Both pathogenically invasive and saprophytically colonizing species possess A2M and mostly exploit higher eukaryotes as hosts. Therefore, bacterial A2M can be used as useful targets to increase vaccine efficacy in infections. 111
Membrane protein
A total of 38 proteins were characterized as integral component of the membrane and 1 protein as extrinsic component of the membrane. TOZ40775.1 was annotated as OmpA family protein. This family of proteins is surface-exposed porin proteins with anti-parallel β barrels in the outer membrane. 112 HMMTOP and TMHMM also predicted the presence of transmembrane helices for this protein (Supplementary Table 4). Several pathogenic roles including adhesion, invasion, intracellular survival, and host defenses have been assigned to OmpA. In various cases, OmpA proteins are being considered as potential vaccine candidates. 112
TOZ49620.1 was annotated as a TerC family protein. This type of protein is largely found in bacteria species and may influence host-pathogen interaction.
113
Moreover, TerC family proteins in
TOZ44410.1 and TOZ49766.1 were both characterized as EAL domain-containing proteins. EAL domain is a ubiquitous signal transduction protein domain involved in hydrolysis of second messenger cyclic dimeric GMP (c-di-GMP) as it is the exclusive substrate of EAL.114,115 The second messenger c-di-GMP regulates many lifestyle aspects and virulence of several gram-negative bacteria.
116
Moreover, EAL domain protein VieA from
Virulent protein prediction
MP3, DeepVF, VFDB, and PHI-base were used for virulence factor prediction with high confidence level. A total of 23 HPs were predicted by 2 or more tools to be virulent, and the remaining HPs were identified by either only one tool or not virulent at all (Supplementary Table 3). As virulence factors help bacteria to colonize and cause disease, the knowledge of biological function and mechanism of the virulence factors is necessary to understand their role in the pathogenesis of bacteria. 2 Moreover, virulent factors are potential therapeutic targets in case of bacterial infections. 124 Characterizing virulence factors include several secretion systems (Type I to Type VI secretory systems) 2-component signal transduction systems, quorum sensing, and biofilm formation.125,126 Virulent proteins are utilized by a large number of pathogenic bacteria, and therefore identifying inhibitors against essential factors for virulence factors is a new research interest, which is a different molecular approach than traditional drug discovery. 127 Annotated virulent HPs can obtain a better target-based approach and aid against bacterial infections as a subsidiary therapy to different antibiotics. 125
Virulent HPs with therapeutic potential
Antigenicity of the virulent HPs was studied, and it was observed that 7 of them have antigenic potential. All of these 7 proteins are likely to be non-allergenic and nontoxic. The subcellular localization of the protein was also explored, and we observed that the 7 antigenic proteins were either membrane bound or periplasmic proteins (Table 2). Our findings suggest that each of these 7 proteins could be a great candidate for vaccine development.128 -131
Prediction of antigenicity, allergenicity, toxicity, and subcellular localization of the virulent HPs.
Abbreviation: HPs, hypothetical proteins.
Subcellular localization and physiochemical prediction
In the study, amino acid sequences of 78 HPs were analyzed by using various tools, such as PSORTb v3.0, CELLO v.2.5, TMHMM 2.0, HMMTOP 2.0 and ProtParam for assessing their subcellular location along with physiochemical prediction (Supplementary Table 4). However, more attention was paid to the virulent HPs that were predicted to have roles in pathogenesis.
The cellular location along with secretion or signaling ability and transmembrane helices of the 23 HPs were predicted. Nine of them were found to have transmembrane helices predicted by both HMMTOP and TMHMM (TOZ49766.1, TOZ48809.1, TOZ40775.1, TOZ44410.1, TOZ43300.1, TOZ45909.1, TOZ49630.1, TOZ47361.1, and TOZ42186.1). About 19 proteins out of 23 were predicted by CELLO to be an inner or outer membrane and periplasmic proteins. However, pSORTdb predicted 9 proteins as cytoplasmic or cytoplasmic membrane proteins, and 7 proteins as outer membrane proteins. The SignalP 5.0 server predicted 10 proteins out of 23 to contain signal peptides for several secretion pathways. About five of them were predicted to be standard secretory signal peptides and cleaved by Signal Peptidase I. In addition, 5 more proteins were predicted to be lipoprotein signal peptides and cleaved by Signal Peptidase II. All ten proteins were predicted to be transported by the Sec translocon.
The pH at which no net electric charge of a molecule remains and does not move in an electric field of direct current is the theoretical pI.132,133 For the virulent proteins, the theoretical pI ranged from 4.58 to 9.47. Again, these 23 virulent HPs molecular weight ranged from 11390.68 to 179998.3. 2D gel electrophoresis visualization in laboratorial experiments can be accompanied by the combination of these 2 parameters. The extinction coefficient of the virulent HPs at 280 nm ranged from 8450 to 228165 M−1 cm−1 with respect to the Cys (cysteine), Trp (tryptophan), and Tyr (tyrosine) concentration. The extinction coefficient indicates the amount of light absorbent by a protein at a specific wavelength, which is useful for purifying and separating a protein in spectrophotometer. In addition, high extinction coefficient occurred in some HPs because of the presence of high concentration of Cys, Trp, and Tyr.132,134,135 The instability index estimates the stability of a protein in test tubes. Proteins with less than 40 instability index are predicted as stable proteins.
136
In the study, the 23 predicted virulent HPs instability index ranged from 20.3 to 59.09, and 16 out of 23 proteins were stable. Stable proteins have a longer half-life. The half-lives of several virulent effector proteins are integral to their function. For example, in
PPI of virulent proteins
Interaction between proteins plays a fundamental role in the biological processes of an organism. 141 Through PPI, protein cellular functions can be analyzed since execution of a function depends on the contact or regulatory interactions with another protein.60,142 Furthermore, PPI can be useful to infer an unidentified or hypothetical protein function based on the evidence of their interaction with known proteome of a particular organism as it is rare for a protein to interact with different biomolecules. Therefore, the PPI network is required to understand protein function and complexity as well as biological networks and pathways.60,143,144
PPI network analysis was performed for the 23 predicted virulent proteins to identify their functions and roles in pathogenesis. Only 20 of them were identified by STRING (Supplementary Table 5) and interactions between them and other
TOZ48059.1 is an efflux transporter outer membrane subunit protein which interacts with 18 different proteins. This protein has strong interaction with 2 two-component system sensor kinase proteins, a multidrug efflux periplasmic linker protein and macrolide transporter ATP-binding/permease protein (ECL_A036, ECL_04898, ECL_00055, ECL_02770). These proteins help bacterial survival against antibiotics and in virulence.125,145 -147 This protein also interacts with at least 5 cus proteins (cusA, cusB cusF, cusR, cusS). Cus protein complex helps in maintaining copper homeostasis and mediates resistance to copper stress by cation efflux.148,149 Toxic properties of copper are often harnessed by the innate immune system, which helps the host to kill bacteria. Bacteria counter this defense by relying on genes for copper tolerance for virulence within the host. 150
The proteins TOZ41378.1 and TOZ40438.1 are Cu(+)/Ag (+) efflux RND transporter outer membrane proteins and demonstrated interactions with 18 and 16 proteins, respectively (Fig S1). They strongly interact with each other along with TOZ48059.1 and most of its interactive proteins. These 3 proteins remain in one cluster. The protein cluster appears to bear the function of 2 component regulatory system with high strength (Log10 observed/expected value is 1.43). Majority of the interacting proteins also contain Histidine kinase domain, and GAF domain, which are associated with osmoregulation, hyphal development and virulence in bacteria like
TOZ48307.1 is an outer membrane lipoprotein carrier protein, which interacts with 15 other proteins. These proteins also form a cluster (Fig S1). TOZ48307.1 interacts with 4 acyl carrier proteins (ACP) (ECL_04843, ECL_04852, ECL_048550, and ECL_04854). In
Finally, TOZ43300.1, which was predicted as a diguanylate cyclase, interacts with 115 different proteins (Figure 3). Interacting proteins were mostly related to 2-component regulatory system, biofilm formation, diguanylate cyclase activity, and intracellular signal transduction. Environment factors helps to induce bacterial biofilm formation, which are microbial multicellular communities encased within extracellular matrix. Two-component signal transduction system (TCS) strategy is used by bacteria to connect input signals change in environment to changes in physiological output, and coordinate input signals to control biofilm formation.
160
In several

Protein-protein interaction network of protein TOZ43300.1, which is a diguanylate cyclase.
Conclusions
Hypothetical proteins form a large portion of a bacterial proteome which play crucial biological roles. Identifying these proteins and their functional annotation will help us to understand about the organism in a better way. For this study, 78 HPs were from
Supplemental Material
sj-jpg-1-bbi-10.1177_11779322221115535 – Supplemental material for Functional Annotation of Hypothetical Proteins From the Enterobacter cloacae B13 Strain and Its Association With Pathogenicity
Supplemental material, sj-jpg-1-bbi-10.1177_11779322221115535 for Functional Annotation of Hypothetical Proteins From the Enterobacter cloacae B13 Strain and Its Association With Pathogenicity by Supantha Dey, Sazzad Shahrear, Maliha Afroj Zinnia, Ahnaf Tajwar and Abul Bashar Mir Md. Khademul Islam in Bioinformatics and Biology Insights
Supplemental Material
sj-xlsx-2-bbi-10.1177_11779322221115535 – Supplemental material for Functional Annotation of Hypothetical Proteins From the Enterobacter cloacae B13 Strain and Its Association With Pathogenicity
Supplemental material, sj-xlsx-2-bbi-10.1177_11779322221115535 for Functional Annotation of Hypothetical Proteins From the Enterobacter cloacae B13 Strain and Its Association With Pathogenicity by Supantha Dey, Sazzad Shahrear, Maliha Afroj Zinnia, Ahnaf Tajwar and Abul Bashar Mir Md. Khademul Islam in Bioinformatics and Biology Insights
Supplemental Material
sj-xlsx-3-bbi-10.1177_11779322221115535 – Supplemental material for Functional Annotation of Hypothetical Proteins From the Enterobacter cloacae B13 Strain and Its Association With Pathogenicity
Supplemental material, sj-xlsx-3-bbi-10.1177_11779322221115535 for Functional Annotation of Hypothetical Proteins From the Enterobacter cloacae B13 Strain and Its Association With Pathogenicity by Supantha Dey, Sazzad Shahrear, Maliha Afroj Zinnia, Ahnaf Tajwar and Abul Bashar Mir Md. Khademul Islam in Bioinformatics and Biology Insights
Supplemental Material
sj-xlsx-4-bbi-10.1177_11779322221115535 – Supplemental material for Functional Annotation of Hypothetical Proteins From the Enterobacter cloacae B13 Strain and Its Association With Pathogenicity
Supplemental material, sj-xlsx-4-bbi-10.1177_11779322221115535 for Functional Annotation of Hypothetical Proteins From the Enterobacter cloacae B13 Strain and Its Association With Pathogenicity by Supantha Dey, Sazzad Shahrear, Maliha Afroj Zinnia, Ahnaf Tajwar and Abul Bashar Mir Md. Khademul Islam in Bioinformatics and Biology Insights
Supplemental Material
sj-xlsx-5-bbi-10.1177_11779322221115535 – Supplemental material for Functional Annotation of Hypothetical Proteins From the Enterobacter cloacae B13 Strain and Its Association With Pathogenicity
Supplemental material, sj-xlsx-5-bbi-10.1177_11779322221115535 for Functional Annotation of Hypothetical Proteins From the Enterobacter cloacae B13 Strain and Its Association With Pathogenicity by Supantha Dey, Sazzad Shahrear, Maliha Afroj Zinnia, Ahnaf Tajwar and Abul Bashar Mir Md. Khademul Islam in Bioinformatics and Biology Insights
Supplemental Material
sj-xlsx-6-bbi-10.1177_11779322221115535 – Supplemental material for Functional Annotation of Hypothetical Proteins From the Enterobacter cloacae B13 Strain and Its Association With Pathogenicity
Supplemental material, sj-xlsx-6-bbi-10.1177_11779322221115535 for Functional Annotation of Hypothetical Proteins From the Enterobacter cloacae B13 Strain and Its Association With Pathogenicity by Supantha Dey, Sazzad Shahrear, Maliha Afroj Zinnia, Ahnaf Tajwar and Abul Bashar Mir Md. Khademul Islam in Bioinformatics and Biology Insights
Footnotes
Acknowledgements
The authors acknowledge high performance computing facility support from Centre for Bioinformatics Learning Advancement and Systematics Training (cBLAST), University of Dhaka. The authors also acknowledge support of Biomolecular Research Foundation (BMRF), Dhaka, Bangladesh.
Author Contributions
ABMMKI conceived the project. SD, SS, and AT collected the data; SD, SS, and MAZ performed the analyses. SD, SS, MAZ, and ABMMKI wrote the manuscript. The manuscript was reviewed and approved by all authors.
Funding:
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data and Software Availability
All data added Table, figures and supplementary file and supplementary tables. In this research work publicly available free mostly online and few offline software/tools were used. Necessary link, reference of the software/tools provided in the method section.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
