Abstract
While it has long been thought that all genomic novelties are derived from the existing material, many genes lacking homology to known genes were found in recent genome projects. Some of these novel genes were proposed to have evolved de novo, ie, out of noncoding sequences, whereas some have been shown to follow a duplication and divergence process. Their discovery called for an extension of the historical hypotheses about gene origination. Besides the theoretical breakthrough, increasing evidence accumulated that novel genes play important roles in evolutionary processes, including adaptation and speciation events. Different techniques are available to identify genes and classify them as novel. Their classification as novel is usually based on their similarity to known genes, or lack thereof, detected by comparative genomics or against databases. Computational approaches are further prime methods that can be based on existing models or leveraging biological evidences from experiments. Identification of novel genes remains however a challenging task. With the constant software and technologies updates, no gold standard, and no available benchmark, evaluation and characterization of genomic novelty is a vibrant field. In this review, the classical and state-of-the-art tools for gene prediction are introduced. The current methods for novel gene detection are presented; the methodological strategies and their limits are discussed along with perspective approaches for further studies.
Introduction
To build and run living cells, organisms store a great part of their necessary biological program into genes, segments of their DNA sequences. In ever-changing environments, organisms competing for resources must adapt and evolve. The emergence of novel functions at the gene level provides such a possibility. Changes in the biological program can mediate unprecedented alternatives that can lead to adaptation. Changes will be submitted to natural selection, may contribute to the emergence of innovative traits, and lead to the adaptation and evolution of organisms. 1
The understanding of relationships between genes is one of the fundamental bases of molecular evolution analyses, phylogeny, and many other fields, from pure theoretical biology to applied biotechnology. 1
Based on the sequence similarity, or homology, between genes, it is possible to establish associative links between genes from different genomes. The analysis of the yeast genome 2 uncovered for the first time a set of genes without homologs in other species. These genes were named novel genes, and new hypotheses were proposed to explain their presence regarding the molecular evolution theory. After their initial discovery, novel protein-coding genes were found in most, if not all, newly sequenced genomes in amounts of 10%-20%.3–6 The importance of the functions encoded by novel genes has been underestimated for a long time. They play important roles in crucial biological functions, including developmental processes, sexual reproduction, behavior, or morphological phenotypic traits.7,8 Furthermore, it has been shown that novel genes can become essential in a short time span in Drosophila melanogaster.9,10
In this study, a review of the current theories of novel gene emergence is provided with a special emphasis on the methodological and computational challenges brought by a special type of novel genes, the de novo genes.
The first part of this review also provides an introduction on the biological background to define what genes are consisting of, and by extension what a novel gene is. Several types of genes exist, the first described and the most well known being the protein-coding gene. As a result, the first novel genes studied were also protein-coding genes, but more and more attention is currently given to nonprotein-coding genes. With several evidences from different research groups of the presence of novel genes in various species appeared several definitions of what a novel gene is emerged. The historical way novel genes were thought to originate was driven by duplication of ancient, also called parental, genes followed by independent divergence. 11 However, this theory is nowadays regarded as incomplete, as it excludes the concept of de novo genes. De novo genes are previously nonfunctional genomic sequence that acquired enough modifications to become transcribed. The number of de novo genes is not as low as one would expect, for instance, Wu et al. 3 found 60 genes classified as de novo in the human genome.
The second part of this review is dedicated to the methodology for gene detection/prediction. These methods can be listed into two groups, either relying primarily on biological experiments, such as high throughput sequencing, combined with software to analyze them, or fully computational methods relying on existing biological knowledges and datasets. Prior to the classification and study of novel genes, the classical annotation of the whole set of genes in one or several genomes in a considered clade is the first primordial step. Gene prediction can either be performed ab initio, by a comparative approach using annotations of known genes in other genomes, or relies on biological experiments. Ab initio gene prediction methods make use of different intrinsic properties of the genomic sequence of a species, such as the nucleotide or k-mer composition of a sequence or the length of an open readingframe (ORF). Additional evidences for gene prediction can be extrinsic, such as gene predictions from sister species, genic features, syntax analysis, or known regulatory elements. Biological experiments such as RNA sequencing (RNA-seq) provide a common material to improve gene prediction of transcribed genes. 12 Nowadays, the majority of the gene models are computationally predicted and while they may be supported by high-throughput sequencing, they are rarely validated experimentally.
What Are Novel Genes?
Proving that a gene is novel is a difficult task. In this section, we present a more detailed description of which characteristics a gene need to exhibit to be qualified as novel.
What is a gene?
First, a novel gene need to be categorized as a gene, ie, a genomic nucleotide sequence, which possesses the features of a gene. Several types of genes exist, the first type described being the protein-coding gene. A protein-coding gene is characterized based on the presence of some specific elements at the genomic level. 13 As a protein-coding gene, the gene must have a coding sequence (CDS). The CDS is a crucial part of a gene as it is the sequence transcribed into RNA and later on translated into an amino acid sequence.
However, a protein-coding gene displays other features that are almost as important. It is debated whether those elements are actual part of the gene or, more precisely, only adjuncts enabling a gene to fulfill its function. 14 For instance, cis-regulatory elements, in the untranslated regions (UTR) of the gene, before the CDS (5‘-UTR) or after (3‘-UTR), may contribute to many different roles 15 : (a) a polyadenylation signal (3‘-UTR) that protects the messenger RNA (mRNA) from degradation and in eukaryotes mediates its transportation outside of the nucleus, (b) a promoter that may contain the transcription start site (TSS) and will mediate the initiation of transcription, and (c) transcription factor binding sites (TFBS) that are necessary for a guided transcription to start. For most eukaryotic genes, their structure is not continuous but made of successive exons and introns.16,17 Upon gene expression, the transcription machinery produces an RNA copy of the gene from which introns are spliced out. Alternative splicing may also subtract or combine targeted exons, enabling optional combination of different exons potentially altering the function of the product. The remaining concatenated exons form the mRNA that is later translated into a peptide by the ribosomal machinery.
Rules and structures that define classical genes are already flexible and need to be even more flexible to define novel genes.
What is novel?
Primarily, novelty is defined as the absence of similarity with other sequences. This definition is mostly assessed by comparison of the primary sequences to the knowledge to date, archived in public databases.
Different terms associated with slightly different definitions are currently used regarding the study of novel genes in biology. Thereafter, a definition is proposed for the terms “orphan gene”, “taxonomically restricted gene”, “novel gene”, and “de novo gene”6,18 (see definition boxes). The “orphan” adjective was originally associated with genes that are specific to a single organism. However, with the emerging next-generation sequencing techniques and a rapidly increased coverage of sequenced species, many formerly assumed species-specific genes were subsequently found in other organisms. The term “taxonomically restricted” was then introduced to refer to genes not only limited to a single organism but also found in phylogenetically related species. 19
Orphan genes
Orphan (or taxonomically restricted) genes are classified based on a given phylogeny. A gene that is only found inside a single species or a branch, but not outside, is orphan in that specific branch.
Novel genes
Novel genes are classified by their age. Genes that have emerged inside a defined time frame are novel genes. The time frame is not fixed and need to be defined for each study. All novel genes are orphan in a specific clade, but, depending on the time frame, not all orphan genes are classified as novel.
De novo genes
De novo genes are defined based on their mechanism of emergence, ie, out of previously noncoding DNA. This might, eg, occur via acquisition of transcriptional regulation, consecutive point mutations, or genomic rearrangements.
In this paper, the term “orphan gene” is used substitutionary for both cases, original “orphans” and “taxonomically restricted genes”. The emergence of orphan genes can be explained by at least three events. 20 (a) Fast-evolving genes can diverge beyond the level of recognition of homology searches. The homology of such fast-evolving genes cannot be traced back to their ancestral genes, but they are not entirely novel genes. (b) Genes can be lost in other species they are compared with, leading to a wrong backdating and their false classification as novel. A gene that is lost in other species might be found as a pseudogene, ie, a gene-like fragment with homology to the novel gene that is not transcribed could survive in the DNA sequence. 21 (c) True orphan genes appear on a specific phylogenetic branch and evolved in a specific lineage. This evolution of orphan genes might be mediated by recombination methods based on previously available DNA or de novo.
Therefore, all orphan genes are not necessarily novel genes, whereas all novel genes can be qualified as orphan in a specific phylogenetic tree. The distinction between the more general group of orphan genes, ie, novel and ancient taxonomically restricted genes, and the narrower group of novel genes is based on the emergence time of the studied gene. 22 It has to be kept in mind that no universal time limit is clearly defined to make the distinction between orphan and novel genes. This time limit is strongly dependent on the set of species studied, in particular the sampling of the species and how representative of the true evolutionary path the set is. An important subset of novel genes are grouped under the term de novo. 23 De novo genes are characterized by their mechanism of origin, which consists of their creation out of previously noncoding sequences. 8 In this paper, a separation will be made between the terms de novo and orphan, based on the mechanism of origin of the gene, out of noncoding sequence or not. The general term novel will be used to refer to the emergence of a gene in a phylogenetic tree, keeping in mind that the emergence should be recent. Figure 1 shows the different terms explained on a phylogenetic tree.

Definition of novel genes based on phylogeny. Circles in the tree show gain of orphan (gray), novel (red), and de novo (green) genes.
Origins of novel genes
Many other remarkable features can be noted regarding the origin and outcome of novel genes. Several events can mediate novelty at the gene level, including, but not limited to, duplication, truncation, elongation, juxtaposition, fusion, and translocation of genes, mediated by recombination methods or nucleotide mutations.24,25 Prevalent recombination methods are (nonallelic) homologous or illegitimate recombination. Transposable element activity is able to include RNA sequences in the genome by retroposition 26 and can shuffle the structure of the genome and sequences containing genes or parts of genes. Further mechanisms, ie, all mechanisms that are able to insert any sequence in the genome, can lead to the formation of novel genes indirectly with various degrees of implication. Sequences can be inserted at proximity of present regulatory elements, which enables their transcription. Reciprocally, regulatory motifs can be inserted in the vicinity of untranscribed regions or rewire the transcriptional response. 27 One example of such an indirect supporter of gene emergence is horizontal gene transfer, which results in the transfer of a gene or parts of a gene from one organism (eg, a virus) to another organism. Horizontal gene transfer has been shown to play important roles in prokaryotes, 28 but is far less characterized in eukaryotes.29,30
Different outcome scenarios are possible for genes impacted by a duplication or translocation event. (a) Duplicated genes can undergo a subfunctionalization process where the original function of a gene is separated into subfunctions. Both genes are needed to provide the original function. 31 (b) Neofunctionalization describes a process where one duplicated gene evolves and acquires a new function that might lead to its classification as a novel gene. 31 The original gene is still able to fulfill its original function while the duplicate can accumulate mutations under relaxed selective pressure. (c) Another common case is gene fusion, denoted by the emergence of a new gene by joining two neighboring genes followed by intergenic splicing. 24 (d) An existing gene can be extended with nongenic DNA through the loss of a stop codon or modification of splicing sites. (e) Novelty can also encompass the expression profile of the genes when regulatory elements are modified. It is possible that novel features make use of the existing regulation from neighboring genes or benefit from steady transcription happening in peculiar genomic regions. 32
Novel or mutated sequences can have profound implications, such as lowered selection pressure. Signaling reprogramming can occur when a novel gene product disturbs the network of protein interactions responsible for signal transmission or transduction. Newly transcribed sequences or changes in regulation can alter protein dosage, enzyme activity, or specificity. Selection can also act toward noncoding RNA genes, ie, not translated into proteins, trans-regulatory elements, or de novo gene creation. Noncoding RNA genes, historically more complicated to study, gained interest upon availability of computational methods and power to achieve predictions. The contribution of functional noncoding RNA genes, often understated in genome projects, might nevertheless encompass a wide panel of functions: structural contribution, protein complex formation or stabilization, catalysis, regulation, immune defense, protein synthesis, or self-propagating elements like retrotransposons. The number of characterized noncoding RNA genes is constantly increasing, most of them were linked to regulatory functions. 33
The distinction whether a gene is newly derived from an ancient gene after a duplication process or evolved de novo can be answered by the evolutionary process involved. Duplicated, or shuffled, genes that use existing regulatory elements of the parental gene and share the classical features of gene structure might be detected with conventional methods of gene prediction. Though novel genes that emerged after duplication have not evolved out of a noncoding sequence, they might have diverged beyond the level of recognition as paralogs, resulting in their false classification as de novo. Buljan et al. 24 showed fusion of exons from neighboring genes as the most common mechanism of novel genes that emerged after a duplication process.
New genetic material can also be created fragment-wise as protein domains and such create a novel gene by modification of the old domain layout. Domains are evolutionary or structurally conserved units that can be used as building blocks for proteins 34 and can subsequently be integrated into proteins. There are, at least, three advantages regarding the annotation of novel genes by using a domain-centric view on proteins as follows. (a) Many novel genes are not derived from previously noncoding sequences, but from rearrangements or reuse of existing protein domains. This modular use of domains can explain many novel proteins when an ancient domain is found in the gene. The domain might also be found within a context of low sequence similarity to other genes that include that domain. (b) Protein domains are able to gain multiple copies in a genome and a single novel domain can enable the emergence of multiple novel genes in a short time span. 34 (c) Protein domains are usually described by probabilistic models, namely, hidden Markov models (HMMs). The search with HMMs is far more sensitive than it would be directly with amino acid sequences. Furthermore, breaking down genes into domains is also a more robust way to explain novel genes and compare them with the existing ones based on domain content. A domain-centric view on gene or exon shuffling leads to a similar finding of domain fusion as the main factor of novel domain arrangements, followed by fission of a domain arrangement into two distinct arrangements.35,36
De novo gene creation, in contrast, can be defined as the emergence of regulated transcription of a hitherto untranscribed DNA fragment. The acquiring of transcription might be achieved when a promoter sequence is newly created by mutations or inserted from DNA rearrangement events. Two competing models propose mechanisms for the de novo emergence of protein-coding genes. In the transcription-first model, a DNA sequence is transcribed to RNA. The now genetic DNA can acquire an ORF afterward under evolutionary pressure. 10 An ORF of a certain length is not needed before transcription occurs in the transcription-first model. The second model that is considered here is the evolution of new genes through proto-ORFs. 37 The first step in the proto-ORF model is an ORF that is occasionally transcribed afterward by acquiring regulatory elements. Noncoding RNA genes might arise in an analogous way to the two models for protein-coding genes. In the case of noncoding RNA genes, conservation is not given for an amino acid sequence or protein structure, but for the secondary structure of the RNA sequence. De novo genes, in comparison to ancient genes, tend to share common properties such as a fast evolution, shortness, and fewer exons, 38 as well as a lower transcription level, 39 a tissue-specific transcription, 40 and a higher abundance of transposable elements. The fast evolution and simplicity of de novo genes supports the concept of initial transcription and translation before a gene needs to gain more complex elements of regulation or splicing.
The functional characterization of novel genes is still at the beginning. Functional annotation of sequences is mostly based on homology to genes of known function and the subsequent annotation with, eg, Gene Ontology terms, 41 which is not possible in most cases by the definition of novel genes. The annotation of novel genes with protein domains might be possible, but many domains also lack functional annotation. The remaining possibilities of functional characterization involve experiments such as knock out or knock down of those genes, or the analysis of their expression on different conditions. Li and Wurtele, 42 for example, showed a successful characterization of an Arabidopsis thaliana orphan gene by expression in soybean.
Methods to Detect Novel Genes
Biological/analytical methods for gene detection
High-throughput generation of transcriptomic data has been an essential contribution to the annotation of known genes and the detection of novel genes. While assessing the gene expression in a biological sample, transcriptomic methods provide evidences of which portions of the genome are effectively transcribed. The comparison of transcriptomic data with known gene models of an organism can result in the prediction of transcribed sequences hitherto unknown as genes. Here, two major technologies based directly on experimental biological evidence are reviewed: RNA-seq and ribosome profiling.
RNA-Seq consists of the determination of the RNA sequences in a given biological sample coming from cells under normal activity or cells under a controlled stimulus. In the perspective of novel protein-coding gene detection, efforts are put on sequencing mRNA. 43 Several RNA-seq solutions exist. The most popular RNA-seq methods are the Illumina platforms, relying on massively parallel sequencing of short DNA fragments. While affordable and offering the possibility of deep sequencing to accurately assess the expression level or mutations, the technology requires computational processing to reconstruct best-effort, full-length RNA sequences and extract the biological sense of the sequences. In the perspective of novel gene detection, the long-read technology from Pacific Biosciences displays valuable merits regarding identification of isoforms and improved sensitivity. 44
Different methods exist to handle and interpret short reads data, of which mapping of the reads on a reference sequence and de novo assembly of the reads in transcripts are the most advanced. The choice of a method mostly depends on the availability of a reference genome, its quality, completeness, accuracy, and its distance to the studied subject. When available, the genome sequence can be used to place the short fragments in correct order by a mapper. 45 De novo assembly of the RNA sequences, on the contrary, will only use the read overlaps to merge and extend them into continuous consensus sequences. Both strategies have strengths and drawbacks. Mapping methods will likely produce better quality gene models if the genomic sequence of the studied species is available, but will likewise fail to detect signals in individual-specific structural variations. Pure de novo methods will sample isoforms and individual-specific transcripts, but might fail on accurate boundary predictions. Few methods allow a combination and fusion of the two strategies, resulting in overall improved gene models.46,47
However, not all the transcripts do necessarily encode a protein, 33 and therefore, novel gene can be missed when only targeting mRNA. The global cell RNA pool, along with protein-coding mRNAs, is populated with several other RNA classes such as ribosomal, transfer, long noncoding, small nuclear, micro, and small interfering RNAs (rRNA, tRNA, lncRNA, snRNA, miRNA, and siRNA, respectively). Though the roles of noncoding RNA now receive a growing interest, the experimental and computational challenges that underlie the prediction of their functions had led researchers to mainly focus on protein-coding genes.
The ribosome profiling technique,48–51 by directly focusing on RNA fragments protected by ribosome, is well adapted to the detection of ORFs and novel protein-coding genes. The protected fragments are sequenced and a direct reading of which genes are translated is possible. The two techniques, namely, RNA-seq and ribosome profiling, need the use of a read mapper utility such as TopHat 52 to align the sequenced reads against the reference genome. Transcriptional properties inferred with RNA-seq methods can help to identify further novel genes. The overall lower expression of novel genes in general can give hints to hitherto unknown genes, as well as a tissue-specific expression of novel genes in, eg, testis for animals.39,40
Another promising technique for completing genome annotation is the use of mass spectrometry (MS). The advances on high-throughput MS have opened a new field termed proteogenomics.53–55 Proteogenomics typical application consists of mapping short peptides produced by MS techniques to protein sequence databases. The technique and its implications are reviewed in the study by Nesvizhskii. 56 The method of proteogenomics has, for example, been successfully applied in A. thaliana, 53 mouse, or human 57 to create a precise annotation of known genes and to discover novel protein-coding genes. Many microbial genomes that lack a high quality of annotation can benefit from proteogenomics to improve gene prediction. 58 Moreover, the impact of proteogenomics can be of significant importance in nonmodel organisms. However, accessing the correctness of a peptide identification is a challenging problem and is highly sensitive to false discovery. 56
Mixing technologies seems also to be a promising perspective. For example, in their study, Sun et al. 59 used a combination of tandem MS and RNA-seq data to detect ORFs in wrongly annotated noncoding RNA sequences.
All biological data need to be processed computationally, where the mapping of short sequence reads to the genome is the most crucial part.
Computational methods for gene detection
When no biological data are available, algorithms are used to predict genes directly from the genomic sequence. Most genome databases, like Ensembl, have their own pipeline for gene annotation, 60 but they are not designed for the detection of novel genes. 61 In this section, a review of the classical gene annotation methodology is provided along with the methods aiming at detecting novel genes.
Gene prediction can be accomplished with different types of information. External genetic data can be used to annotate genomic DNA sequences using candidate genes found by similarity of known features of other species. Homologs may be provided by databases of DNA, cDNA, or amino acid sequences. 62 Projector 63 and GeneWise 64 are programs that compare and align two related DNA sequences and predict the gene structure of one sequence based on the gene structure of the second sequence, assuming that similar sequences share a similar gene structure. Gene prediction based on homology usually makes use of fast heuristic alignment programs such as Blast 65 or Exonerate. 66
Ab initio gene prediction software uses intrinsic properties of the sequence to find genes. Thus, for the prediction of a protein-coding gene, a strong emphasis is put on the detection of an intact ORF. ORF detection can be done with tools such as getorf included in EMBOSS 67 or OrfPredictor. 68 The nucleotide hexamers, or in general k-mers, frequency is a good predictor for coding and noncoding sequences. Accordingly, k-mer frequency is used by several gene prediction programs, eg, SORFIND 69 or Genview. 70 HMM approaches with the k-mer composition are used by more recent programs such as GeneMark 71 and Eugene. 72 Other approaches look for recognizable parts in the sequence. As most protein-coding genes consist of more than one exon, the splice site prediction can be used to determine the exon boundaries of a gene and so the gene structure. An example of a splice site predictor based on multiple sequence alignments is SPLICEVIEW, 73 whereas NetGene2 74 uses a neural network approach. Splice site prediction can be achieved using RNA-seq data with the KisSplice 75 or FineSplice tools. 76 The RSVP tool predicts splice variants of genes based on a genomic sequence and incorporates information from RNA-seq reads. 77 A comparison of computational effort and dependency to biological data of these tools are given in Table 1.
List of tools and methods used for novel gene prediction.
The Augustus 78 program combines both intrinsic and extrinsic methods of gene prediction. Modern gene prediction pipelines like Maker 79 or Ensembl 80 use a combination of all methods to get the best possible evidences for each predicted gene.
Few properties can give clues about the coding potential of a sequence. The most widely used properties to predict coding potential are the sequence homology to known databases, the presence of a stop codon, and the amino acid composition of a sequence. These properties need to be inferred and processed with statistical analysis or machine learning approaches used by programs such as CPC 81 or CPAT. 82 Other techniques, based on statistical scores or evolutionary simulation, are also giving promising results to detect the coding potential of a nucleotide sequence such as PhyloCSF, 83 ReEVOLVER, 84 or the t1/2 statistic. 85 However, methods based on evolutionary simulation require the targeted potential genes to have accumulated enough mutations to be distinguished from other genes. These methods need improvement to be used in the context of novel genes, because the recent appearance of a gene is not necessary followed by accumulation of mutations.
The first accurate and widely used gene prediction tools were GENSCAN 86 for eukaryotes and GLIMMER 87 for prokaryotes. GENSCAN and the successor of GLIMMER, GLIMMERHMM, detect the presence of a gene in a DNA sequence using only the input genome sequence and a generalized hidden Markov model (GHMM) to define a general gene structure. New methods that use other sources of information, such as sister genomes, multiple sequence alignments (MSA), expressed sequence tags (EST, short sequenced cDNA fragments), or combinations of them, are now outperforming GENSCAN, examples for those tools are TWINSCAN or N-SCAN. 88 FragGeneScan is a tool that can predict protein-coding regions from short-read data, using sequencing error models and codon usage with a HMM model. 89 GHMM is a popular machine learning methodology among these tools, but has the disadvantage of needing the development of a statistical model a priori. The CONTRAST 90 software, using supervised machine learning algorithms such as support vector machine and conditional random field, has been shown to outperform the previously cited methodologies.
Gene classification
The previously listed methods of gene detection are canonical and not specific to the detection of novel genes. The classification of a gene as novel can only be made based on comparison with other species. After the detection of a gene, one needs to find if this gene is novel by searching for potential homologs. Sequence databases represent current biological knowledge and are used to classify a gene as a novel gene. Most sequence databases are synchronized among each other, but might differ in their treatment of features such as splicing variants, including sources, or just in their composition or meta content. Refseq and Uniprot are two major curated protein sequence databases, publicly available and maintained, which are suitable for purposes of gene prediction and classification. A common basis for the classification of annotated genes as novel or orphan is the known gene repertoire of a species. Orphans can be detected by homology searches of known genes to gene databases. A gene that lacks homology to any other gene in another clade can be classified as orphan. NCBIs blastp program has been shown to be sufficient at the identification of homology 91 based on sequence similarity. Different expectation value (E-value) cutoffs for blastp are used in literature for the classification of homology, ranging from 10-10 to 10-3 E-value cutoffs5,91 scale with the size of the used target database and should be adapted to the performed study. A drawback of orphan finding by homology is fast evolving genes, which can be falsely classified as novel by lack of detectable homology to their ancestors. Methods of homology detection with increased sensitivity that are not only based on sequence similarity would help circumvent the problem of hidden homology and support novel gene detection.
Novel gene finding can be computed during species comparison analyses by a clustering of known genes based on orthology. 92 Programs like OrthoMCL 93 or ProteinOrtho 94 are able to use the annotated gene sets of different species as an input and cluster them into families. Genes that are not clustered with any other gene, ie, that have no orthologous genes in these other species, can be used as potential novel genes. The treatment of paralogs is important when designing a clustering approach and has to be interpreted under the second important consideration: how many and how are manually selected species included in the analysis. The number of genes lacking homology to other genes that are found by a clustering is directly linked to the number and relatedness of the species used for the clustering. Species that are closely related to the species of interest can yield reliable results with more orthologs than more distantly related species; however, more distantly related species with better gene annotation should also be considered.
Phylogenetic approaches can be used after a gene clustering to find orphan genes that are restricted to a clade. A defined set of species, that represents a clade of interest, and a phylogenetic tree of these species, can serve as an input for a phylogenetic analysis. The definition of orthologous genes at branches of the tree and so the time of emergence of orphan genes can be achieved after a clustering by gene homology. Phylostratigraphy is a similar approach to estimate the age of genes and finding orphan genes and their point of emergence.95,96 Phylostratigraphy uses a set of species defining outgroups at important ancestral nodes in a phylogenetic tree. The gene content of each node is inferred by comparing the gene content of descendant nodes with that of the corresponding outgroup. Subsequently, the branch where a gene emerged can be assigned and defines its clade and time of emergence.
The decision whether a putative novel gene is actually novel or has been lost in other species might be clarified by the existence of a pseudogene. Pseudogenes can be detected by homology of putative novel genes on the DNA level and might be compared to a database of collected pseudogenes like Pseudogene.org. 97 The PseudoPipe pipeline predicts pseudogenes in the genome using Blast and a clustering algorithm, but is limited to mammalian genomes. 98 Alternatively, the classification might be baffled when the observed loss of the gene ensue from errors or lack of evidence at the genome assembly or gene prediction steps. Such issues can be resolved at a rather small scale by simple experiments such as targeted PCR or targeted resequencing.
Novel genes can also be defined based on their domain content. The coverage of proteins with domains is nowadays usually quite good and ever improving. The amount of proteins annotated with at least one protein domain ranges from ~50% in plants to ~75% in insects.99–101 The abovementioned domain coverage is derived from PFAM, one of the most widely used protein domain databases. 102 Novel genes, in terms of domains, can be engendered by the emergence of a novel domain, and/or the rearrangement of ancient domains. The challenging criteria in detecting genetic novelty reside in the quality of the domain models as it deeply impacts the domain annotation of a protein. To detect and link the appearance of a novel gene to domain rearrangements or the emergence of a novel domain, the domain content of several sister species need to be compared. Comparison with sister species facilitates a better phylum coverage of sequenced genomes and proteomes, which is crucial for the analysis. Protein domains present in sister species are assigned to a phylogenetic tree and the content of the ancestral nodes are predicted by using a method of parsimony reconstruction, such as the Dollo parsimony. Dollo parsimony limits the number of times that a character, in this case a domain, can be gained in a tree and therefore is well adapted to the study of domain gains that are supposed to be rare events. The domains present in the ancestral nodes can be used to detect the time frame of emergence of novel domains and genes with a program like ProteinHistorian. 103 It has been shown that different mechanisms can enable the origination of new domain arrangements, such as domain gain or loss, fusion, or fission. However, the comparative analysis to reliably determine the presence of new domain arrangements requires genomes of high quality, good domain annotations, and reliable phylogenetic trees. Domain annotations are using HMM models that are based on MSAs of parts of protein sequences. Due to their intrinsic definition, a novel sequence or a de novo sequence is less likely to have known sequences to align to in an MSA and subsequently less likely to be annotated by a HMM model than a well-studied sequence. Therefore, the detection of novel domains only based on MSA is limited to clades with either a high number of species or to genes with many orthologs or paralogs.
Other methodologies can be used to circumvent the intrinsic MSA problem and detect true novel domains. 104 The Seg-HCA method has been proposed to annotate domains on whole proteomes. 105 The method is based on the hydrophobic cluster analysis (HCA) of protein sequences, which uses the hydrophobic pattern of a sequence.106,107 Seg-HCA discriminates hydrophobic clusters to delineate sequences with domains. These annotated portions of sequences are closely related to the structural definition of a protein domain with the presence of a hydrophobic core.
Discussion
Recently developed sequencing techniques are able to give new biological insights into the biology of species. Sequencing of single strains, either DNA or RNA, can help to detect very recent population-specific genes and transcripts. Zhao et al. 40 found 142 putative de novo candidate genes in six D. melanogaster strains with RNA-seq methods. The analysis of population data might enable to find very recent changes in sequences that lead to the creation of genes, as well as support for previously detected putative genes. Further perspectives in sequencing techniques are, for example, sequencing of a single cell, new sequencing devices for long reads or strand-specific sequencing. Single-cell sequencing is a promising technique for sequencing bacterial species that cannot be cultivated or tissues that currently cannot be sequenced with reliable results. Different cells of an individual can be studied to find genes and their possible evolution, for example, in cells involved in the immune system or tumor cells, in which the transcription of genetic elements that correspond to novel genes could be activated in a deleterious way. 108 The enhancement of current gene prediction with antisense transcript, assisted by strand-specific RNA-seq can lead to the detection of further ancient and novel genes. 109 A review of current RNA-seq technologies and future perspectives is provided in Ref. 43. The MinION, a small novel sequencing device by Oxford Nanopore, is theoretically able to produce sequence reads as long as the underlying DNA or RNA chain. 110 Longer sequence reads, as produced by novel techniques such as the MinION or Pacific Biosciences technologies, 111 can help to improve mapping quality and circumvent problems with repeated sequences or structural variations in the genome, leading to better evidence for gene predictions. The sequencing of more closely related species will also help in the classification of novel genes.
The study of novel genes should not be limited to novel protein-coding gene. The minimal size of genetic features, either protein-coding genes or RNA genes, has been reevaluated in the past few years. Genes, previously thought to need a minimal size of 200-300 base pairs, have been shown to be functional as smaller units such as small secreted proteins acting as trans-effectors, antimicrobial peptides, and small RNAs. Several small types of peptides have been assigned to different functional profiles, for example, enriched in signaling or regulation. 112 The loss of minimal size requirements and the finding of other than protein-coding genes highly affects the prediction of novel genes, as, for example, a minimum gene length or presence of an ORF are not crucial properties of genes. De novo genes might lack recognizable regulatory elements, depending on their current state of evolution.
Additionally, gene structure is not always canonical and can hold or miss features that are currently used for gene prediction. Several RNA transcripts might be reliably mapped to the same genomic location that can be the result of different mechanisms. While alternative splicing leads to more than one transcript per gene, usually of different length, it is possible that two genes overlap on the DNA level, either on the same or the opposite strand. The phenomenon of overprinting describes the case when a gene is “embedded” in another one, accomplished by different start and stop sites in a genetic sequence. 113 Other cases of diffuse gene structure might be caused by differently fused genes or genes in different reading frames. The mapping of transcript evidence, as well as the identification of gene boundaries, is getting even more complex through genes where one transcript contains more than one gene (polycistronic genes). Polycistronic genes are common in prokaryotes, but have also been reported in eukaryotic genomes. 114 The occurrence of different transcripts of the same DNA sequence makes the definition of a gene fuzzy and interferes with its prediction. Interpretation of transcriptomic data has to take all the cases mentioned above into account while constructing a reliable gene structure.
The detection of novel genes is as diverse as the processes leading to their emergence. The classification of a gene as novel, defined by missing sequence homology to other known genes outside the lineage where it emerged, highly depends on the sequence alignment tools that are used to detect homology and parameters of these tools, and even when a gene is classified as novel, its mechanism of emergence is not easily accessible.
Novelty at the domain level needs to be considered in gene prediction as it appears that many genes evolved by recombination of ancient domains. 115 The impact of novel domains on functional innovation and changes in phenotypes are a promising research field for novel gene prediction and understanding evolution. Computational methods that use evolutionary reconstruction, statistical models, or machine learning algorithms are showing very promising results for the annotation of genomes. However, de novo detection is still a very challenging problem as sequences from sister species are crucial to perform a confident prediction and time frame events.
In summary, computational methods, predicting novel genes are based on sequence properties, the detection of known elements, and comparison with sister species. Novel computational techniques to predict correct structures for genes with the unusual properties of novel genes are on the onset of development and will open new perspectives.
Conclusions
The analysis of emerging de novo genes only recently became a topic in genomic research. Whereas most novel genes emerged from ancient genetic material, an amount of up to 20% evolved de novo from noncoding intergenic or intronic sequences. 3 Current methodology to detect de novo genes is using conventional gene prediction workflows. The first basis for predicting genes is the intrinsic properties of the DNA sequence itself, ie, the prediction of features that a gene can consist of such as ORFs, known promoters, TFBS, splicing sites, or features held by the sequence like nucleotide or k-mer composition. A second basis for gene prediction is external data. External data include, but are not limited to, comparison with known genes of related species, and can also be new biological data from experiments. A major drawback in the classification of novel genes is the uncertainty of false positive as the prediction will be only as good as the completeness of the set of genes that the novel genes are compared to.
Finding generic gene descriptors, at least for certain species, and adaptations of known methods for gene prediction to those generic descriptors is a great goal for gene annotations, either for ancient or for novel genes. However, known properties of de novo genes often differ from that of ancient ones and are also not necessarily consistent within or across species.
The detection and classification of (novel) genes can be improved in several ways. (1) The ever better taxonomically coverage of important clades enable better comparisons of genomes for the classification of predicted genes as novel. (2) Better sequence comparison tools can help finding hidden homologies to circumvent wrong gene predictions and classifications by combining DNA/RNA and amino acid sequences together with profile-based or machine learning methods. (3) The finding of generic gene descriptors to avoid overfitting of gene prediction methods. (4) The development of further biological techniques to improve identification of specifically transcribed or translated sequences.
Novel genes are thought to impact organisms and their aptitude to adapt by providing, at the population level, a varied set of tinkering and novelties. The prediction of novel genes, the classification of known genes as novel, and possible explanations of mechanisms of emergence are crucial for understanding recent evolutionary traits. Development of new computational and experimental methods is necessary to build atop of the existing knowledge of genomes the tools to unravel the genesis and impact of the novel and de novo genes on species and evolution.
Author Contributions
Wrote the first draft of the manuscript: SK. Contributed to the writing of the manuscript: SK, LM, TB-F. All authors reviewed and approved of the final manuscript.
