Abstract
During the journey from the discovery of DNA to be the source of genetic information and elucidation of double-helical nature of DNA molecule to the assembly of human genome sequence and there after, bioinformatics has become an integral part of modern biology. Bioinformatics relies substantially on significant contributions made by scientists in various fields, including but not limited to, linguistics, biology, mathematics, computer science, and statistics. There is an ever increasing amount of data to elucidate toxic mechanisms and/or adverse effects of xenobiotics in the field of toxicogenomics. Annotation in combination with various bioinformatics analytical tools can play a crucial role in the understanding of genes and proteins, and can potentially help draw meaningful conclusions from various data sources. This article attempts to present a simple overview of bioinformatics, and an effort is made to discuss annotation.
UNDERSTANDING BIOLOGY THROUGH BIOINFORMATICS
In 1944, Avery, MacLeod, and McCarty demonstrated DNA to be the source of genetic information in a series of elegant experiments with
The field of bioinformatics relies substantially on contributions made by researchers from many fields, including but not limited to biology, mathematics, computer science, statistics, and linguistics. This interdisciplinary field can be defined as a ‘scientific discipline that encompasses all aspects of biological information acquisition, processing, storage, distribution, analysis, and interpretation.’ Bioinformatics employs computational approaches to help understand and answer complex biological questions, which require the analysis and interpretation of huge and complex sets of data before biologically valid conclusions can be drawn. These analyses empower the researcher with substantial predictive power, which, in addition to the traditional methods of biology, is beginning to positively influence the basic ways in which science is done. It is becoming widely accepted that the combination of experimental and computational analysis is essential towards a complete understanding of cellular pathways, and biology (Baxevanis and Francis Ouellette 2001). The field of toxicology is rapidly evolving to increase our understanding of mechanisms of toxicity through the assessment of the biochemical, molecular, as well as genetic bases of toxicity of various chemical substances. To increase our understanding of the biological mechanisms and pathways underlying the toxicity of various compounds, investigators in the field of toxicology are taking advantage of the ever-increasing amounts of biological data from both internal and external sources. In order to facilitate such complex data analyses, bioinformatics is beginning to play a crucial role in toxicological studies (Burchiel et al. 2001).
One of the earlier influences of bioinformatics has been an explosion in the number of databases and/or resources that are now publicly available. These resources have not only been critical in providing biological information for determining the direction of further research, but have also contributed substantially to the bioinformatics education. In addition to providing complex sets of biological information, these resources also present tools to submit, mine, retrieve, and analyze biological information (Baxevanis and Francis Ouellette 2001). A partial list of some of the publicly available resources is presented in Table 1. This list is by no means exhaustive, and represents only a minor subset of resources that may be useful in genome annotation and analysis. This list does not include any ADMET (absorption, distribution, metabolism, excretion, and toxicity) information bioinformatics knowledge systems, for example. Bioinformatics-based tools are used every day in our workplace now, and the Internet has completely changed the way we currently search for information. For example, literature searching is no longer a matter of looking up references in printed indexes (Table 2). An approach taken by several laboratories and industries is to develop bioinformatics tools to efficiently mine various literature resources to retrieve a collection of references that are of interest to a particular investigator or project.
Comparative analysis is a tradition in biology, which has now extended to the sequences of genes and proteins through various bioinformatics tools. ‘Sequence alignment’ refers to explicit mapping of residues (individual bases or amino acids) between two or more sequences. This type of analysis is aimed at understanding structural, functional, or evolutionary relationships among various sequences. Whereas ‘pairwise sequence alignment’ refers to the comparative analysis of two sequences, ‘multiple sequence alignment’ refers to the comparative analysis of more than two sequences. Pairwise sequence alignments are useful in identifying regions of homology in a pair of DNA or protein sequences. Multiple sequence alignments are useful in identification of sites in DNA or protein sequences, which may be functionally important and/or conserved. Because it is possible to use a particular sequence as a query to search an entire database of sequences and extract matches, sequence database searching is a common task carried out in various laboratories on a regular basis to gain an understanding of the queried sequence. With the rapid sequencing of various genomes, sequence information represents a data type that is abundantly available. An uncharacterized DNA sequence can be examined through various bioinformatics tools to predict open reading frames, genes, introns, exons, promoter sites, repeat sequences, and poly A signals, and the like. For genome-wide studies, bioinformatics tools incorporating sophisticated statistical analyses, mathematical modeling, and visualization are also available for analyzing data generated from studies employing DNA microarrays. Structural modeling of proteins and molecules is another active area where various tools and algorithms help scientists understand and design novel studies (Mount 2001).
ANNOTATION
Advances in genomics have contributed to the development of a subdiscipline of toxicology, toxicogenomics, which as the name implies, merges genomics with toxicology, and employs the knowledge of genomics to assess genome-wide effects of xenobiotics. These genome-wide studies are facilitated by microarray technology, which allows the measurement of transcriptional modulation of thousands of genes present in the genome following the exposure to a xenobiotic of choice. To analyze the vast amounts of data generated by these microarray experiments, bioinformatics-based algorithms and tools, both public and proprietary, are extensively employed. An active area of bioinformatics research is in developing robust algorithms and tools to organize, store, analyze, visualize, and interpret vast amounts of gene expression information generated by the microarray experiments (Butte 2002). Because transcriptional responses are believed to be one of the earliest reactions of a cell following exposure to xenobiotics, they may serve to provide a preliminary indication of the pathways and/or biological mechanisms being affected by xenobiotics. These responses, in combination with proteomics and other experimental data, may have the potential to provide initial signals of adverse affects of compounds early in the pipeline during drug discovery and development. However, the combination of toxicogenomics with various other data also presents novel informatics challenges concerning annotation, data integration, data analysis, data mining, and database and tool development (Mattes et al. 2004).
For scientists who might have at first been limited to working with one or few genes, these genome-wide studies allow thousands of genes to monitored. Even after arriving at a small set of genes that might be of interest for further analysis, the scientist is often confronted with the problem of lack of knowledge of function of various genes. The biological functions of a majority of genes remain unknown, or known only through homology-based predictions to well-characterized genes, although the genomes of more than 800 organisms have been sequenced. To increase our understanding of various genes, genome annotation can and has play(ed) a critical role. Genome annotation is described as “the process of taking the raw DNA sequence produced by the genome-sequencing projects, and adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological process” by Lincoln Stein (Stein 2001). “The value of the genome is only as good as its annotation. It is the annotation that bridges the gap from the sequence to the biology of the organism. The aim of high-quality annotation is to identify the key features of the genome—in particular, the genes and their products,” writes Lincoln Stein, emphasizing the importance of annotation (Stein 2001).
Genome annotation comprises of annotation at various levels: nucleotide level, protein level, and process level (Stein 2001). Nucleotide-level annotation focuses on examining the nucleotide sequence to identify genes, noncoding RNAs, transcriptional regulatory regions, genetic markers, repetitive elements, segmental duplications, nucleotide polymorphisms, and linking available genetic, cytogenetic, or hybrid maps. Various bioinformatics tools are currently available for these analyses, and this is an active area of research to improve current algorithms and develop new ones.
Protein-level annotation focuses on creating a catalogue of the proteins encoded in an organism’s genome, recognizing their names and assigning putative functions. Because a majority of the proteins are not characterized and their functions are largely unknown, the initial process involves categorizing these predicted proteins into subsets of proteins or protein families, based on the homologies, presence of various functional domains, and motifs as well as similarities to well-characterized proteins from other species (Stein 2001). Table 1 lists some of the resources that can serve as excellent sources of annotation.
Process-level annotation focuses on associating genome to biological processes. Literature-based annotation can serve as an excellent process to provide annotation at the process level. Although literature-based annotation is extremely labor intensive, developing targeted literature searches to focus on the desired level of annotation may prove extremely successful. Several model organism databases provide excellent examples of the efforts of various groups to develop highly valuable resources based on various levels of annotation (Stein 2001). Also, developing standard vocabulary to describe the functions of genes is proving to be a critical step in annotation and its universal application. For example, Gene Ontology Consortium follows standard vocabulary to describe the functions of genes in three independent hierarchies, each describing molecular function, biological process, and cellular component, respectively, for a particular gene. This process not only allows assigning annotation at the very specific level, but also allows standardized usage of annotation to describe a gene function (Gene Ontology Consortium 2001). One more advantage of these hierarchies is the ability to use higher level generic terms to predict the functions of similar or related family members identified through the application of various bioinformatics tools. Literature mining to predict the functions of genes, pathways, and protein networks among other things is an active area of research. As annotation protocols and literature-mining algorithms for functional annotation further improve, we stand to gain novel insights into experiments requiring analyses of sets of genes generated from microarray experiments.
