Nucleic Acids Research, 2005, Vol. 33, Database issue D442-D446
© 2005, the authors
Nucleic Acids Research, Vol. 33, Database issue © Oxford University Press 2005; all rights reserved
DED: Database of Evolutionary Distances
owski*Institute of Molecular Evolutionary Genetics and Department of Biology, Pennsylvania State University, University Park, PA 16802, USA
* To whom correspondence should be addressed at Department of Biology, Pennsylvania State University, 514 Mueller Lab, University Park, PA 16802, USA. Tel: +1 814 865 5025; Fax: +1 814 865 9366; Email: wojtek{at}psu.edu
Received August 15, 2004; Revised and Accepted October 12, 2004
| ABSTRACT |
|---|
|
|
|---|
A large database of homologous sequence alignments with good estimates of evolutionary distances can be a valuable resource for molecular evolutionary studies and phylogenetic research in particular. We recently created a database containing 159 921 transcripts from human, mouse, rat, zebrafish and fugu species. Approximately 16 000 homology groups were identified with the help of Ensembl homology evidence. At the macro-level, the database allows us to answer queries of the form:
- What is the average k-distance between 5' untranslated regions of human and mouse?
- List the 10 groups with the highest Ka/Ks ratio between mouse and rat.
- List all identical proteins between human and rat.
| INTRODUCTION |
|---|
|
|
|---|
The previous decade in biology witnessed unprecedented accumulation of molecular sequence data. However, as Sydney Brenner remarked The great challenge in biological research today is how to turn data into knowledge (1). Evolution, inspite of being recognized for decades as crucially important for understanding life, was until recently the most speculative area of biology. This situation has been radically changed with the molecular approaches that are now possible, thanks to the availability of large amounts of molecular sequences. However, in order to be useful for evolutionary studies, sequences have to be carefully selected and grouped into homology clusters. This is the most important preparatory step and the most tedious one in any evolutionary analysis. For many analyses, homologous sequences have to be further classified as orthologous (i.e. sequences that shared their last common ancestor during speciation time) or paralogous (i.e. sequences that were created by ancestral gene duplication). This distinction is especially important for molecular phylogeny as it is necessary to work with orthologous genes to infer species phylogeny based on gene phylogeny. Interestingly, despite vast amount of sequence data from different organisms, there have been surprisingly few large scale gene comparison studies between different species or groups of organisms (26). Information on expected evolutionary distances or protein/gene identity between different organisms (e.g. human and zebrafish) or taxonomy groups (e.g. mammals and reptiles) is difficult to obtain. To fill this gap, we have created the Database of Evolutionary Distances (DED) which contains sequence information from several vertebrate species clustered into homology groups. It also includes multiple sequence alignments for both protein and nucleotide sequences along with the phylogenetic trees and graphical representation of sequence relationships within a homology group. Large number of links to external databases makes further data exploration as easy as a click of a mouse.
Our DED should be useful for gene function assignment, molecular phylogenetic studies, search for lateral gene transfer, reconstruction of identification of biochemical pathways in poorly characterized organisms and sequence evolution patterns. Simple, yet powerful, web interfaces provide a convenient way to access the data. The results are displayed in easy-to-understand tabulated and/or graphical forms.
| SEQUENCE DATA |
|---|
|
|
|---|
The basic objects stored in our database are genes and their associated transcripts. For each gene, we maintain all its known transcript variants and for each transcript we store its sequence, coding region annotation and exonintron structure. Currently, our database is based on Ensembl release 20 (7) of human, mouse, rat, zebrafish and fugu data (see Table 1). A total of 159 921 vertebrate transcripts stored in the database represent 126 842 unique genes clustered in homology groups (see later).
|
Based on the information retrieved from Ensembl, the gene and transcript objects in our database were cross-referenced with objects in external databases such as RefSeq, Pfam, GO, etc. As expected, the human genes and transcripts have the most external links associated with them (140 302), while those of zebrafish have the least (34 008). Surprisingly, rat records have relatively few external links (42 896) possibly reflecting the transient status of the rat genome annotation. Obviously, Ensembl is the most frequently linked external database, followed by EMBL database, and LocusLink (for details see Table 2).
|
| HOMOLOGY GROUPS |
|---|
|
|
|---|
Single linkage clustering was used to create homology groups from pairwise homology information obtained through Ensmart (8). Overall, 16 127 groups are formed from 150 158 pairwise homology relations. Although not all species are present in each group, there are 8402 groups that contain transcripts from all five species. There are several one-to-many homology relationships annotated in Ensembl. In such cases, our use of single-linkage clustering results in homology groups that contain multiple genes from the same species. Figure 1 shows the distribution of group sizes. For each homology group, CLUSTAL W (9) is used to compute two multiple sequence alignmentsone from the mRNA sequences and one from the amino acid sequences.
|
The multiple sequence alignments are then stored in a compressed format within the Mysql database. Compression is achieved by noting that a gapped sequence that belongs to an alignment can be obtained from the ungapped transcript (or protein) sequence already stored in the database if one knows the location of the gaps. Instead of storing a whole alignment, we store only information about location and length of gaps in the alignment. This procedure results in a 100-fold reduction of the required storage.
| DISTANCE COMPUTATION |
|---|
|
|
|---|
In calculating distances, only the transcript with the longest coding region is taken into consideration. mRNA alignments are used for calculation of p and k distances of coding sequences and untranslated regions. We use Kimura's two-parameters model to compute k distances. In case the coding regions do not align perfectly with each other, only the common part of each distinct mRNA region is considered for calculation.
Protein sequence alignments are used for protein identity calculations and serve as a template for the coding sequence alignment that is used in synonymous (Ks) and non-synonymous (Kn) distance calculations. Currently Ks, Kn are obtained using the NeiGojobori method (10) as implemented in the PAML package (11). All other pairwise comparison analyses were carried out using Bioperl 1.4 modules (12).
| USER INTERFACES |
|---|
|
|
|---|
A simple search interface allows users to search the database by keyword or accession number from Ensembl (or other databases linked to Ensembl records such as Swiss-Prot, RefSeq, Gene Ontology, etc.). Genes matching the search criteria and the homology groups that they belong to are displayed. Clicking on the hyperlink for a homology group listed in the search results leads to a page with the full description of the group consisting of seven sections (see Figure 2): (i) description of group members with links to external databases; (ii) pairwise comparison analysis results in a tabular format; (iii) pictorial representation of alignments mapped to exonintron structures which help visualize conservation of gene structure; (iv) protein alignment; (v) mRNA alignment; (vi) phylogenetic tree; (vii) group structure picture which shows the pairwise homology relationships that resulted in the construction of the group (Figure 3 shows a case where one possibly false homology relationship resulted in the merging of two distinct homology groups).
|
|
By default, only the description of group members and pairwise comparison results are shown. User preferences stored in a cookie are used to determine the set of sections to be shown.
A more elaborate accession search interface can be used for larger scale analyses. It enables calculation of some evolutionary parameters at a global scale (i.e. it summarizes results for a selected group of genes or if there is no limit specified, for all genes present in the database). Extensive filtering options allow a user to restrict analysis to alignments which satisfy certain length and similarity constraints. This helps avoid some statistical biases due to data sampling artefacts or erroneous comparison of paralogous genes. The summary of overall evolutionary statistics, shown in Figure 4, is in agreement with published literature (2,3,1315).
|
This interface makes it convenient to verify published results regarding evolutionary rates of groups of proteins. For instance, it was shown in a recent study that sperm-specific proteins evolve at a faster rate than other proteins (16). The paper listed either the RefSeq id or the EMBL accession number for each of the analyzed proteins. By entering the RefSeq ids in one entry box and the EMBL accession numbers in another entry box, one can confirm these results in seconds in the accession search page.
| CONCLUSIONS AND PERSPECTIVES |
|---|
|
|
|---|
Evolutionary analysis is a key step in many biological investigations from classical systematics to comparative genomics and bioinformatics. Very often, researchers are interested in knowing how the results of a comparison of a single gene or set of genes fit a global picture. However, such global information is hard to obtain or does not exist. To fill this gap, we have created the DED, which contains sequence information from several vertebrate species clustered into homology groups. This database should be useful in a wide range of biological investigations including gene function assignment, molecular phylogenetic studies and sequence evolution patterns.
Our database depends on other primary databases for sequence, structure and homology information. However, because of the extensive post-processing involved, it is not possible to update our database and keep it synchronized with the primary (source) databases at all times. At present, we plan to update the DED at least twice a year and add new genomes at the time of scheduled updates. In addition, we also plan to add sequence information from organisms whose genomes are not yet fully sequenced.
| Notes |
|---|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact journals.permissions{at}oupjournals.org.
| REFERENCES |
|---|
|
|
|---|
- Brenner,S. ( (2002) ) Ontology recapitulates philology. Scientist, , 16, , 12. .
- Makalowski,W., Zhang,J. and Boguski,M.S. ( (1996) ) Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res., , 6, , 846857.
[Abstract/Free Full Text] . - Makalowski,W. and Boguski,M.S. ( (1998) ) Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc. Natl Acad. Sci. USA, , 95, , 94079412.
[Abstract/Free Full Text] . - Mushegian,A.R., Garey,J.R., Martin,J. and Liu,L.X. ( (1998) ) Large-scale taxonomic profiling of eukaryotic model organisms: a comparison of orthologous proteins encoded by the human, fly, nematode, and yeast genomes. Genome Res., , 8, , 590598.
[Abstract/Free Full Text] . - Wheelan,S.J., Boguski,M.S., Duret,L. and Makalowski,W. ( (1999) ) Human and nematode orthologslessons from the analysis of 1800 human genes and the proteome of Caenorhabditis elegans. Gene, , 238, , 163170.[CrossRef][Web of Science][Medline] .
- Glazko,G.V. and Nei,M. ( (2003) ) Estimation of divergence times for major lineages of primate species. Mol. Biol. Evol., , 20, , 424434.
[Abstract/Free Full Text] . - Birney,E., Andrews,T.D., Bevan,P., Caccamo,M., Chen,Y., Clarke,L., Coates,G., Cuff,J., Curwen,V., Cutts,T. et al. ( (2004) ) An overview of Ensembl. Genome Res., , 14, , 925928.
[Abstract/Free Full Text] . - Kasprzyk,A., Keefe,D., Smedley,D., London,D., Spooner,W., Melsopp,C., Hammond,M., Rocca-Serra,P., Cox,T. and Birney,E. ( (2004) ) EnsMart: a generic system for fast and flexible access to biological data. Genome Res., , 14, , 160169.
[Abstract/Free Full Text] . - Thompson,J.D., Higgins,D.G. and Gibson,T.J. ( (1994) ) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., , 22, , 46734680.
[Abstract/Free Full Text] . - Nei,M. and Gojobori,T. ( (1986) ) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol., , 3, , 418426.[Abstract] .
- Yang,Z. ( (1997) ) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci., , 13, , 555556.
[Free Full Text] . - Stajich,J.E., Block,D., Boulez,K., Brenner,S.E., Chervitz,S.A., Dagdigian,C., Fuellen,G., Gilbert,J.G., Korf,I., Lapp,H. et al. ( (2002) ) The Bioperl toolkit: Perl modules for the life sciences. Genome Res., , 12, , 16111618.
[Abstract/Free Full Text] . - Waterston,R.H., Lindblad-Toh,K., Birney,E., Rogers,J., Abril,J.F., Agarwal,P., Agarwala,R., Ainscough,R., Alexandersson,M., An,P. et al. ( (2002) ) Initial sequencing and comparative analysis of the mouse genomeI. Nature, , 420, , 520562.[CrossRef][Medline] .
- Aparicio,S., Chapman,J., Stupka,E., Putnam,N., Chia,J.M., Dehal,P., Christoffels,A., Rash,S., Hoon,S., Smit,A. et al. ( (2002) ) Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science, , 297, , 13011310.
[Abstract/Free Full Text] . - Gibbs,R.A., Weinstock,G.M., Metzker,M.L., Muzny,D.M., Sodergren,E.J., Scherer,S., Scott,G., Steffen,D., Worley,K.C., Burch,P.E. et al. ( (2004) ) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature, , 428, , 493521.[CrossRef][Medline] .
- Torgerson,D.G., Kulathinal,R.J. and Singh,R.S. ( (2002) ) Mammalian sperm proteins are rapidly evolving: evidence of positive selection in functionally diverse genes. Mol Biol. Evol., , 19, , 19731980.
[Abstract/Free Full Text] .
This article has been cited by other articles:
![]() |
J. Coulombe-Huntington and J. Majewski Characterization of intron loss events in mammals Genome Res., January 1, 2007; 17(1): 23 - 32. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Shepelev and A. Fedorov Advances in the Exon-Intron Database (EID) Brief Bioinform, June 1, 2006; 7(2): 178 - 185. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Fisher, E. A. Grice, R. M. Vinton, S. L. Bessling, and A. S. McCallion Conservation of RET Regulatory Function from Human to Zebrafish Without Sequence Similarity Science, April 14, 2006; 312(5771): 276 - 279. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






