Nucleic Acids Research Advance Access originally published online on October 11, 2007
Nucleic Acids Research 2008 36(Database issue):D267-D270; doi:10.1093/nar/gkm852
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Research, 2008, Vol. 36, Database issue D267-D270
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
This article appears in the following Nucleic Acids Research issue: Database issue [View the issue table of contents]
Articles |
OPTIC: orthologous and paralogous transcripts in clades
Department of Physiology, Anatomy and Genetics, MRC Functional Genetics Unit, University of Oxford, Le Gros Clark Building, Oxford OX1 3QX, UK
* To whom correspondence should be addressed. Tel: +44 1865 2 85 85 4; Fax: +44 1865 2 85 86 2; Email: andreas.heger{at}dpag.ox.ac.uk
Correspondence may also be addressed to Chris Ponting. Tel: +44 1865 2 85 85 5; Fax: +44 1865 2 85 86 2; Email: chris.ponting{at}dpag.ox.ac.uk
Received August 13, 2007. Revised September 25, 2007. Accepted September 26, 2007.
| ABSTRACT |
|---|
|
|
|---|
The genome sequences of a large number of metazoan species are now known. As multiple closely related genomes are sequenced, comparative studies that previously focussed on only pairs of genomes can now be extended over whole clades. The orthologous and paralogous transcripts in clades (OPTIC) database currently provides sets of gene predictions and orthology assignments for three clades: (i) amniotes, including human, dog, mouse, opossum, platypus and chicken (17 443 orthologous groups); (ii) a Drosophila clade of 12 species (12 889 orthologous groups) and (iii) a nematode clade of four species (13 626 orthologous groups). Gene predictions, multiple alignments and phylogenetic trees are freely available to browse and download from http://genserv.anat.ox.ac.uk/clades. Further genomes and clades will be added in the future.
| INTRODUCTION |
|---|
|
|
|---|
New technologies and reduced costs are driving a marked increase in the numbers of genomes that are being sequenced. This steep rise in data presents opportunities for predicting evolutionary relationships of genes not between pairs of genomes, as previously, but instead among genomes from a clade of closely related species. Computational tools for gene prediction, orthology assignment and multiple alignment are now needing to be developed using phylogenetic approaches. To meet this challenge, we have developed a pipeline for gene prediction and orthology assignment for any clade of genomes (Heger and Ponting, in press). The current release of the orthologous and paralogous transcripts in clades (OPTIC) database contains three clades: 12 species from the genus Drosophila, an amniotic clade of five mammals with chicken as outgroup, and four Caenorhabditis nematodes (Table 1).
|
The pipeline predicts orthology for both orthologous groups and simple 1:1 ortholog sets. Here, orthologous groups contain orthologs and in-paralogs but exclude out-paralogs (1), those duplicated genes that were each present in the last common ancestor of a clade. Simple 1:1 ortholog sets are derived from orthologous groups by examining the gene tree and extracting sub-trees that contain exactly one gene per species. To enable inferences of gene duplication or loss, or positive selection on individual codons, we supply amino acid or nucleotide multiple sequence alignments, and phylogenetic trees, for each orthologous group. All data may be searched or downloaded freely from http://genserv.anat.ox.ac.uk/download/clades.
Database construction
The pipeline requires a set of genome sequences and ENSEMBL (2) gene sets for each genome. If a gene set for a genome is unavailable, we predict transcripts by homology from a reference transcript set and thereafter automatically derive a gene set from them (Heger and Ponting, in press). A quality control step removes partial predictions and marks those predictions as pseudogenes that contain in-frame stop-codons and frameshift insertions and deletions. Both genes and pseudogenes comprise a predicted gene set. ENSEMBL and predicted gene sets are then submitted to an orthology assignment process. A full description of the pipeline, including parameter settings, is provided on the web site. Briefly, the pipeline implements the following steps:
- Gene prediction by homology from a transcript set using Exonerate (3).
- Pairwise orthology assignment between all pairs of genomes using:
- BlastP (4) all-against-all alignments of all translated transcripts and
- PhyOP (5) tree-based orthology assignment of genes.
- BlastP (4) all-against-all alignments of all translated transcripts and
- Graph-based grouping of genes from all species into clusters.
- Multiple alignment of translated exons using MUSCLE (6).
- Estimation of phylogenetic tree topology using NJTree (7).
- Decomposition of clusters into orthologous groups.
- Branch length estimation using codeml from the PAML package (8).
- Computation of simple 1:1 ortholog sets.
Data are stored in a relational database and gene predictions are displayed within a GMOD genome browser (http://www.gmod.org). Software is open source and available without charge on request to the authors.
Database contents
For the current release, we have applied our pipeline to three metazoan clades (Table 1) each containing between 4 and 12 species. Genes were predicted for Drosophila and Caenorhabditis species genome assemblies using D. melanogaster (9) and C. elegans (10) protein-coding transcripts as templates. Mammalian and chicken gene sets were from ENSEMBL release 42 (2). The web server provides an up-to-date list of genome assemblies for the current release.
We find 12 889 orthologous groups in the Drosophila clade, 17 443 groups in the amniotic clade and 13 626 groups in the four Caenorhabditis species. Of these, 10 563 orthologous groups in the Drosophila clade, 9675 groups in the amniotic clade and 6545 groups in the Caenorhabditis clade contain the full species complements. The numbers of simple 1:1 ortholog sets are smaller (5241, 7587, and 5987 for the three clades, respectively) owing to gene duplications and absences from incomplete assemblies.
For each orthologous group, we provide:
Transcript predictions: Predicted transcripts are available as exonic genomic coordinates, and as peptide and coding sequences.
Orthologs: Orthologous groups and simple 1:1 ortholog sets.
Multiple alignments: Multiple alignments of transcripts and genes within an orthologous group are provided both as aligned nucleic acid sequences and as aligned peptide sequences. Frameshift insertions or deletions in pseudogenes have been removed, and stop-codons have been masked in order to facilitate downstream analyses. Genes have been aligned by concatenating exons of all transcripts while maintaining frame.
Phylogenetic trees: For each orthologous group, we provide a phylogenetic tree. The topology of the tree has been calculated from NJTree, while branch lengths (nucleotide substitutions per site) have been assigned using PAML.
Database access and web service
The web service permits interactive data querying and browsing of orthologous groups and simple 1:1 ortholog sets for each clade (Figure 1). Species distributions of orthologous groups are denoted by phylogenetic profiles denoting the presence (+) or absence (0) of one or more genes in a group. For example, a search for orthologous groups in the amniotic clade with the phylogenetic profile + + +000 lists 542 orthologous groups that contain genes in human, mouse and dog, but have no orthologs in opossum, platypus and chicken.
|
In queries for simple 1:1 ortholog sets, 1 indicates that exactly one copy of this gene is present and – indicates that this particular species should not be considered. Thus, the profile 111––– applied to simple 1:1 ortholog sets yields 13 788 simple 1:1 ortholog sets that contain exactly one gene in human, dog and mouse, and any number of homologs in opossum, platypus or chicken.
For each orthologous group and simple 1:1 ortholog set, multiple alignments and a phylogenetic tree may be displayed. A synteny viewer also allows an assessment of whether orthologs occur in regions of conserved synteny. Genes of particular interest can be located either by identifier or by genomic location. Computational biologists interested in performing large-scale analyses can download complete datasets from the download area.
| OUTLOOK |
|---|
|
|
|---|
OPTIC is designed to provide precalculated phylogenetic datasets that are of benefit to clade genomic analyses. Our approach complements other existing projects (2,7,12, 12) in four respects: (i) we apply the pipeline to diverse, and not just experimental model, organisms; (ii) we define clades with respect to phylogenetic distances that are amenable to evolutionary rate analysis (roughly, where the number of synonymous substitutions per synonymous site is <2.0 (5)); (iii) our orthology relationships are inferred by considering all species equally, in a phylogenetic approach and (4) we use all exons across all alternative transcripts as opposed to the longest transcripts only. A particularly useful feature of OPTIC is its provision of multiple alignments either for genes as concatenated exons, or for alternative transcripts.
We plan to update gene predictions and orthology assignments and add more genomes and clades when they become available.
| ACKNOWLEDGEMENTS |
|---|
This study was funded by Medical Research Council, UK. We are grateful to Leo Goodstadt for many helpful discussions. We would like to thank the various genome sequencing centers and ENSEMBL for making their genomic data and gene sets freely available for download. Funding to pay the Open Access publication charges for this article was provided by Medical Research Council, UK.
Conflict of interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
- Remm M, Storm CE, Sonnhammer EL. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. (2001) 314:1041–1052.[CrossRef][Web of Science][Medline]
- Hubbard TJP, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, et al. Ensembl 2007. Nucleic Acids Res. (2007) 35:D610–D617.
[Abstract/Free Full Text] - Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics (2005) 6:31.[CrossRef][Medline]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.
[Abstract/Free Full Text] - Goodstadt L, Ponting CP. Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human. PLoS Comput. Biol. (2006) 2:e133.[CrossRef][Medline]
- Edgar RC. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. (2004) 32:1792–1797.
[Abstract/Free Full Text] - Li H, Coghlan A, Ruan J, Coin LJ, Heriche J, Osmotherly L, Li R, Liu T, Zhang Z, et al. Treefam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. (2006) 34:D572–D580.
[Abstract/Free Full Text] - Yang Z. Paml: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. (1997) 13:555–556.
[Free Full Text] - Crosby MA, Goodman JL, Strelets VB, Zhang P, Gelbart WM. Flybase: genomes by the dozen. Nucleic Acids Res. (2007) 35:D486–491.
[Abstract/Free Full Text] - Bieri T, Blasiar D, Ozersky P, Antoshechkin I, Bastiani C, Canaran P, Chan J, Chen N, Chen WJ, et al. Wormbase: new content and better access. Nucleic Acids Res. (2007) 35:D506–D510.
[Abstract/Free Full Text] - O'Brien KP, Remm M, Sonnhammer ELL. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. (2005) 33:D476–D480.
[Abstract/Free Full Text] - Huerta-Cepas J, Dopazo H, Dopazo J, Gabaldon T. The human phylome. Genome Biol. (2007) 8:R109.[CrossRef][Medline]
- Heger A, Ponting CP. Evolutionary rate analysis of orthologues and paralogues from twelve Drosophila genomes. Genome Res. (2007) in press.
This article has been cited by other articles:
![]() |
A. Heger, C. P. Ponting, and I. Holmes Accurate Estimation of Gene Evolutionary Rates Using XRATE, with an Application to Transmembrane Proteins Mol. Biol. Evol., August 1, 2009; 26(8): 1715 - 1721. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Bao and M. Friedrich Molecular Evolution of the Drosophila Retinome: Exceptional Gene Gain in the Higher Diptera Mol. Biol. Evol., June 1, 2009; 26(6): 1273 - 1287. [Abstract] [Full Text] [PDF] |
||||
![]() |
D.-Q. Nguyen, C. Webber, J. Hehir-Kwa, R. Pfundt, J. Veltman, and C. P. Ponting Reduced purifying selection prevails over positive selection in human copy number variant evolution Genome Res., November 1, 2008; 18(11): 1711 - 1723. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


