Skip Navigation


Nucleic Acids Research Advance Access originally published online on November 30, 2007
Nucleic Acids Research 2008 36(Database issue):D263-D266; doi:10.1093/nar/gkm1020
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (813K) Freely available
Right arrow Screen PDF (179K) Freely available
Right arrowOA All Versions of this Article:
36/suppl_1/D263    most recent
gkm1020v2
gkm1020v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Berglund, A.-C.
Right arrow Articles by Sonnhammer, E. L. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Berglund, A.-C.
Right arrow Articles by Sonnhammer, E. L. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2008, Vol. 36, Database issue D263-D266
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article appears in the following Nucleic Acids Research issue: Database issue [View the issue table of contents]

Articles

InParanoid 6: eukaryotic ortholog clusters with inparalogs

Ann-Charlotte Berglund1, Erik Sjölund2, Gabriel Östlund2 and Erik L. L. Sonnhammer2,*

1Linnaeus Centre for Bioinformatics, Uppsala University, BMC Box 598, 75124, Uppsala and 2Stockholm Bioinformatics Center, Albanova, Stockholm University, SE-10691 Stockholm, Sweden

*To whom correspondence should be addressed. Tel: +46 8 55378567; Fax: +46 8 55378214; Email: Erik.Sonnhammer{at}sbc.su.se

Received September 15, 2007. Revised October 23, 2007. Accepted October 27, 2007.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 DATA AND IMPLEMENTATION
 INPARANOID CONTENT
 DATA AVAILABILITY
 REFERENCES
 
The InParanoid eukaryotic ortholog database (http://InParanoid.sbc.su.se/) has been updated to version 6 and is now based on 35 species. We collected all available ‘complete’ eukaryotic proteomes and Escherichia coli, and calculated ortholog groups for all 595 species pairs using the InParanoid program. This resulted in 2 642 187 pairwise ortholog groups in total. The orthology-based species relations are presented in an orthophylogram. InParanoid clusters contain one or more orthologs from each of the two species. Multiple orthologs in the same species, i.e. inparalogs, result from gene duplications after the species divergence. A new InParanoid website has been developed which is optimized for speed both for users and for updating the system. The XML output format has been improved for efficient processing of the InParanoid ortholog clusters.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 DATA AND IMPLEMENTATION
 INPARANOID CONTENT
 DATA AVAILABILITY
 REFERENCES
 
Many analyses in comparative genomics depend on correct mapping of orthologs between species. Orthologs are defined as genes in different species deriving from a single gene in the last common ancestor (1), and are therefore likely to have the same function. If an ortholog undergoes duplication in one species, the copies are referred to as inparalogs (2). Inparalogs are by definition co-orthologs to one or more orthologs in another species. In contrast, two genes deriving from a duplication that predated the speciation event between the species are referred to as outparalogs. The InParanoid program was developed to identify clusters of inparalogs while avoiding inclusion of outparalogs.

InParanoid is one of the first comprehensive ortholog databases (3,4), but nowadays more than 15 different ortholog databases exist (5). A reason for the multitude of ortholog databases is that different research questions have different needs. For instance, the COGs database (6) contains very large clusters of orthologs that often contain outparalogs (7). At the other extreme, the Homologene database (8) often places inparalogs in different clusters. For some applications, one extreme or the other may be appropriate. However, the average user is normally interested in simply finding all orthologs in species Y to a gene in species X, including all inparalogs but excluding outparalogs. InParanoid was developed to optimally serve this type of user.

Two benchmarks have recently been published that try to objectively assess the quality of different ortholog databases (9,10). In both these tests, which look either at accuracy of functional annotation or at inferred accuracy, InParanoid was top ranked. This suggests that InParanoid is successful at balancing the false-negative and false-positive rate, and is appropriate as a general-purpose orthology tool.

The InParanoid program has been upgraded to version 2.0. This release contains a number of fixes to minor bugs that could lead to incorrect cluster merging or bootstrap values. These problems were however rare.

We here present InParanoid 6, comprising 34 eukaryotic species and one prokaryotic outgroup. The website has been completely reconstructed and has new front- and back-ends, yet looks very similar to the old site. The new design makes it much faster for the user, and allows easier updating of the system. With the new back-end, we can handle much larger datasets in the future without performance problems.


    DATA AND IMPLEMENTATION
 TOP
 ABSTRACT
 INTRODUCTION
 DATA AND IMPLEMENTATION
 INPARANOID CONTENT
 DATA AVAILABILITY
 REFERENCES
 
The data was gathered from three different sources: Ensembl, NCBI and model organism databases (MODs). We only considered eukaryotic genomes sequenced to a coverage greater than 6X, with <1% unknown amino acids (X in the protein sequences). Most MOD data was packaged and uploaded by the staffs at TAIR, WormBase, FlyBase, ZFIN, dictyBase, SGD and MGI to us directly, but three MODs were downloaded from their repositories. Before running InParanoid, each proteome was made non-redundant by keeping only the longest transcript from each gene. If this is not done first, different transcripts from the same gene can end up in different clusters if they exist in more than one species. Below we only list the non-redundant number of proteins for each species.

Nine of the proteomes were uploaded to us by MOD staff. Together, we have formed an informal consortium of MODs that want to cross-reference each other using orthologs from InParanoid. We particularly welcome this system as it allows us to use the most complete and recent set of proteins for each organism, and ensures that we use identifiers that work in the MODs so that web links to proteins are valid. We hope that more MODs will join in and provide their proteomes in a new and robust XML format that will be introduced for the next release.

From Ensembl, data was obtained for Aedes aegypti (transcripts for 15 419 genes), Anopheles gambiae (13 277), Apis mellifera (13 448), Bos taurus (22 280), Canis familiaris (19 314), Ciona intestinalis (14 278), Gallus gallus (16 715), Gasterosteus aculeatus (20 879), Homo sapiens (22 983), Macaca mulatta (22 045), Monodelphis domestica (19 597), Pan troglodytes (20 982), Rattus norvegicus (23 299), Takifugu rubripes (22 008), Tetraodon nigroviridis (28 005) and Xenopus tropicalis (18 473). Apis mellifera was taken from Ensembl release 38 and all other proteomes from release 43.

From NCBI, we obtained Candida glabrata (5192), Cryptococcus neoformans (6487), Debaromyces hansenii (6318), Entamoeba histolytica (9772), Escherichia coli K12 (4243), Entamoeba histolytica (9772), Kluyveromyces lactis (5336), Yarrowia lipolytica (6544).

The MODs uploaded proteomes for Arabidopsis thaliana (26 819), Caenorhabditis briggsae (19 334), Caenorhabditis elegans (20 084), Caenorhabditis remanei (25 595), Danio rerio, (12 303), Dictyostelium discoideum (13 523), Drosophila melanogaster (13 854), Mus musculus (23 132), Saccharomyces cerevisiae (5792). We obtained from other MODs Oryza sativa (77 853) (from http://www.gramene.org), Drosophila pseudoobscura (9871) (from http://www.flybase.org), and Schizosaccharomyces pombe (5003) (http://www.sanger.ac.uk).

InParanoid clustering
NCBI–Blast comparisons using these datasets were performed between each pair of species, involving four whole proteome runs per species pair (normal runs both ways plus two self-self runs). For the 35 proteomes this amounts to 595 species pairs, requiring 1225 whole-proteome Blast searches. These were executed on the SBC compute cluster comprising about 300 Linux nodes. The pairwise Blast results were used as the input for the InParanoid ortholog clustering procedure (3).

The output from InParanoid 6 is available as XML, SQL, HTML and native format for downloading at the InParanoid homepage, and is searchable via the web interface. The XML format was defined in the RELAX NG schema language.


    INPARANOID CONTENT
 TOP
 ABSTRACT
 INTRODUCTION
 DATA AND IMPLEMENTATION
 INPARANOID CONTENT
 DATA AVAILABILITY
 REFERENCES
 
The 35 species present in the InParanoid database result in 595 pairwise ortholog lists. The information in these lists was used to generate a phylogenetic tree that reflects the level of orthology between the different species. We calculated the orthology distance from species A to B, dAB, by

Formula
and used the average orthology distances (dAB + dBA)/2 to construct a UPGMA tree, shown in Figure 1. This so-called ‘orthophylogram’ shows quantitatively the level of orthology between different clades. In general, it agrees with the standard taxonomic species tree, but we noted a few exceptions. Opossum (M. domesticus), a marsupial mammal, is clustered together with placental mammals, and the zebrafish D. rerio clustered as an outgroup to the land animals rather than together with other fish. The latter anomaly is very minor as all fish are still neighbors in the tree, but the placement of opossum is surprising. If this placement is correct, then marsupials could have evolved from a particular lineage of placental mammals. Another difference is found in the yeast clade. In the taxonomy, K. lactis, S. cerevisiae and D. hansenii are clustered together, while C. glabrata is placed outside this group. These are arranged differently in the orthophylogram in that C. glabrata has traded place with D. hansenii, which now is placed as an outgroup to K. lactis, S. cerevisiae and C. glabrata. InParanoid's grouping is supported by 25S rDNA sequences (11). Surprisingly, the green plants are placed as a subgroup among single-cell organisms, next to the fungal group.


Figure 1
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. Orthophylogram of all 35 species in InParanoid 6. This UPGMA tree is based on the average fraction of orthologs between species. For instance, on average 91.2% of the proteins in H. sapiens and P. troglodytes are orthologous. The tree topology generally corresponds to the standard taxonomy, but a few exceptions were noted (see text).

 
It is worth noting that on average only 91.2% of the proteins in H. sapiens and chimpanzee P. troglodytes are orthologous. The individual figures are 88.4% for human and 94% for chimpanzee. This is surprisingly low since the genome-wide nucleotide divergence between human and chimpanzee is estimated to only 1.23% (12). The much higher difference observed for orthologs is not due to unique proteins in either proteome, as the fraction of homologs reported by Inparanoid is 96.7% for human and 99.3% for chimpanzee. Rather, it reflects that the sequences were too divergent to be considered orthologs. This is, however, often caused by incomplete sequencing or errors in gene annotation.

The average number of inparalogs per cluster ranges from 1.001 (in Drosophila pseudoobscura when compared to D. melanogaster) to 7.160 (in O. sativa when compared to D. rerio). The overall mean number of inparalogs per species was 1.54, and the median was 1.25. The distribution of cluster sizes is shown in Figure 2. The highly duplicated genome of O. sativa is responsible for all average cluster sizes of four, and generates a separate peak in the distribution around five. In fact, O. sativa had on average more than four inparalogs per ortholog group when compared to every other non-plant species. It is surprising that the average number of inparalogs in O. sativa was so high when compared with D. rerio; when compared with D. rerio's phylogenetic neighbors the number was only around five. Although the rice proteome clearly contains the largest number of genes, our figures are probably somewhat overestimated. Evidence for this is that we were not able to find shared gene identifiers between any rice proteins in the MOD. This problem will be resolved in the future by collaborating directly with the rice MOD staff to get a better-annotated rice proteome.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2. Histogram of average number of inparalogs/cluster per species for all species–species comparisons in InParanoid 6. The peak around five inparalogs/cluster is entirely caused by O. sativa, rice.

 

    DATA AVAILABILITY
 TOP
 ABSTRACT
 INTRODUCTION
 DATA AND IMPLEMENTATION
 INPARANOID CONTENT
 DATA AVAILABILITY
 REFERENCES
 
The InParanoid database is freely available at http://inparanoid.sbc.su.se. In addition to the data which is available to search/browse using the web interface, fasta files containing all proteins, protein description files, ortholog tables in raw, SQL and XML format are available for each pairwise InParanoid analysis. The InParanoid program is freely available upon request to inparanoid{at}sbc.su.se.


    ACKNOWLEDGEMENTS
 
We thank Tomas Ohlson for database assistance and all the MOD staff that have provided their data. This study was funded by grants from Stockholm University, Royal Institute of Technology, Pharmacia, and the Knut and Alice Wallenberg Foundation.

Funding to pay the Open Access publication charges for this article was provided by Stockholm University.

Conflict of interest statement. None declared.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 DATA AND IMPLEMENTATION
 INPARANOID CONTENT
 DATA AVAILABILITY
 REFERENCES
 

  1. Fitch WM. Distinguishing homologous from analogous proteins. Syst. Zool. (1970) 19:99–113.[Abstract/Free Full Text]

  2. Sonnhammer ELL, Koonin EV. Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. (2002) 18:619–620.[CrossRef][Web of Science][Medline]

  3. Remm M, Storm CEV, Sonnhammer ELL. Automatic clustering of orthologs and In-paralogs from pairwise species comparisons. J. Mot. Behav. (2001) 314:1041–1052.

  4. O'Brien, Remm M, Sonnhammer ELL. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. (2005) 33:D476–D480.[Abstract/Free Full Text]

  5. Alexeyenko A, Lindberg J, Perez-Bercoff A, Sonnhammer ELL. Overview and comparison of ortholog databases. Drug Discovery Today: Technol. (2006) 3:137–143.[CrossRef]

  6. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics (2003) 4:41.[CrossRef][Medline]

  7. Dessimoz C, Boeckmann B, Roth AC, Gonnet GH. Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits. Nucleic Acids Res. (2006) 34:3309–3316.[Abstract/Free Full Text]

  8. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. (2007) 35:D5–D12.[Abstract/Free Full Text]

  9. Hulsen T, Huynen MA, de Vlieg J, Groenen PM. Benchmarking ortholog identification methods using functional genomics data. Genome Biol. (2006) 7:R31.[CrossRef][Medline]

  10. Chen F, Mackey AJ, Vermunt JK, Roos DS. Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS ONE (2007) 2:e383.[CrossRef]

  11. Dujon B, Sherman D, Fischer G, Durrens P, Casaregola S, Lafontaine I, De Montigny J, Marck C, Neuvéglise C, et al. Genome evolution in yeasts. Nature (2004) 439:35–44.

  12. Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature (2005) 437:69–87.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
T. W. Harris, I. Antoshechkin, T. Bieri, D. Blasiar, J. Chan, W. J. Chen, N. De La Cruz, P. Davis, M. Duesbury, R. Fang, et al.
WormBase: a comprehensive resource for nematode research
Nucleic Acids Res., November 12, 2009; (2009) gkp952v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. Muller, D. Szklarczyk, P. Julien, I. Letunic, A. Roth, M. Kuhn, S. Powell, C. von Mering, T. Doerks, L. J. Jensen, et al.
eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations
Nucleic Acids Res., November 9, 2009; (2009) gkp951v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Ostlund, T. Schmitt, K. Forslund, T. Kostler, D. N. Messina, S. Roopra, O. Frings, and E. L. L. Sonnhammer
InParanoid 7: new algorithms and tools for eukaryotic orthology analysis
Nucleic Acids Res., November 5, 2009; (2009) gkp931v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
K. Forslund and E. L. Sonnhammer
Benchmarking homology detection procedures with low complexity filters
Bioinformatics, October 1, 2009; 25(19): 2500 - 2505.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. C. Friedel and R. Zimmer
Identifying the topology of protein complexes from affinity purification assays
Bioinformatics, August 15, 2009; 25(16): 2140 - 2146.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C.-H. Sun, M.-S. Kim, Y. Han, and G.-S. Yi
COFECO: composite function annotation enriched by protein complex data
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W350 - W355.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. Plake, L. Royer, R. Winnenburg, J. Hakenberg, and M. Schroeder
GoGene: gene annotation in the fast lane
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W300 - W304.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
A. Alexeyenko and E. L.L. Sonnhammer
Global networks of functional coupling in eukaryotes from comprehensive data integration
Genome Res., June 1, 2009; 19(6): 1107 - 1116.
[Abstract] [Full Text] [PDF]


Home page
Plant Cell PhysiolHome page
Y. Ogata, N. Sakurai, K. Aoki, H. Suzuki, K. Okazaki, K. Saito, and D. Shibata
KAGIANA: An Excel-Based Tool for Retrieving Summary Information on Arabidopsis Genes
Plant Cell Physiol., January 1, 2009; 50(1): 173 - 177.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A.-Y. Guo, B. T. Webb, M. F. Miles, M. P. Zimmerman, K. S. Kendler, and Z. Zhao
ERGR: An ethanol-related gene resource
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D840 - D845.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. D. McDowall, M. S. Scott, and G. J. Barton
PIPs: human protein-protein interaction prediction database
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D651 - D656.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (813K) Freely available
Right arrow Screen PDF (179K) Freely available
Right arrowOA All Versions of this Article:
36/suppl_1/D263    most recent
gkm1020v2
gkm1020v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Berglund, A.-C.
Right arrow Articles by Sonnhammer, E. L. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Berglund, A.-C.
Right arrow Articles by Sonnhammer, E. L. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?