Skip Navigation

Nucleic Acids Research 2005 33(Database Issue):D216-D218; doi:10.1093/nar/gki007
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (221K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Kaplan, N.
Right arrow Articles by Linial, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kaplan, N.
Right arrow Articles by Linial, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2005, Vol. 33, Database issue D216-D218
© 2005, the authors
Nucleic Acids Research, Vol. 33, Database issue © Oxford University Press 2005; all rights reserved

ProtoNet 4.0: A hierarchical classification of one million protein sequences

Noam Kaplan*, Ori Sasson1, Uri Inbar1, Moriah Friedlich1, Menachem Fromer1, Hillel Fleischer1, Elon Portugaly1, Nathan Linial1 and Michal Linial

Department of Biological Chemistry, Institute of Life Sciences and 1 School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel

* To whom correspondence should be addressed at The Hebrew University, Department of Biological Chemistry, Givat Ram Campus, Jerusalem, Israel, 91904. Tel: +972 2 6585433; Fax: +972 2 6586448; Email: kaplann{at}cc.huji.ac.il

Received September 10, 2004; Accepted September 14, 2004


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 HIERARCHY CONDENSATION
 DATABASE UPDATES
 CLUSTER NAMES
 WEBSITE ENHANCEMENTS
 REFERENCES
 
ProtoNet is an automatic hierarchical classification of the protein sequence space. In 2004, the ProtoNet (version 4.0) presents the analysis of over one million proteins merged from SwissProt and TrEMBL databases. In addition to rich visualization and analysis tools to navigate the clustering hierarchy, we incorporated several improvements that allow a simplified view of the scaffold of the proteins. An unsupervised, biologically valid method that was developed resulted in a condensation of the ProtoNet hierarchy to only 12% of the clusters. A large portion of these clusters was automatically assigned high confidence biological names according to their correspondence with functional annotations. ProtoNet is available at: http://www.protonet.cs.huji.ac.il.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 HIERARCHY CONDENSATION
 DATABASE UPDATES
 CLUSTER NAMES
 WEBSITE ENHANCEMENTS
 REFERENCES
 
ProtoNet (1) (launched in 2002) is an automatic hierarchical clustering of the SwissProt and TrEMBL (2) protein databases. The clustering process is based on an all-against-all BLAST (3) similarity search. The similarities' E-score is used to perform a continuous bottom-up clustering process by joining the two most similar protein clusters at each step, resulting in a hierarchy of protein clusters at various degrees of biological granularity. This hierarchy is structured as a collection of trees, in which the root clusters contain all the proteins of the tree and the rest of the clusters represent subdivisions of the proteins into smaller groups. Several interesting biological insights about protein families and the evolution of structural and functional relations between proteins are obtained by browsing this global hierarchical organization of the protein world (4). Furthermore, ProtoNet can be used to assess the function of novel protein sequences, by finding the best matching cluster for the new sequence. In this paper, we describe several new developments in ProtoNet 4.0 including the increase of the sequences analyzed from 114 033 proteins in SwissProt (version 2.1) to 1 072 911 sequences (SwissProt and TrEMBL, version 4.0) and the improved methodology for simplification of the scaffold of the protein hierarchy. ProtoNet is available at: http://www.protonet.cs.huji.ac.il.


    HIERARCHY CONDENSATION
 TOP
 ABSTRACT
 INTRODUCTION
 HIERARCHY CONDENSATION
 DATABASE UPDATES
 CLUSTER NAMES
 WEBSITE ENHANCEMENTS
 REFERENCES
 
Due to the immense size of the ProtoNet hierarchy and the number of protein clusters, it would be very difficult to navigate in such a large hierarchy. Furthermore, it is obvious that many of the clusters are biologically irrelevant and uninteresting (e.g. huge root clusters that contain hundreds of thousands non-related proteins). In order to get a condensed yet biologically relevant view of the hierarchy, we searched for some process-intrinsic parameter that would indicate the type of clusters that are biologically relevant. The parameter found measures the stability of the cluster in the process, assuming that a stable cluster would be also more relevant biologically. We found this assumption to be correct, and that if we select a small subset of the clusters that show high stability, they would retain the biological validity of ProtoNet (N. Kaplan et al., manuscript in preparation).

The default condensation of the ProtoNet hierarchy leaves 12% of the clusters. However, the ProtoNet website now offers an ‘advanced mode’, in which advanced users can control the level of condensation and the method by which it is done, which results in a larger or smaller set of clusters as required. It is to be noted that the condensation causes the trees to change from binary trees to non-binary trees, and some browsing options have been developed accordingly (see ‘Web Enhancements’).


    DATABASE UPDATES
 TOP
 ABSTRACT
 INTRODUCTION
 HIERARCHY CONDENSATION
 DATABASE UPDATES
 CLUSTER NAMES
 WEBSITE ENHANCEMENTS
 REFERENCES
 
ProtoNet has gone through a major update of all database sources. Primarily, the protein database from which the ProtoNet tree is constructed has been updated and extended to include the TrEMBL protein database as well as SwissProt. This results in a leap from 114 033 to 1 072 911 protein sequences. Although TrEMBL is not manually validated by experts, it provides a much more extensive view of the protein world including whole genomes and thorough representation of several key organisms (see Table 1).


View this table:
[in this window]
[in a new window]
 
Table 1. Representation of selected species in ProtoNet

 

    CLUSTER NAMES
 TOP
 ABSTRACT
 INTRODUCTION
 HIERARCHY CONDENSATION
 DATABASE UPDATES
 CLUSTER NAMES
 WEBSITE ENHANCEMENTS
 REFERENCES
 
A major objective in bioinformatics is to assign a biological function to proteins. We have developed an automatic high-confidence method that assigns a functional annotation to ProtoNet protein clusters. The method finds a functional annotation from either InterPro (5), GO (6), SwissProt or ENZYME (7) databases that best fits the proteins of the cluster and assigns it a score relative to how well it fits the cluster. If this score is above a certain threshold, the annotation is assigned as the cluster's name. Understandably, not all protein clusters would have an existing annotation that fits them well. Clusters whose best fitting annotation does not pass the threshold remain nameless, which possibly suggests a novel functional group or clusters that are associated with mixed functions. By applying this method, we were able to assign biological names to 78% of the clusters that contain 20 proteins or more. The annotation can be assigned with high confidence to the cluster because a high threshold is used.


    WEBSITE ENHANCEMENTS
 TOP
 ABSTRACT
 INTRODUCTION
 HIERARCHY CONDENSATION
 DATABASE UPDATES
 CLUSTER NAMES
 WEBSITE ENHANCEMENTS
 REFERENCES
 
Several enhancements have been made to the ProtoNet website in order to allow easier and more in-depth analysis of the ProtoNet trees.

Browsing cluster names
Cluster names are extremely useful for quickly browsing the ProtoNet trees, which eliminates the need to check the proteins of each cluster in order to get an impression of the hierarchy. Furthermore, the assignment of a biological function to clusters suggests an easy scheme of assigning a function to proteins with unknown function: a protein can be assigned the function of all clusters to which it belongs. This scheme can be used not only on each of the 1 072 911 proteins from which the ProtoNet hierarchy is constructed, but also for new protein sequences given by the user (using the ‘Classify your protein’ option in the website, which finds the most suitable cluster for a new protein sequence given by the user).

Browsing the tree
Subtree view
In order to cope with the change to a non-binary tree, we have introduced the ProtoBrowser (Figure 1), which shows the tree in the vicinity of the cluster that is being displayed. Instead of presenting only the branch of the tree to which the cluster belonged to, the new display allows an easy navigation to neighboring clusters and an enhanced global overview of biological protein families.



View larger version (34K):
[in this window]
[in a new window]
 
Figure 1. The ProtoBrowser allows viewing the near vicinity of a cluster in the ProtoNet hierarchy. Blue triangle-shaped icons represent protein clusters. The cluster currently being viewed is the cluster A268586, which appears in the center in red. Clusters that include proteins with 3D solved structures are marked PDB.

 
Functionality view
PANDORA (8) is a web-based tool that allows an in-depth biological analysis of large protein sets. It is a natural choice when trying to biologically interpret large protein clusters that contain hundreds of proteins. ProtoNet now offers a direct link from its cluster page to PANDORA, providing the ability to easily understand what biological groups the cluster is built from and analyze them from several different biological aspects.

Similarity view
When viewing a protein cluster, it is sometimes helpful to obtain an in-depth look into the sequence similarity between the proteins of the cluster. This could allow the user to identify if there is a natural partitioning into subgroups or if the inner similarity of the cluster is uniform. In order to address this, the website offers a cluster similarity matrix (Figure 2), which shows an all-against-all color-coded matrix of all protein pairs in the cluster, colored according to the BLAST E-score between the two proteins. This also facilitates access to the BLAST result, which can be obtained simply by clicking on the appropriate cell in the matrix.



View larger version (90K):
[in this window]
[in a new window]
 
Figure 2. Example of a cluster similarity matrix. Colored cells represent different degrees of similarity, ranging from white (no similarity: BLAST E-score higher than 100) to dark blue (high similarity, BLAST E-score close to 0). It is evident that the cluster A222801 is roughly divided into 3 subsets: in the upper left of the diagonal there are proteins that show no similarity to each other or to any protein in the cluster; in the center of the diagonal there is a subset of proteins that are similar to each other but to no other proteins; and at the bottom right of the diagonal there is another subset of proteins that are similar to each other but not to other proteins.

 
Maintenance and updating
The ProtoNet source databases are generally updated twice a year. The next ProtoNet release is planned to include the UniProt Ref 100 protein database. Other future plans include allowing the users to select a subset of proteins from ProtoNet according to their needs (e.g. selecting for study only the proteins from the SwissProt database or only the proteins of a specific species) and expanding links to further biological databases such as OMIM (9) and DIP (10).


    ACKNOWLEDGEMENTS
 
We thank the entire current and previous ProtoNet teams for their endless support. Special thanks to Alex Savenok for the web design as well as for the development of the visualization tools. We thank the fellowship support by the Sudarsky Center for Computational Biology (SCCB) to N.K., U.I., M.F. and M.F.. This study is partially supported by the EU NoE BioSapiens consortium and the CESG consortium supported by the NIH.


    Notes
 
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact journals.permissions{at}oupjournals.org.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 HIERARCHY CONDENSATION
 DATABASE UPDATES
 CLUSTER NAMES
 WEBSITE ENHANCEMENTS
 REFERENCES
 

  1. Sasson,O., Vaaknin,A., Fleischer,H., Portugaly,E., Bilu,Y., Linial,N. and Linial,M. ( (2003) ) ProtoNet: hierarchical classification of the protein space. Nucleic Acids Res., , 31, , 348–352.[Abstract/Free Full Text] .

  2. Boeckmann,B., Bairoch,A., Apweiler,R., Blatter,M.C., Estreicher,A., Gasteiger,E., Martin,M.J., Michoud,K., O`Donovan,C., Phan,I. et al. ( (2003) ) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., , 31, , 365–370.[Abstract/Free Full Text] .

  3. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. ( (1997) ) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., , 25, , 3389–3402.[Abstract/Free Full Text] .

  4. Shachar,O. and Linial,M. ( (2004) ) A robust method to detect structural and functional remote homologues. Proteins, , in press. .

  5. Camon,E., Magrane,M., Barrell,D., Lee V., Dimmer,E., Maslen,J., Binns,D., Harte,N., Lopez,R. and Apweiler R. ( (2004) ) The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res., , 32, (Database issue), D262–266.[Abstract/Free Full Text] .

  6. Mulder,N.J., Apweiler,R., Attwood,T.K., Bairoch,A., Barrell,D., Bateman,A., Binns,D., Biswas,M., Bradley,P., Bork,P. et al. ( (2003) ) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res., , 31, , 315–318.[Abstract/Free Full Text] .

  7. Bairoch,A. ( (2000) ) The ENZYME database in 2000. Nucleic Acids Res., , 28, , 304–305.[Abstract/Free Full Text] .

  8. Kaplan,N., Vaaknin,A. and Linial,M. ( (2003) ) PANDORA: keyword-based analysis of proteins sets by integration of annotation sources. Nucleic Acids Res., , 31, , 5617–5626.[Abstract/Free Full Text] .

  9. McKusick,V.A. ( (1998) ) Mendelian Inheritance in Man. Johns Hopkins University Press, Baltimore. .

  10. Salwinski,L., Miller,C.S., Smith,A.J., Pettit,F.K., Bowie,J.U. and Eisenberg,D. ( (2004) ) The database of interacting proteins: 2004 update. Nucleic Acids Res., , 32, (Database issue), D449–451.[Abstract/Free Full Text] .


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
M. Chitale, T. Hawkins, C. Park, and D. Kihara
ESG: extended similarity group method for automated protein function prediction
Bioinformatics, July 15, 2009; 25(14): 1739 - 1745.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Naamati, M. Askenazi, and M. Linial
ClanTox: a classifier of short animal toxins
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W363 - W368.
[Abstract] [Full Text] [PDF]


Home page
J R Soc InterfaceHome page
G. A Reeves, D. Talavera, and J. M Thornton
Genome and proteome annotation: organization, interpretation and integration
J R Soc Interface, February 6, 2009; 6(31): 129 - 147.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Loewenstein, E. Portugaly, M. Fromer, and M. Linial
Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
Bioinformatics, July 1, 2008; 24(13): i41 - i49.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. Rattei, P. Tischler, R. Arnold, F. Hamberger, J. Krebs, J. Krumsiek, B. Wachinger, V. Stumpflen, and W. Mewes
SIMAP structuring the network of protein similarities
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D289 - D292.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Heger, S. Mallick, C. Wilton, and L. Holm
The global trace graph, a novel paradigm for searching protein sequence databases
Bioinformatics, September 15, 2007; 23(18): 2361 - 2367.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
N. Kaplan and M. Linial
ProtoBee: Hierarchical classification and annotation of the honey bee proteome
Genome Res., November 1, 2006; 16(11): 1431 - 1438.
[Abstract] [Full Text] [PDF]


Home page
IOVSHome page
W. Adachi, H. Ulanovsky, Y. Li, B. Norman, J. Davis, and J. Piatigorsky
Serial Analysis of Gene Expression (SAGE) in the Rat Limbal and Central Corneal Epithelium.
Invest. Ophthalmol. Vis. Sci., September 1, 2006; 47(9): 3801 - 3810.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
R. L. Marsden, D. Lee, M. Maibaum, C. Yeats, and C. A. Orengo
Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space
Nucleic Acids Res., February 15, 2006; 34(3): 1066 - 1080.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
O. Sasson and M. Linial
ProTarget: automatic prediction of protein structure novelty
Nucleic Acids Res., July 1, 2005; 33(suppl_2): W81 - W84.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
I. Bahir and M. Linial
ProTeus: identifying signatures in protein termini
Nucleic Acids Res., July 1, 2005; 33(suppl_2): W277 - W280.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (221K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Kaplan, N.
Right arrow Articles by Linial, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kaplan, N.
Right arrow Articles by Linial, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?