Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (43K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Guo, T.
Right arrow Articles by Sun, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Guo, T.
Right arrow Articles by Sun, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2004, Vol. 32, Database issue D122-D124
© 2004 Oxford University Press

DBSubLoc: database of protein subcellular localization

Tao Guo, Sujun Hua, Xinglai Ji and Zhirong Sun*

Institute of Bioinformatics, Department of Biological Sciences and Biotechnology, Tsinghua University, Beijing 100084, China

*To whom correspondence should be addressed. Tel/Fax: +86 10 62772237; Email: sunzhr{at}mail.tsinghua.edu.cn

Received August 14, 2003; Revised September 29, 2003; Accepted October 14, 2003


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONTENT
 DATABASE ACCESS
 FUTURE DEVELOPMENT
 REFERENCES
 
We have built a protein subcellular localization annotation database, the DBSubLoc database, which is available at http://www.bioinfo.tsinghua. edu.cn/dbsubloc.html. Annotations were taken from primary protein databases, model organism genome projects and literature texts, and then were analyzed to dig out the subcellular localization features of the proteins. The proteins are also classified into different categories. Based on sequence alignment, non-redundant subsets of the database have been built, which may provide useful information for subcellular localization prediction. The database now contains >60 000 protein sequences including ~30 000 protein sequences in the non-redundant data sets. Online download, search and Blast tools are also available.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONTENT
 DATABASE ACCESS
 FUTURE DEVELOPMENT
 REFERENCES
 
Subcellular localization is one of the key features of a protein, since it is closely related to biological function (1). During translation or later, proteins will be transported into different regions such as cytoplasm, membrane system, nuclear region, mitochondrion, etc., or may be secreted out of the cell. As high-throughput genome sequencing projects have produced an enormous amount of raw protein sequence data, it is very important to annotate their functional features, including subcellular localization.

Most known protein subcellular localizations are determined by experimental methods and some others can be obtained based on very high sequence similarities. Now some bioinformatics methods have been developed to predict the subcellular location of proteins, which make use of either the sorting signals (2), or amino acid composition in the sequences (39). There are two factors that have an effect on the prediction. One is the number of protein sequences for training in the artificial intelligence method; and the other is the number of target subcellular locations covered in the data set, which determine the capability of the prediction method.

We have built the DBSubLoc database to collect and manage information related to subcellular localization, and to make it into an integrated platform to improve both the amount and the quality of the data sets, which may provide useful information for prediction methods, and also for research into functional relations of proteins. Sequence analyses were also performed to produce high-quality non-redundant sub-datasets.

The DBSubLoc database has been built on annotations from primary protein sequence databases: Swiss-Prot/TrEMBL (10) and the Protein Information Resource (PIR) (11). Annotations were also provided by model organism genome projects such as the SGD (Saccharomyces cerevisiae) (12), TAIR (Arabidopsis thaliana) (13), FlyBase (Drosophila melanogaster) (14) and MGD (15) (Mus musculus). We selected only full-length and unambiguous proteins to build the DBSubLoc database. Repetitive sequences and short sequences of <20 amino acids were excluded. In the selected sequences, there are two types of subcellular location annotations. One type is annotated in natural language that is easy for humans to understand, but hard for programs to process, and the other is cross-referenced to the Gene Ontology (GO) term database (16), which is ideal for further processing. Most of the text annotations in natural language are converted to certain GO cellular component by automatic keyword recognition or manual identification. For other annotations that give very complex descriptions to the subcellular localization features, or describe proteins that are localized into multiple cellular components, their GO cross-references are determined manually. Some subcellular localization features are not determined by experimental methods directly; these low-quality annotations, marked with ‘by similarity’, ‘probable’ or ‘potential’, are also collected into the DBSubLoc database, but most of them are eliminated in the non-redundant data sets. Therefore, in the DBSubLoc, each protein entry is cross-referenced to at least one GO cellular component term that indicates its subcellular location. The proteins annotated in model organism genomes are cross-referenced to the NCBI Taxonomy database (http://www.ncbi.nlm.nih.gov/Taxonomy/tax.html), and are also categorized based on their taxonomy class (i.e. virus, archaea, bacteria, eukaryote, etc.)

Based on the DBSubLoc database, we have also provided subsets of the database that are composed of non-redundant protein sequences for each taxonomy class. Using Blast, protein sequences were compared with each other and grouped according to their sequence similarity. In each non-redundant subset, the similarities between two protein entries are <60%.


    DATABASE CONTENT
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONTENT
 DATABASE ACCESS
 FUTURE DEVELOPMENT
 REFERENCES
 
In the DBSubLoc database, each entry is composed of several records that describe one protein. Each entry contains the following information: the unique integer identity of the entry in the database; the name and text description of the protein; the taxonomic name of its source organism; the text annotations of its subcellular location; the amino acid sequence and cross-references to another database. Each cross-reference record indicates one link to other databases including Swiss-Prot, the GO term database, the NCBI Taxonomy database, PubMed, etc. The cross-reference record provides the referenced database name and the unique identifier in that database. As the database grows, more cross-references will be appended to existing entries.

The DBSubLoc database and the non-redundant sub-datasets are released as plain text file. The format is similar to that of a Swiss-Prot data file. Each line in the file is one record of an entry in the ‘KEY VALUE’ format, for example, the ‘ID 10000001’ record means that the unique identity of this entry is 10000001. The cross-reference records begin with a ‘CX’ key, each of the value data contain one cross-reference record in the ‘Reference Database: Reference ID’ format, for example, the ‘CX GO: 0005737’ record means that the protein entry is linked to the GO term database 0005737 entry. The sequence of the protein may be spread into several records with the ‘SQ’ key. A detailed description of the format can be found on the web page. Tables 1 and 2 give brief statistical information on full data set and non-redundant data sets. The number of entries are listed in the tables. More statistical information is available at the website.


View this table:
[in this window]
[in a new window]
 
Table 1. Brief statistical information of the full DBSubLoc database
 

View this table:
[in this window]
[in a new window]
 
Table 2. Brief statistical information of the non-redundant DBSubLoc database
 

    DATABASE ACCESS
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONTENT
 DATABASE ACCESS
 FUTURE DEVELOPMENT
 REFERENCES
 
We provide free access to the DBSubLoc database for education and research users. The website is available at http://www.bioinfo.tsinghua.edu.cn/dbsubloc.html. Users can download the database release file or smaller taxonomy-categorized files. Users can also search the database with protein name, protein identity or cross-referenced database identity. An online sequence alignment service is also provided at the website. Users can submit one protein sequence to search for homologous sequences in the complete DBSubLoc database or in one of its non-redundant subsets. With the development of the DBSubLoc database, new database releases, more services and tools will be provided.


    FUTURE DEVELOPMENT
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONTENT
 DATABASE ACCESS
 FUTURE DEVELOPMENT
 REFERENCES
 
We aim to collect more data and information, and classify and purify them into an efficient relational data model for further research. The non-redundant data sets are to be tested in developing new prediction methods. Because of the complexity of cellular components, more work is needed to make data set purification and partition better. More annotations will be appended from the literature, other database and by prediction.


    ACKNOWLEDGEMENTS
 
This work was funded by the National Key Foundational Research Grant in China (863-2002AA23/03/, 2002AA234041, 973-2003CB715900, NSFC-90303017).


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONTENT
 DATABASE ACCESS
 FUTURE DEVELOPMENT
 REFERENCES
 

  1. Eisenhaber,F. and Bork,P. (1998) Wanted: subcellular localization of proteins based on sequence. Trans. Cell Biol., 8, 169–170, 911.

  2. Nielsen,H., Engelbrecht,J., Brunak,S. and von Heijne,G. (1997) A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int. J. Neural Syst., 8, 581–599.[CrossRef][Medline]

  3. Gardy,J.L., Spencer,C., Wang,K., Ester,M., Tusnady,G.E., Simon,I., Hua,S., deFays,K., Lambert,C., Nakai,K. and Brinkman,F.S. (2003) PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res., 31, 3613–3617.[Abstract/Free Full Text]

  4. Hua,S. and Zhirong,S. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17, 721–728.[Abstract/Free Full Text]

  5. Nakashima,H. and Nishikawa,K. (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol., 238, 54–61.[CrossRef][Web of Science][Medline]

  6. Reinhardt,A. and Hubbard,T. (1998) Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res., 26, 2230–2236.[Abstract/Free Full Text]

  7. Yuan,Z. (1999) Prediction of protein subcellular locations using Markov chain models. FEBS Lett., 451, 23–26.[CrossRef][Web of Science][Medline]

  8. Feng,Z.P. (2001) Prediction of the subcellular location of prokaryotic proteins based on a new representation of the amino acid composition. Biopolymers, 58, 491–499.[CrossRef][Web of Science][Medline]

  9. Feng,Z.P. (2002) An overview on predicting the subcellular location of a protein. In Silico Biol., 2, 291–303.[Medline]

  10. Boeckmann,B., Bairoch,A., Apweiler,R., Blatter,M.-C., Estreicher,A., Gasteiger,E., Martin,M.J., Michoud,K., O’Donovan,C., Phan,I. et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370.[Abstract/Free Full Text]

  11. Wu,C.H., Huang,H., Arminski,L., Castro-Alvear,J., Chen,Y., Hu,Z.-Z., Ledley,R.S., Lewis,K.C., Mewes,H.-W., Orcutt,B.C. et al. (2002) The Protein Information Resource: an integrated public resource of functional annotation of proteins. Nucleic Acids Res., 30, 35–37.[Abstract/Free Full Text]

  12. Dwight,S.S., Harris,M.A., Dolinski,K., Ball,C.A., Binkley,G., Christie,K.R., Fisk,D.G., Issel-Tarver,L., Schroeder,M., Sherlock,G. et al. (2002) Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res., 30, 69–72.[Abstract/Free Full Text]

  13. Huala,E., Dickerman,A., Garcia-Hernandez,M., Weems,D., Reiser,L., LaFond,F., Hanley,D., Kiphart,D., Zhuang,J., Huang,W. et al. (2001) The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis and visualization system for a model plant. Nucleic Acids Res., 29, 102–105.[Abstract/Free Full Text]

  14. The FlyBase Consortium (2003) The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res., 31, 172–175.[Abstract/Free Full Text]

  15. Blake,J.A., Richardson,J.E., Bult,C.J., Kadin,J.A., Eppig,J.T. and the members of the Mouse Genome Database Group. 2003. MGD: The Mouse Genome Database. Nucleic Acids Res., 31, 193–195.

  16. Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T. et al. The Gene Ontology Consortium (2000) Gene Ontology: tool for the unification of biology. Nature Genet., 25, 25–29.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Am. J. Physiol. Cell Physiol.Home page
A. V. Andreeva, M. A. Kutuzov, V. A. Tkachuk, and T. A. Voyno-Yasenetskaya
T-cadherin is located in the nucleus and centrosomes in endothelial cells
Am J Physiol Cell Physiol, November 1, 2009; 297(5): C1168 - C1177.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
H. R. Ansari, D. R. Flower, and G. P. S. Raghava
AntigenDB: an immunoinformatics database of pathogen antigens
Nucleic Acids Res., October 9, 2009; (2009) gkp830v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Pierleoni, P. L. Martelli, P. Fariselli, and R. Casadio
eSLDB: eukaryotic subcellular localization database
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D208 - D212.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. ProteomicsHome page
L. Zhang, C. Shao, D. Zheng, and Y. Gao
An Integrated Machine Learning System to Computationally Screen Protein Databases for Protein Binding Peptide Ligands
Mol. Cell. Proteomics, July 1, 2006; 5(7): 1224 - 1232.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Chen, N. Huang, and Z. Sun
SubLoc: a server/client suite for protein subcellular location based on SOAP
Bioinformatics, February 1, 2006; 22(3): 376 - 377.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
J. L. Heazlewood, J. Tonti-Filippini, R. E. Verboom, and A. H. Millar
Combining Experimental and Predicted Datasets for Determination of the Subcellular Location of Proteins in Arabidopsis
Plant Physiology, October 1, 2005; 139(2): 598 - 609.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
P. Ananthalakshmi, Ch. K. Kumar, M. Jeyasimhan, K. Sumathi, and K. Sekar
Fragment Finder: a web-based software to identify similar three-dimensional structural motif
Nucleic Acids Res., July 1, 2005; 33(suppl_2): W85 - W88.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Rey, M. Acab, J. L. Gardy, M. R. Laird, K. deFays, C. Lambert, and F. S. L. Brinkman
PSORTdb: a protein subcellular localization database for bacteria
Nucleic Acids Res., January 1, 2005; 33(suppl_1): D164 - D168.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Y. Chen, Y. Zhang, Y. Yin, G. Gao, S. Li, Y. Jiang, X. Gu, and J. Luo
SPD--a web-based secreted protein database
Nucleic Acids Res., January 1, 2005; 33(suppl_1): D169 - D173.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
S. Reumann, C. Ma, S. Lemke, and L. Babujee
AraPerox. A Database of Putative Arabidopsis Proteins from Plant Peroxisomes
Plant Physiology, September 1, 2004; 136(1): 2587 - 2608.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (43K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Guo, T.
Right arrow Articles by Sun, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Guo, T.
Right arrow Articles by Sun, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?