Skip Navigation



Nucleic Acids Research Advance Access published online on June 6, 2007

Nucleic Acids Research, doi:10.1093/nar/gkm390
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (99K) Freely available
Right arrow Screen PDF (87K) Freely available
Right arrowOA All Versions of this Article:
35/suppl_2/W354    most recent
gkm390v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Cheng, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Cheng, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.


Web Server Issue

DOMAC: an accurate, hybrid protein domain prediction server

Jianlin Cheng*

School of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL, 32816, USA

*To whom correspondence should be addressed. Tel: (407) 823-0230; Fax: (407) 823-5419; Email: jcheng{at}cs.ucf.edu

Received January 12, 2007. Revised March 28, 2007. Accepted May 1, 2007.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 IMPLEMENTATION
 RESULTS
 USE OF WEB SERVICE
 CONCLUSION AND FUTURE WORK
 REFERENCES
 
Protein domain prediction is important for protein structure prediction, structure determination, function annotation, mutagenesis analysis and protein engineering. Here we describe an accurate protein domain prediction server (DOMAC) combining both template-based and ab initio methods. The preliminary version of the server was ranked among the top domain prediction servers in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7), 2006. DOMAC server and datasets are available at: http://www.bioinfotool.org/domac.html


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 IMPLEMENTATION
 RESULTS
 USE OF WEB SERVICE
 CONCLUSION AND FUTURE WORK
 REFERENCES
 
Protein domains are structural, functional and evolutionary units of proteins. The prediction of domains from sequence information can improve tertiary structure prediction (1), enhance protein function annotation (2), aid structure determination (3) and guide protein engineering (4) and mutagenesis (5).

A number of different methods have been developed to identify domains starting from primary sequences. These methods can be roughly classified into four categories: template-based methods (6–10), ab initio (template-free) methods (11–22), the hybrid approach combining template-based and ab initio methods (23), and meta-domain prediction methods (24).

Here we describe an accurate, hybrid domain prediction server (DOMAC) that integrates homology modeling, domain parsing and ab initio methods together. The preliminary implementation of the server [under the name: FOLDpro (25)] participated in the domain evaluation in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7) (26,27). It was ranked among the top domain prediction servers in CASP7.


    IMPLEMENTATION
 TOP
 ABSTRACT
 INTRODUCTION
 IMPLEMENTATION
 RESULTS
 USE OF WEB SERVICE
 CONCLUSION AND FUTURE WORK
 REFERENCES
 
Our hybrid approach uses the template-based method to predict domains for proteins having homologous template structures in Protein Data Bank (PDB) (28), and the ab initio method based on neural networks (29) to predict domains for de novo proteins. It predicts protein domains in two steps.

First, it uses the PSI-BLAST (30) to search the target sequence against NCBI Non-Redundant sequence database to construct a profile. The profile is used to search a template structure library built from the proteins in PDB to identify templates, similarly as PDB-BLAST approach (31).

Second, if some significant templates are identified (e-value <=0.001), it generates a structure model for the target using Modeller (32) based on the template structures. Multiple significant templates are combined to improve model quality if available. Then it uses an accurate domain parsing tool PDP (33) to parse the model into domains. If the parsed domains do not cover the whole target sequence, DOMAC will assign uncovered regions to adjacent domains.

If no significant homologous template is found, DOMAC will invoke the ab initio domain predictor DOMpro (29) to predict domains. DOMpro uses neural networks in conjunction with sequence profile, predicted secondary structure, and relative solvent accessibility to predict domain boundary. The secondary structure and relative solvent accessibility are predicted by SSpro (34) and ACCpro (35) in the SCRATCH suite (36). DOMpro tries to identify domain boundary positions based on the composition bias of sequence and structural features in domain linker regions.

The preliminary implementation of DOMAC participated in CASP7 and was ranked first among 13 domain prediction servers. Since then, we have significantly speeded up the template identification process without sacrificing accuracy and added a module to update the template library weekly to incorporate the newly released proteins in PDB.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 IMPLEMENTATION
 RESULTS
 USE OF WEB SERVICE
 CONCLUSION AND FUTURE WORK
 REFERENCES
 
Here we firstly describe the performance of the preliminary implementation of DOMAC in CASP7 (under server name: FOLDpro). We compare it with 12 other server predictors in CASP7 using two evaluation metrics: CASP evaluation metric (37) and domain number accuracy.

CASP metric (NDO: normalized domain overlap score) is to compute the overlapping score of domains without explicitly checking domain number and domain boundary (37). It computes the numbers of correctly and wrongly overlapped residues between true domains and predicted domains, respectively. It summarizes the numbers of the overlapping residues into a single score to evaluate domain prediction. The best score for a target is 1 and the worst score is 0. The domain number accuracy is defined as the percentage of targets with correct domain number predictions.

Table 1 reports the performance of 13 servers on 95 targets in CASP 7. The CASP score is the average domain overlap score across all predicted targets. The domain number accuracy is computed by comparing the domain number predictions with the official domain definitions released by CASP7. In terms of the two evaluation metrics, the preliminary implementation of DOMAC (FOLDpro) yielded the best performance.


View this table:
[in this window]
[in a new window]

 
Table 1. The performance of 13 domain prediction servers in CASP7

 
We also evaluate DOMAC on the three categories of CASP7 targets: highly homologous, homologous and analogous/ab initio. The domain number prediction accuracy of DOMAC is 96%, 94% and 88% in the three categories, respectively.

However, because the majority (68 out of 95) of CASP7 targets is single-domain proteins, the domain prediction accuracy is very likely over-estimated.

Thus, we evaluate DOMAC on a larger, balanced, high-quality dataset manually curated by Holland et al. (2). The publicly released version of the Holland's benchmark2 dataset has 156 proteins consisting of 54 single-domain proteins, 69 two-domain proteins, 25 three-domain proteins, 4 four-domain proteins, 3 five-domain proteins and 1 six-domain protein. We evaluate both template-based and ab initio methods on the whole dataset, respectively. Table 2 reports the specificity and sensitivity of each method in each category in terms of domain numbers. The overall domain number prediction accuracy of the template-based and ab initio methods is 75% and 46%, respectively.


View this table:
[in this window]
[in a new window]

 
Table 2. The specificity and sensitivity of domain number prediction on the Holland's dataset using the template-based and ab initio methods

 
Moreover, we assess the accuracy of the domain boundary prediction, which is important for generating hypotheses for crystallizing individual protein domains. Following the same convention (7,22), a predicted boundary within 20 residues away from a true domain boundary is considered correct.

The domain boundary specificity and sensitivity is 50% and 76.5% for the template-based method, and 27% and 14% for the ab initio method. Thus, the accuracy are sufficient for guiding the crystallization experiment, whereas the ab initio method is not always reliable enough for the general, practical use.


    USE OF WEB SERVICE
 TOP
 ABSTRACT
 INTRODUCTION
 IMPLEMENTATION
 RESULTS
 USE OF WEB SERVICE
 CONCLUSION AND FUTURE WORK
 REFERENCES
 
The use of DOMAC are intuitive through a simple input form. Since the reliability assessment of domain predictions is still an open issue, the user is advised to use the accuracy on the Holland's dataset to decide how to use these predictions. The input form requires only three inputs: email address, target name, and protein sequence. DOMAC usually can make predictions within 15 min and send the results back to users through email.

Domain prediction results include the user-defined target name, the protein sequence, the predicted domain number, the start and end positions of each domain and the method (template-based or ab initio). For template-based prediction, it also reports the PDB codes of the templates. Figure 1 shows an output example for the CASP7 target T0324.


Figure 1
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. Domain prediction result of CASP7 target T0324. The protein is predicted to have two domains. Domain 1 has two non-continuous segments, spanning from residues 1 to 16 and residues 82 to 208, respectively. Domain 2 spans from residues 17 to 81. The templates used to make the domain prediction are identified by PDB code + chain id. The chain in a single-chain protein is always assigned chain id ‘A’ instead of ‘-’.

 

    CONCLUSION AND FUTURE WORK
 TOP
 ABSTRACT
 INTRODUCTION
 IMPLEMENTATION
 RESULTS
 USE OF WEB SERVICE
 CONCLUSION AND FUTURE WORK
 REFERENCES
 
We have developed a hybrid domain prediction web service integrating template-based and ab initio methods. The template-based method is accurate enough for guiding protein structure prediction, structure determination, function annotation, mutagenesis analysis and protein engineering. However, the ab initio method still needs to be improved for practical use. Since protein domain architecture is largely shaped by gene recombination events, such as gene fusion, fission, domain swapping and exon exchange, leveraging the evolutionary gene recombination signals embedded in the multiple sequence alignment of a protein family and exon boundaries (or splicing sites) in its gene structure, may help improve ab initio domain prediction significantly.


    ACKNOWLEDGEMENTS
 
J.C. is very grateful to Dr Pierre Baldi for the support during his PhD research at University of California Irvine.

Funding to pay the Open Access publication charges for this article was provided by the New Faculty Start-Up Grant at the University of Central Florida.

Conflict of interest statement. None declared.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 IMPLEMENTATION
 RESULTS
 USE OF WEB SERVICE
 CONCLUSION AND FUTURE WORK
 REFERENCES
 

  1. Chivian D, Kim DE, Malmstrom, L. Bradley P, Robertson T, Murphy P, Strauss CE, Bonneau R, Rohl CA, et al. Automated prediction of CASP-5 structures using the Robetta server. Proteins (2003) 53(S6):524–533.[CrossRef][ISI][Medline]

  2. Holland T, Veretnik S, Shindyalov IN, Bourne PE. A benchmark for domain assignment from protein 3-dimensional structure and it's applications. J. Mol. Biol (2006) 361:562–590.[CrossRef][ISI][Medline]

  3. Campbell I, Downing A. Building protein structure and function from modular units. Trends Biotechnol (1994) 12:168–172.[CrossRef][ISI][Medline]

  4. Guerois R, Serrano L. Protein design based on folding models. Curr. Opin. Struct. Biol (2001) 11:101–106.[CrossRef][ISI][Medline]

  5. Nielsen P, Yamada Y. Identification of cell-binding sites on the Laminin a5 n-terminal domain by site-directed mutagenesis. J. Biol. Chem (2001) 276:10906–10912.[Abstract/Free Full Text]

  6. Heger A, Holm L. Exhaustive enumeration of protein domain families. J. Mol. Biol (2003) 328:749–767.[CrossRef][ISI][Medline]

  7. Marsden RL, McGuffin LJ, Jones DT. Rapid protein domain assignment from amino acid sequence using predicted secondary structure. Protein Sci (2002) 11:2814–2824.[Abstract/Free Full Text]

  8. von Ohsen N, Sommer I, Zimmer R, Lengauer T. Arby: automatic protein structure prediction using profile-profile alignment and confidence measures. Bioinformatics (2004) 20:2228–2235.[Abstract/Free Full Text]

  9. Gewehr JE, Zimmer R. SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles. Bioinformatics (2006) 22:181–187.[Abstract/Free Full Text]

  10. Coin L, Bateman A, Durbin R. Enhanced protein domain discovery using taxonomy. BMC Bioinformat (2004) 5:56.[CrossRef]

  11. Park J, Teichmann SA. DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins. Bioinformatics (1998) 14:144–150.[Abstract/Free Full Text]

  12. Gouzy J, Corpet F, Kahn D. Whole genome protein domain analysis using a new method for domain clustering. Comput. chem (1999) 23:333–340.[CrossRef][ISI][Medline]

  13. Bryson K, McGuffin LJ, Marsden RL, Ward JJ, Sodhi JS, Jones DT. Protein structure prediction servers at University College London. Nucleic Acids Res (2005) 33:w36–w38.[CrossRef][Medline]

  14. George RA, Heringa J. SnapDRAGON: a method to delineate protein structural domains from sequence data. J. Mol. Biol (2002) 316:839–851.[CrossRef][ISI][Medline]

  15. Linding R, Russell RB, Neduva V, Gibson TJ. GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res (2003) 31:3701–3708.[Abstract/Free Full Text]

  16. Nagarajan N, Yona G. Automatic prediction of protein domains from sequence information using a hybrid learning system. Bioinformatics (2004) 20:1335–1360.[Abstract/Free Full Text]

  17. Wheelan SJ, Marchler-Bauer A, Bryant SH. Domain size distributions can predict domain boundaries. Bioinformatics (2000) 16:613–618.[Abstract/Free Full Text]

  18. Sim J, Kim SY, Lee J. PPRODO: prediction of protein domain boundaries using neural networks. Proteins (2005) 59:627–632.[CrossRef][ISI][Medline]

  19. Chen L, Wang W, Ling S, Jia C, Wang F. Kemadom: a web server for domain prediction using kernel machine with local context. Nucleic Acids Res (2006) 34:W158–w163.[Abstract/Free Full Text]

  20. Adams R, Das S, Smith T. Multiple domain protein diagnostic patterns. Prot. Sci (1996) 5:1240–1249.[Abstract]

  21. George R, Heringa J. Protein domain identification and improved sequence similarity search using PSI-BLAST. Protein Struct. Funct. Genet (2002) 48:672–681.[CrossRef]

  22. Liu J, Rost B. Sequence-based prediction of protein domains. Nucleic Acids Res (2004) 32:3522–3530.[Abstract/Free Full Text]

  23. Kim DE, Chivian D, Malmstrom L, Baker D. Automated prediction of domain boundaries in casp6 targets using Ginzu and RosettaDOM. Proteins (2005) 61(Suppl. 7):193–200.[CrossRef][ISI][Medline]

  24. Saini HK, Fischer D. Meta-DP: domain prediction meta server. Bioinformatics (2005) 21:2917–2920.[Abstract/Free Full Text]

  25. Cheng J, Baldi P. A Machine Learning Information Retrieval Approach to Protein Fold Recognition. Bioinformatics (2006) 22:1456–1463.[Abstract/Free Full Text]

  26. Moult J, Fidelis K, Zemla A, Hubbard T. Critical assessment of methods of protein structure prediction (CASP)-round v. Proteins (2003) 53(Suppl. 6):334–339.[CrossRef][ISI][Medline]

  27. Moult J, Fidelis K, Rost B, Hubbard T, Tramontano A. Critical assessmentof methods of protein structure prediction (CASP) - round 6. Proteins (2005) 61(Suppl 7):3–7.[CrossRef][ISI][Medline]

  28. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res (2000) 28:235–242.[Abstract/Free Full Text]

  29. Cheng J, Sweredoski MJ, Baldi P. DOMpro: Protein domain prediction using profiles, secondary structure, relative solvent accessibility, and recursive neural networks. Data Mining and Knowledge Discovery (2006) 13:1–10.[CrossRef][ISI]

  30. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller AA, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.[Abstract/Free Full Text]

  31. Rychlewski L, Jaroszewski L, LI W, Godzik A. Comparison of sequence profiles. Strategies for structural predictions using sequence information". Protein Sci (2000) 9:232–241.[Abstract]

  32. Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol (1993) 234:779–815.[CrossRef][ISI][Medline]

  33. Alexandrov N, Shindyalov I. PDP: protein domain parser. Bioinformatics (2003) 19:429–430.[Abstract/Free Full Text]

  34. Pollastri G, Przybylski D, Rost B, Baldi P. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins (2002) 47:228–235.[CrossRef][ISI][Medline]

  35. Pollastri G, Baldi P, Fariselli P, Casadio R. Prediction of coordination number and relative solvent accessibility in proteins. Proteins (2002) 47:142–153.[CrossRef][ISI][Medline]

  36. Cheng J, Randall AZ, Sweredoski MJ, Baldi P. SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res (2005) 33(web server issue):w72–w76.[CrossRef][Medline]

  37. Tai CH, Lee, W.J. Vincent JJ, Lee B. Evaluation of domain prediction in CASP6. Proteins (2005) 61(Suppl. 7):183–192.[CrossRef][ISI][Medline]

  38. Söding J. Protein homology detection by HMM-HMM comparison. Bioinformatics (2005) 21:951–960.[Abstract/Free Full Text]

  39. Bau D, Martin AJM, Mooney C, Vullo A, Walsh I, Pollastri G. Distill: A suite of web servers for the prediction of one-, two- and three dimensional structural features of proteins. BMC Bioinformat (2006) 7:402.[CrossRef]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Print PDF (99K) Freely available
Right arrow Screen PDF (87K) Freely available
Right arrowOA All Versions of this Article:
35/suppl_2/W354    most recent
gkm390v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Cheng, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Cheng, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?