Skip Navigation

Nucleic Acids Research 2006 34(Web Server issue):W164-W168; doi:10.1093/nar/gkl166
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (96K) Freely available
Right arrow Screen PDF (99K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Vullo, A.
Right arrow Articles by Tosatto, S. C. E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Vullo, A.
Right arrow Articles by Tosatto, S. C. E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org


Article

Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines

Alessandro Vullo, Oscar Bortolami1, Gianluca Pollastri* and Silvio C. E. Tosatto1

School of Computer Science and Informatics, University College Dublin Ireland 1 Department of Biology and CRIBI Biotechnology Centre, University of Padova Italy

*To whom correspondence should be addressed. Tel: +353 1 716 2926; fax: +353 1 269 7262; Email: gianluca.pollastri{at}ucd.ie

Received February 13, 2006. Revised March 1, 2006. Accepted March 20, 2006.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM DESCRIPTION
 SERVER DESCRIPTION
 DISCUSSION
 REFERENCES
 
Intrinsically disordered proteins have long stretches of their polypeptide chain, which do not adopt a single native structure composed of stable secondary and tertiary structure in the absence of binding partners. The prediction of intrinsically disordered regions in proteins from sequence is increasingly becoming of interest, as the presence of many such regions in the complete genome sequences are discovered and important functional roles are associated with them. We have developed a machine learning approach based on two support vector machines (SVM) to discriminate disordered regions from sequence. The SVM are trained and benchmarked on two sets, representing long and short disordered regions. A preliminary version of Spritz was shown to perform consistently well at the recent biannual CASP-6 experiment [Critical Assessment of Techniques for Protein Structure Prediction (CASP), 2004]. The fully developed Spritz method is freely available as a web server at http://distill.ucd.ie/spritz/ and http://protein.cribi.unipd.it/spritz/.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM DESCRIPTION
 SERVER DESCRIPTION
 DISCUSSION
 REFERENCES
 
Over the last couple of years there has been growing experimental and theoretical interest into what has come to be known as intrinsic disorder in proteins (111). Some proteins appear to have long stretches of their polypeptide chain which do not adopt a single native structure. These disordered fragments instead appear to exist in a state that is unstructured but different from unfolded proteins. Frequently such disordered fragments interact with other proteins and can become structured under certain conditions, such as ligand binding or phosphorylation. Recent studies have determined the functional classes of signal transduction and transcription to be overrepresented among disordered proteins (12,13). Perhaps linked to this functional bias is a higher percentage of disordered regions in Eukaryotes compared to Prokaryotes and Archea, since the former require a more sophisticated control of communication (12,13). From a structural point of view, disordered regions frequently do not appear on the electron density map in X-ray crystallographic studies and are highly flexible according to NMR data. While disorder is not overly represented in the PDB for the above reasons, long ‘loopy’ regions have also been observed in some structures (13). Predicting the location of such unstructured regions is therefore important in protein structure prediction, as these disordered fragments cannot and should not be predicted. This has prompted the inclusion of a disorder prediction session in the last Critical Assessment of techniques for protein Structure Prediction (CASP) experiments (14), starting with CASP-5 in 2002.

Perhaps due to the inclusion in CASP, there has been recently increased interest in predicting disorder and associated features from sequence (12,1525). The most prominent feature of many, in particular long, disordered sequences is low sequence complexity. This has been used for a long time to filter sequence database searches with the SEG filter (26). Long disordered regions show a high net charge and have few hydrophobic residues (1,2). A similar trend has also recently been observed for short disordered sequence patterns (21). This sequence bias has been used to develop machine learning based predictors in analogy to secondary structure prediction. A training set of long and/or short disordered regions is used to train a neural net or support vector machine to discriminate disordered from ordered sequence fragments. The results are quite encouraging, as disorder appears to be easier to predict than secondary structure (27). Long disordered regions in particular are consistently predicted by most methods. The performance on short disordered regions is subject to higher fluctuations. Overprediction is still an issue with most methods, i.e. more residues are predicted to be disordered than is really the case. However, at least some of these residues correspond to the more flexible parts of the protein structure, pointing at an intuitive correlation between disorder and flexibility. In analogy to secondary structure prediction (2830), recently some of the best performing disorder predictors have started to use a combination of predictors. Probably because short and long disordered regions may be the result of different processes, and both may be further classified into different ‘flavours’ (22), approaches relying on multiple specialized predictors have led to improved performances. In this paper we present our novel server for the prediction of intrinsically disordered region in proteins, based on two separate support vector machines (SVM), one specialized to recognize long disordered regions, one short disordered regions. The server, called Spritz, can be accessed from http://distill.ucd.ie/spritz/ and http://protein.cribi.unipd.it/spritz/.


    PROGRAM DESCRIPTION
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM DESCRIPTION
 SERVER DESCRIPTION
 DISCUSSION
 REFERENCES
 
Overview
The server Spritz is implemented using two specialized binary classifiers, one for short regions of disorder and the other for long disordered fragments. The purpose of doing so is to develop different, disjoint expertize by taking advantage of the different class distributions in the two cases. These classifiers are derived from independent datasets as discussed in the following.

LD dataset
We first assemble a subset of completely disordered sequences, each with over 45 disordered residues, from DISPROT release 1.2 (31) by removing sequences containing errors of annotations and then using the most up to date GI accession numbers. The final set is completed by incorporating an equal sized subset of chains classified as having no disordered fragments and derived from the PDBselect25, March 2002 release. The filtered and balanced set contains 293 sequences corresponding to 34 159 residues, 17 001 (49.7%)of which are classified as belonging to long regions of disorder.

SD dataset
A collection of short disordered sequence fragments is compiled from the November 2004 release of the PDB (32). We filter out proteins that are not solved by X-ray diffraction, are less than 30 AA and have resolution worse than 2.5 Å. In order to obtain non overlapping entries with the LD set, we finally select those chains having no more than 20 disordered amino acids and sharing at most 25% sequence similarity. The final set contains 1017 sequences corresponding to 278 600 residues, 8824 (3.2%) of which are classified as belonging to short regions of disorder. There is strong unbalance towards the ordered class, reflecting the PDB distribution.

Specialized SVM classifiers
The LD and SD datasets are used independently to derive two binary classifiers. These classifiers are both implemented with probabilistic soft-margin SVM or C-SVMs (33,34) mounted with a Gaussian kernel. Note that differently from (12), we use a non linear kernel which is less biased (i.e. more expressive) than the linear one but also more prone to overfitting. We tackle the overfitting problem by running model selection on independent validation data (35). An additional problem is represented by the strong unbalance of class distribution in the SD set. To mitigate this problem, we train SVMs using asymmetric costs, i.e. a larger penalty for disorder misclassifications. This penalty is equal to the ratio of the number of negative (i.e. ordered) to positive examples in the training set.

Both classifiers use as inputs residue attributes extracted from a window of k residues. For each of the k residues, 20 amino acid frequencies are input into the SVM, computed from multiple alignments obtained from three runs of PSI-Blast (36) against the non-redundant NCBI sequence database redundancy reduced at 90%. For each amino acid in the window, we consider also secondary structure information predicted by the Porter server (37) in the form of three probability values. In our tests secondary structure information leads to gains of roughly 3% in two-state classification accuracy. After initial bootstrapping, all the experiments and the final stage (the server) are implemented using a window of size five (respectively three) for the LD (respectively SD) set predictor.


    SERVER DESCRIPTION
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM DESCRIPTION
 SERVER DESCRIPTION
 DISCUSSION
 REFERENCES
 
The web server has two interfaces, one for single and one for multiple queries (see also Figure 1). The former takes as input a bare protein sequence while the latter accepts input in FASTA format with multiple entries for batch querying. Submissions of up to 32 768 characters are accepted. The user has the option to adjust the classifier output according to the estimated false positive rate (FPR). The fraction of disordered amino acids that are expected to be recovered can be estimated from the FPR and the ROC curve provided in the help page. The user can select either predictor from the interface and may provide an FPR (default 0.05).


Figure 1
View larger version (38K):
[in this window]
[in a new window]
 
Figure 1 The Spritz web server interface.

 
The server outputs the following sequences: (i) the prediction of secondary structure in three classes (C = coil; E = strand; H = helix), obtained using Porter (37); (ii) the prediction of protein disorder in two classes (O = ordered; D = disordered). The results in plain ASCII text format are sent by email.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM DESCRIPTION
 SERVER DESCRIPTION
 DISCUSSION
 REFERENCES
 
To assess the performance of our method we run a 5-fold cross validation (5-fold CV) procedure with both predictors. CV performance for both short and long disorder experts is given in Table 1 together with an independent assessment of both final predictors (i.e. trained on the whole respective dataset) on the set represented by the DR category targets of the most recent CASP-6 competition (38). We report Matthews correlation coefficient (C), the area under the ROC curve (AUC) and the two state (ordered/disordered) classification accuracy Q2 for a false hit rate set to 5%. To show that these results are competitive with state-of-the-art methods, Table 1 also reports the performance of two top ranking CASP-6 methods. VSL-1 (group id 193) (39) and a preliminary version of Spritz used in CASP-6, participating as group CaspIta (group id 096). We also report the performances of the short disorder expert on the LD set, and of the long disorder expert on the SD set. In both cases the specialized predictor performs significantly better on its own class of disorder, justifying the choice of distinct experts.


View this table:
[in this window]
[in a new window]
 
Table 1 Results of SVM trained on SD and LD dataset

 
Figure 2 shows the ROC curve of both the long disorder (SVM LD) and the short disorder (SVM SD) experts as computed from the CASP-6 targets. This plot is very similar to what we obtain on the 5-folds CV experiments and suggests a threshold switch between the experts, since it shows the long disorder expert consistently outperforming the alternative above 5% FPR. Nonetheless, the choice of the expert is left to the user, to increase the flexibility of the method. To help the user's choice, in the web help page we provide the expected fraction of disordered residues recovered for a given FPR, for both experts.


Figure 2
View larger version (27K):
[in this window]
[in a new window]
 
Figure 2 ROC curve of both the long disorder (SVM LD) and the short disorder (SVM SD) experts as computed from the CASP-6 targets.

 


    ACKNOWLEDGEMENTS
 
A.V. and G.P. are supported by an Embark Fellowship from the Irish Research Council for Science, Engineering and Technology, Science Foundation Ireland grants 04/BR/CS0353 and 05/RFP/CMS0029, grant RP/2005/219 from the Health Research Board of Ireland and UCD President's Award 2004 and Seed Funding scheme 2005 grants. S.C.E.T. is funded by a ‘Rientro dei cervelli’ grant from the Italian Ministry for Education, University and Research (MIUR). Funding to pay the Open Access publication charges for this article was provided by Science Foundation Ireland's grant 04/BR/CS0353.

Conflict of interest statement. None declared.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM DESCRIPTION
 SERVER DESCRIPTION
 DISCUSSION
 REFERENCES
 

  1. Uversky, V.N., Gillespie, J.R., Fink, A.L. (2000) Why are ‘natively unfolded’ proteins unstructured under physiologic conditions? Proteins, 41, 415–427[CrossRef][Web of Science][Medline] .

  2. Uversky, V.N. (2002) What does it mean to be natively unfolded? Eur. J. Biochem, . 269, 2–12[Web of Science][Medline] .

  3. Dunker, A.K., Brown, C.J., Lawson, J.D., Iakoucheva, L.M., Obradovic, Z. (2002) Intrinsic disorder and protein function Biochemistry, 41, 6573–6582[CrossRef][Medline] .

  4. Dunker, A.K. and Obradovic, Z. (2001) The protein trinity—linking function and disorder Nat. Biotechnol, . 19, 805–806[CrossRef][Web of Science][Medline] .

  5. Bracken, C., Iakoucheva, L.M., Romero, P.R., Dunker, A.K. (2004) Combining prediction, computation and experiment for the characterization of protein disorder Curr. Opin. Struct. Biol, . 14, 570–576[CrossRef][Web of Science][Medline] .

  6. Fink, A.L. (2005) Natively unfolded proteins Curr. Opin. Struct. Biol, . 15, 35–41[CrossRef][Web of Science][Medline] .

  7. Dyson, H.J. and Wright, P.E. (2002) Coupling of folding and binding for unstructured proteins Curr. Opin. Struct. Biol, . 12, 54–60[CrossRef][Web of Science][Medline] .

  8. Wright, P.E. and Dyson, H.J. (1999) Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm J. Mol. Biol, . 293, 321–331[CrossRef][Web of Science][Medline] .

  9. Romero, P., Obradovic, Z., Dunker, A.K. (2004) Natively disordered proteins: functions and predictions Appl. Bioinformatics, 3, 105–113[CrossRef][Medline] .

  10. Dyson, H.J. and Wright, P.E. (2005) Intrinsically unstructured proteins and their functions Nature Rev. Mol. Cell Biol, . 6, 197–208[CrossRef][Web of Science][Medline] .

  11. Tompa, P. (2002) Intrinsically unstructured proteins Trends Biochem. Sci, . 27, 527–533[CrossRef][Web of Science][Medline] .

  12. Ward, J.J., Sodhi, J.S., McGuffin, L.J., Buxton, B.F., Jones, D.T. (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life J. Mol. Biol, . 337, 635–645[CrossRef][Web of Science][Medline] .

  13. Liu, J., Tan, H., Rost, B. (2002) Loopy proteins appear conserved in evolution J. Mol. Biol, . 322, 53–64[CrossRef][Web of Science][Medline] .

  14. Moult, J., Fidelis, K., Rost, B., Hubbard, T., Tramontano, A. (2005) Critical assessment of methods of protein structure prediction (CASP)—round 6 Proteins, 61, 3–7 .

  15. Radivojac, P., Obradovic, Z., Smith, D.K., Zhu, G., Vucetic, S., Brown, C.J., Lawson, J.D., Dunker, A.K. (2004) Protein flexibility and intrinsic disorder Protein Sci, . 13, 71–80[CrossRef][Web of Science][Medline] .

  16. Obradovic, Z., Peng, K., Vucetic, S., Radivojac, P., Brown, C.J., Dunker, A.K. (2003) Predicting intrinsic disorder from amino acid sequence Proteins, 53, 566–572 .

  17. Romero, P., Obradovic, Z., Li, X., Garner, E.C., Brown, C.J., Dunker, A.K. (2001) Sequence complexity of disordered protein Proteins, 42, 38–48[CrossRef][Web of Science][Medline] .

  18. Jones, D.T. and Ward, J.J. (2003) Prediction of disordered regions in proteins from position specific score matrices Proteins, 53, 573–578 .

  19. Linding, R., Jensen, L.J., Diella, F., Bork, P., Gibson, T.J., Russell, R.B. (2003) Protein disorder prediction: implications for structural proteomics Structure (Camb), 11, 1453–1459 .

  20. Linding, R., Russell, R.B., Neduva, V., Gibson, T.J. (2003) GlobPlot: exploring protein sequences for globularity and disorder Nucleic Acids Res, . 31, 3701–3708[Abstract/Free Full Text] .

  21. Lise, S. and Jones, D.T. (2005) Sequence patterns associated with disordered regions in proteins Proteins, 58, 144–150[CrossRef][Web of Science][Medline] .

  22. Vucetic, S., Brown, C.J., Dunker, A.K., Obradovic, Z. (2003) Flavors of protein disorder Proteins, 52, 573–584[CrossRef][Web of Science][Medline] .

  23. Oldfield, C.J., Cheng, Y., Cortese, M.S., Brown, C.J., Uversky, V.N., Dunker, A.K. (2005) Comparing and combining predictors of mostly disordered proteins Biochemistry, 44, 1989–2000[CrossRef][Medline] .

  24. Coeytaux, K. and Poupon, A. (2005) Prediction of unfolded segments in a protein sequence based on amino acid composition Bioinformatics, 21, 1891–1900[Abstract/Free Full Text] .

  25. Dosztanyi, Z., Csizmok, V., Tompa, P., Simon, I. (2005) The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins J. Mol. Biol, . 347, 827–839[CrossRef][Web of Science][Medline] .

  26. Wootton, J.C. (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures Comput Chem, . 18, 269–285[CrossRef][Web of Science][Medline] .

  27. Melamud, E. and Moult, J. (2003) Evaluation of disorder predictions in CASP5 Proteins, 53, 561–565 .

  28. Cuff, J.A. and Barton, G.J. (1999) Evaluation and improvement of multiple sequence methods for protein secondary structure prediction Proteins, 34, 508–519[CrossRef][Web of Science][Medline] .

  29. Albrecht, M., Tosatto, S.C.E., Lengauer, T., Valle, G. (2003) Simple consensus procedures are effective and sufficient in secondary structure prediction Protein Eng, . 16, 459–462[Abstract/Free Full Text] .

  30. Ward, J.J., McGuffin, L.J., Buxton, B.F., Jones, D.T. (2003) Secondary structure prediction with support vector machines Bioinformatics, 19, 1650–1655[Abstract/Free Full Text] .

  31. Vucetic, S., Obradovic, Z., Vacic, V., Radivojac, P., Peng, K., Iakoucheva, L.M., Cortese, M.S., Lawson, J.D., Brown, C.J., Sikes, J.G., et al. (2005) DisProt: a database of protein disorder Bioinformatics, 21, 137–140[Abstract/Free Full Text] .

  32. Deshpande, N., Addess, K.J., Bluhm, W.F., Merino-Ott, J.C., Townsend-Merino, W., Zhang, Q., Knezevich, C., Xie, L., Chen, L., Feng, Z.K., et al. (2005) The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema Nucleic Acids Res, . 33, D233–D237[Abstract/Free Full Text] .

  33. Shawe, Taylor and Cristianini, N. Kernel Methods for Pattern Analysis, (2004) Cambridge, UK Cambridge University Press .

  34. Platt, J. Probabilistic Outputs of Support Vector Machines and Comparison to Reguralised Likelihood Methods, (2000) Cambridge, MA MIT press .

  35. Keerthi, S.S. and Lin, C.J. (2003) Asymptotic behaviors of support vector machines with Gaussian kernel Neural Comput, . 15, 1667–1689[CrossRef][Web of Science][Medline] .

  36. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res, . 25, 3389–3402[Abstract/Free Full Text] .

  37. Pollastri, G. and McLysaght, A. (2005) Porter: a new, accurate server for protein secondary structure prediction Bioinformatics, 21, 1719–1720[Abstract/Free Full Text] .

  38. Jin, Y. and Dunbrack, R.L.Jr. (2005) Assessment of disorder predictions in CASP6 Proteins, 61, 167–175 .

  39. Obradovic, Z., Peng, K., Vucetic, S., Radivojac, P., Dunker, A.K. (2005) Exploiting heterogeneous sequence properties improves prediction of protein disorder Proteins, 61, 176–182[Web of Science][Medline] .


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Biol. Chem.Home page
R. Saxena, P. Kanudia, M. Datt, H. H. Dar, S. Karthikeyan, B. Singh, and P. K. Chakraborti
Three Consecutive Arginines Are Important for the Mycobacterial Peptide Deformylase Enzyme Activity
J. Biol. Chem., August 29, 2008; 283(35): 23754 - 23764.
[Abstract] [Full Text] [PDF]


Home page
FASEB J.Home page
T. Parasassi, M. De Spirito, G. Mei, R. Brunelli, G. Greco, L. Lenzi, G. Maulucci, E. Nicolai, M. Papi, G. Arcovito, et al.
Low density lipoprotein misfolding and amyloidogenesis
FASEB J, July 1, 2008; 22(7): 2350 - 2356.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C.-T. Su, C.-Y. Chen, and C.-M. Hsu
iPDA: integrated protein disorder analyzer
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W465 - W472.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
O. V. Galzitskaya, S. O. Garbuzynskiy, and M. Yu. Lobanov
FoldUnfold: web server for the prediction of disordered regions in protein chain
Bioinformatics, December 1, 2006; 22(23): 2948 - 2949.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (96K) Freely available
Right arrow Screen PDF (99K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Vullo, A.
Right arrow Articles by Tosatto, S. C. E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Vullo, A.
Right arrow Articles by Tosatto, S. C. E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?