Nucleic Acids Research 2006 34(Web Server issue):W182-W185; doi:10.1093/nar/gkl189
© The Author 2006. Published by Oxford University Press. All rights reserved
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org
DiANNA 1.1: an extension of the DiANNA web server for ternary cysteine classification
F. Ferrè1 and
P. Clote1,2,*
1 Department of Biology, Boston College Chestnut Hill, MA USA
2 Department of Computer Science (courtesy appointment), Boston College Chestnut Hill, MA USA
*To whom correspondence should be addressed. Tel: +1 617 552 1332; Fax: +1 617 552 2011; Email: clote{at}bc.edu
Received February 14, 2006. Revised March 20, 2006. Accepted March 20, 2006.
 |
ABSTRACT
|
|---|
DiANNA is a recent state-of-the-art artificial neural network
and web server, which determines the cysteine oxidation state
and disulfide connectivity of a protein, given only its amino
acid sequence. Version 1.0 of DiANNA uses a feed-forward neural
network to determine which cysteines are involved in a disulfide
bond, and employs a novel architecture neural network to predict
which half-cystines are covalently bound to which other half-cystines.
In version 1.1 of DiANNA, described here, we extend functionality
by applying a support vector machine with spectrum kernel for
the cysteine classification problemto determine whether
a cysteine is reduced (free in sulfhydryl state), half-cystine
(involved in a disulfide bond) or bound to a metallic ligand.
In the latter case, DiANNA predicts the ligand among iron, zinc,
cadmium and carbon. Available at:
http://bioinformatics.bc.edu/clotelab/DiANNA/.
 |
INTRODUCTION
|
|---|
Cysteine residues play a unique role in determining protein
stability and function. Cysteines may be reduced (free, where
sulfur occurs in the reactive sulfhydryl form) or oxidized;
the latter may be involved in a disulfide bond, i.e. a half-cystine,
or instead covalently bound to a metallic ligand that is part
of a prosthetic group. Experimental determination of cysteine
species (free, half-cystine, ligand-bound) is non-trivial, and
often only the knowledge of the three-dimensional structure
indicates the species. For this reason, cysteine classification
is an important bioinformatics problem that may be approached
by using machine learning methods. In this paper, we apply support
vector machines (SVM) to the ternary cysteine classification
problem, to determine whether a given cysteine is free, a half-cystine
or ligand-bound. To the best of our knowledge, the present paper
describes the only existent ternary cysteine classification
program.
It is reasonable to assume that each species of cysteine resides in a distinct micro-environment which influences the cysteine redox potential and its steric accessibility. This hypothesis is confirmed and exploited in several machine learning approaches for cysteine classification that, while different, share the common feature that the discrimination is based on the analysis of the cysteine sequence context, using a symmetric sequence window of length w centered about each cysteine. Particular effort has been spent on the binary classification problem to discriminate intra-chain half-cystines from free cysteines, the latter being the most represented species. For this problem, various methods have yielded steadily increasing prediction accuracies (1,2). Nevertheless, other species of cysteines existnamely ligand-bound cysteines and half-cystines involved in inter-chain disulfide bonds. Such cysteines reside in possibly different micro-environments, hence may be discernable from other species. Only one attempt has been made to discriminate ligand-bound cysteines; specifically, Passerini and Frasconi (3) obtained prediction accuracy of
90% for the binary classification problem of distinguishing ligand-bound cysteines from half-cystines.
DiANNA 1.1 is the only software which performs ternary cysteine classification; all other cysteine classification web servers consider only the binary classification problem of discriminating free cysteines from intra-chain half-cystines. In this paper, we apply a SVM with (a variant of) the spectrum kernel (4) to classify cysteines into three different species: free, half-cystine or ligand-bound. For predicted ligand-bound cysteines, we further refine the classification by predicting the bound ligand to be iron, zinc, cadmium or carbon. Although we have some results concerning inter-chain disulfide bonds (data not shown), the DiANNA web server is intended only for use with single-chain proteins.
 |
DATASET
|
|---|
To test and train a ternary SVM predictor for cysteine classification,
it was necessary to build a dataset, in which each cysteine
species is well represented. This was done as follows. From
the Protein Data Bank (
5), we extracted the set of single-chain
proteins containing ligand-bound cysteines, and produced a non-redundant
collection by using the program UniqueProt (
6) with HSSP distance
set to 0. This produced a list of 202 chains, denoted by
UP.
To enrich the small number (60) of half-cystines examples (which
is probably not representative), we considered the 967 non-redundant
protein chains used in (
1) for training and testing a neural
network to predict cysteine oxidation state prediction (dataset
MA). We merged the UP and MA datasets, and re-applied UniqueProt
to eliminate redundancy between the two lists. From each redundancy
cluster, we selected one member containing ligand-bound cysteines,
if available (if not, we selected the representative member
proposed by UniqueProt). In this fashion, we obtained a dataset
(denoted
UPMA) of 526 chains, with adequate representation of
each of the three cysteine classes.
Table 1 displays the number
of cysteines in each species, and
Table 2 presents the number
of chains containing each species. From each protein in
UPMA,
we extracted symmetric windows of size
w centered around each
cysteine. Different values of
w were tested, and the best results
were obtained for
w = 17 [the same value led to the best performance
in (
3)]. The annotated
UPMA list is available at URL
http://bioinformatics.bc.edu/clotelab/DiANNA/UPMA_annotated.html.
View this table:
[in this window]
[in a new window]
|
Table 2 Breakdown of protein chains which contain at the same time half-cystines (HC), free cysteines (FC) and ligand-bound cysteines (LC), for each of the three datasets considered in this paper
|
|
 |
SVM PREDICTION USING STRING KERNELS
|
|---|
SVMs were introduced by Vapnik within the context of a mathematically
rigorous statistical learning theoryfor a very clear
exposition of this topic see (
7). Often demonstrating better
prediction accuracy than neural networks, SVMs have become increasingly
popular in bioinformatics, with applications ranging from translation
initiation site determination (
8), remote homology detection
in proteins (
9), viral protease cleavage site prediction (
10),
fast computation of
Z-scores for minimum free energy of RNA
(
11) and so on.
To apply SVMs to the ternary cysteine classification problem, we use the spectrum representation (4) which describes an amino acid sequence by specifying the vector of k-mers which occur; i.e. for peptide p, define
k(p) = 
a(x):a
Ak
, where
a(x) is the number of occurrences of the k-mer a in p, and A is the set of 1-letter codes of amino acids. Leslie et al. use the term spectrum kernel resp. mismatch kernel in (4,13), and Busuttil et al. use the term profile-based kernel in (14). More rigorously speaking, these authors actually apply classical kernels [e.g. the linear kernel in (4,13)] for new representations of amino acid sequencesthe spectrum representation, mismatch representation, profile-based spectrum representation. In this paper, we obtained the best results when k = 3, so that the amino acid sequence p in each size w window is encoded by the vector
3(p) of 8000 coordinates, giving the number of occurrences of each 3-mer in p. With the spectrum representation, we used the software libSVM (12) with a degree 2 polynomial kernel, such that the cost parameter C = 1for explanation of these parameters see (12).
To train and test the SVMs we used 5-fold cross-validation, splitting positive and negative datasets into five random subsets of approximatively the same size. Using libSVM, the SVM multiclass classifier outputs, for each cysteine in the input sequence, the probability of being a free cysteine (FC), a half-cystine (HC) and ligand-bound (LC). To measure the performance of the algorithm we used the Q3 score, which is the ratio between correctly predicted examples and the total number of examples. The Q3 score is commonly used for the performance evaluation of three states (sheet, helix, coil) secondary structure predictorse.g. see (15). Additionally, we computed the Qp score, which is the fraction of proteins for which all cysteines are correctly classified. The results (Table 3) show that the highest Q3 and Qp scores are obtained using for the spectrum representation with a degree 2 polynomial kernel (scores of 0.78 and 0.53, respectively). Although the papers (13) and (14,16) report that the mismatch and profile-based kernels outperform the spectrum kernel in protein classification experiments, we found that this is not the case for cysteine oxidation state prediction. Additional data describing the results of binary classification experiments can be found in the web supplement at the DiANNA web site.
View this table:
[in this window]
[in a new window]
|
Table 3 Performance measure (Q3 and Qp scores) for the three-class prediction of LC, HC, FC using different kernels and input representation
|
|
Table 4 displays the number of examples in dataset
UPMA for
each distinct ligand type in ligand-bound cysteines. For the
cases for which we have at least 39 examples (i.e. Zn, Fe, Cd,
C) we investigated whether machine learning can be used to discriminate
the atomic species boundi.e. whether sequence context
of each type of ligand is significantly different. Experiments
were performed where the positive set consisted of amino acid
sequences symmetrically flanking those cysteines bound to a
specific ligand (say iron), while the negative set consisted
of sequences flanking cysteines bound to a different ligand.
In the case of cadmium (Cd) and carbon (C), we randomly resampled
the positive training set (which is substantially smaller than
the negative training set) until the number of positive and
negative examples was the same (note that the test set is unchanged).
As in ternary cysteine classification, we found that the best
discrimination was obtained in using the degree 2 polynomial
kernel with the spectrum representation. Results are reported
in
Table 5 and
Figure 1.

View larger version (30K):
[in this window]
[in a new window]
|
Figure 1 ROC curves for the prediction of cysteines covalently bound to specific ligands. [For an explanation of receiver operating characteristic (ROC) curves see (20)].
|
|
 |
WEB SERVER
|
|---|
DiANNA 1.1 has a simple user-friendly web interface, which allows
the user to obtain a prediction of the state (free, half-cystine
or ligand-bound) for each cysteine in an input protein. The
ternary SVM predictor outputs the highest probability class,
and, for those cysteines predicted as ligand-bound, the most
likely ligand is displayed (among iron, zinc, cadmium, carbon),
by a winner-takes-all decision. Additionally, as described previously
(
17,
18), DiANNA 1.1 uses a state-of-the-art method to predict
the disulfide connectivityi.e. which cysteines form a
disulfide bond with which other cysteines. A screen shot of
the DiANNA 1.1 web server output for a ternary classification
prediction is shown in
Figure 2. Additionally, DiANNA 1.1 allows
all possible binary classification predictions for the three
cysteine classes (free, half-cystine, ligand-bound). The web
server interface is largely self-explanatory. The upper panel
of
Figure 2 displays the input form, including the pull-down
menu, which allows the user to choose the classifier used for
cysteine state prediction (ternary classifier, or one of three
binary classifiers). The lower panel of
Figure 2 displays the
output of the ternary cysteine state classifier, indicating
the probability of each class (half-cystine, free cysteine,
ligand-bound). In the case of predicted ligand-bound cysteines,
the predicted ligand is listed in the right-most column. The
user enters a protein in FASTA format, possibly including a
FASTA comment, and chooses either to predict the cysteine state
for each cysteine, or to determine the disulfide connectivity.
The latter function has already been described in (
17).

View larger version (40K):
[in this window]
[in a new window]
|
Figure 2 DiANNA ternary cysteine classification prediction input and output example. Upper panel: The DiANNA web-server update allows the user to choose between disulfide connectivity prediction and cysteine classification (ternary cysteine classification is only available in the 1.1 update). In the latter case, the user can type or paste a FASTA sequence in a text box, then choose among four different classification predictions by means of a drop down menu (i.e. the ternary LC versus HC versus FC classification, and the three binary classifications LC versus HC, LC versus FC and HC versus FC). Lower panel: Output for the ternary classification. For each cysteine in the submitted sequence, the SVM model predicts the probability of being half-cystine, free cysteine or ligand-bound. The class having the highest probability is highlighted. If a specific cysteine is predicted as ligand bound, a tentative prediction about the putative ligand (out of four possible ligands) is attempted.
|
|
 |
CONCLUSION
|
|---|
Given the amino acid sequence of a protein, DiANNA (
17) is a
state-of-the-art method to predict disulfide connectivity topology.
Version 1.0 of the DiANNA web server, described in (
18), additionally
predicts the oxidation state of each cysteine (free or half-cystine),
by using our implementation of the neural network of Fariselli
et al. (
19). In version 1.1 of the DiANNA web server, described
in this paper, we replace the binary classifier of (
19) by a
SVM with degree 2 polynomial kernel for the spectrum representation
(
4). Using libSVM, we obtain a ternary classifier, capable of
discriminating between free cysteines, half-cystines and ligand-bound
cysteines. Moreover, for the latter, DiANNA 1.1 predicts the
type of ligand. To the best of our knowledge, this is the first
application of string-based kernels to sequence windows; until
this paper, such kernels had been used only for protein classification.
 |
ACKNOWLEDGEMENTS
|
|---|
We would like to thank J. Waldispühl for helping in the
web interface design, and anonymous referees for some valuable
suggestions. Work of P.C. was partially supported by NSF DBI-0543506.
Funding to pay the Open Access publication charges for this
article was provided by NSF grant DBI-0543506.
Conflict of interest statement. None declared.
 |
REFERENCES
|
|---|
- Martelli, P.L., Fariselli, P., Malaguti, L., Casadio, R. (2002) Prediction of the disulfide bonding state of cysteines in proteins with hidden neural networks Protein Eng, . 15, 951953[Abstract/Free Full Text]
.
- Chen, Y.C., Lin, Y.S., Lin, C.J., Hwang, J.K. (2004) Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences Proteins, 55, 10361042[CrossRef][ISI][Medline]
.
- Passerini, A. and Frasconi, P. (2004) Learning to discriminate between ligand-bound and disulfide-bound cysteines Protein Eng. Des. Sel, . 17, 367373[Abstract/Free Full Text]
.
- Leslie, C., Eskin, E., Noble, W.S. (2002) The spectrum kernel: a string kernel for SVM protein classification Pac. Symp. Biocomput, . 564575
.
- Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., et al. (2002) The Protein Data Bank Acta Crystallogr. D Biol. Crystallogr, . 58, 899907[CrossRef][Medline]
.
- Mika, S. and Rost, B. (2003) UniqueProt: creating representative protein sequence sets Nucleic Acids Res, . 31, 37893791[Abstract/Free Full Text]
.
- Vapnik, V. The Nature Of Statistical Learning Theory, (1995) NY Springer
.
- Zien, A., Ratsch, G., Mika, S., Scholkopf, B., Lengauer, T., Muller, K.R. (2000) Engineering support vector machine kernels that recognize translation initiation sites Bioinformatics, 16, 799807[Abstract/Free Full Text]
.
- Jaakkola, T., Diekhans, M., Haussler, D. (1999) Using the Fisher kernel method to detect remote protein homologies Proc. Int. Conf. Intell. Syst. Mol. Biol, . 149158
.
- Narayanan, A., Wu, X., Yang, Z.R. (2002) Mining viral protease data to extract cleavage knowledge Bioinformatics, 18, S5S13[Abstract]
.
- Washietl, S., Hofacker, I.L., Stadler, P.F. (2005) Fast and reliable prediction of noncoding RNAs Proc. Natl Acad. Sci. USA, 102, 24542459[Abstract/Free Full Text]
.
- Fan, R.-E., Chen, P.-H., Lin, C.-J. (2005) Working set selection using the second order information for training SVM J. Machine Learning Res, . 6, 18891918
.
- Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S. (2004) Mismatch string kernels for discriminative protein classification Bioinformatics, 20, 467476[Abstract/Free Full Text]
.
- Busuttil, S., Abela, J., Pace, G. (2004) Support vector machines with profile-based kernels for discriminative protein classification Genome Inform, . 15, 191200[Medline]
.
- Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices J. Mol. Biol, . 292, 195202[CrossRef][ISI][Medline]
.
- Kuang, R., Ie, E., Wang, K., Siddiqi, M., Freund, Y., Leslie, C. (2004) Profile-based string kernels for remote homology detection and motif extraction Proc. IEEE Comput. Syst. Bioinform. Conf, . 152160
.
- Ferrè, F. and Clote, P. (2005) Disulfide connectivity prediction using secondary structure information and diresidue frequencies Bioinformatics, 21, 23362346[Abstract/Free Full Text]
.
- Ferrè, F. and Clote, P. (2005) DiANNA: a web server for disulfide connectivity prediction Nucleic Acids Res, . 33, W230W232[Abstract/Free Full Text]
.
- Fariselli, P., Riccobelli, P., Casadio, R. (1999) Role of evolutionary information in predicting the disulfide-bonding state of cysteine in proteins Proteins, 36, 340346[CrossRef][ISI][Medline]
.
- Gribskov, M. and Robinson, N. (1996) The use of receiver operating characteristic (ROC) analysis to evaluate sequence matching Comput. Chem, 20, 2534[CrossRef][ISI][Medline]
.

CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:

|
 |

|
 |
 
R. Sanchez, M. Riddle, J. Woo, and J. Momand
Prediction of reversibly oxidized protein cysteine thiols using protein structure properties
Protein Sci.,
March 1, 2008;
17(3):
473 - 481.
[Abstract]
[Full Text]
[PDF]
|
 |
|