Nucleic Acids Research Advance Access originally published online on May 25, 2007
Nucleic Acids Research 2007 35(Web Server issue):W538-W542; doi:10.1093/nar/gkm254
Nucleic Acids Research, 2007, Vol. 35, No. suppl_2 W538-W542
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
CytoSVM: an advanced server for identification of cytokine-receptor interactions
Jin-Rui Xu1,
Jing-Xian Zhang1,
Bu-Cong Han1,
Liang Liang1 and
Zhi-Liang Ji1,2,*
1Key Laboratory for Cell Biology & Tumor Cell Engineering, the Ministry of Education of China, School of Life Sciences and 2The Key Laboratory for Chemical Biology of Fujian Province, Xiamen University, Xiamen 361005, FuJian Province, P R China
*To whom correspondence should be addressed. Tel: 86-0592-2182897; Fax: 86-0592-2181015; Email: appo{at}bioinf.xmu.edu.cn; zhiliang.ji{at}gmail.com;
Received January 23, 2007. Revised March 24, 2007. Accepted April 8, 2007.
 |
ABSTRACT
|
|---|
The interactions between cytokines and their complementary receptors
are the gateways to properly understand a large variety of cytokine-specific
cellular activities such as immunological responses and cell
differentiation. To discover novel cytokine-receptor interactions,
an advanced support vector machines (SVMs) model, CytoSVM, was
constructed in this study. This model was iteratively trained
using 449 mammal (except rat) cytokine-receptor interactions
and about 1 million virtually generated positive and negative
vectors in an enriched way. Final independent evaluation by
rat's data received sensitivity of 97.4%, specificity of 99.2%
and the Matthews correlation coefficient (MCC) of 0.89. This
performance is better than normal SVM-based models. Upon this
well-optimized model, a web-based server was created to accept
primary protein sequence and present its probabilities to interact
with one or several cytokines. Moreover, this model was applied
to identify putative cytokine-receptor pairs in the whole genomes
of human and mouse. Excluding currently known cytokine-receptor
interactions, total 1609 novel cytokine-receptor pairs were
discovered from human genome with probability

80% after further
transmembrane analysis. These cover 220 novel receptors (excluding
their isoforms) for 126 human cytokines. The screening results
have been deposited in a database. Both the server and the database
can be freely accessed at
http://bioinf.xmu.edu.cn/software/cytosvm/cytosvm.php.
 |
INTRODUCTION
|
|---|
The binding of cytokines to their receptors on cell membranes
triggers the cellular activities such as immunological regulation,
cell growth, differentiation, apoptosis and migration in vertebrates
(
1). Therefore, characterization of novel cytokine-receptor
pairs becomes the shortcut to understand these cytokine-mediated
signal pathways.
The traditional isolation and characterization methods for identification of cytokine-receptor pairs are significantly limited by their characteristics of short half life, low plasma concentrations, pleiotropy and redundancy. It has been improved by the applications of modern molecular technologies such as cloning technology. Furthermore, as a complementary solution to experimental approaches, searches for new members of cytokines or their receptors are now often conducted by identifying genes highly homologous to known cytokine/receptor genes. Currently, 203 human cytokine-receptor pairs have been characterized as presented in KEGG pathway database (2). Unfortunately, it has become more and more difficult to discover new partners of cytokine and receptor if no new sequence features were identified. Especially for those peptides without significant sequence similarity to known cytokines/receptors, their functions are difficult to be probed on the basis of homologous or clustering methods.
Various alternative methods for describing protein interactions have been developed in recent years. These include evolutionary analysis (3,4), Hidden Markov Models (5), structural consideration (68), protein/gene fusion (9,10), motifs recognition (11), family classification by sequence clustering (12) and functional family prediction by statistical learning methods (13,14). Support vector machines (SVMs) is a two-class classifier, which has been previously used in the classification of cytokine families (http://www.bioinfo.tsinghua.edu.cn/%7Ehn/CTKPred/index.html) (14). In this study, we constructed an improved SVM model, CytoSVM, for the identification of cytokine-receptor interactions on the basis of protein primary sequences. This model was further applied to screen the whole genomes of human and mouse for novel cytokine-receptor pairs.
 |
CONSTRUCTION OF CytoSVM MODEL
|
|---|
CytoSVM is a model based on the statistical learning algorithm,
SVM. This algorithm has been well-studied and implemented to
solve a variety of protein classification problems including
protein functional class (
13,
15), fold recognition (
16), analysis
of solvent accessibility (
17), prediction of secondary structures
(
18) and proteinprotein interactions (
19). As a method
that uses sequence-derived physicochemical properties of proteins
as the basis for classification, SVM may be particularly useful
for functional classification of distantly related proteins
and homologous proteins of different functions (
13). Such a
feature makes SVM a potentially attractive method for probing
the novel cytokine receptors, especially when the diversity
of cytokine receptors in sequence cannot be properly handled
by sequence homology-based approaches.
The data sets
The positive data pool
The positive data (the true cytokine-receptor interactions) were collected from the KEGG pathway database (2) and the literatures. These interaction pairs cover 449 distinct known cytokine-receptor interactions in mammals except rat. To be eligible for model construction, every sequence was represented by specific feature vector assembled from encoded representations of tabulated residue properties including amino acid composition, hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility for each residue in the sequence (13,1519). A positive vector of interaction pair was formed by joining the vectors of the cytokine and its complementary receptor. To enlarge the positive data pool, four virtual vectors were generated around each positive vector by slightly (about 1/1000 folds) increasing/decreasing the value of vector elements in multi-dimension space. As a result, total 2243 positive data (449 true positives and 1794 virtual positives) were prepared for model training.
The negative data pool
The negative data pool includes both the true and the virtual data. The true negatives are literature-reported 126 non-cytokineprotein interactions, which are very limited in the representation of sequential and structural features of non-cytokinereceptor interactions. To cover all possible negative conditions, a large number of virtual negative interaction pairs were generated as follows: 7816 seed sequences representing diverse domain families, excluding those containing any known cytokine or its receptor, were extracted from Pfam protein families database (20). These Pfam seeds were paired with, covering all possible combinations, mammal cytokines to form the virtual negative interactions. Same transformations from sequences to vectors were demonstrated to these negative interaction pairs as described earlier. Totally, about 1 million negative data were ready in negative data pool.
The SVM algorithm
The theory of SVM has been well described in literature (21,22). The structural and physicochemical features of a protein interaction are represented by a feature vector quantified from its primary sequence as described earlier. The vector is projected into a hyperspace wherein a hyperplane is used to classify this protein interaction pair as either positive (cytokinereceptor interaction) or negative (non-cytokinereceptor interaction) depending on the side of the hyperplane the vector is located. In this study, an RBF kernel function K(xi, xj) was adopted to map the input vector into a high dimensional feature space:
| (1) |
The output of SVM
model is the respective class of input, directly associated
with the posterior probability by fitting a sigmoid (
23):
| (2) |
where
f(
x) is the output of SVM, and the
parameters
A and
B are estimated from the negative log likelihood
of the training data. A higher probability indicates the higher
confidence of positive prediction.
Evaluation and performance measure
As in the case of all discriminative methods (24), the performance of SVM classification can be measured by: the quantity of true positives TP, true negatives TN, false positives FP, false negatives FN, sensitivity SE = TP/(TP + FN) which is the accuracy of cytokineprotein interaction prediction and specificity SP = TN/(TN + FP) which is the accuracy of non-cytokineprotein interaction prediction. The overall performance of the model can be measured both using the Matthews correlation coefficient (MCC) below:
| (3) |
and a receiver operating characteristic
(ROC) plot (
25). ROC plot is a plot of the true positive rate
against the false positive rate for the different possible thresholds
of a model test. The area under the ROC curve (AUC) is usu-ally
adopted as a scalar measure that gauges one facet of performance
(
25). In this study, ROC plot (Please refer to
http://bioinf.xmu.edu.cn/software/cytosvm/help.htm#roccurve)
and AUC were used to compare the performance of different SVM
models (
Table 1). It is shown that the enriched-SVM model with
virtual positives (model M1) has the best performance, which
was chosen to classify the cytokinereceptor interactions.
The enriched-SVM model
The model construction adopted all 2243 real and virtual positive
vectors in positive data pool and about 1 million negative vectors
in negative data pool. To represent all negative sequential
and structural features and at the same time reduce the very
unbalance between positive and negative data, the vectors in
negative data pool were randomly divided into 230 groups of
about 4200 negative vectors. These 230 negative data groups
were combined with the 2243 positive data respectively to form
totally 230 data sets for the construction of model. These data
sets were arranged in the way of: 229 groups were used for independent
trainings, while the remaining one was left for testing.
The SVM model was initialized by 229 independent model trainings and optimized through several rounds of training in an enriched way. The negative support vectors (vectors close to the hyperplane on negative side) that decide the hyperplanes of the 229 independent models were extracted to form a new negative data pool. This pool was further arranged into groups for next round of learning process. The iterative learning process, or enriched selection of support vectors, was continued to seek the global optimally separating hyperplane (OSH) until the positive and negative data come to a near balance, of which the ratio is about 1:3 in this case. The optimized model was first tested by the remaining data set to assess its theoretical performance, which achieved sensitivity of 100%, specificity of 99.98% and MCC of 0.99. Considering the overfitting problem due to the overtraining on the same data set, the model was further independently evaluated by 79 real cytokinereceptor interactions and 2360 generated negative data in rat, achieving sensitivity of 97.4%, specificity of 99.2% and MCC of 0.89 (Table 2). Such performance is comparable to other computational approaches in proteinprotein interactions.
 |
THE ACCESS OF SERVER AND DATABASE
|
|---|
The descriptions of server
The web-based server upon the optimized CytoSVM model can be
freely accessed at
http://bioinf.xmu.edu.cn/software/cytosvm/PredictReceptor.php (
Figure 1). The server runs under Linux environment that allows
user to submit the query through a PHP-coded dynamic interface.
The default input of the server is the protein primary sequence
of putative receptor/cytokine in standard FASTA format or raw
data format. The server is case insensitive, however, wild characters
like *,- and non-amino acids characters will be
removed from sequence automatically. An optional function of
prediction by protein names is also provided. To initialize
the prediction, user is required to choose a cytokine/receptor
or cytokine/receptor families as well. The output of the server
is the list of cytokines/receptors which are able to interact
with query sequence with certain probabilities. Clicking on
the name of a cytokine/receptor will lead user to the detailed
information page, where user may find links to search for other
putative receptors interacting with this cytokine in human or
mouse genomes.
The descriptions of database
In this study, the well-optimized CytoSVM model was also applied
to screen putative cytokinereceptor interactions in whole
genomes of human and mouse. Finally, 1609 novel cytokine-receptor
pairs with probability >80% (3346 pairs with probability
>50%), covering 220 novel receptors (excluding their isoforms)
for 126 human cytokines were identified in human genome after
further transmembrane analysis (
http://bioinf.xmu.edu.cn/software/cytosvm/statistics.php).
These predicted results were deposited in a database at
http://bioinf.xmu.edu.cn/software/cytosvm/BrowseSearch.php.
The database is running upon Linux/Apache/PHP platform and maintained
by RDBMS system of
Oracle 9i, which enables multiple accesses
simultaneously. User is allowed to search the putative receptors
of a definite cytokine by selecting the item from the cytokine
classification list (
Figure 2). Quick search by keywords is
also supported to find putative interactions of cytokines or
receptors. Only interactions with probability value >50%
will be responded for each single search. Clicking on the name
of a cytokine or receptor will guide user into the detailed
information page, where the general properties of the interactive
partners are shown. Statistic of putative cytokine-receptor
pairs in human genome and the help documents are also provided
to aid database and server access.
 |
CONCLUSION
|
|---|
In conclusion, a web-based enriched-SVM model, CytoSVM, was
successfully constructed in this study to predict the putative
cytokinereceptor interactions. As a complementary method
to homologous-based methods and other computational approaches
in proteinprotein interaction prediction, CytoSVM shows
its capability in functionally annotating those proteins that
possess poor sequence similarity to known proteins. The application
of CytoSVM in the discovery of novel cytokinereceptor
interactions in genome scale broadens the understanding of cytokines
physiological activities in the systematic level. Via these
predicted interactants, the identification of novel cytokine-involved
cellular processes is possible. Furthermore, it prompts the
identification of new therapeutic targets for the treatment
of various diseases. It is thus expected that experimental verifications
could be demonstrated according to the clues provided by our
study in the future.
 |
Acknowledgments
|
|---|
The support from the National Natural Science Foundation of
China (#30400573) and the Program for New Century Excellent
Talents in University (NCET) of MOE and Xiamen University are
gratefully acknowledged. Funding to pay the Open Access publication
charges for this article was provided by NCET 2006 to ZL Ji.
Conflict of interest statement. None declared.
 |
Footnotes
|
|---|
The authors wish it to be known that, in their opinion, the
second and third authors contributed equally to this work.
 |
REFERENCES
|
|---|
- Oppenheim JJ. Cytokines: past, present, and future. Int. J. Hematol (2001) 74:38.[Web of Science][Medline]
- Altermann E, Klaenhammer TR. PathwayVoyager: pathway mapping using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. BMC Genomics (2005) 6:6066.[CrossRef][Medline]
- Pazos F, Ranea JA, Juan D, Sternberg MJ. Assessing protein co-evolution in the context of the tree of life assists in the prediction of the interactome. J. Mol. Biol (2005) 352:10021015.[CrossRef][Web of Science][Medline]
- Goh CS, Bogan AA, Joachimiak M, Walther D, Cohen FE. Co-evolution of proteins with their interaction partners. J. Mol. Biol (2000) 299:283293.[CrossRef][Web of Science][Medline]
- Fujiwara Y, Asogawa M. Protein function prediction using hidden Markov models and neural networks. NEC. Res. Dev (2002) 43:238241.
- Di Gennaro JA, Siew N, Hoffman BT, Zhang L, Skolnick J, Neilson LI, Fetrow JS. Enhanced functional annotation of protein sequences via the use of structural descriptors. J. Struct. Biol (2001) 134:232245.[CrossRef][Web of Science][Medline]
- Teichmann SA, Murzin AG, Chothia C. Determination of protein function, evolution and interactions by structural genomics. Curr Opin Struct Biol (2001) 11:354363.[CrossRef][Web of Science][Medline]
- Chen R, Weng Z. Docking unbound proteins using shape complementarity, desolvation, and electrostatics. Proteins (2002) 47:281294.[CrossRef][Web of Science][Medline]
- Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA. Protein interaction maps for complete genomes based on gene fusion events. Nature (1999) 402:8690.[CrossRef][Medline]
- Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. Detecting protein function and protein-protein interactions from genome sequences. Science (1999) 285:751753.[Abstract/Free Full Text]
- Hodges HC, Tsai JW. 3D-Motifs: an informatics approach to protein function prediction. FASEB. J (2002) 16:A543A543.
- Enright AJ, Van Dongen S, Ouzounis CA. an efficient algorithm for large-scale detection of protein families. Nucleic Acids Res (2002) 30:15751584.[Abstract/Free Full Text]
- Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res (2003) 31:36923697.[Abstract/Free Full Text]
- Huang N, Chen H, Sun Z. CTKPred: an SVM-based method for the prediction and classification of the cytokine superfamily. Protein Eng. Des. Sel (2005) 18:365368.[Abstract/Free Full Text]
- Karchin R, Karplus K, Haussler D. Classifying G-protein coupled receptors with support vector machines. Bioinformatics (2002) 18:147159.[Abstract/Free Full Text]
- Ding CH, Dubchak I. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics (2001) 17:349358.[Abstract/Free Full Text]
- Yuan Z, Burrage K, Mattick JS. Prediction of protein solvent accessibility using support vector machines. Proteins (2002) 48:566570.[CrossRef][Web of Science][Medline]
- Hua S, Sun Z. A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J. Mol. Biol (2001) 308:397407.[CrossRef][Web of Science][Medline]
- Bock JR, Gough DA. Predicting proteinprotein interactions from primary structure. Bioinformatics (2001) 17:455460.[Abstract/Free Full Text]
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, et al. The Pfam protein families database. Nucleic Acids Res (2004) 32:D138D141.[Abstract/Free Full Text]
- Burges C. A Tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc (1998) 2:121167.[CrossRef]
- Vapnik VN. The Nature of Statistical Learning Theory (1995) New York: Springer.
- Platt JC. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin ClassiersSmola AJ, Bartlett PL, Schölkopf B, Schuurnabs D, eds. (2000) MIT Press, Cambridge, MA. 6174.
- Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics (2000) 16:412424.[Abstract/Free Full Text]
- Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology (1982) 143:2936.[Abstract/Free Full Text]

CiteULike
Connotea
Del.icio.us What's this?