Skip Navigation



Nucleic Acids Research Advance Access published online on May 25, 2007

Nucleic Acids Research, doi:10.1093/nar/gkm254
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (3526K) Freely available
Right arrow Screen PDF (374K) Freely available
Right arrowOA All Versions of this Article:
35/suppl_2/W538    most recent
gkm254v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Xu, J.-R.
Right arrow Articles by Ji, Z.-L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xu, J.-R.
Right arrow Articles by Ji, Z.-L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.


Web Server Issue

CytoSVM: an advanced server for identification of cytokine-receptor interactions

Jin-Rui Xu1, Jing-Xian Zhang1, Bu-Cong Han1, Liang Liang1 and Zhi-Liang Ji1,2,*

1Key Laboratory for Cell Biology & Tumor Cell Engineering, the Ministry of Education of China, School of Life Sciences and 2The Key Laboratory for Chemical Biology of Fujian Province, Xiamen University, Xiamen 361005, FuJian Province, P R China

*To whom correspondence should be addressed. Tel: 86-0592-2182897; Fax: 86-0592-2181015; Email: appo{at}bioinf.xmu.edu.cn; zhiliang.ji{at}gmail.com;

Received January 23, 2007. Revised March 24, 2007. Accepted April 8, 2007.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 CONSTRUCTION OF CytoSVM MODEL
 THE ACCESS OF SERVER...
 CONCLUSION
 Acknowledgments
 REFERENCES
 
The interactions between cytokines and their complementary receptors are the gateways to properly understand a large variety of cytokine-specific cellular activities such as immunological responses and cell differentiation. To discover novel cytokine-receptor interactions, an advanced support vector machines (SVMs) model, CytoSVM, was constructed in this study. This model was iteratively trained using 449 mammal (except rat) cytokine-receptor interactions and about 1 million virtually generated positive and negative vectors in an enriched way. Final independent evaluation by rat's data received sensitivity of 97.4%, specificity of 99.2% and the Matthews correlation coefficient (MCC) of 0.89. This performance is better than normal SVM-based models. Upon this well-optimized model, a web-based server was created to accept primary protein sequence and present its probabilities to interact with one or several cytokines. Moreover, this model was applied to identify putative cytokine-receptor pairs in the whole genomes of human and mouse. Excluding currently known cytokine-receptor interactions, total 1609 novel cytokine-receptor pairs were discovered from human genome with probability ~80% after further transmembrane analysis. These cover 220 novel receptors (excluding their isoforms) for 126 human cytokines. The screening results have been deposited in a database. Both the server and the database can be freely accessed at http://bioinf.xmu.edu.cn/software/cytosvm/cytosvm.php.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 CONSTRUCTION OF CytoSVM MODEL
 THE ACCESS OF SERVER...
 CONCLUSION
 Acknowledgments
 REFERENCES
 
The binding of cytokines to their receptors on cell membranes triggers the cellular activities such as immunological regulation, cell growth, differentiation, apoptosis and migration in vertebrates (1). Therefore, characterization of novel cytokine-receptor pairs becomes the shortcut to understand these cytokine-mediated signal pathways.

The traditional isolation and characterization methods for identification of cytokine-receptor pairs are significantly limited by their characteristics of short half life, low plasma concentrations, pleiotropy and redundancy. It has been improved by the applications of modern molecular technologies such as cloning technology. Furthermore, as a complementary solution to experimental approaches, searches for new members of cytokines or their receptors are now often conducted by identifying genes highly homologous to known cytokine/receptor genes. Currently, 203 human cytokine-receptor pairs have been characterized as presented in KEGG pathway database (2). Unfortunately, it has become more and more difficult to discover new partners of cytokine and receptor if no new sequence features were identified. Especially for those peptides without significant sequence similarity to known cytokines/receptors, their functions are difficult to be probed on the basis of homologous or clustering methods.

Various alternative methods for describing protein interactions have been developed in recent years. These include evolutionary analysis (3,4), Hidden Markov Models (5), structural consideration (6–8), protein/gene fusion (9,10), motifs recognition (11), family classification by sequence clustering (12) and functional family prediction by statistical learning methods (13,14). Support vector machines (SVMs) is a two-class classifier, which has been previously used in the classification of cytokine families (http://www.bioinfo.tsinghua.edu.cn/%7Ehn/CTKPred/index.html) (14). In this study, we constructed an improved SVM model, CytoSVM, for the identification of cytokine-receptor interactions on the basis of protein primary sequences. This model was further applied to screen the whole genomes of human and mouse for novel cytokine-receptor pairs.


    CONSTRUCTION OF CytoSVM MODEL
 TOP
 ABSTRACT
 INTRODUCTION
 CONSTRUCTION OF CytoSVM MODEL
 THE ACCESS OF SERVER...
 CONCLUSION
 Acknowledgments
 REFERENCES
 
CytoSVM is a model based on the statistical learning algorithm, SVM. This algorithm has been well-studied and implemented to solve a variety of protein classification problems including protein functional class (13,15), fold recognition (16), analysis of solvent accessibility (17), prediction of secondary structures (18) and protein–protein interactions (19). As a method that uses sequence-derived physicochemical properties of proteins as the basis for classification, SVM may be particularly useful for functional classification of distantly related proteins and homologous proteins of different functions (13). Such a feature makes SVM a potentially attractive method for probing the novel cytokine receptors, especially when the diversity of cytokine receptors in sequence cannot be properly handled by sequence homology-based approaches.

The data sets
The positive data pool
The positive data (the true cytokine-receptor interactions) were collected from the KEGG pathway database (2) and the literatures. These interaction pairs cover 449 distinct known cytokine-receptor interactions in mammals except rat. To be eligible for model construction, every sequence was represented by specific feature vector assembled from encoded representations of tabulated residue properties including amino acid composition, hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility for each residue in the sequence (13,15–19). A positive vector of interaction pair was formed by joining the vectors of the cytokine and its complementary receptor. To enlarge the positive data pool, four virtual vectors were generated around each positive vector by slightly (about 1/1000 folds) increasing/decreasing the value of vector elements in multi-dimension space. As a result, total 2243 positive data (449 true positives and 1794 virtual positives) were prepared for model training.

The negative data pool
The negative data pool includes both the true and the virtual data. The true negatives are literature-reported 126 non-cytokine–protein interactions, which are very limited in the representation of sequential and structural features of non-cytokine–receptor interactions. To cover all possible negative conditions, a large number of virtual negative interaction pairs were generated as follows: 7816 seed sequences representing diverse domain families, excluding those containing any known cytokine or its receptor, were extracted from Pfam protein families database (20). These Pfam seeds were paired with, covering all possible combinations, mammal cytokines to form the virtual negative interactions. Same transformations from sequences to vectors were demonstrated to these negative interaction pairs as described earlier. Totally, about 1 million negative data were ready in negative data pool.

The SVM algorithm
The theory of SVM has been well described in literature (21,22). The structural and physicochemical features of a protein interaction are represented by a feature vector quantified from its primary sequence as described earlier. The vector is projected into a hyperspace wherein a hyperplane is used to classify this protein interaction pair as either positive (cytokine–receptor interaction) or negative (non-cytokine–receptor interaction) depending on the side of the hyperplane the vector is located. In this study, an RBF kernel function K(xi, xj) was adopted to map the input vector into a high dimensional feature space:

Formula 1

(1)
The output of SVM model is the respective class of input, directly associated with the posterior probability by fitting a sigmoid (23):

Formula 2

(2)
where f(x) is the output of SVM, and the parameters A and B are estimated from the negative log likelihood of the training data. A higher probability indicates the higher confidence of positive prediction.

Evaluation and performance measure
As in the case of all discriminative methods (24), the performance of SVM classification can be measured by: the quantity of true positives TP, true negatives TN, false positives FP, false negatives FN, sensitivity SE = TP/(TP + FN) which is the accuracy of cytokine–protein interaction prediction and specificity SP = TN/(TN + FP) which is the accuracy of non-cytokine–protein interaction prediction. The overall performance of the model can be measured both using the Matthews correlation coefficient (MCC) below:

Formula 3

(3)
and a receiver operating characteristic (ROC) plot (25). ROC plot is a plot of the true positive rate against the false positive rate for the different possible thresholds of a model test. The area under the ROC curve (AUC) is usu-ally adopted as a scalar measure that gauges one facet of performance (25). In this study, ROC plot (Please refer to http://bioinf.xmu.edu.cn/software/cytosvm/help.htm#roccurve) and AUC were used to compare the performance of different SVM models (Table 1). It is shown that the enriched-SVM model with virtual positives (model M1) has the best performance, which was chosen to classify the cytokine–receptor interactions.


View this table:
[in this window]
[in a new window]

 
Table 1. The descriptions of different SVM models

 
The enriched-SVM model
The model construction adopted all 2243 real and virtual positive vectors in positive data pool and about 1 million negative vectors in negative data pool. To represent all negative sequential and structural features and at the same time reduce the very unbalance between positive and negative data, the vectors in negative data pool were randomly divided into 230 groups of about 4200 negative vectors. These 230 negative data groups were combined with the 2243 positive data respectively to form totally 230 data sets for the construction of model. These data sets were arranged in the way of: 229 groups were used for independent trainings, while the remaining one was left for testing.

The SVM model was initialized by 229 independent model trainings and optimized through several rounds of training in an enriched way. The negative support vectors (vectors close to the hyperplane on negative side) that decide the hyperplanes of the 229 independent models were extracted to form a new negative data pool. This pool was further arranged into groups for next round of learning process. The iterative learning process, or enriched selection of support vectors, was continued to seek the global optimally separating hyperplane (OSH) until the positive and negative data come to a near balance, of which the ratio is about 1:3 in this case. The optimized model was first tested by the remaining data set to assess its theoretical performance, which achieved sensitivity of 100%, specificity of 99.98% and MCC of 0.99. Considering the ‘overfitting’ problem due to the overtraining on the same data set, the model was further independently evaluated by 79 real cytokine–receptor interactions and 2360 generated negative data in rat, achieving sensitivity of 97.4%, specificity of 99.2% and MCC of 0.89 (Table 2). Such performance is comparable to other computational approaches in protein–protein interactions.


View this table:
[in this window]
[in a new window]

 
Table 2. The evaluation of CytoSVM model

 

    THE ACCESS OF SERVER AND DATABASE
 TOP
 ABSTRACT
 INTRODUCTION
 CONSTRUCTION OF CytoSVM MODEL
 THE ACCESS OF SERVER...
 CONCLUSION
 Acknowledgments
 REFERENCES
 
The descriptions of server
The web-based server upon the optimized CytoSVM model can be freely accessed at http://bioinf.xmu.edu.cn/software/cytosvm/ PredictReceptor.php (Figure 1). The server runs under Linux environment that allows user to submit the query through a PHP-coded dynamic interface. The default input of the server is the protein primary sequence of putative receptor/cytokine in standard FASTA format or raw data format. The server is case insensitive, however, wild characters like ‘*,-’ and non-amino acids characters will be removed from sequence automatically. An optional function of prediction by protein names is also provided. To initialize the prediction, user is required to choose a cytokine/receptor or cytokine/receptor families as well. The output of the server is the list of cytokines/receptors which are able to interact with query sequence with certain probabilities. Clicking on the name of a cytokine/receptor will lead user to the detailed information page, where user may find links to search for other putative receptors interacting with this cytokine in human or mouse genomes.


Figure 1
View larger version (62K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. The interface of CytoSVM server.

 
The descriptions of database
In this study, the well-optimized CytoSVM model was also applied to screen putative cytokine–receptor interactions in whole genomes of human and mouse. Finally, 1609 novel cytokine-receptor pairs with probability >80% (3346 pairs with probability >50%), covering 220 novel receptors (excluding their isoforms) for 126 human cytokines were identified in human genome after further transmembrane analysis (http://bioinf.xmu.edu.cn/software/cytosvm/statistics.php). These predicted results were deposited in a database at http://bioinf.xmu.edu.cn/software/cytosvm/BrowseSearch.php. The database is running upon Linux/Apache/PHP platform and maintained by RDBMS system of Oracle 9i, which enables multiple accesses simultaneously. User is allowed to search the putative receptors of a definite cytokine by selecting the item from the cytokine classification list (Figure 2). Quick search by keywords is also supported to find putative interactions of cytokines or receptors. Only interactions with probability value >50% will be responded for each single search. Clicking on the name of a cytokine or receptor will guide user into the detailed information page, where the general properties of the interactive partners are shown. Statistic of putative cytokine-receptor pairs in human genome and the help documents are also provided to aid database and server access.


Figure 2
View larger version (53K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2. The interface of CytoSVM database.

 

    CONCLUSION
 TOP
 ABSTRACT
 INTRODUCTION
 CONSTRUCTION OF CytoSVM MODEL
 THE ACCESS OF SERVER...
 CONCLUSION
 Acknowledgments
 REFERENCES
 
In conclusion, a web-based enriched-SVM model, CytoSVM, was successfully constructed in this study to predict the putative cytokine–receptor interactions. As a complementary method to homologous-based methods and other computational approaches in protein–protein interaction prediction, CytoSVM shows its capability in functionally annotating those proteins that possess poor sequence similarity to known proteins. The application of CytoSVM in the discovery of novel cytokine–receptor interactions in genome scale broadens the understanding of cytokines’ physiological activities in the systematic level. Via these predicted interactants, the identification of novel cytokine-involved cellular processes is possible. Furthermore, it prompts the identification of new therapeutic targets for the treatment of various diseases. It is thus expected that experimental verifications could be demonstrated according to the clues provided by our study in the future.


    Acknowledgments
 TOP
 ABSTRACT
 INTRODUCTION
 CONSTRUCTION OF CytoSVM MODEL
 THE ACCESS OF SERVER...
 CONCLUSION
 Acknowledgments
 REFERENCES
 
The support from the National Natural Science Foundation of China (#30400573) and the Program for New Century Excellent Talents in University (NCET) of MOE and Xiamen University are gratefully acknowledged. Funding to pay the Open Access publication charges for this article was provided by NCET 2006 to ZL Ji.

Conflict of interest statement. None declared.


    Footnotes
 
The authors wish it to be known that, in their opinion, the second and third authors contributed equally to this work.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 CONSTRUCTION OF CytoSVM MODEL
 THE ACCESS OF SERVER...
 CONCLUSION
 Acknowledgments
 REFERENCES
 

  1. Oppenheim JJ. Cytokines: past, present, and future. Int. J. Hematol, ( (2001) ) 74, : 3–8.[ISI][Medline]

  2. Altermann E, Klaenhammer TR. PathwayVoyager: pathway mapping using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. BMC Genomics, ( (2005) ) 6, : 60–66.[CrossRef][Medline]

  3. Pazos F, Ranea JA, Juan D, Sternberg MJ. Assessing protein co-evolution in the context of the tree of life assists in the prediction of the interactome. J. Mol. Biol, ( (2005) ) 352, : 1002–1015.[CrossRef][ISI][Medline]

  4. Goh CS, Bogan AA, Joachimiak M, Walther D, Cohen FE. Co-evolution of proteins with their interaction partners. J. Mol. Biol, ( (2000) ) 299, : 283–293.[CrossRef][ISI][Medline]

  5. Fujiwara Y, Asogawa M. Protein function prediction using hidden Markov models and neural networks. NEC. Res. Dev, ( (2002) ) 43, : 238–241.

  6. Di Gennaro JA, Siew N, Hoffman BT, Zhang L, Skolnick J, Neilson LI, Fetrow JS. Enhanced functional annotation of protein sequences via the use of structural descriptors. J. Struct. Biol, ( (2001) ) 134, : 232–245.[CrossRef][ISI][Medline]

  7. Teichmann SA, Murzin AG, Chothia C. Determination of protein function, evolution and interactions by structural genomics. Curr Opin Struct Biol, ( (2001) ) 11, : 354–363.[CrossRef][ISI][Medline]

  8. Chen R, Weng Z. Docking unbound proteins using shape complementarity, desolvation, and electrostatics. Proteins, ( (2002) ) 47, : 281–294.[CrossRef][ISI][Medline]

  9. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA. Protein interaction maps for complete genomes based on gene fusion events. Nature, ( (1999) ) 402, : 86–90.[CrossRef][Medline]

  10. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. Detecting protein function and protein-protein interactions from genome sequences. Science, ( (1999) ) 285, : 751–753.[Abstract/Free Full Text]

  11. Hodges HC, Tsai JW. 3D-Motifs: an informatics approach to protein function prediction. FASEB. J, ( (2002) ) 16, : A543–A543.

  12. Enright AJ, Van Dongen S, Ouzounis CA. an efficient algorithm for large-scale detection of protein families. Nucleic Acids Res, ( (2002) ) 30, : 1575–1584.[Abstract/Free Full Text]

  13. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res, ( (2003) ) 31, : 3692–3697.[Abstract/Free Full Text]

  14. Huang N, Chen H, Sun Z. CTKPred: an SVM-based method for the prediction and classification of the cytokine superfamily. Protein Eng. Des. Sel, ( (2005) ) 18, : 365–368.[Abstract/Free Full Text]

  15. Karchin R, Karplus K, Haussler D. Classifying G-protein coupled receptors with support vector machines. Bioinformatics, ( (2002) ) 18, : 147–159.[Abstract/Free Full Text]

  16. Ding CH, Dubchak I. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, ( (2001) ) 17, : 349–358.[Abstract/Free Full Text]

  17. Yuan Z, Burrage K, Mattick JS. Prediction of protein solvent accessibility using support vector machines. Proteins, ( (2002) ) 48, : 566–570.[CrossRef][ISI][Medline]

  18. Hua S, Sun Z. A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J. Mol. Biol, ( (2001) ) 308, : 397–407.[CrossRef][ISI][Medline]

  19. Bock JR, Gough DA. Predicting protein—protein interactions from primary structure. Bioinformatics, ( (2001) ) 17, : 455–460.[Abstract/Free Full Text]

  20. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, et al. The Pfam protein families database. Nucleic Acids Res, ( (2004) ) 32, : D138–D141.[Abstract/Free Full Text]

  21. Burges C. A Tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc, ( (1998) ) 2, : 121–167.[CrossRef]

  22. Vapnik VN. The Nature of Statistical Learning Theory., ( (1995) ) New York: Springer.

  23. Platt JC. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classiers., —Smola AJ, Bartlett PL, Schölkopf B, Schuurnabs D, eds. ( (2000) ) MIT Press, Cambridge, MA. 61–74.

  24. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, ( (2000) ) 16, : 412–424.[Abstract/Free Full Text]

  25. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, ( (1982) ) 143, : 29–36.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Print PDF (3526K) Freely available
Right arrow Screen PDF (374K) Freely available
Right arrowOA All Versions of this Article:
35/suppl_2/W538    most recent
gkm254v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Xu, J.-R.
Right arrow Articles by Ji, Z.-L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xu, J.-R.
Right arrow Articles by Ji, Z.-L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?