Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (267K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (78)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Cai, C.Z.
Right arrow Articles by Chen, Y.Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Cai, C.Z.
Right arrow Articles by Chen, Y.Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2003, Vol. 31, No. 13 3692-3697
© 2003 Oxford University Press

SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence

C.Z. Cai1,2, L.Y. Han1, Z.L. Ji1, X. Chen1 and Y.Z. Chen*,1

1 Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore 2 Department of Applied Physics, Chongqing University, Chongqing 400044, PR China

*To whom correspondence should be addressed. Tel: +65 68746877; Fax: +65 67746756; Email: yzchen{at}cz3.nus.edu.sg

Received February 14, 2003; Revised March 19, 2003. Accepted April 2, 2003


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE ACCESS
 METHODS
 RESULTS AND REMARKS
 REFERENCES
 
Prediction of protein function is of significance in studying biological processes. One approach for function prediction is to classify a protein into functional family. Support vector machine (SVM) is a useful method for such classification, which may involve proteins with diverse sequence distribution. We have developed a web-based software, SVMProt, for SVM classification of a protein into functional family from its primary sequence. SVMProt classification system is trained from representative proteins of a number of functional families and seed proteins of Pfam curated protein families. It currently covers 54 functional families and additional families will be added in the near future. The computed accuracy for protein family classification is found to be in the range of 69.1–99.6%. SVMProt shows a certain degree of capability for the classification of distantly related proteins and homologous proteins of different function and thus may be used as a protein function prediction tool that complements sequence alignment methods. SVMProt can be accessed at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE ACCESS
 METHODS
 RESULTS AND REMARKS
 REFERENCES
 
Knowledge about protein function is essential in the understanding of biological processes (1,2). As the gap between the amount of sequence information and functional characterization widens, increasing efforts are being directed at the development of computational tools for protein function prediction (25). Various methods have been developed, which include sequence similarity (68), evolutionary analysis (9,10), structure-based approach (11), protein/gene fusion (12,13), protein interaction (14,15) and family classification by sequence clustering (16,17).

In the absence of clear sequence or structural similarities, the criteria for comparison of distantly-related proteins become increasingly difficult to formulate (17). Moreover, not all homologous proteins have analogous functions (9). The presence of a shared domain within a group of proteins does not necessarily imply that these proteins perform the same function (18). Many proteins sharing promiscuous domains (e.g. SH2, WD40, DnaJ) are known to have very different functions (12). These problems often hinder some of the clustering-based methods (16). In addition to the development of algorithms to overcome these problems (16), different approaches that combine or complement existing methods are being explored (3,9,17,19).

It is of interest to consider protein functional family classification as a method for facilitating protein function prediction, which is expected to be particularly useful in the cases described above and may thus be used as a protein function prediction tool to complement sequence alignment methods. Functional families of various proteins have been documented (2023). A method for the classification of proteins with diverse sequence distribution is also available. A statistical learning method, support vector machines (SVM) (24), has recently been used for classification of G-protein coupled receptors (25) and DNA-binding proteins (26). It has also been employed in a number of other protein studies including protein–protein interaction prediction (15), fold recognition (27), solvent accessibility (28) and structure prediction (29,30). The prediction accuracy ranges from 65 to 91.4% in these studies. Thus SVM classification of protein functional family may be potentially developed into a protein function prediction tool to complement methods based on sequence similarity and clustering.

Instead of direct comparison or clustering of sequences, SVM classification is based on the analysis of physicochemical properties of a protein generated from its sequence (2530). Samples of proteins known to be in a functional class (positive samples) and those not in the class (negative samples) are used to train a SVM system to recognize specific features and classify proteins into either the functional class or outside of the class. Such an approach may be applied to functional prediction for both distantly-related and closely-related proteins. Proteins of specific functional class share common structural and chemical features essential for performing similar functions (2022). Given sufficient samples of proteins of specific function, SVM can be trained and used to recognize proteins with characteristics for a particular function (15,25,26).

We have developed a web-based software, SVMProt, for the classification of a protein into functional class from its primary sequence. The functionally distinguished classes of proteins are collected from several databases (2023,31,32) that include all major classes of enzymes, receptors, transporters, channels, DNA-binding proteins and RNA-binding proteins. The core SVM program used in SVMProt is SVM* which has recently been developed and tested for the classification of DNA-binding proteins (26). SVMProt is specifically trained and tested on each of the functional classes currently collected. Its usefulness on protein functional classification is evaluated. Its capability in the classification of distantly related proteins and homologous proteins of different function is also studied.


    SOFTWARE ACCESS
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE ACCESS
 METHODS
 RESULTS AND REMARKS
 REFERENCES
 
The SVMProt web page is at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi and it is shown in Figure 1. The sequence of a protein, in RAW format and containing no non-amino acid letters, can be input in a window provided. A sequence of less than 50 amino acids is not accepted. The computed result is displayed in a separate window as shown in Figure 2. Depending on the computed result, one of the following four outcomes is displayed. If the input protein is predicted to belong to one or more functional families, then the name of each family is displayed. For some protein families, a cross-link to the respective protein family database is provided and that of more families will be added. If the input protein is predicted to not belong to any of the functional classes currently included in SVMProt, then a message of ‘Your input protein is not in any of the functional classes currently covered by SVMProt’ is displayed. If the input sequence contains invalid characters or abnormal composition such as a long stretch of consecutive single letters, then a message of ‘invalid character ...’ or ‘your input sequence is not a valid sequence’ is displayed. If the input sequence is less than 50 amino acids, then a message of ‘your input sequence is less than 50 amino acids’ is displayed.



View larger version (60K):
[in this window]
[in a new window]
 
Figure 1. SVMProt web page.

 


View larger version (51K):
[in this window]
[in a new window]
 
Figure 2. Example of the SVMProt output returned to the user.

 

    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE ACCESS
 METHODS
 RESULTS AND REMARKS
 REFERENCES
 
Table 1 lists the protein functional families currently covered by SVMProt. These include 46 families of enzymes from BRENDA (20), G-protein coupled receptors from GPCRDB (21), nuclear receptors from NucleaRDB (21), tyrosine receptor kinases derived from NCBI (31), five families of channels and one family of transporters from TCDB (22) and LGICdb (23) and DNA- and RNA-binding proteins derived from SWISS-PROT (32). Additional families of transporters will be added very soon. Other families of proteins are being searched and collected. The updated list of functional classes is provided in the SVMProt web page.


View this table:
[in this window]
[in a new window]
 
Table 1. List of protein families currently covered by SVMProt, statistics of datasets and prediction results. Predicted results are given in TP (true positive), FN (false negative), TN (true negative), FP (false positive), and Q (overall accuracy). Number of positive or negative samples in testing and independent evaluation sets is TP+FN or TN+FP, respectively
 
SVMProt is trained for protein classification in the following manner. First, every protein sequence is represented by specific feature vector assembled from encoded representations of tabulated residue properties including amino acid composition, hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility for each residue in the sequence (15,2530). Three descriptors, composition (C), transition (T) and distribution (D), are used to describe global composition of each of these properties (33). C is the number of amino acids of a particular property (such as hydrophobicity) divided by the total number of amino acids. T characterizes the percent frequency with which amino acids of a particular property is followed by amino acids of a different property. D measures the chain length within which the first, 25, 50, 75 and 100% of the amino acids of a particular property is located respectively.

A hypothetical protein sequence AEAAAEAEEAAAAAEAEEEAAEEAEEEAAE, as shown in Figure 3, has 16 alanines (n1=16) and 14 glutamic acids (n2=14). The composition for these two amino acids are n1x100.00/(n1+n2)=53.33 and n2x100.00/(n1+n2)=46.67 respectively. There are 15 transitions from A to E or from E to A in this sequence and the percent frequency of these transitions is (15/29)x100.00= 51.72. The first, 25, 50, 75 and 100% of As are located within the first 1, 5, 12, 20 and 29 residues, respectively. The D descriptor for As is thus 1/30x100.00=3.33, 5/30x 100.00=16.67, 12/30x100.00=40.0, 20/30x100.00= 66.67, 29/30x100.00=96.67. Likewise, the D descriptor for Es is 6.67, 26.67, 60.0, 76.67, 100.0. Overall, the amino acid composition descriptors for this sequence are C=(53.33, 46.67), T=(51.72) and D=(3.33, 16.67, 40.0, 66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0), respectively.



View larger version (10K):
[in this window]
[in a new window]
 
Figure 3. Hypothetical sequence for illustration of derivation of the feature vector of a protein.

 
Descriptors for other properties can be computed by a similar procedure and all the descriptors are combined to form the feature vector. In most studies, amino acids are divided into three classes for each property and thus the three descriptors for each property consist of 21 elements: three for C, three for T and 15 for D (15,2530,33).

SVMProt is fed and trained with examples of proteins of a particular functional family (positive samples) and those that do not belong to this family (negative samples). The feature vectors of these positive and negative samples are input into the SVMProt system. The trained SVMProt system can then be used to classify a protein into either the positive group (protein is predicted to be in the family) or the negative group (protein is predicted to not belong to the family). Because protein feature vectors describe global composition of various physicochemical properties, SVMProt cannot address such questions as which part of a protein sequence is likely to match with a protein family.

All distinct protein members in each family found by us are used to construct positive samples for training SVMProt. More proteins are being searched which will be added in training and testing SVMProt. The negative samples for training are selected from seed proteins of the curated protein families in the Pfam database (34) excluding those that belong to the family under study. Training sets of both positive and negative samples are further screened so that only essential proteins that optimally represent each class are retained. The SVMProt training system for each family is optimized and tested by using separate testing sets of both positive and negative samples. While possible, all the remaining distinct proteins in each functional family (not in the training set of that family) are used as positive samples and all the remaining representative seed proteins in Pfam curated families are used to construct negative samples in a testing set. The performance of SVMProt classification is further evaluated by using independent sets of both positive and negative samples. There is no duplicate protein in each training, testing or independent evaluation set. The number of both positive and negative samples of proteins for the training, testing and independent evaluation sets of every functional class is given in Table 1.

The theory of SVM had been described in the literature (15,2430). Thus only a brief description is given here. SVM is based on the structural risk minimization (SRM) principle from statistical learning theory (24). In linearly separable cases, SVM constructs a hyperplane which separates two different groups of feature vectors with a maximum margin. A feature vector is represented by xi, with physicochemical descriptors of a protein as its components. The hyperplane is constructed by finding another vector w and a parameter b that minimizes ||w||2 and satisfies the following conditions:


where yi is the group index, w is a vector normal to the hyperplane, |b|/||w|| is the perpendicular distance from the hyperplane to the origin and ||w||2 is the Euclidean norm of w. After the determination of w and b, a given vector x can be classified by:

In non-linearly separable cases, SVM maps the input variable into a high dimensional feature space using a kernel function K(xi, xj). An example of a kernel function is the Gaussian kernel which has been extensively used in different studies (15,2430):

Linear support vector machine is applied to this feature space and then the decision function is given by:

where the coefficients {alpha}i0 and b are determined by maximizing the following Langrangian expression:

under conditions:

A positive or negative value from Eq. 3 or Eq. 5 indicates that the vector x belongs to the positive or negative group, respectively. To further reduce the complexity of parameter selection, hard margin SVM with threshold instead of soft margin SVM with threshold is used in SVMProt.

Scoring of SVM classification of proteins has been estimated by a reliability index and its usefulness has been demonstrated by statistical analysis (29). A slightly modified reliability score, R-value, is used in SVMProt:

where d is the distance between the position of the vector of a classified protein and the optimal separating hyperplane in the hyperspace. There is a statistical correlation between R-value and expected classification accuracy (probability of correct classification) (29). Thus another quantity, P-value, is introduced to indicate the expected classification accuracy. P-value is derived from the statistical relationship, shown in Figure 4, between the R-value and actual classification accuracy based on the analysis of 9932 positive and 45 999 negative samples of proteins.



View larger version (12K):
[in this window]
[in a new window]
 
Figure 4. Statistical relationship between the R-value and P-value (probability of correct classification) derived from analysis of 9932 positive and 45 999 negative samples of proteins.

 
As in the case of all discriminative methods (24,35), the performance of SVMProt classification can be measured by the quantity of true positives (TP), true negatives (TN), false positives (FP), false negatives (FN) and the overall accuracy (Q) given below:


    RESULTS AND REMARKS
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE ACCESS
 METHODS
 RESULTS AND REMARKS
 REFERENCES
 
The results for the classification of each of the functional classes are given in Table 1. All the computed TP, TN, FP, FN and Q are given in the table. The overall accuracy Q of protein classification ranges from 69.1 to 99.6%, which is on average slightly improved from that obtained in other SVM studies of proteins (15,2430). One possible reason for this improvement is the use of representative proteins of Pfam curated families as negative samples for SVM classification, which provides a more comprehensive sampling of proteins not in a functional class.

Some low sequence similarity proteins share similar function (3638). Efforts have been directed at exploration of various novel approaches in predicting the function of these distantly related proteins (16,37,39). SVMProt is tested on 24 randomly selected distantly related proteins in seven families. Sequence similarity E-value for each of these proteins from BLAST search against most members of its family is significantly higher than the commonly accepted value of 0.05 for similarity proteins. Thus alignment methods may not work well for these proteins. Fourteen proteins are correctly classified by SVMProt, which accounts for 58.3% of all distantly related proteins studied. This suggests that, to a certain extent, SVMProt is useful for the classification of distantly related proteins.

Homologous proteins do not necessarily have analogous function (9) and there are certain levels of difficulty to distinguish them using sequence alignment methods. SVMProt is tested to four pairs of homologous proteins of different families and the results are shown in Table 2. While all eight proteins are correctly classified into their respective family, only five of them are not classified into the family of their respective homolog, representing 62.5% of all the homologous proteins examined. This limited study seems to indicate that SVMprot has a certain degree of capability for classification of homologous proteins of different functions. Further analysis is needed to provide a more objective assessment.


View this table:
[in this window]
[in a new window]
 
Table 2. Assessment of SVMProt classification of homologous proteins of different functions
 
The ability of SVMProt in the classification of some distantly related proteins and homologous proteins of different functions probably results from the use of a combination of physicochemical properties to represent a protein. Protein function is determined by specific structural and chemical features at substrate binding sites (20). Some of these function-related features might be captured by the residue properties such as hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility which are used in the construction of the SVMProt feature vectors for proteins.

As shown in Table 1, there are several families with substantially high Q score (~90%) but relatively modest TP : FN ratio (<100 : 37). Generally, SVMProt gives an accurate prediction of TNs. The imbalance between the number of proteins in a family and those outside of the family may thus lead to cases of high Q score with modest TP : FN ratio. Examination of FN proteins of these families shows that many of these proteins either belong to more than one family or contain a domain shared by proteins in another family. These proteins are often classified into the related family. An analysis of a broad range of families indicates that a substantial portion (61.3%) of incorrectly classified proteins are of low sequence similarity to most of the other members in its family (i.e. the sequence similarity score E value of each of these proteins against most members of its family is significantly higher than 0.05). The percentage of low sequence similarity proteins in a family is not expected to be very high. Therefore, our study seems to suggest that sequence distance has a certain level of influence on the accuracy of SVM classification.

Several factors may affect the prediction accuracy. One is the diversity of protein samples. It is likely that not all possible types of proteins are adequately represented in some functional classes. This can be improved along with the availability of more protein data. SVM prediction may be further improved by using more comprehensive and refined set of protein descriptors. The SVM optimization procedure and feature vector selection algorithm may also be improved by adding additional constraints and by incorporating independent component analysis and kernel PCA in the preprocessing steps.

Our study suggests that SVM has potential in the classification of proteins into functional families. SVMProt appears to have a certain level of capability for classification of distantly related proteins and homologous proteins of different functions and, thus, potentially may be used as a protein function prediction tool that complements sequence alignment methods. Further improvements on protein functional family coverage, sample collection and SVM algorithm may enable the development of SVMProt into a useful protein function prediction tool.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE ACCESS
 METHODS
 RESULTS AND REMARKS
 REFERENCES
 

  1. Eisenberg,D., Marcotte,C.A., Xenarios,I. and Yeates,T.O. (2000) Protein function in the post-genomic era. Nature, 405, 823–826.[CrossRef][Medline]

  2. Bork,P., Dandekar,T., Diaz-Lazcoz,Y., Eisenhaber,F., Huynen,M. and Yuan,Y. (1998) Predicting function: from genomes and back. J. Mol. Biol., 283, 707–725.[CrossRef][Web of Science][Medline]

  3. Pellegrini,M. (2001) Computational methods for protein function analysis. Curr. Opin. Chem. Biol., 5, 46–50.[CrossRef][Web of Science][Medline]

  4. Teichman,S.A. and Mitchison,G. (2000) Computing protein function. Nat. Biotechnol., 18, 27.[CrossRef][Web of Science][Medline]

  5. Huynen,M., Snel,B., Lathe,W. and Bork,P. (2000) Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res., 10, 1204–1210.[Abstract/Free Full Text]

  6. Bork,P. and Koonin,E.V. (1998) Predicting functions from protein sequences—where are the bottlenecks? Nature Genet., 18, 313–318.[CrossRef][Web of Science][Medline]

  7. Baxevanis,A.D. (1998) Practical aspects of multiple sequence alignment. Methods Biochem. Anal., 39, 172–188.[Web of Science][Medline]

  8. Schuler,G.D. (1998) Sequence alignment and database searching. Methods Biochem. Anal., 39, 145–171.[Web of Science][Medline]

  9. Benner,S.A., Chamberlin,S.G., Liberles,D.A., Govindarajan,S. and Knecht,L. (2000) Functional inferences from reconstructed evolutionary biology involving rectified databases—an evolutionarily grounded approach to functional genomics. Res. Microbiol., 151, 97–106.[Medline]

  10. Eisen,J.A. (1998) Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res., 8, 163–167.[Free Full Text]

  11. Teichmann,S.A., Murzin,A.G. and Chothia,C. (2001) Determination of protein function, evolution and interactions by structural genomics. Curr. Opin. Struct. Biol., 11, 354–363.[CrossRef][Web of Science][Medline]

  12. Marcotte,E.M., Pellegrini,M., Ng,H.L., Rice,D.W., Yeates,T.O. and Eisenberg,D. (1999) Detecting protein function and protein–protein interactions from genome sequences. Science, 285, 751–753.[Abstract/Free Full Text]

  13. Enright,A.J., Iliopoulos,I., Kyrpides,N. and Ouzounis,C.A. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature, 402, 86–90.[CrossRef][Medline]

  14. Aravind,L. (2000) Guilt by association: contextual information in genome analysis. Genome Res., 10, 1074–1077.[Free Full Text]

  15. Bock,J.R. and Gough,D.A. (2001) Predicting protein–protein interactions from primary structure. Bioinformatics, 17, 455–462.[Abstract/Free Full Text]

  16. Enright,A.J., Van Dongen,S.V. and Ouzounis,C.A. (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res., 30, 1575–1584.[Abstract/Free Full Text]

  17. Enright,A.J. and Ozounis,C.A. (2000) GeneRage: a robust algorithm for sequence clustering and domain detection. Bioinformatics, 16, 451–457.[Abstract/Free Full Text]

  18. Henikoff,S., Greene,E.A., Pietrokovski,S., Bork,P., Attwood,T.K. and Hood,L. (1997). Gene families: the taxonomy of protein paralogs and chimeras. Science, 278, 609–614.[Abstract/Free Full Text]

  19. Ponting,C.P. (2001) Issues in predicting protein function from sequence. Brief Bioinform., 2, 19–29.[Abstract/Free Full Text]

  20. Schomburg,I., Chang,A. and Schomburg,D. (2002) BRENDA, enzyme data and metabolic information. Nucleic Acids Res., 30, 47–49.[Abstract/Free Full Text]

  21. Horn,F., Vriend,G. and Cohen,F.E. (2001) Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems. Nucleic Acids Res., 29, 346–349.[Abstract/Free Full Text]

  22. Saier,M.H.Jr (2000) A functional-phylogenetic classification system for transmembrane solute transporters. Microbiol. Mol. Biol. Rev., 64, 354–411.[Abstract/Free Full Text]

  23. Le Novere,N. and Changeux,J.-P. (2001) LGICdb: the ligand-gated ion channel database. Nucleic Acids Res., 29, 294–295.[Abstract/Free Full Text]

  24. Burges,C.J.C. (1998) A tutorial on Support Vector Machine for pattern recognition. Data Min. Knowl. Disc., 2, 121–167.[CrossRef]

  25. Karchin,R., Karplus,K. and Haussler,D. (2002) Classifying G-protein coupled receptors with support vector machines. Bioinformatics, 18, 147–159.[Abstract/Free Full Text]

  26. Cai,C.Z., Wang,W.L. and Chen,Y.Z. (2003) Support Vector Machine classification of physical and biological datasets. Inter. J. Mod. Phys. C., in press.

  27. Ding,C.H.Q. and Dubchak,I. (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17, 349–358.[Abstract/Free Full Text]

  28. Yuan,Z., Burrage,K. and Mattick,J.S. (2002) Prediction of protein solvent accessibility using support vector machines. Proteins, 48, 566–570.[CrossRef][Web of Science][Medline]

  29. Hua,S.J. and Sun,Z.R. (2001) A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J. Mol. Biol., 308, 397–407.[CrossRef][Web of Science][Medline]

  30. Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002) Prediction of protein structural classes by support vector machines. Comput. Chem., 26, 293–296.[CrossRef][Web of Science][Medline]

  31. Wheeler,D.L., Church,D.M., Federhen,S., Lash,A.E., Madden,T.L., Pontius,J.U., Schuler,G.D., Schriml,L.M., Sequeira,E., Tatusova,T.A. and Wagner,L. (2003) Database resources of the National Center for Biotechnology. Nucleic Acids Res., 31, 28–33.[Abstract/Free Full Text]

  32. Boeckmann,B., Bairoch,A., Apweiler,R., Blatter,M.-C., Estreicher,A., Gasteiger,E., Martin,M.J., Michoud,K., O'Donovan,C., Phan,I. et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370.[Abstract/Free Full Text]

  33. Dubchak,I., Muchnik,I., Holbrook,S.R. and Kim,S.-H. (1995) Prediction of protein folding class using global description of amino acid sequence. Proc. Natl Acad. Sci. USA, 92, 8700–8704.[Abstract/Free Full Text]

  34. Bateman,A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., Eddy,S.R., Griffiths-Jones,S., Howe,K.L., Marshall,M. and Sonnhammer,E.L. (2002) The Pfam protein families database. Nucleic Acids Res., 30, 276–280.[Abstract/Free Full Text]

  35. Baldi,P., Brunak,S., Chauvin,Y., Anderson,C.A.F. and Nielsen,H. (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16, 412–419.[Abstract/Free Full Text]

  36. Nagano,N., Porter,C.T. and Thornton,J.M. (2001) The (betaalpha)(8) glycosidases: sequence and structure analyses suggest distant evolutionary relationships. Protein Eng., 14, 845–855.[Abstract/Free Full Text]

  37. Frishman,D. and Argos,P. (1992) Recognition of distantly related protein sequences using conserved motifs and neural networks. J. Mol. Biol., 228, 951–962.[CrossRef][Web of Science][Medline]

  38. Miyata,Y. and Nishida,E. (1999) Distantly related cousins of MAP kinase: biochemical properties and possible physiological functions. Biochem. Biophys. Res. Commun., 266, 291–295.[CrossRef][Web of Science][Medline]

  39. Yang,A.S. (2002) Structure-dependent sequence alignment for remotely related proteins. Bioinformatics, 18, 1658–1665.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Plant Physiol.Home page
M. Lin, B. Hu, L. Chen, P. Sun, Y. Fan, P. Wu, and X. Chen
Computational Identification of Potential Molecular Interactions in Arabidopsis
Plant Physiology, September 1, 2009; 151(1): 34 - 46.
[Abstract] [Full Text] [PDF]


Home page
J. Pharmacol. Exp. Ther.Home page
F. Zhu, L. Han, C. Zheng, B. Xie, M. T. Tammi, S. Yang, Y. Wei, and Y. Chen
What Are Next Generation Innovative Therapeutic Targets? Clues from Genetic, Structural, Physicochemical, and Systems Profiles of Successful Targets
J. Pharmacol. Exp. Ther., July 1, 2009; 330(1): 304 - 315.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Cui, Q. Liu, D. Puett, and Y. Xu
Computational prediction of human proteins that can be secreted into the bloodstream
Bioinformatics, October 15, 2008; 24(20): 2370 - 2375.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J.-L. Faulon, M. Misra, S. Martin, K. Sale, and R. Sapra
Genome scale enzyme metabolite and drug target interaction predictions using the signature molecular descriptor
Bioinformatics, January 15, 2008; 24(2): 225 - 233.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J.-R. Xu, J.-X. Zhang, B.-C. Han, L. Liang, and Z.-L. Ji
CytoSVM: an advanced server for identification of cytokine-receptor interactions
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W538 - W542.
[Abstract] [Full Text] [PDF]


Home page
DNA ResHome page
K. Fujishima, M. Komasa, S. Kitamura, H. Suzuki, M. Tomita, and A. Kanai
Proteome-Wide Prediction of Novel DNA/RNA-Binding Proteins Using Amino Acid Composition and Periodicity in the Hyperthermophilic Archaeon Pyrococcus furiosus
DNA Res, June 15, 2007; (2007) dsm011v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Z. R. Li, H. H. Lin, L. Y. Han, L. Jiang, X. Chen, and Y. Z. Chen
PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W32 - W37.
[Abstract] [Full Text] [PDF]


Home page
Pharmacol. Rev.Home page
C. J. Zheng, L. Y. Han, C. W. Yap, Z. L. Ji, Z. W. Cao, and Y. Z. Chen
Therapeutic targets: progress of their exploration and investigation of their characteristics.
Pharmacol. Rev., June 1, 2006; 58(2): 259 - 279.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
S. DeBolt, D. R. Cook, and C. M. Ford
L-Tartaric acid synthesis from vitamin C in higher plants
PNAS, April 4, 2006; 103(14): 5608 - 5613.
[Abstract] [Full Text] [PDF]


Home page
J. Lipid Res.Home page
H. H. Lin, L. Y. Han, H. L. Zhang, C. J. Zheng, B. Xie, and Y. Z. Chen
Prediction of the functional class of lipid binding proteins from sequence-derived properties irrespective of sequence similarity
J. Lipid Res., April 1, 2006; 47(4): 824 - 831.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
Z. Solan, D. Horn, E. Ruppin, and S. Edelman
Unsupervised learning of natural languages
PNAS, August 16, 2005; 102(33): 11629 - 11634.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
L. Y. Han, C. Z. Cai, Z. L. Ji, Z. W. Cao, J. Cui, and Y. Z. Chen
Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach
Nucleic Acids Res., December 7, 2004; 32(21): 6437 - 6444.
[Abstract] [Full Text] [PDF]


Home page
Toxicol SciHome page
C. W. Yap, C. Z. Cai, Y. Xue, and Y. Z. Chen
Prediction of Torsade-Causing Potential of Drugs by Support Vector Machine Approach
Toxicol. Sci., May 1, 2004; 79(1): 170 - 177.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
L. Y. HAN, C. Z. CAI, S. L. LO, M. C.M. CHUNG, and Y. Z. CHEN
Prediction of RNA-binding proteins from primary sequence by a support vector machine approach
RNA, March 1, 2004; 10(3): 355 - 368.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (267K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (78)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Cai, C.Z.
Right arrow Articles by Chen, Y.Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Cai, C.Z.
Right arrow Articles by Chen, Y.Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?