Nucleic Acids Research Advance Access originally published online on May 25, 2007
Nucleic Acids Research 2007 35(Web Server issue):W357-W362; doi:10.1093/nar/gkm231
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Research, 2007, Vol. 35, No. suppl_2 W357-W362
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Articles |
PI2PE: protein interface/interior prediction engine
Institute of Molecular Biophysics and School of Computational Science and Department of Physics, Florida State University, Tallahassee, FL 32306, USA
*To whom correspondence should be addressed. Tel: +1 850 645 1336; Fax: +1 850 644 7244; Email: zhou{at}sb.fsu.edu
Received January 17, 2007. Revised March 8, 2007. Accepted March 29, 2007.
| ABSTRACT |
|---|
|
|
|---|
The side chains of the 20 types of amino acids, owing to a large extent to their different physical properties, have characteristic distributions in interior/surface regions of individual proteins and in interface/non-interface portions of protein surfaces that bind proteins or nucleic acids. These distributions have important structural and functional implications. We have developed accurate methods for predicting the solvent accessibility of amino acids from a protein sequence and for predicting interface residues from the structure of a protein-binding or DNA-binding protein. The methods are called WESA, cons-PPISP and DISPLAR, respectively. The web servers of these methods are now available at http://pipe.scs.fsu.edu. To illustrate the utility of these web servers, cons-PPISP and DISPLAR predictions are used to construct a structural model for a multicomponent proteinDNA complex.
| INTRODUCTION |
|---|
|
|
|---|
The growth of protein structures in the Protein Data Bank (PDB) and expansion of our understanding of protein physical properties are constantly enhancing our ability to predict structural and functional features. Many prediction methods are now automated and accurate. We have contributed three such methods: WESA for predicting the solvent accessibility of amino acids from a protein sequence (1), cons-PPISP for predicting interface residues from the structure of a protein which binds a second protein (2,3), and DISPLAR for predicting interface residues from the structure of a protein which binds DNA (4). Both WESA and DISPLAR were found to have higher prediction accuracy than competing methods (1,4). Cons-PPISP was shown to be able to complement experimental techniques such as NMR chemical shift perturbation in mapping proteinprotein interfaces (3). The methods have found uses in protein structure prediction (5) and in docking of protein complexes (6). For wide access to them, we have now developed web servers for these methods. Here we describe the functionality of the web servers and illustrate their utility by a structural model, built from predictions of the web servers, for a multicomponent proteinDNA complex.
The web servers are components of PI2PE, the protein interface/interior prediction engine, located at http://pipe.scs.fsu.edu. Together, they serve as a pipeline from protein sequences to tertiary structures, then onto quaternary structures of binary complexes, and finally onto superstructures of functioning, multicomponent complexes.
The three prediction methods have similar designs. The input in each case consists of data for a list of residues. In WESA, the list is comprised of a central residue and equal numbers of its sequential neighbors on the left and on the right. In both cons-PPISP and DISPLAR, the list is comprised of a central residue and a number of its spatial neighbors. The input data in WESA are sequential profiles, as given by the position-specific scoring matrix produced by PSI-blast (7). In cons-PPISP and DISPLAR, solvent accessibilities (defined as percentages of exposed surface areas of residues) are also included in the input.
| METHODS |
|---|
|
|
|---|
All the three prediction methods used structures collected from the PDB for training. The dataset of WESA was comprised of 2148 proteins chains with sequence identity <25%. For cons-PPISP, the dataset was comprised of 1256 protein chains that form either heterodimers (involving 458 chains) or homodimers (accounting for the remaining 798 chains of the dataset). The sequence identities among the 1256 chains were <30%. For DISPLAR, the dataset was comprised of 264 protein multimers that form complexes with DNA in their PDB entries. The identities among multimeric entries were <50%. The three protein lists can be downloaded at http://pipe.scs.fsu.edu.
To illustrate why the prediction methods work, in Figure 1 we show the distributions of the 20 types of amino acids in interior and surface regions of folded proteins, and in interface and non-interface portions of protein surfaces that bind proteins or DNA. The results were calculated on the respective datasets. It is clear that the distributions exhibit distinctive patterns. Expectedly non-polar amino acids show preference for the interior whereas polar and charged residues for the surface. Non-polar amino acids similarly (albeit less prominently) prefer interfaces of proteinprotein complexes. On the other hand, proteinDNA interfaces appear to be more dictated by electrostatic complementarity (instead of hydrophobicity), with positively charged arginine and lysine significantly enriched while negatively charged aspartate and glutamate significantly depleted in the interfaces.
|
Another feature, captured by sequence profiles, that distinguishes interior from surface and interface from non-interface is sequence conservation. For structural and functional reasons, interior and interface positions are expected to be more conserved. Such a trend is indeed shown through a comparison in conservation scores between interior and surface positions and between interface and non-interface positions (Supplementary Figure 1).
In addition to sequence profiles, cons-PPISP and DISPLAR also use solvent accessibilities as part of the input. In proteinprotein complexes, interface positions consistently have higher solvent accessibilities than non-interface positions. On the other hand, in proteinDNA complexes, positively charged arginine and lysine have higher solvent accessibilities in interface positions than in non-interface positions, but negatively charged aspartate and glutamate show the opposite tendency (Supplementary Figure 2).
Both cons-PPISP and DISPLAR are based on training neural networks. WESA is a metamethod, based on a weighted ensemble of five separate methods, one of which is neural network training. Further details on the implementations of the methods can be found in the original papers (14). WESA has consistently shown a two-state (interior/surface) prediction accuracy
80%. Cons-PPISP predictions cover >50% of actual proteinprotein interface residues and have >70% accuracy. DISPLAR predictions cover >60% of actual proteinDNA interface residues and have >80% accuracy.
| FUNCTIONALITY OF THE WEB SERVERS |
|---|
|
|
|---|
The direct web link for the WESA web server is http://pipe.scs.fsu.edu/wesa.html. Once there, the user is asked to provide the sequence, in FASTA format, of the protein on which solvent accessibility is predicted. In addition, the user is asked to type in a unique identifier, at the user's choice, for referencing the particular WESA submission, and an e-mail address, for receiving the prediction. A link (http://pipe.scs.fsu.edu/output_wesa.txt) also provides a sample output (obtained from submitting the sequence of PDB entry 1who), with explanations for the columns of numbers for each residue. In short, for each residue in the sequence, the results predicted by five separate methods and the WESA metamethod are given as 1 for exposed or 0 for buried, along with the prediction confidence (ranging from 0.00 for no confidence at all to 1.00 for full confidence). The threshold for an exposed residue is set at 20% solvent exposure. Two figures displaying the actual and WESA-predicted buried residues of 1who are found at the web server.
The direct web link for the cons-PPISP web server is http://pipe.scs.fsu.edu/ppisp.html. The protein structure, on which an interface prediction is to be made, must be provided in PDB format, either by uploading or by pasting. In addition, the user must specify the chain(s) in the structure to be used for prediction. Here there are three common possibilities. (1) The PDB file does not have chain ID; "_"is to be entered. (2) The PDB file has a single chain ID (say "A"), or, it has multiple chains but only a single chain is to be used for prediction; "A" must be entered. (3) Multiple chains (e.g. A, B and C) in a PDB file are to be treated as a single structure and used together for interface prediction; the user must enter "A,B,C". A sample output can be found at http://pipe.scs.fsu.edu/output_ppisp.txt.
The direct web link for the DISPLAR web server is http://pipe.scs.fsu.edu/displar.html. Input and output formats are very similar to cons-PPISP predictions.
For convenience of using the web servers, we provide scripts for running predictions in a batch mode. The scripts allow the user to submit multiple jobs by a unix command from his/her terminal, as if the prediction programs are installed on the local computer. In reality the scripts upload the sequences or PDB files to the PI2PE web servers. The predictions are run at the servers, and results are sent back to the local computer and saved in files specified by the scripts.
| ILLUSTRATIVE APPLICATIONS |
|---|
|
|
|---|
We now present interface residues predicted by the cons-PPISP and DISPLAR web servers to illustrate their utility. First, results are given for three proteins that competitively bind proteins and DNA and have unbound structures in the PDB. Next, predictions for a protein that simultaneously binds DNA and another protein are given, and the results are used to build structures for the proteinprotein binary complex and the proteinDNA ternary complex. Full lists of predicted interface residues for the proteins are given in Supplementary Table S1.
Figure 2A displays the cons-PPISP predictions for the ribonuclease barnase, in reference to the structure of the complex with its inhibitor, barstar (PDB entry 1brs [PDB] ). The predictions were made on the unbound structure of barnase, 1a2p. Twenty predicted residues line up the actual interface. DISPLAR also correctly predicted the same portion of the barnase surface for binding DNA. As shown in Figure 2B, the 27 predicted interface residues are concentrated around the binding site for a tetradeoxynucleotide, d(CGAC) (as defined in 1brn). Six residues (I55-S57, R59, Y97 and Y103) are common in the two sets of predictions. It is of interest to note that cons-PPISP predicted two clusters on the unbound structure of barstar (1bta), one is in the actual interface with barnase (shown in Figure 2A) and the other defines an unknown binding site. DISPLAR did not make any positive predictions on 1bta, which is not expected to bind DNA.
|
We applied cons-PPISP to the tumor repressor p53 core domain (2fej). In Figure 3A, the predictions are displayed on the structure of p53 in complex with either 53BP1 or 53BP2 (1gzh or 1ycs). The five predicted residues (N239-M243) lie in the interfaces with the p53-binding proteins. A second cluster of 22 residues (listed in Supplementary Table 1) was also predicted; many of these residues are found in the dimerdimer interface of a tetramer of the p53 core domain bound to DNA (8). Some of these residues are also implicated in the binding with the E6/E6-AP complex (9). When DISPLAR was applied on 2fej, 19 residues were predicted. These also lie in the interface with DNA (as found in 1tsr) (Figure 3B). The protein- and DNA-binding sites on p53 partly overlap. Cons-PPISP predicted residues are close to some of the DISPLAR predicted residues, but no residue was predicted by both methods.
|
The two-component transcriptional activator PhoB regulates its DNA-binding activity through transiently forming a domaindomain complex, thereby blocking the DNA-binding site. We used cons-PPISP to predict interface residues on the unbound structures of both the effector and receiver domains (1qqi and 1b00). The structure for full-length PhoB is not available, but we found the structure (1ys6) for a close homolog, PrrA, in the PDB. The root-mean-square deviations (RMSD) of 1qqi and 1b00 from the corresponding domains in 1ys6 are 2.4 and 2.1 Å, with sequence identities over aligned positions at 35 and 41%, respectively. When 1qqi and 1b00 are aligned onto 1ys6, the 23 and 21 predicted residues on the two respective domains indeed mostly line the interface as modeled on 1ys6 (Figure 4A). DISPLAR predicted 25 residues on 1qqi, which are located in the actual interface between the effector domain and DNA (as found in 1gxp) (Figure 4B). The binding sites for the receiver domain and DNA on the effector domain (1qqi) overlap, and eight residues (R68, T69, D71, H73 and V93-T96) were predicted by both cons-PPISP and DISPLAR. No DNA-contacting residues were predicted by DISPLAR on the receiver domain (1b00), which is not known to binding DNA.
|
Core binding factors (CBF) form a heterodimer between the
and ß subunits, which in turn forms a ternary complex with DNA. Using the unbound structures of CBF
and CBFß (1eaq and 1ilf, respectively), we predicted interface residues between these two proteins by cons-PPISP. Examined on the dimer structure of CBF
and CBFß (1e50), the 20 and 23 predicted residues on the two subunits indeed line the actual interface (Figure 5A). Previously we have used cons-PPISP predictions to drive the docking of unbound structures (6). Here we use the predictions to score docked structures, obtained by running ZDOCK 2.3 (10) with 15° sampling. The best 200 ZDOCK structures (out of a total of 2000) were re-ranked according to the number of cons-PPISP predicted residues among the interfacial residues (defined as having 10-Å contacts across the interface). A structure with an RMSD of 2.2 Å was ranked second according to cons-PPISP predictions (improved from ZDOCK's ranking of 49th). This docked structure is shown in Figure 5B.
|
The dimer structure of CBF
and CBFß, 1e50, was used to predict DNA-contacting residues by DISPLAR. Nineteen residues, all on CBF
, were predicted. As shown in Figure 6A, CBF
is indeed the subunit that contacts DNA in the ternary complex (1h9d), and the predicted CBF
residues line the DNA-binding site. We docked the dimer structure with a B-DNA decamer built in InsightII (Accelrys Software Inc., San Diego, CA, USA) from the sequence in 1h9d. Parameters for DNA nucleotides required for running ZDOCK were taken from Fanelli and Ferrari (11). A docked structure for the ternary complex, ranked 113th in ZDOCK but improved to 11th by DISPLAR predictions, was found to have an RMSD of just 1.2 Å from 1h9d and is shown in Figure 6B.
|
| FUTURE DEVELOPMENTS |
|---|
|
|
|---|
The continuous growth of the sequence database (ftp://ftp.ncbi.nlm.nih.gov/blast/db/) and the PDB (http://www.rcsb.org) will further improve the accuracy of the three prediction methods. We plan to periodically upload the NCBI the non-redundant (nr) onto PI2PE (http://pipe.scs.fsu.edu). In addition, we plan to expand the datasets for the three predictors by including new entries from the PDB; re-training will be done.
The high accuracy of WESA predictions suggests that solvent accessibility is now a matured field. Future methodological developments will thus focus on proteinprotein and proteinDNA interface predictions. A strategy that contributed to the success of WESA is the combination of complementary methods. We plan to combine cons-PPISP and DISPLAR with approaches based on phylogenetic tree (12,13), surface patch characteristics (14), secondary structure (15), empirical scoring function (16), support vector machine (1720) and Bayesian network (21,22).
Ultimately proteins need to be studied within their functioning units, which often are multicomponent protein complexes. The PI2PE web servers will contribute to better understanding of these complexes.
| SUPPLEMENTARY DATA |
|---|
|
|
|---|
Supplementary Data are available at NAR Online.
| ACKNOWLEDGEMENTS |
|---|
We thank Dr Jeff McDonald for assistance in setting up the web servers. This work was supported in part by NIH grant GM058187. Funding to pay the Open Access publication charges for this article was provided by the NIH (grant GM058187).
Conflict of interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
- Chen H, Zhou H-X. Prediction of solvent accessibility and sites of deleterious mutations from protein sequence. Nucleic Acids Res (2005) 33:31933199.
[Abstract/Free Full Text] - Zhou H-X, Shan Y. Prediction of protein interaction sites from sequence profile and residue neighbor list. Proteins (2001) 44:336343.[CrossRef][Web of Science][Medline]
- Chen H, Zhou H-X. Prediction of interface residues in proteinprotein complexes by a consensus neural network method: Test against NMR data. Proteins (2005) 61:2135.[CrossRef][Web of Science][Medline]
- Tjong H, Zhou H-X. DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces. Nucleic Acids Res (2007) 35:14651477.
[Abstract/Free Full Text] - Zhang Y. Template-based modeling and free modeling by I-TASSER in CASP7. Proteins (2007) (in press).
- van Dijk ADJ, de Vries SJ, Dominguez C, Chen H, Zhou H-X, Bonvin AMJJ. Data-driven docking: HADDOCK's adventures in CAPRI. Proteins (2005) 60:232238.[CrossRef][Web of Science][Medline]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:33893402.
[Abstract/Free Full Text] - Kitayner M, Rozenberg H, Kessler N, Rabinovich D, Shaulov L, Haran TE, Shakked Z. Structural basis of DNA recognition by p53 tetramers. Mol. Cell (2006) 22:741753.[CrossRef][Web of Science][Medline]
- Wang PL, Sait F, Winter G. The wildtype conformation of p53: epitope mapping using hybrid proteins. Oncogene (2001) 20:23182324.[CrossRef][Web of Science][Medline]
- Chen R, Li L, Weng Z. ZDOCK: an Initial-stage protein-docking algorithm. Proteins (2003) 52:8087.[CrossRef][Web of Science][Medline]
- Fanelli F, Ferrari S. Prediction of MEF2ADNA interface by rigid body docking: a tool for fast estimation of protein mutational effects on DNA binding. J. Struct. Biol (2006) 153:278283.[CrossRef][Web of Science][Medline]
- Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben-Tal N. ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res (2005) 33:W299W302.
[Abstract/Free Full Text] - Morgan DH, Kristensen DM, Mittleman D, Lichtarge O. ET viewer: an application for predicting and visualizing functional sites in protein structures. Bioinformatics (2006) 22:20492050.
[Abstract/Free Full Text] - Murakami Y, Jones S. SHARP2: proteinprotein interaction predictions using patch analysis. Bioinformatics (2006) 22:17941795.
[Abstract/Free Full Text] - Hoskins J, Lovell S, Blundell TM. An algorithm for predicting proteinprotein interaction sites: abnormally exposed amino acid residues and secondary structure elements. Protein Sci (2006) 15:10171029.[CrossRef][Web of Science][Medline]
- Liang S, Zhang C, Liu S, Zhou Y. Protein binding site prediction using an empirical scoring function. Nucleic Acids Res (2006) 34:36983707.
[Abstract/Free Full Text] - Bradford JR, Westhead DR. Improved prediction of proteinprotein binding sites using a support vector machines approach. Bioinformatics (2005) 21:14871494.
[Abstract/Free Full Text] - Bordner AJ, Abagyan R. Statistical analysis and prediction of proteinprotein interfaces. Proteins (2005) 60:353366.[CrossRef][Web of Science][Medline]
- Kuznetsov IB, Gou Z, Li R, Hwang S. Using evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins. Proteins (2006) 64:1927.[CrossRef][Web of Science][Medline]
- Wang L, Brown SJ. BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res (2006) 34:W243W248.
[Abstract/Free Full Text] - Bradford JR, Needham CJ, Bulpitt AJ, Westhead DR. Insights into proteinprotein interfaces using a Bayesian network prediction method. J. Mol. Biol (2006) 362:365386.[CrossRef][Web of Science][Medline]
- Yan C, Terribilini M, Wu F, Jernigan RL, Dobbs D, Honavar V. Predicting DNA-binding sites of proteins from amino acid sequence. BMC Bioinformatics (2006) 7:262.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
N. Tuncbag, G. Kar, O. Keskin, A. Gursoy, and R. Nussinov A survey of available tools and web servers for analysis of protein-protein interactions and interfaces Brief Bioinform, May 1, 2009; 10(3): 217 - 232. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Qin and H.-X. Zhou meta-PPISP: a meta web server for protein-protein interaction site prediction Bioinformatics, December 15, 2007; 23(24): 3386 - 3387. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







