SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins
1 Department of Bioengineering and Bioinformatics, Moscow State University, Vorob'evy gory, 1-73, Moscow, 119992, Russia, 2 State Scintific Center GosNIIGenetika, 1st Dorozhny pr., 1, Moscow, 113545, Russia and 3 Institute for Problems of Information Transmission RAS, Bolshoi Karetny per., 19, Moscow, 127994, Russia
* To whom correspondence should be addressed. Tel: +7 095 9394331; Fax: +7 095 2090579; Email: abr{at}belozersky.msu.ru
Received February 12, 2004; Revised and Accepted March 24, 2004
| ABSTRACT |
|---|
|
|
|---|
SDPpred (Specificity Determining Position prediction) is a tool for prediction of residues in protein sequences that determine the proteins' functional specificity. It is designed for analysis of protein families whose members have biochemically similar but not identical interaction partners (e.g. different substrates for a family of transporters). SDPpred predicts residues that could be responsible for the proteins' choice of their correct interaction partners. The input of SDPpred is a multiple alignment of a protein family divided into a number of specificity groups, within which the interaction partner is believed to be the same. SDPpred does not require information about the secondary or three-dimensional structure of proteins. It produces a set of the alignment positions (specificity determining positions) that determine differences in functional specificity. SDPpred is available at http://math.genebee.msu.ru/~psn/.
| INTRODUCTION |
|---|
|
|
|---|
Many protein families contain homologous proteins that have a common biological function but different specificity towards substrates, ligands, effectors, DNA, proteins and other interacting molecules, including other monomers of the same protein. All these interactions must be highly specific. The proteins can be assigned to specificity groups based on experimental data or comparative genomic analysis.
Identification of residues that account for protein specificity might be useful in many biological studies. For instance, these residues can be used for planning experiments on functional analysis or protein redesign. One obvious application of SDPpred (Specificity Determining Position Prediction) is to minimize the number of point mutations required to switch the specificity of an enzyme, regulator or transporter. Analysis of the predicted residues can also provide a deeper insight into the nature of functional specificity. Our experience is that specificity determining positions (SDPs) include not only residues located in active sites of proteins, but also residues involved in establishing contact between subunits.
Construction of phylogenetic trees does not always allow one to assign specificity to all members of a family. An algorithm that extends the idea of SDP analysis and addresses this problem will be put on the web in the near future. It will predict specificity of the unclassified family members. This will provide a possibility to use SDPpred as a tool for detailed protein annotation.
Amino acid residues that determine differences in protein specificity and account for correct recognition of interaction partners are usually thought to correspond to those positions of a protein multiple alignment where the distribution of amino acids is closely associated with grouping of proteins by specificity. SDPpred searches for positions that are well conserved within specificity groups but differ between them. These positions are called SDPs. Such positions, though obvious in alignments containing a small number of proteins and specificity groups, become challenging to find in large protein families with a variety of specificities. Prediction of SDPs is reasonable not only for protein families whose members have different interaction partners, but also for any family containing specificity groups of any nature (e.g. proteins of different thermostability).
Recently, a number of algorithms addressing the problem described above have been developed. Several approaches exploit information about protein structure or functional sites (1,2). Some methods use only protein sequences (38). SDPpred implements the algorithm described in (8). Compared with other methods, the algorithm implemented in SDPpred and described in detail in (8) has several advantages. First, it does not use any information about protein structure. The procedure is based solely on statistical analysis of an alignment, and thus it can be applied to protein families that do not contain any members with resolved three-dimensional (3D) structure. Second, it automatically calculates the number of SDPs and the probability of occurrence of these positions by chance. It does not incorporate any ad hoc cutoff setting and thus does not require any prior knowledge about special properties of the analyzed family. Third, substitutions within specificity groups are weighted according to physical properties of amino acids using a substitution matrix, so that substitutions to amino acids with similar properties are only weakly penalized. Finally, SDPpred incorporates information about evolutionary distance within and between groups by using different amino acid substitution matrices. To the best of our knowledge, currently there is no publicly available server, which addresses the problem of identification of SDPs.
| ALGORITHM DESCRIPTION |
|---|
|
|
|---|
The algorithm implemented in SDPpred is described in detail in (8). Some of its features are inherited from the method of (7).
Briefly, consider a multiple protein sequence alignment. The proteins are divided into N specificity groups, numbered by i = 1,...,N. The goal is to identify columns (positions) in the alignment in which the amino acid distribution is closely associated with the grouping by specificity. This association in column p of the alignment is measured by the mutual information
![]() |
= 1,...,20 is a residue type, fp(
, i) is the ratio of the number of occurrences of residue
in group i at position p to the length of the whole alignment column, fp(
) is the frequency of residue
in the whole alignment column, f(i) is the fraction of proteins belonging to group i. The mutual information reflects the statistical association between two discrete random variables
and i.
To address the facts that the frequencies are calculated based on a small sample, and that substitutions to amino acids with similar physical properties should be weakly penalized, the observed amino acid frequencies are modified. Instead of using f(
, i) = n(
, i)/n(i), where n(
, i) is the number of occurrences of residue
in group i, n(i) is the size of group i (here i is a single group or the whole alignment), SDPpred uses smoothed frequencies
![]() |

) is the probability of amino acid substitution ß
according to the matrix corresponding to the average identity in group i, and 0
1 is a smoothing parameter. SDPpred uses matrices of the BLOSUM series (9) for groups with average identity
60% and their analogs calculated as described in (10) for groups with larger average identity. Additionally, zero frequencies are avoided automatically, and thus the necessary pseudocounts are introduced in a natural way. To calculate the statistical significance of the obtained values of Ip, each column is shuffled, yielding the distribution F(Ish). To offset the background similarity of proteins, which is higher within groups than between groups, SDPpred calculates Iexp, the expected mutual information for column p, as a linear transform of Ish, as described in (7).
Then, Z-scores are calculated:
![]() |
A high Z-score value indicates a position where the amino acid distribution is much more closely associated with grouping by specificity than for an average position of the alignment, and which is thus likely to be an SDP.
Given a series of Z-scores corresponding to every position of the multiple alignment, one needs to evaluate the significance of the Z-scores in order to tell whether the observed Z-score is sufficiently high to indicate an SDP. SDPpred uses an automated procedure for setting the thresholds based on the computation of the Bernoulli estimator. The observed Z-scores are arranging in decreasing order: Z1, Z2,.... The threshold is defined as
![]() |
![]() |
The described procedure depends on the distribution of Z-scores. It can be proved that the distribution of the mutual information lies asymptotically between the Gaussian and exponential distributions. On real data the procedure is robust relative to the distribution, and the set of SDPs is almost the same assuming Gaussian and exponentially distributed Z-scores.
The results of testing, which agree well with available structural and experimental data, are described in (8). In that study, we analyzed two protein families: the LacI family of bacterial transcription factors and the MIP family of membrane channels in bacteria. Both these families include proteins with resolved 3D structure, which was used to evaluate predictions. In both cases, the fraction of contacting residues among SDPs is much larger than in the whole alignment (Table 1). Interestingly, in the case of the MIP family we not only described the channel very well (all residues known to interact with the substrate are either conserved or belong to the predicted set of SDPs), but also identified some residues that lie on the surface of contact between the subunits and cluster together, possibly forming structural clasps (Figure 1) (11).
|
|
| DESCRIPTION OF THE WEB INTERFACE |
|---|
|
|
|---|
The only information needed for prediction of SDPs is a multiple alignment of protein sequences divided into specificity groups. SDPpred does not require any information about protein 3D structure.
SDPpred can analyze alignments of length up to 2000 positions, containing at most 1000 proteins. There can be up to 1000 specificity groups; however, it is recommended that each group contain at least three sufficiently divergent sequences. On the other hand, the average identity in each group should not be <25%. Having more than two groups also strongly improves the quality of prediction due to more efficient elimination of the background evolutionary similarity.
A typical SDPpred query of is shown in Figure 2. The aligned sequences should be in FASTA, GDE or Pfam plain text (with gaps as dashes and all characters in upper case) alignment format. Alignments in one of these formats can be easily obtained from databases (e.g. Pfam) or alignment programs (e.g. ClustalW). The alignment should be manually edited according to the specificity group assignments. The specificity groups should be separated by lines beginning with the equals sign and containing the name of the following group (e.g. =Group1).
|
The user has to select the number of shuffles for computation of the statistical significance (between 1000 and 10 000). An alignment of a thousand sequences divided into several hundreds of specificity groups is analyzed in a couple of hours if each column is shuffled 10 000 times. Using fewer shuffles reduces the required time proportionally, but makes the results less reliable. Typically, the top of the SDP list remains the same, but minor variations may appear near the cutoff. An average query of 72 sequences divided into 12 specificity groups, where the average protein length is 400 amino acids and the alignment length is 587 positions, is analyzed in 4 min.
The last parameter is the maximum percentage of gaps allowed in a column to be analyzed. Columns with a greater fraction of gaps are excluded from the analysis. Typically, this number should not exceed 30%, but in some cases (e.g. when the user is analyzing group-specific loops) it may be reasonable to set this parameter to a higher value. However, a large percentage of allowed gaps produces many SDPs at the termini of the alignment, which are likely to be incorrect.
| OUTPUT |
|---|
|
|
|---|
SDPpred outputs the set of SDPs, i.e. positions of the alignment, which are likely to determine differences in functional specificity among the given groups. This set can be visualized in several ways. The user can switch between the alignment of the family with the SDPs highlighted, the detailed description of each SDP, and the plot of probabilities, from which the minimum is chosen to set the cutoff (Figure 3). The latter is particularly useful in the case when there are several local minima of close significance. Then, it might be useful to consider them all.
|
The predicted SDPs can be mapped on to any protein of the alignment. Amino acid residues corresponding to the SDPs in the selected protein are listed below the alignment on the alignment page and in the tables describing SDPs in detail.
| PROSPECTS |
|---|
|
|
|---|
We plan to implement the algorithm that predicts specificity for those members of the family whose specificity is unknown. The algorithm is described in detail in (8).
| Notes |
|---|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated.
| REFERENCES |
|---|
|
|
|---|
- Lichtarge,O., Bourne,H,R. and Cohen,F,E. ( (1996) ) An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol., , 257, , 342358.[CrossRef][Web of Science][Medline]
- Johnson,J.M. and Church,G.M. ( (2000) ) Predicting ligand-binding function in families of bacterial receptors. Proc. Natl Acad. Sci. USA, , 97, , 39653970.
[Abstract/Free Full Text] - Livingstone,C. and Barton,G. ( (1993) ). Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Comput. Appl. Biosci., , 9, , 745756.
[Abstract/Free Full Text] - Casari,G., Sander,C. and Valencia,A. ( (1995) ). A method to predict functional residues in proteins. Nat. Struct. Biol., , 2, , 171178.[CrossRef][Web of Science][Medline]
- Gaucher,E.A., Gu,X., Miyamoto,M.M. and Benner,S.A. ( (2002) ). Predicting functional divergence in protein evolution by site-specific rate shifts. Trends Biochem. Sci., , 27, , 315321.[CrossRef][Web of Science][Medline]
- Hannenhalli,S.S. and Russell,R.B. (( (2000) ). Analysis and prediction of functional sub-types from protein sequence alignments. J. Mol. Biol., , 303, , 6176.[CrossRef][Web of Science][Medline]
- Mirny,L.A. and Gelfand,M.S. ( (2002) ). Using orthologous and paralogous proteins to identify specificity-determining residues in bacterial transcription factors. J. Mol. Biol., , 321, , 720.[CrossRef][Web of Science][Medline]
- Kalinina,O.V., Mironov,A.A., Gelfand,M.S. and Rakhmaninova,A.B. ( (2004) ) Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Sci., , 13, , 443456.[CrossRef][Web of Science][Medline]
- Henikoff,S. and Hennikoff,J. ( (1992) ). Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, , 89, , 1091510919.
[Abstract/Free Full Text] - Sutormin,R.A., Rakhmaninova,A.B. and Gelfand,M.S. ( (2003) ). BATMAS30the amino acid substitution matrix for alignment of bacterial transporters. Proteins, , 51, , 8595.[CrossRef][Web of Science][Medline]
- Kalinina,O.V., Gelfand,M.S., Mironov,A.A. and Rakhmaninova,A.B. Amino acid residues forming specific contacts between subunits in tetramers of the membrane channel GlpF. Biofizika, in press.
This article has been cited by other articles:
![]() |
J. E. Donald and E. I. Shakhnovich SDR: a database of predicted specificity-determining residues in proteins Nucleic Acids Res., January 1, 2009; 37(suppl_1): D191 - D194. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Ye, G. Vriend, and A. P. IJzerman Tracing evolutionary pressure Bioinformatics, April 1, 2008; 24(7): 908 - 915. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-L. Faulon, M. Misra, S. Martin, K. Sale, and R. Sapra Genome scale enzyme metabolite and drug target interaction predictions using the signature molecular descriptor Bioinformatics, January 15, 2008; 24(2): 225 - 233. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Ye, K. Anton Feenstra, J. Heringa, A. P. IJzerman, and E. Marchiori Multi-RELIEF: a method to recognize specificity determining residues from multiple sequence alignments using a Machine-Learning approach for feature weighting Bioinformatics, January 1, 2008; 24(1): 18 - 25. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Kramer, J. D. Fischer, E. Zientz, V. Vijayan, C. Griesinger, A. Lupas, and G. Unden Citrate Sensing by the C4-Dicarboxylate/Citrate Sensor Kinase DcuS of Escherichia coli: Binding Site and Conversion of DcuS to a C4-Dicarboxylate- or Citrate-Specific Sensor J. Bacteriol., June 1, 2007; 189(11): 4290 - 4298. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Carro, M. Tress, D. de Juan, F. Pazos, P. Lopez-Romero, A. del Sol, A. Valencia, and A. M. Rojas TreeDet: a web server to explore sequence space. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W110 - W115. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Pei, W. Cai, L. N. Kinch, and N. V. Grishin Prediction of functional specificity determinants from protein sequences using log-likelihood ratios Bioinformatics, January 15, 2006; 22(2): 164 - 171. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. Edwards and D. C. Shields BADASP: predicting functional specificity in protein families using ancestral sequences Bioinformatics, November 15, 2005; 21(22): 4190 - 4191. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. J. Oparina, O. V. Kalinina, M. S. Gelfand, and L. L. Kisselev Common and specific amino acid residues in the prokaryotic polypeptide release factors RF1 and RF2: possible functional implications Nucleic Acids Res., September 14, 2005; 33(16): 5226 - 5234. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Donald and E. I. Shakhnovich Predicting specificity-determining residues in two large eukaryotic transcription factor families Nucleic Acids Res., August 5, 2005; 33(14): 4455 - 4465. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. A. Ivanisenko, A. M. Eroshkin, and N. A. Kolchanov WebProAnalyst: an interactive tool for analysis of quantitative structure-activity relationships in protein families Nucleic Acids Res., July 1, 2005; 33(suppl_2): W99 - W104. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










