Nucleic Acids Research 2004 32(Web Server Issue):W76-W82; doi:10.1093/nar/gkh425
© 2004, the authors
Nucleic Acids Research, Vol. 32, Web Server issue © Oxford University Press 2004; all rights reserved
ProteMiner-SSM: a web server for efficient analysis of similar protein tertiary substructures
Darby Tien-Hau Chang,
Chien-Yu Chen,
Wen-Chin Chung,
Yen-Jen Oyang*,
Hsueh-Fen Juan1 and
Hsuan-Cheng Huang2
Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, ROC, 1 Institute of Biotechnology and Department of Chemical Engineering, National Taipei University of Technology, Taipei, Taiwan, ROC and 2 Institute of Biological Chemistry, Academia Sinica, Taipei, Taiwan, ROC
* To whom correspondence should be addressed. Tel: +886 2 23625336 #431 Fax: +886 2 23688675; Email: yjoyang{at}csie.ntu.edu.tw
Correspondence may also be addressed to Chien-Yu Chen. Email: cychen{at}mars.csie.ntu.edu.tw
Present address: Chien-Yu Chen, Graduate School of Biotechnology and Bioinformatics, Yuan-Ze University, Chung-Li, Taiwan
Received February 15, 2004; Revised April 1, 2004; Accepted April 12, 2004
 |
ABSTRACT
|
|---|
Analysis of proteinligand interactions is a fundamental
issue in drug design. As the detailed and accurate analysis
of proteinligand interactions involves calculation of
binding free energy based on thermodynamics and even quantum
mechanics, which is highly expensive in terms of computing time,
conformational and structural analysis of proteins and ligands
has been widely employed as a screening process in computer-aided
drug design. In this paper, a web server called ProteMiner-SSM
designed for efficient analysis of similar protein tertiary
substructures is presented. In one experiment reported in this
paper, the web server has been exploited to obtain some clues
about a biochemical hypothesis. The main distinction in the
software design of the web server is the filtering process incorporated
to expedite the analysis. The filtering process extracts the
residues located in the caves of the protein tertiary structure
for analysis and operates with O(
nlog
n) time complexity, where
n is the number of residues in the protein. In comparison, the

-hull algorithm, which is a widely used algorithm in computer
graphics for identifying those instances that are on the contour
of a three-dimensional object, features O(
n2) time complexity.
Experimental results show that the filtering process presented
in this paper is able to speed up the analysis by a factor ranging
from 3.15 to 9.37 times. The ProteMiner-SSM web server can be
found at
http://proteminer.csie.ntu.edu.tw/. There is a mirror
site at
http://p4.sbl.bc.sinica.edu.tw/proteminer/.
 |
INTRODUCTION
|
|---|
One of the fundamental issues in drug design is analysis of
protein-ligand interactions (
1,
2). The detailed and accurate
analysis of proteinligand interactions involves calculation
of binding free energy based on thermodynamics and even quantum
mechanics (
3,
4). However, this approach is highly expensive
in terms of computing time. As a result, conformational and
structural analysis of proteins and ligands has been widely
employed as a screening process in computer-aided drug design
(
5
8).
In this paper, a web server designed for efficient analysis of similar protein tertiary substructures, named ProteMiner-SSM, is presented. Figure 1 illustrates one application that the design of ProteMiner-SSM addresses. In this application, the biochemist is given the crystal structure of a protein bound with a specific ligand and wants to conduct a search in the Protein Data Bank (PDB) database (9) for the other proteins that contain a similar binding site and therefore could interact with the specific ligand. In one experiment reported in this paper, ProteMiner-SSM has been exploited to investigate whether some proteins in the caspase family contain a similar binding site to the structure of integrin reported in (10). The experimental results provide biochemists with some valuable clues that conform to a biochemical hypothesis.
In terms of the application illustrated in
Figure 1, it is apparent
that only the substructures in the caves of the protein tertiary
structure are of interest. Therefore, in order to expedite the
analysis process, it is desirable to incorporate a mechanism
that can effectively extract the residues in the caves of the
protein tertiary structure. In this paper, an efficient filtering
process with O(
nlog
n) time complexity is employed, where
n is
the number of residues in the protein. In comparison with the

-hull algorithm (
11), which is a widely used algorithm in computer
graphics for identifying those instances on the contour of a
three-dimensional (3D) object, the filtering process employed
in this paper features a lower time complexity, O(
nlog
n) versus
O(
n2). Experimental results show that the filtering process
presented in this paper is able to speed up the analysis by
a factor ranging from 3.15 to 9.37 times.
The next section of this paper elaborates the software design of ProteMiner-SSM. We then report the experiments conducted to evaluate the performance of ProteMiner-SSM. Finally, concluding remarks are presented, and two appendices give the mathematical basis of Equations 1 and 2 in the text.
 |
SOFTWARE DESIGN OF ProteMiner-SSM
|
|---|
ProteMiner-SSM carries out analysis in two steps. In the first
step, a filtering process based on an efficient kernel density
estimation algorithm is invoked to identify the crucial tertiary
substructures on the contour of the protein that the analysis
should focus on. In the second step, the geometric hashing algorithm
in computer graphics (
12,
13) is invoked to compare the crucial
substructures of the target protein and the binding/active site
of the reference protein. In this paper, we refer to the protein
that contains the binding/active site of interest as the reference
protein and the proteins in PDB against which the alignment
is to be performed as the target proteins.
ProteMiner-SSM conducts analysis at the residue level with each residue represented by its alpha carbon in the vector space. In other words, a protein substructure is defined by the coordinates of the alpha carbons included in the substructure. The efficient kernel density estimation algorithm that forms the basis of the filtering process treats the set of residues {s1, s2,..., sn} of a protein as n samples randomly taken from a probability distribution in the 3D vector space and employs the learning algorithm that we have recently proposed (14,15) to construct an approximate probability density function of the following form:
 | (1) |
where
is a vector in an m-dimensional vector space and in this paper m=3,
- ß is the parameter that controls the smoothness of the approximation function,
-
where R(si) is the distance between sample si and its k-th nearest neighbor, k is a parameter to be set by the user, and
(
) is the Gamma function (18),
-
One interesting observation is that, regardless
of which ß = (
i/
i) ratio is employed, we have

. If this observation can be proved to be generally
correct, then we can further simplify
Equation 1 and obtain
 | (2) |
The mathematical basis of
Equations 1 and
2 is elaborated in the appendices.
As the approximate probability density function shown in Equations 1 and 2 is a continuous and smooth function in the vector space, we can expect that the function values at the residues located on the contour of the protein tertiary structure are generally smaller than the function values at the inner residues. Accordingly, we can set a threshold of the function values to distinguish those residues that are located on the contour from those that are not.
With the residues on the contour of the protein tertiary structure been successfully identified, the next task of the filtering process is to further classify each of these residues depending on whether it is located in a cave or not. This task can be carried out by applying Equation 1 or 2 again but with a larger ß value. Applying Equation 1 or 2 with a larger ß value implies that the approximate probability density function obtained is smoother. As a result, the function values at those residues that are located in a cave will be generally larger than the function values at those residues that are on the contour of the protein tertiary structure but not in a cave. Accordingly, a threshold can be set to classify these residues.
With the filtering process applied to both the reference protein and the target protein, the next task that ProteMiner-SSM carries out is conducting structural alignment on the crucial substructures identified. In ProteMiner-SSM, we have adopted the common practice for carrying out protein structural alignment with the geometric hashing algorithm (58,16). With this practice, the coordinate systems examined by the geometric hashing algorithm are limited to those defined by the two backbone bonds connected to the alpha carbon of each residue. With the filtering process incorporated, in our implementation, the geometric hashing algorithm further narrows down its search space to only the coordinate systems defined by the residues located in the caves. In the design of ProteMiner-SSM, the likelihood of residue substitution is also taken into account. If the entry in the PAM 250 matrix (1,17) corresponding to a pair of residues aligned by the geometric hashing algorithm is <2, then this pair of residues is excluded from the list of those successfully aligned.
The discussions presented in this section so far elaborate the basics of the software design of ProteMiner-SSM. Additional details, including parameter settings and time complexity analysis, can be found in the supplementary material.
 |
EXPERIMENTAL RESULTS
|
|---|
This section reports two experiments conducted to evaluate the
performance of ProteMiner-SSM. The main objective of the first
experiment is to test the accuracy of ProteMiner-SSM. The second
experiment demonstrates how biochemists can exploit ProteMiner-SSM
to facilitate their research works.
In the first experiment, three datasets, each of which contains a reference structure and a number of target proteins, are used to test whether ProteMiner-SSM is able to identify the region on the contour of the target protein that contains a similar substructure as the reference protein. Table 1 shows the characteristics of these three reference protein structures. The first two reference structures are two enzymes in PDB, PDB ID = 1HDZ
[PDB]
(alcohol dehydrogenase) and 1BL5
[PDB]
(Isocitrate dehydrogenase), and the third reference structure, PDB ID = 1L5G
[PDB]
, contains an integrin
Vß3 bound with a peptide ligand as reported in (10). For each of the two enzyme proteins, five proteins from the same family in PDB are employed as the target proteins. For integrin, the alternative structures of integrin
Vß3 with different bindings, PDB ID = 1JV2
[PDB]
and 1M1X
[PDB]
, are employed as the target proteins. Table 2 reports the results of the first experiment. The experimental results show that, with a high degree of accuracy, ProteMiner-SSM is able to identify the residues in the binding/active sites of the target protein. The only miss occurs when protein 1HJ6
[PDB]
is aligned with reference protein 1BL5
[PDB]
. However, as Table 2 shows, the miss is not due to the filtering process invoked to expedite the analysis. Without the filtering process, the geometric hashing algorithm still can only successfully align seven out of the eight residues in the active site of protein 1HJ6
[PDB]
with the residues in the active site of the reference protein.
In the second experiment, ProteMiner-SSM is invoked to figure
out whether some proteins in the Caspase family may contain
a similar binding site to the structure of integrin reported
in (
10).
Table 3 shows the results output by ProteMiner-SSM.
It is observed that caspase-7, PDB ID = 1F1J
[PDB]
and 1K86
[PDB]
, Procaspase-7,
PDB ID = 1GQF
[PDB]
, caspase-8, PDB ID = 1F9E
[PDB]
and caspase-9, PDB ID
= 1JXQ
[PDB]
, have the largest numbers of residues successfully aligned
with the residues in the binding site of integrin. This result
is in conformity with a hypothesis theorized by biochemists.
However, the outputs of ProteMiner-SSM can only be regarded
as interesting clues and, as shown in
Table 4, it is typical
that multiple possible alignments are found. Therefore, more
in-depth analyses, such as protein docking or protein affinity
analysis, must be conducuted to further confirm the hypothesis.
View this table:
[in this window]
[in a new window]
|
Table 4. Two possible mappings from the second experiment of the residues in the crucial substructures of caspase-8 to the residues in the binding site of integrin Vß3
|
|
The results in
Tables 1

4 also show that the filtering
process incorporated in ProteMiner-SSM is able to speed up the
analysis by a factor ranging from 3.15 to 9.37 times. However,
for the case reported in
Table 3, the experimental results reveal
that a certain degree of accuracy has been traded for efficiency.
On the other hand, no such tradeoff has been observed for the
cases reported in
Table 2. Nevertheless, our experience is that
the loss of accuracy due to the filtering process is generally
within an acceptable range. In the supplementary material, we
present in-depth discussions on parameter setting.
 |
CONCLUSION AND FUTURE WORK
|
|---|
In this paper, a web server designed for efficient analysis
of similar protein tertiary substructures is presented. In one
experiment presented in this paper, ProteMiner-SSM has been
exploited to investigate whether some proteins in the caspase
family contain a similar binding site to the structure of integrin

Vß3, and the experimental results are in conformity
with the biochemical hypothesis. However, the predictions made
by ProteMiner-SSM can only be regarded as interesting clues
that require more in-depth investigations to be conducted. The
experimental results also show that the filtering process presented
in this paper is able to speed up the analysis process by a
factor ranging from 3.15 to 9.37 times.
As the experiences from this research work have been encouraging, it is of interest to investigate how to extend the ideas presented in this paper to other protein analysis problems. Possible topics include protein function prediction, protein structural clustering and protein structural classification.
 |
SUPPLEMENTARY MATERIAL
|
|---|
Supplementary Material is available at NAR Online.
 |
APPENDIX A
|
|---|
The efficient kernel density estimation algorithm that forms
the basis of the filtering process employed in this paper treats
a given set of instances {
s1,
s2,...,
sn} in the vector space
as
n samples randomly taken from a probability distribution
and constructs an approximate probability density function of
the following form:
 | (A.1) |
where

is
a vector in the vector space and ||

si|| is the distance
between vectors

and
si. Accordingly, the task that the efficient
kernel density estimation algorithm carries out is to determine
the values of
wi and
i in
Equation A.1, so that

provides a good approximation of the original probability density
function
f. In fact, the kernel density estimation problem described
here can be transformed to a kernel smoothing problem, if we
employ the following equation to estimate the values of
f at
si,
i = 1, 2,...,
n:
 | (A.2) |
where
- m is the dimension of the vector space,
- R(si) is the distance between instance si and its k-th nearest neighbor,
- [R(si)m
m/2/
((m/2) + 1)] is the volume of a hypersphere with radius R(si) in an m-dimensional vector space,
(·) is the Gamma function (18) and
- k is a parameter to be set by the user.
As shown in Equation A.1, the efficient kernel density estimation algorithm places one spherical Gaussian function at each instance. For an instance si, the efficient kernel density estimation algorithm conducts a mathematical analysis on a synthesized data set. The synthesized data set is derived from two ideal assumptions and serves as an analogy of the distribution of the instances in the proximity of si. The first ideal assumption is that the sampling density in the proximity of si is sufficiently high and, therefore, the variation of the probability density function f at si and its neighboring instances approaches 0. The second ideal assumption is that the instances in the proximity of si are evenly spaced by a distance determined by the value of f(si). The details of the synthesized data set are elaborated in the following:
- Instance si is located at the origin and the neighboring instances are located at (h1
i, h2
i, ..., hm
i), where h1, h2, ..., hm are integers and
i is the average distance between two adjacent instances in the proximity of si. How
i is determined will be addressed later on.
- The values of the probability density function at the instances in the synthesized data set, including si, are all equal to f(si). The value of f(si) is estimated based on Equation A.2.
The efficient
kernel density estimation algorithm begins with an analysis
on the synthesized data set to figure out the values of
wi and
i that make function
gi(·) defined in the following virtually
a constant function equal to
f(
si),
 | (A.3) |
In other words, the objective is to make
gi(
x) a good approximator of
f(
x) in the proximity of
si. Let
x = (
x1,
x2, ...,
xm), then we have
It is shown in Appendix B that, with
i =
i,
Therefore, with
i =
i,
gi(
x) defined in
Equation A.3 is virtually a constant function. In fact, it can be shown
that, as long as
i 
0.45 ·
i,
gi(
x) is virtually a constant
function. Accordingly, the next thing to do is to find the appropriate
value of
wi that makes
gi(
x) approximately equal to
f(
si). We
have
where ß
=
i/
i. Therefore, we need to set
wi as follows:
If we employ
Equation A.2 to estimate
the value of
f(
si), then we have
 | (A.4) |
So far, we have found that if we set an appropriate ratio of
ß =
i/
i and set
wi according to
Equation A.4, we can
make
gi(
x) a good approximator of
f(
x) in the proximity of
si.
The only remaining issue is to derive a closed form of
i. In
this paper,
i is set to the average distance between two adjacent
instances in the proximity of sample
si. In an
m-dimensional
vector space, the number of uniformly distributed instances,
N, in a hypercube with volume
V can be computed by
N
V/
m, where

is the spacing between two adjacent instances. Accordingly,
we set
 | (A.5) |
Finally, with Equations A.4 and A.5 incorporated into Equation A.1, we obtain an approximate probability density function of the form shown in Equation 1 in the main text.
 |
APPENDIX B
|
|---|
Let

, where

and

are two coefficients
and
y is a real number. We have
Since
q(
y) is a symmetric and periodical function, if we want
to find the global maximum and minimum values of
q(
y), we only
need to analyze
q(
y) within the interval

. Let

and
y0 = (

/2)·(j/n) +

, where
n 
1 and 0
j
n 1 are
integers, and

. We have
Let us consider the special case with

=

. Then, we have
Let

. Since ( 1/
2)(
t
h

)exp( (
t h

)
2/2
2) is a decreasing function for
t 
[(
h 1)

, (
h + 1)

] and is an increasing function for
t 
[(
h 1)

, (
h + 1)

], we have
-
-
- for h
0 and h
1,
Therefore,
where
If

0, then we have for any
 | (B.1) |
On
the other hand, if

< 0, then we have for any
 | (B.2) |
Combining
Equations B.1 and
B.2, we obtain, for all

,
Similarly,
we can show that
where
If we set
n = 100 000,
then we have, with

=

, 2.506628261
q(
y)

2.506628288, for

.
 |
Notes
|
|---|
The online version of this article has been published under
an open access model. Users are entitled to use, reproduce,
disseminate, or display the open access version of this article
provided that: the original authorship is properly and fully
attributed; the Journal and Oxford University Press are attributed
as the original place of publication with the correct citation
details given; if an article is subsequently reproduced or disseminated
not in its entirety but only in part or as a derivative work
this must be clearly indicated.
 |
REFERENCES
|
|---|
- Krane,D.E. and Raymer,M.L. ( (2002) ) Fundamental Concepts of Bioinformatics, Benjamin Cummings.
- Lesk,A.M. ( (2002) ) Introduction to bioinformatics, Oxford University Press, New York.
- Atkins,P.W. and Depaula,J. ( (2001) ), Physical Chemistry, 7th edn. W H Freeman & Co.
- Bourne,P.E. and Weissig,H. (eds) ( (2003) ) Structural Bioinformatics, Wiley-Liss Inc., New Jersey.
- Boutonnet,N.S., Rooman,M.J., Ochagavia,M.E., Richelle,J. and Wodak,S.J. ( (1995) ) Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins. Protein Eng., , 8, , 647662.[ISI][Medline]
- Orengo,C. and Taylor,W. ( (1996) ) SSAP: sequential structure alignment program for protein structure comparison. Methods Enzymol., , 266, , 617635.[ISI][Medline]
- Pennec,X. and Ayache,N. ( (1994) ) An O(n2) algorithm for 3D substructure matching of proteins. In Califano,A., Rigoutsos,I. and Wolson,H.J. (eds), Shape and Pattern Matching in Computational Biology. Proceedings of the First Internatinal Workshop, Seattle, Plenum Publishing, pp. 2540.
- Pennec,X. and Ayache,N. ( (1998) ) A geometric algorithm to find small but highly similar 3D substructures in proteins. Bioinformatics, , 14, , 516522.[Abstract/Free Full Text]
- Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. ( (2000) ) The Protein Data Bank. Nucleic Acids Res., , 28, , 235242.[Abstract/Free Full Text]
- Xiong,J.P., Stehle,T., Zhang,R., Joachimiak,A., Frech,M., Goodman,S.L. and Arnaout,M.A. ( (2002) ) Crystal structure of the extracellular segment of integrin alpha Vbeta3 in complex with an Arg-Gly-Asp ligand. Science, , 296, , 151155.[Abstract/Free Full Text]
- Edelsbrunner,H. and Mucke,E.P. ( (1994) ) Three-dimensional alpha shapes. ACM Trans. Graphics, , 13, (1), 4372.
- Haim,J.W. ( (1997) ) Geometric hashing: an overview. IEEE Comput. Sci. Eng., , 4, (4), 1021.
- Lamdan,Y. and Wolfson,H. ( (1988) ) Geometric Hashing: A General and Efficient Model-Based Recognition Scheme. Proceedings of International Conference on Computer Vision, pp. 238249.
- Oyang,Y.-J., Chang,D.T.-H., Chen,C.-Y. and Hwang,S.-C. ( (2003) ) Expediting Protein Structural Analysis with an Efficient Kernel Density Estimation Algorithm. Proceedings of IEEE 5th International Symposium on Multimedia Software Engineering, Taichung, Taiwan.
- Oyang,Y.-J., Hwang,S.-C., Ou,Y.-Y., Chen,C.-Y. and Chen,Z.-W. ( (2002) ) A Novel Learning Algorithm for Data Classification with Radial Basis Function Networks. Proceedings of 9th International Conference on Neural Information Processing (ICONIP-2002), Singapore.
- Tu,J.-T. ( (2003) ) Protein active site prediction by matching 3D structural data. Master thesis, Department of Computer Science and Information Engineering, National Taiwan University.
- Altschul,S.F. ( (1991) ) Amino acid substitution matrices from an information heoretic perspective. J. Mol. Biol., , 219, , 555565.[CrossRef][ISI][Medline]
- Artin,E. ( (1964) ) The Gamma Function, Holt, Rinehart and Winston, New York.

CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:

|
 |

|
 |
 
P. F. Gherardini and M. Helmer-Citterich
Structure-based function prediction: approaches and applications
Brief Funct Genomic Proteomic,
July 3, 2008;
(2008)
eln030v1.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
E. S. C. Shih, R.-c. R. Gan, and M.-J. Hwang
OPAAS: a web server for optimal, permuted, and other alternative alignments of protein structures.
Nucleic Acids Res.,
July 1, 2006;
34(Web Server issue):
W95 - W98.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
D. T.-H. Chang, Y.-Z. Weng, J.-H. Lin, M.-J. Hwang, and Y.-J. Oyang
Protemot: prediction of protein binding sites with automatically extracted geometrical templates.
Nucleic Acids Res.,
July 1, 2006;
34(Web Server issue):
W303 - W309.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
D. T.-H. Chang, Y.-J. Oyang, and J.-H. Lin
MEDock: a web server for efficient prediction of ligand binding sites based on a novel optimization algorithm
Nucleic Acids Res.,
July 1, 2005;
33(suppl_2):
W233 - W238.
[Abstract]
[Full Text]
[PDF]
|
 |
|