Nucleic Acids Research Advance Access published online on October 11, 2007
Nucleic Acids Research, doi:10.1093/nar/gkm794
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
MegaMotifBase: a database of structural motifs in protein families and superfamilies
Ganesan Pugalenthi1,
P. N. Suganthan1,
R. Sowdhamini2,* and
Saikat Chakrabarti3
1School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore, 2National Centre for Biological Sciences, UAS-GKVK Campus, Bellary Road, Bangalore 560 065, India and 3National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
* To whom correspondence should be addressed. Tel: +91-80-23666250; Fax: +91-80-3636662; Email: mini{at}ncbs.res.in. Correspondence may also be addressed to Saikat Chakrabarti. Tel: 301-594-6474; Email: chakraba{at}mail.nlm.nih.gov
Received August 14, 2007. Revised September 16, 2007. Accepted September 17, 2007.
 |
ABSTRACT
|
|---|
Structural motifs are important for the integrity of a protein
fold and can be employed to design and rationalize protein engineering
and folding experiments. Such conserved segments represent the
conserved core of a family or superfamily and can be crucial
for the recognition of potential new members in sequence and
structure databases. We present a database, MegaMotifBase, that
compiles a set of important structural segments or motifs for
protein structures. Motifs are recognized on the basis of both
sequence conservation and preservation of important structural
features such as amino acid preference, solvent accessibility,
secondary structural content, hydrogen-bonding pattern and residue
packing. This database provides 3D orientation patterns of the
identified motifs in terms of inter-motif distances and torsion
angles. Important applications of structural motifs are also
provided in several crucial areas such as similar sequence and
structure search, multiple sequence alignment and homology modeling.
MegaMotifBase can be a useful resource to gain knowledge about
structure and functional relationship of proteins. The database
can be accessed from the URL
http://caps.ncbs.res.in/MegaMotifbase/index.html
 |
INTRODUCTION
|
|---|
Previous studies have pointed out that short segments of sequence
and/or structural elements are required for retention of fold
and function of a protein (
1,
2). Sequence-based representations,
however, are only an approximation to the underlying structural
and functional information. Therefore, motifs identified at
3D structure level provide significant and reliable information.
We had earlier identified such structurally invariant segments
through careful manual intervention for superfamilies where
proteins are distantly related but retain similar fold and biological
functions (
3,
4).
Here, we present a database, MegaMotifBase, which provides a set of important structural segments or motifs for protein structures related at family or superfamily level on the basis of conservation of both sequence and structural features. Motifs among structurally aligned proteins are recognized by the conservation of amino acid preference and solvent accessibility and are examined for the conservation of important structural features like secondary structural content, hydrogen-bonding pattern and residue packing (3–5). These motifs may form the common structural core by maintaining a particular spatial pattern, when compared across different proteins belonging to the same family or superfamily. Such motifs can also be employed to design and rationalize protein engineering and folding experiments. MegaMotifBase provides a comprehensive compilation of structural motifs identified through a completely automated method for large number of families (1032) and superfamilies (1194) of proteins, in contrast to earlier efforts (3,4,6), which were limited to poor coverage and extensive manual supervision. This database can be accessed from the URL http://caps.ncbs.res.in/MegaMotifbase/index.html.
 |
KEY FEATURES OF THE DATABASE
|
|---|
- Identification and collection of important conserved structural segments that are crucial for the integrity of the fold and can be projected as the minimum structural requirements for a new member to be considered as part of a pre-existing family or superfamily. It is also possible to use simple sequence conservation to recognize motifs.
- Interactive 3D views of the motifs on the superposed structures are displayed for better understanding and visualization.
- Spatial orientations of the motifs, in terms of inter-motif distances and torsion angles, are provided. This enables the users to analyze the structural variations that are felt even at conserved core of the fold owing to poor sequence identity and evolutionary pressures.
- Options are provided for scanning multiple structural motifs along with their spatial orientation in a given query protein structure and to scan multiple motifs in a query structure against the entire structural motif database. This could be very useful in protein classification and assignment of family or superfamily relationship to newly solved protein structures with unknown function.
- Options are also provided to search for similar sequences by a multimotif-based database scanning procedure called SCANMOT (7). This scanning option provides an opportunity to identify distantly related sequences for each family or superfamily. The specificity of the search engine is increased by utilizing the inter-motif spacing and pairwise global alignment of the query and hits.
- The current version of the database provides options to align similar sequence with the query protein structure(s). It allows the user to obtain a control over the alignment by providing sequence–structure motif regions as input to the alignment program to achieve a more structurally relevant and functionally useful alignment of protein sequences. The alignment algorithm employs local conserved regions of the sequences to be fixed, and aligns the rest based on normal progressive alignment. The chances of global misalignment are thereby reduced and the possibility of obtaining overall better alignment is increased (8).
- The database also allows users to build 3D models of similar protein sequences of unknown structure.
- The entire database of motifs and the alignments can be downloaded as a flat file for further use and analyses (Figure 1).
 |
CONTENTS OF THE DATABASE
|
|---|
MegaMotifBase compiles structural motifs at different levels
of protein classification strata.
Structural motifs at the family level
We have collected 1032 structural alignments of protein domains that are related at the family levels from the HOMSTRAD (9) database. Motifs among structurally aligned proteins were recognized by the conservation of amino acid preference and solvent accessibility and examined for the conservation of other important structural features like secondary structural content, hydrogen-bonding pattern and residue packing. Sequentially conserved regions were identified from the multiple alignments by examining the nature of amino acid exchanges using a standard 20 x 20 substitution matrix (10). Solvent accessibility was measured using the PSA program from JOY4.0 suite (11). SSTRUC and HBOND programs, that are also part of JOY4.0 suite, were used to identify secondary structural positions and hydrogen bonds, respectively. Residue packing has been measured in terms of Ooi number (12) that provides the number of residues surrounding each C
atom of residues in a protein. Higher Ooi numbers correspond to better residue packing and suggest that the residue is in a well-packed environment.
A structural feature is considered conserved at an alignment position if it is present in all or all but one member within the alignment. We found this condition was more practical for families with poor structural representation. The structural motifs identified are mapped on the alignment using different color code and often represent the conserved core of the family. Ranking of the motifs is performed considering the extent of conservation of the structural feature. An idea of the 3D orientation pattern of the structural motifs is provided via graphic displays and spatial orientation matrices.
Structural motifs at the superfamily level
Structural motifs for multimember superfamilies
The superfamily is a hierarchical classification that contains proteins of different families having similar structure and function. These proteins might have very low sequence identities but retain the same fold through well-conserved secondary structural parts. Therefore, identification of structural motifs for superfamilies is even more valuable since the evolutionary divergence makes it impossible to derive conserved sequence or structural segments simply by residue conservation. We identified structural motifs for all the multimember superfamilies (628) available in the latest PASS2 and SCOP (version 1.63) databases (13,14) following the same protocol described above to identify motifs for proteins related at the family level.
Structural motifs for single member superfamilies
A majority of the entries in protein structural databank are single member superfamilies for which it is hard to derive structural motifs due to the paucity of structural homologues. Important conserved segments for these 566 superfamilies (PASS2 database, (13)) have been identified and compiled into the MegaMotifBase. Conserved regions, recognized by permitted amino acid exchanges, were mapped onto the structure and content of various structural features (solvent accessibility, secondary structure content, hydrogen bonding and residue packing) were examined. Only the conserved segments with high structural feature content were projected as sequence-structural templates for the particular superfamily member. Interactive 3D displays of the templates in 3D structure [in Chime® and RASMOL (15)] were provided for better understanding and visualization. A static image of the 3D structure is provided using MOLSCRIPT (16).
We also provide the application of sequence–structural templates in three different areas: multimotif-based sequence search, multiple sequence alignment and homology modeling. In each case, the inclusion of the sequence-structural templates can give rise to sensitive and accurate results. This emphasizes the need for inclusion of singletons to provide added value to the recognition of additional members, comparative modeling and in designing experiments.
 |
APPLICATIONS
|
|---|
The availability of structural motifs is useful since these
conserved patterns form the common core by maintaining a particular
spatial orientation pattern. These motifs can also assist in
the identification of new potential members of an existing family
and/or superfamily. Scanning of multiple structural motifs,
along with their spatial orientation in a given query protein
structure, could be very useful in protein structural classification.
In MegaMotifBase database, we also provide the application of
motifs in three other crucial areas: motif-based similar sequence
search, multiple sequence alignment and in homology modeling.
In each case, the inclusion of the sequence–structural
motifs can give rise to sensitive and accurate results.
 |
ACKNOWLEDGEMENTS
|
|---|
G.P. and P.N.S. acknowledge the financial support offered by
A*Star (Agency for Science, Technology and Research). R.S. acknowledges
Wellcome Trust, UK and National Centre for Biological Sciences
(TIFR) for financial and infrastructural support. S.C. acknowledges
Intramural Research Program of the National Library of Medicine
at NIH/DHHS. Funding to pay the Open Access publication charges
for this article was provided by Wellcome Trust, U.K.
Conflict of interest statement. None declared.
 |
REFERENCES
|
|---|
- Farber GK, Petsko GA. The evolution of alpha/beta barrel enzymes. Trends Biochem. Sci. (1990) 15:228–234.[CrossRef][ISI][Medline]
- Kannan N, Selvaraj S, Gromiha MM, Vishveshwara S. Clusters in alpha/beta barrel proteins: implications for protein structure, function, and folding: a graph theoretical approach. Proteins (2001) 43:103–112.[CrossRef][ISI][Medline]
- Chakrabarti S, Venkatramanan K, Sowdhamini R. SMoS: a database of structural motifs of protein superfamilies. Protein Eng. (2003) 16:791–793.[Abstract/Free Full Text]
- Chakrabarti S, Sowdhamini R. Regions of minimal structural variation among members of protein domain superfamilies: application to remote homology detection and modeling using distant relationships. FEBS Lett. (2004) 569:31–36.[CrossRef][ISI][Medline]
- Pugalenthi G, Suganthan PN, Sowdhamini R, Chakrabarti S. SMotif: a server for structural motifs in proteins. Bioinformatics (2007) 23:637–638.[Abstract/Free Full Text]
- Chakrabarti S, Manohari G, Pugalenthi G, Sowdhamini R. SSToSS—sequence-structural templates of single-member superfamilies. In Silico Biol. (2006) 6:311–319.[Medline]
- Chakrabarti S, Anand AP, Bhardwaj N, Pugalenthi G, Sowdhamini R. SCANMOT: searching for similar sequences using a simultaneous scan of multiple sequence motifs. Nucleic Acids Res. (2005) 33:W274–W276.[Abstract/Free Full Text]
- Chakrabarti S, Bhardwaj N, Anand PA, Sowdhamini R. Improvement of alignment accuracy utilizing sequentially conserved motifs. BMC Bioinformatics (2004) 5:167–178.[CrossRef][Medline]
- Mizuguchi K, Deane CM, Blundell TL, Overington JP. HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci. (1998) 7:2469–2471.[Abstract]
- Johnson MS, Overington JP. A structural basis for sequence comparisons. An evaluation of scoring methodologies. J. Mol. Biol. (1993) 233:716–738.[CrossRef][ISI][Medline]
- Mizuguchi K, Deane CM, Blundell TL, Johnson MS, Overington JP. JOY: protein sequence-structure representation and analysis. Bioinformatics (1998) 14:617–623.[Abstract/Free Full Text]
- Nishikawa K, Ooi TJ. Radial locations of amino acid residues in a globular protein: correlation with the sequence. J. Biochem. (1986) 100:1043–1047.[Abstract/Free Full Text]
- Bhaduri A, Pugalenthi G, Sowdhamini R. PASS2: an automated database of protein alignments organised as structural superfamilies. BMC Bioinformatics (2004) 5:35–41.[CrossRef][Medline]
- Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. (1995) 247:536–540.[CrossRef][ISI][Medline]
- Sayle A, Milner-White EJ. RASMOL: biomolecular graphics for all. Trends Biochem. Sci. (1995) 20:374–376.[CrossRef][ISI][Medline]
- Kraulis PJ. MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J. Appl. Crystallogr. (1991) 24:946–950.[CrossRef][ISI]

CiteULike
Connotea
Del.icio.us What's this?