Nucleic Acids Research Advance Access published online on November 9, 2009
Nucleic Acids Research, doi:10.1093/nar/gkp944
© The Author(s) 2009. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
GWIDD: Genome-wide protein docking database
Petras J. Kundrotas1,
Zhengwei Zhu1 and
Ilya A. Vakser1,2,*
1Center for Bioinformatics and 2Department of Molecular Biosciences, 2030 Becker Drive, The University of Kansas, Lawrence, KS 66047, USA
*To whom correspondence should be addressed. Tel: +1 785 584 1057; Fax: +1 785 864 5558; Email: vakser{at}ku.edu
Received August 10, 2009. Revised September 27, 2009. Accepted October 12, 2009.
 |
ABSTRACT
|
|---|
Structural information on interacting proteins is important
for understanding life processes at the molecular level. Genome-wide
docking database is an integrated resource for structural studies
of protein–protein interactions on the genome scale, which
combines the available experimental data with models obtained
by docking techniques. Current database version (August 2009)
contains 25 559 experimental and modeled 3D structures for 771
organisms spanned over the entire universe of life from viruses
to humans. Data are organized in a relational database with
user-friendly search interface allowing exploration of the database
content by a number of parameters. Search results can be interactively
previewed and downloaded as PDB-formatted files, along with
the information relevant to the specified interactions. The
resource is freely available at
http://gwidd.bioinformatics.ku.edu.
 |
INTRODUCTION
|
|---|
Function of proteins in the living cell is determined by their
ability to interact with other biologically relevant molecules
(other proteins, DNA, RNA, small ligand, etc.). Thus understanding
mechanisms of these interactions is critically important for
studying life processes at the molecular level. Genome sequencing
provided vast amount of information on proteins, spanning the
entire universe of life from viruses to the highest eukaryotic
organisms. In the post-genomic era, the efforts focus on the
function assignment of the sequenced proteins based on their
three-dimensional (3D) structures and/or participation in interactions.
Because of the limitations of the experimental techniques for
structural characterization, computational methods play a vital
role (
1).
Success in recreating maps of interactions for specific organisms and/or specific biochemical pathways emphasize the need for large-scale modeling efforts to deliver 3D structures of the protein complexes. Computational methods for structural modeling of the protein–protein interactions (PPIs) historically started with ab initio (or template free) methods based on shape complementarity and were later supplemented by the constraints derived from statistical analysis of properties of known protein complexes or from the experimentally acquired additional biochemical/biophysical knowledge (2). Most of the existing docking servers (3) employ constrained-based template-free approach. Despite the significant progress in development of the template-free algorithms, their accuracy in the high-throughput applications is limited.
Accumulation of experimental data in the last decade have caused paradigm shift in 3D modeling of individual proteins from ab initio to template-based techniques. A similar trend is underway in structural modeling of protein complexes (protein docking). Recently, several groups assessed quality of the models produced by the homology/threading docking techniques where a protein complex is modeled based on similarity to another protein complex with the known structure (4–8). It was demonstrated that the majority of the homology-docking models are of acceptable and medium quality, according to the CAPRI criteria (3). It was estimated that the homology docking can account for a significant part (15–20%) of known PPI (7). Structural alignment techniques were also benchmarked on various sets of protein complexes (9,10).
Success in developing the high-throughput modeling techniques makes it feasible to create a long-needed comprehensive resource, which would reflect large-scale efforts in structural modeling of known protein complexes. Genome-wide docking database (GWIDD) provides annotated collection of experimental and modeled 3D structures of protein–protein complexes from the entire universe of life spanning from viruses to humans. The database provides user-friendly interface for searching and browsing database content and downloading experimental and modeled structures of protein complexes.
 |
DATABASE CONTENT AND DESCRIPTION
|
|---|
Source of PP1s data
PPIs are imported to GWIDD from external sources specialized
in collecting and curating PPI. Currently they include BIND
(
http://www.bind.ca) (
11) and DIP (
http://dip.doe-mbi.ucla.edu)
(
12,
13) databases. These databases were chosen because their
content is not restricted to a single genome or group of genomes
like in many other PPI databases [e.g. different flavors of
MINT (
14) or MIPS (
15)]. They are also regularly updated providing
up-to-date pool of the initial data. The interactions are obtained
through either high-throughput discovery methods or small-scale
experiments and thus are of diverse reliability. However, at
the current stage, evaluation of credibility of the PPI data
is outside the scope of GWIDD.
Current content of the database
The ultimate goal of the GWIDD resource is to provide the 3D structures for all known PPI. At the current stage, the following steps are taken toward this goal. First, if an interaction is found in protein data bank (PDB), this structure is used and no modeling is performed. Otherwise, a search for a pair of homologous sequences from a complex with known structure is performed and the model is build by homology docking (6,7). We used earlier-described criteria for statistical significance of the sequence alignments (7) with an additional requirement that both alignments contain at least 80% of the target sequences. In the future, for the interactions not covered by these two steps, we will use other docking methods (e.g. structural alignment, template free docking), which will be incorporated in the upcoming GWIDD releases. However, even the current limited modeling approach provides structures for 14 635 PPI, which together with the available non-redundant X-ray structures (10 924) constitutes >20% of the currently known PPI. Summary of the GWIDD content is provided in Table 1. As of August 2009, GWIDD contained 126 897 binary interactions, involving 43 976 proteins from 771 different organisms spanning the entire universe of life (Table 1). Among those, 6079 entries are either cross-organism interactions or do not have organism annotation in the source data. Thus they are not present in Table 1, although will appear in search results. The distribution of available structures (X-ray and modeled) is shown in Figure 1 for 10 organisms with the largest numbers of structured GWIDD entries. The database is automatically updated every half year.

View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 1. Number of experimental structures (dark gray bars) and structures modeled by homology docking (light gray bars) for 10 organisms with the largest structural coverage in GWIDD. Numbers at the bars indicate the total amount of non-identical interactions, including those with no structure, in DIP and BIND databases.
|
|
Implementation of the database and its user interface
The data from the external source databases have different formats
and different levels of details. Thus such data are unified
into a single dataset of PPI, removing redundancy and retaining
common data fields for all the sources. Due to the large amount
of data and complex data dependency as well as complex query
requirement, all interaction data are stored in a relational
database, except for large files, such as PDB ones, which are
stored directly in the file system and are linked from the relational
database. Implementation of the web interface is based on LAPP
(Linux-Apache-PostgreSQL-PHP) software stack. Web user interface
is built using PHP and jQuery library, where PHP is for web
presentation and logic as well as back-end database access.
jQuery is responsible for AJAX and other JavaScript-based dynamic
features. Visualization of protein structures is implemented
utilizing Jmol (
http://www.jmol.org) technology. Homology docking
was performed by NEST (
16), BLAST (
17) and in-house profile-to-profile
alignment program used previously for the benchmarking of homology
docking (
7). The above parts are joined by a set of Python scripts.
User interface description
The database can be freely accessed at http://gwidd.bioinformatics.ku.edu. The default option offered to users is search of the database by keywords related to a single interaction partner (Protein A part in Figure 2A). Other search options are available by clicking tabs Sequence (explicit input or upload of sequence in the FASTA format) or Structure (upload of a PDB-format file). When searching by keywords, user can either enter any keyword in the protein description (name of organism, cellular location, biological function, etc.) or choose from the series of drop-down menus containing lists of all organisms currently in GWIDD. By repeating the selection with the box Add another organism to the list checked, user can choose several organisms. When the box is unchecked, the search will clear the list of previously selected organisms. Also, in each submenu, user can select all listed organisms by a single click on the top Select All position. An option to search by standard taxonomy ID with link to taxonomy database http://www.uniprot.org is also provided for convenience. Search results for the Keyword tab can be, for example, all PPI related to a certain pathway (defined by the keyword) or all interactions within certain organism or group of organisms. Search results for the Sequence and Structure tabs contain all interactions with the input sequence as one of the interaction partners (in the case of input PDB file the sequence extracted from the SEQRES tags or, if the SEQRES part is not available, from ATOM tags for the C
atoms). The amino acid sequences from different sources can differ in length even for the same protein (e.g. due to unresolved residues in the X-ray structure). Thus advanced options are provided in the sequence search parts. An example of search by organism and its results is shown in Figure 2.
If information related to the other interaction partner is also
known, user can enable the second part of the search interface
(Protein B, see
Figure 2A) by checking the corresponding
box and input the information similarly to Protein A.
In addition, search results can be filtered by the availability
of different types of GWIDD entries (experimental structures,
modeled structures or interactions with no structures). Online
help is provided in pop-up windows (question marks inside blue
circles, see
Figure 2). Search results screen (
Figure 2B) displays
all interactions in the database satisfying the input search
criteria in the form of expandable list of GWIDD interaction
IDs with minimum additional information. The expanded item in
the list contains the name and the GWIDD IDs of the interacting
partners along with information on the type of 3D structure
available for this interaction (if applicable). For the homology-docking
models, the alignments used to build the model are provided
and the model quality is assessed by the sequence identity criteria
(
5). For the available structures, links are provided to download
the PDB-format file along with the text file containing relevant
information, as well as to the visualization screen where the
structure is displayed in colored-by-chain space-filled interactive
representation.
 |
COMPARISON TO OTHER EXISTING RESOURCES
|
|---|
There are several resources available that are similar in spirit
(genome-wide approach to PPI) to the GWIDD resource. Michigan
molecular interactions (MiMIs), database (
http://mimi.ncibi.org)
(
18) provides one cohesive view of molecules found in several
popular interaction databases, including BIND, HPRD, IntAct,
GRID and others, with complementary or conflicting data among
the sites highlighted. POINT (
http://point.bioinformatics.tw)
(
19) is a functional database for prediction of the human protein–protein
interactome based on available orthologous interactome datasets
with the emphasis on extraction of mouse, fruit fly, worm and
yeast PPI datasets from DIP, followed by their conversion to
predicted human interactome. 3D-GENOMICS (
http://www.sbg.bio.ic.ac.uk/3dgenomics)
(
20) provides structural annotations for proteins from sequenced
genomes and in August 2003 included data for 93 proteomes. NCBI
Inferred Biomolecular Interactions Server (IBIS,
http://www.ncbi.nlm.nih.gov/Structure/ibis/ibis.cgi)
reports physical interactions observed in experimentally determined
structures for sequences homologous to the input amino acid
sequence, thus inferring interacting partners and binding sites.
However, none of the above resources provide single integrated
and searchable pool of experimental and modeled 3D structures
for all genomes for which at least one PPI is annotated. Recently
developed ProtInfo PPC server (
http://protinfo.compbio.washington.edu/ppc)
(
21) provides model structures for users supplied sequences,
but lacks the annotated database of 3D structures.
 |
FUTURE DEVELOPMENT
|
|---|
The major direction in the future development of GWIDD is expanding
the pool of available structures modeled by other modeling techniques,
such as docking by structural alignment (to be submitted) and
template-free docking by GRAMM methodology (
22–24). To
assess the applicability of these methods to the high-throughput,
genome-wide modeling, large-scale benchmarking is currently
underway. New sources of PPI will be incorporated as they become
available.
 |
FUNDING
|
|---|
Funding for open access charge: National Institutes of Health
(R01 GM074255
[GenBank]
).
Conflict of interest statement. None declared.
 |
ACKNOWLEDGEMENTS
|
|---|
Andrey Tovchigrechko and Tatiana Baronova made important contributions
to GWIDD project at the earlier stages of development.
 |
REFERENCES
|
|---|
- Russell RB, Alber F, Aloy P, Davis FP, Korkin D, Pichaud M, Topf M, Sali A. A structural perspective on protein–protein interactions. Curr. Opin. Struct. Biol. (2004) 14:313–324.[CrossRef][Web of Science][Medline]
- Vakser IA, Kundrotas P. Predicting 3D structures of protein-protein complexes. Curr. Pharm. Biotech. (2008) 9:57–66.[CrossRef]
- Lensink MF, Mendez R, Wodak SJ. Docking and scoring protein complexes: CAPRI 3rd edn. Proteins (2007) 69:704–718.[CrossRef][Web of Science][Medline]
- Aloy P, Pichaud M, Russell RB. Protein complexes: Structure prediction challenges for the 21st century Curr. Opin. Struct. Biol. (2005) 15:15–22.[CrossRef]
- Aloy P, Russell RB. Interrogating protein interaction networks through structural biology. Proc. Natl Acad. Sci. USA (2002) 99:5896–5901.[Abstract/Free Full Text]
- Kundrotas PJ, Alexov E. Predicting 3D structures of transient protein-protein complexes by homology. Bioch. Biophys. Acta. (2006) 1764:1498–1511.[Medline]
- Kundrotas PJ, Lensink MF, Alexov E. Homology-based modeling of 3D structures of protein-protein complexes using alignments of modified sequence profiles. Int. J. Biol. Macromol. (2008) 43:198–208.[CrossRef][Web of Science][Medline]
- Lu L, Lu H, Skolnick J. MULTIPROSPECTOR: An algorithm for the prediction of protein-protein interactions by multimeric threading. Proteins (2002) 49:350–364.[CrossRef][Web of Science][Medline]
- Gunther S, May P, Hoppe A, Frommel C, Preissner R. Docking without docking: ISEARCH - prediction of interactions using known interfaces. Proteins (2007) 69:839–844.[CrossRef][Web of Science][Medline]
- Launay G, Simonson T. Homology modelling of protein-protein complexes: A simple method and its possibilities and limitations. BMC Bioinformatics (2008) 9:427.[CrossRef][Medline]
- Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess Eea. The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res. (2005) 33:D418–D424.[Abstract/Free Full Text]
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. (2004) 32:D449–D451.[Abstract/Free Full Text]
- Xenarios I, Rice DW, Salwinski L, Baron NK, Marcotte EM, Eisenberg D. DIP: The Database of Interacting Proteins. Nucleic Acids Res. (2000) 28:289–291.[Abstract/Free Full Text]
- Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, Cesareni G. MINT: A Molecular INTeraction database. FEBS Lett. (2002) 513:135–140.[CrossRef][Web of Science][Medline]
- Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stümpflen V, Mewes HW, et al. The MIPS mammalian protein–protein interaction database. Bioinformatics (2005) 21:832–834.[Abstract/Free Full Text]
- Petrey D, Xiang ZX, Tang CL, Xie L, Gimpelev M, Mitros T, Soto CS, Goldsmith-Fischman S, Kernytsky A, Schlessinger A, et al. Using multiple structure alignments, fast model building, and energetic analysis in fold recognition and homology modeling. Proteins (2003) 53:430–435.[CrossRef][Web of Science][Medline]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A new generation of database programs. Nucleic Acids Res. (1997) 25:3389–3402.[Abstract/Free Full Text]
- Tarcea VG, Weymouth T, Ade A, Bookvich A, Gao J, Mahavisno V, Wright Z, Chapman A, Jayapandian M, Ozgur A, et al. Michigan molecular interactions r2: From interacting proteins to pathways. Nucleic Acids Res. (2009) 37:D642–D646.[Abstract/Free Full Text]
- Huang TW, Tien AC, Huang WS, Lee YCG, Peng CL, Huei-Hun Tseng HH, Kao CY, Huang CYF. POINT: A database for the prediction of protein–protein interactions based on the orthologous interactome. Bioinformatics (2004) 20:3273–3276.[Abstract/Free Full Text]
- Fleming K, Muller A, MacCallum RM, Sternberg MJE. 3D-GENOMICS: a database to compare structural and functional annotations of proteins between sequenced genomes. Nucleic Acids Res. (2004) 32:D245–D250.[Abstract/Free Full Text]
- Kittichotirat W, Guerquin M, Bumgarner RE, Samudrala R. Protinfo PPC: A web server for atomic level prediction of protein complexes. Nucleic Acids Res. (2009) 37:W519–W525.[Abstract/Free Full Text]
- Katchalski-Katzir E, Shariv I, Eisenstein M, Friesem AA, Aflalo C, Vakser IA. Molecular surface recognition: Determination of geometric fit between proteins and their ligands by correlation techniques. Proc. Natl Acad. Sci. USA (1992) 89:2195–2199.[Abstract/Free Full Text]
- Vakser IA, Matar OG, Lam CF. A systematic study of low-resolution recognition in protein-protein complexes. Proc. Natl Acad. Sci. USA (1999) 96:8477–8482.[Abstract/Free Full Text]
- Tovchigrechko A, Wells CA, Vakser IA. Docking of protein models. Protein Sci. (2002) 11:1888–1896.[CrossRef][Web of Science][Medline]

CiteULike
Connotea
Del.icio.us What's this?