Nucleic Acids Research Advance Access originally published online on June 26, 2008
Nucleic Acids Research 2008 36(14):e88; doi:10.1093/nar/gkn386
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Research, 2008, Vol. 36, No. 14 e88
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Methods Online |
A protein–DNA docking benchmark
Bijvoet Center for Biomolecular Research, Science Faculty, Utrecht University, The Netherlands
*To whom correspondence should be addressed. Tel: +31 0 30 2533859; Fax: +31 0 30 2537623; Email: a.m.j.j.bonvin{at}uu.nl
Received February 6, 2008. Revised June 2, 2008. Accepted June 3, 2008.
| ABSTRACT |
|---|
|
|
|---|
We present a protein–DNA docking benchmark containing 47 unbound–unbound test cases of which 13 are classified as easy, 22 as intermediate and 12 as difficult cases. The latter shows considerable structural rearrangement upon complex formation. DNA-specific modifications such as flipped out bases and base modifications are included. The benchmark covers all major groups of DNA-binding proteins according to the classification of Luscombe et al., except for the zipper-type group. The variety in test cases make this non-redundant benchmark a useful tool for comparison and development of protein–DNA docking methods. The benchmark is freely available as download from the internet.
| INTRODUCTION |
|---|
|
|
|---|
Biomolecular docking has become a mature discipline within structural biology (1). Docking aims at predicting the structure of a complex given the 3D structures of its components. The field of protein–protein docking in particular has seen extensive progress over the last decade as witnessed by recent CAPRI (Critical Assessment of Predicted Interactions) results, a community-wide blind docking experiment (2). For protein–DNA docking, however, progress lags behind. The scarcity of information for a proper identification of interaction surfaces on DNA and its inherent flexibility have hampered the development of effective docking methods. The field of protein–DNA docking is, however, receiving increased attention and efforts are put into the development of docking methods that address the above mentioned limitations (3). Considering the importance of biomolecular interactions in system biology, gaining insight into the biochemistry of recognition and gene expression is highly relevant (4). New developments in protein–DNA docking approaches are therefore expected.
A set of well-defined test cases that form a common ground for validating and comparing the different docking methods would facilitate the development of effective protein–DNA docking methods. Such a benchmark should contain the native structures of both protein and DNA in their unbound form together with the reference structure of the complex.
We have constructed a benchmark of 47 protein–DNA test cases in a similar manner as has been done for protein–protein docking (5). The benchmark covers all major groups of protein–DNA complexes according to the classification proposed by Luscombe et al. (6) except for the zipper-type group. It contains a variety of challenging systems in terms of size of the interaction interface, number of individual components present in the complex and conformational changes that the unbound components undergo upon complex formation. Its diversity makes it a comparison tool for different docking methods as their performance may vary depending on the type of complexes. This benchmark should benefit the entire docking community and offer a starting-point for the improvement of various algorithms.
| MATERIALS AND METHODS |
|---|
|
|
|---|
RCSB Protein Data Bank (PDB) query
A non-redundant benchmark was generated from structures deposited in the RCSB PDB (7). The PDB (as of September 2007) was queried for all entries containing X-ray crystallographic structures with a resolution better than 3.0 Å containing both protein and DNA. Complexes containing DNA structures with a sequence length smaller than 8 bp and protein structures containing mutations in the core and or interface region were removed.
For the resulting complexes, the PDB was queried for unbound protein entries. Structures resolved using NMR or X-ray crystallography with a resolution better than 3.0 Å were retrieved. Structures with a sequence similarity larger than or equal to 90% were removed. Structures were regarded as redundant if the raw alignment score is positive, >80% of their sequences are aligned and >60% of the sequences are identical. Sequence alignments were performed using the Needleman–Wunsch algorithm as implemented in the LSQMAN software package (8) with a gap penalty of 5.
Generation of unbound DNA models
Models for unbound DNA were generated using the DNA analysis and rebuilding program 3DNA (9) with the base-pair sequence of the DNA in the reference complex. The models were generated in canonical B-DNA conformation (fiber model 4) using the nucleotide building blocks as determined in the fiber diffraction studies of Chandrasekaran and Arnott (10). Structures with overhanging base-pairs were converted to all-paired structures by adding their Watson–Crick counterparts.
Structure post-processing
The residue numbering of the bound and unbound components was matched to allow for easy comparison. The DNA was assigned one chain identifier and renumbered. Structures of unbound proteins that contain more than one chain were assigned a single chain identifier instead of being separated into their individual components; residues were renumbered to avoid overlap in numbering. Atom and residue names were matched to the topallhdg5.3.pro (11) and dna-rna_allatom.top topology files (12) naming for direct use in HADDOCK (13).
Analysis
The size of the interaction interface between protein and DNA is expressed in terms of the buried surface area (BSA, Table 1) of the DNA in the complex. The BSA was calculated using NACCESS (Hubbard, S. J., Thornton, J. M. 1993) with a probe radius of 1.4 Å. The conformational changes between the unbound and the bound states are expressed in terms of the root mean square deviation (RMSD) calculated using ProFit (Martin, A.C.R., http://www.bioinf.org.uk/software/profit/). These were calculated in three different ways:
- Conformational change of the protein–DNA interface was calculated by superimposition of all C
and phosphate atoms at the interface. Residues belonging to the interface are identified as those having atoms within 5.0 Å intermolecular distance of one another (RMSD Inter., Table 1). The interface RMSD values were used to classify the test cases as easy, intermediate or difficult (see below).
- As the conformational change in the DNA tends to affect the complete molecule, the RMSD of the DNA was calculated by superimposition of all phosphate atoms (RMSD DNA, Table 1).
- Conformational changes in the protein, such as global domain reorientations and flexible segments not located at the interface are represented by means of the RMSD calculated over all C
atoms of the protein (RMSD Prot, Table 1).
|
| COMPOSITION OF THE BENCHMARK |
|---|
|
|
|---|
The protein–DNA benchmark version 1.0 (Table 1) contains 47 test cases. For all test cases, the unbound structures of both protein and DNA are available. In addition, the reference complexes have been separated into their DNA and protein bound forms. This should allow to evaluate the performance of a docking method for bound–bound, bound–unbound and unbound–unbound cases. Although the reference structure is always from X-ray crystallography, the unbound proteins contain both solution NMR and X-ray structures. The use of an ensemble of NMR structures as starting point for the docking provides an easy way for various docking algorithms to sample additional conformational space. The benchmark contains members of all major structural groups described by Luscombe et al. (6) apart from the zipper-type group. These are: 16 helix–turn–helix (group 1), three zinc-coordinating (group 2), five other
-helix (group 4), two β-sheet (group 5), four β-hairpin/ribbon (group 6) and 17 enzyme (group 8) complexes. Each test case in the benchmark poses its own challenges for a docking algorithm. A common theme throughout the benchmark is conformational changes either in the protein, the DNA or both. This benchmark differs from its protein–protein counterpart by the omnipresence of conformation changes. To provide some structure in the test cases, we classified them as easy, intermediate or difficult. This classification is based on the interface RMSD values between the bound and unbound components of the complex:
- easy test case: interface RMSD between 0.0 Å and 2.0 Å
- intermediate test case: interface RMSD between 2.0 Å and 5.0 Å
- difficult test case: interface RMSD above 5.0 Å.
An easy test case
The individual components from this group of complexes do not change significantly the conformation of their interface upon binding. Conformational changes at the interface of the protein are mostly brought about by small flexible loop rearrangements. This does not mean that the components can always be regarded as rigid. Conformational changes at the interface of the DNA often cause the DNA to bend and twist in the interface region (see DNA RMSD values in Table 1). A representative example from this group is the Papillomavirus replication initiation domain E-1 (PDB entry 1ksy, Figure 1A).
|
An intermediate test case
Unbound components of this group undergo more pronounced structural rearrangements in their interface upon complex formation. The type of conformational changes involves global and local domain rearrangements in the protein and global conformational change in the DNA. An example is the intron-encoded homing endonuclease I-PPOI complex (PDB entry 1a73, Figure 1B), the protein shows little conformational change upon binding but the DNA is heavily kinked in its centre.
A difficult test case
In the difficult cases, the extent of structural rearrangement upon complex formation increases even further. In addition to the conformational changes occurring in the intermediate test cases, the difficult group contains complexes with features like structural transitions and major domain reorientations in the protein. An example is the proline utilization transcription activator (PDB entry 1zme
[PDB]
, Figure 1C), a protein that has two DNA interaction domains linked together by a long highly flexible loop; the dimerization interface connecting the two DNA interaction domains show a loop to sheet transition upon DNA binding. In the PVUII endonuclease complex (PDB entry 1eyu, Figure 1D), the individual protein chains do not show much conformational changes but a hinge point connecting them facilitates a clamping motion upon binding. This results in a large RMSD between bound and unbound structures. This is an example of global domain motions upon binding.
The benchmark also contains several structures with special features such as strand breaks (PDB entries 1g9z [PDB] , 1o3t and 3bam) and flipped out bases in the DNA (PDB entries 1diz, 1emh, 1vas and 7mht).
We constructed this benchmark as a test base to stimulate developments in the field of protein–DNA docking and will use it in particular for further developing our own protein–DNA docking approach (3). Ideally, the classification of easy, intermediate or difficult could have been based on docking results; at this stage, however, we chose to purely base it on conformational changes as measured by the RMSDs between bound and unbound form. Basing the classification on HADDOCK results would have introduced a bias not only toward the amount of conformational changes, but also toward our ability to predict protein–DNA interfaces since HADDOCK requires some kind of input to drive the docking process. We will of course proceed with evaluating our performance on this benchmark, but this is outside the scope of this article.
In conclusion, allowing for structural rearrangements in both protein and DNA during docking, while maintaining the helical character of DNA is a major challenge in protein–DNA docking. The large variety of protein–DNA complexes in the benchmark should provide a valuable test set to evaluate and improve docking algorithms. Version 1.0 of the benchmark is available from the web site: http://haddock.chem.uu.nl/dna/benchmark.html
| ACKNOWLEDGEMENTS |
|---|
Financial support for this research and the Open Access publication charges for this article was provided by the European Community (FP6 STREP project ExtendNMR, contract no. LSHG-CT-2005-018988, FP6 I3 project EU-NMR, contract no. RII3-026145 and FP7 I3 project eNMR, contract no. 213010-e-NMR) and from a VICI grant from the Netherlands Organization for Scientific Research (NWO) to A.M.J.J.B. (grant no. 700.96.442).
Conflict of interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
- van Dijk AD, Boelens R, Bonvin AM. Data-driven docking for the study of biomolecular complexes. FEBS J. (2005) 272:293–312.[CrossRef][Medline]
- Janin J. The targets of CAPRI rounds 6-12. Proteins (2007) 69:699–703.[CrossRef][Web of Science][Medline]
- van Dijk M, van Dijk AD, Hsu V, Boelens R, Bonvin AM. Information-driven protein-DNA docking using HADDOCK: it is a matter of flexibility. Nucleic Acids Res. (2006) 34:3317–3325.
[Abstract/Free Full Text] - Rhodes D, Schwabe JW, Chapman L, Fairall L. Towards an understanding of protein-DNA recognition. Phil. Trans. Roy. Soc. Lond. (1996) 351:501–509.[CrossRef]
- Mintseris J, Wiehe K, Pierce B, Anderson R, Chen R, Janin J, Weng Z. Protein-Protein Docking Benchmark 2.0: an update. Proteins (2005) 60:214–216.[CrossRef][Web of Science][Medline]
- Luscombe NM, Austin SE, Berman HM, Thornton JM. An overview of the structures of protein-DNA complexes. Genome Biol (2000) 1. e1.
- Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res. (2007) 35:301–303.[CrossRef]
- Sierk ML, Kleywegt GJ. Deja vu all over again: finding and analyzing protein structure similarities. Structure (2004) 12:2103–2111.[Medline]
- Lu XJ, Olson WK. 3DNA: a software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures. Nucleic Acids Res. (2003) 31:5108–5121.
[Abstract/Free Full Text] - Chandrasekaran RA, Arnott S. The structures of DNA and RNA helices in oriented fibers. In: Landolt-Börnstein Numerical Data and Functional Relationships in Science and Technology—Saenger W, ed. (1989) Vol. VII/1b. Springer, Berlin. 31–170.[Medline]
- Linge JP, Williams MA, Spronk CA, Bonvin AM, Nilges M. Refinement of protein structures in explicit solvent. Proteins (2003) 50:496–506.[CrossRef][Web of Science][Medline]
- Brunger AT, Adams PD, Clore GM, DeLano WL, Gros P, Grosse-Kunstleve RW, Jiang JS, Kuszewski J, Nilges M, Pannu NS, et al. Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr. (1998) 54:905–921.
- Dominguez C, Boelens R, Bonvin AM. HADDOCK: a protein-protein docking approach based on biochemical or biophysical information. J. Am. Chem. Soc. (2003) 125:1731–1737.[CrossRef][Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

interface RMSD < 5.0 Å) and difficult (interface RMSD
5.0 Å) test cases from the protein–DNA benchmark. Easy test case: the Papillomavirus replication initiation domain E-1 (PDB id 1ksy) (interface RMSD = 1.6 Å) (A). Intermediate test case: the intron-encoded homing endonuclease I-PPOI complex (PDB id 1a73) (interface RMSD = 4.3 Å) (B). Difficult test cases: the proline utilization transcription activator (PDB id 1zme