Nucleic Acids Research Advance Access published online on April 9, 2008
Nucleic Acids Research, doi:10.1093/nar/gkn140
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Methods Online |
Detecting cis-regulatory binding sites for cooperatively binding proteins
1Department of Electrical Engineering, ESAT-SCD, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Leuven, 2Department of Molecular and Cellular Interactions, VIB, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussel and 3Department of Electrical Engineering, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussel, Belgium
*To whom correspondence should be addressed. Tel: +32 16 328646; Email: Liesbeth.vanOeffelen{at}esat.kuleuven.be
Received January 15, 2008. Revised March 8, 2008. Accepted March 14, 2008.
| ABSTRACT |
|---|
|
|
|---|
Several methods are available to predict cis-regulatory modules in DNA based on position weight matrices. However, the performance of these methods generally depends on a number of additional parameters that cannot be derived from sequences and are difficult to estimate because they have no physical meaning. As the best way to detect cis-regulatory modules is the way in which the proteins recognize them, we developed a new scoring method that utilizes the underlying physical binding model. This method requires no additional parameter to account for multiple binding sites; and the only necessary parameters to model homotypic cooperative interactions are the distances between adjacent protein binding sites in basepairs, and the corresponding cooperative binding constants. The heterotypic cooperative binding model requires one more parameter per cooperatively binding protein, which is the concentration multiplied by the partition function of this protein. In a case study on the bacterial ferric uptake regulator, we show that our scoring method for homotypic cooperatively binding proteins significantly outperforms other PWM-based methods where biophysical cooperativity is not taken into account.
| INTRODUCTION |
|---|
|
|
|---|
Unraveling regulatory pathways is a key step toward understanding biological processes. A major problem with which biologists are often confronted is that they want to retrieve new binding sites for a known regulatory protein, while reducing the number of costly and time-consuming experiments. Therefore, they generally construct a PWM based on a set of known binding sequences such as those resulting from SELEX experiments. Then they score the putative promoter of each gene and validate the highest scoring genes in the wet lab by, e.g. mutagenesis in the predicted binding site followed by RT-PCR.
Several methods have been developed to score genes based on a PWM, depending on the interaction between the transcription factor and DNA. The first interaction mode studied was between a protein and a single binding site within a promoter (1). Later on, physically inspired adaptations were proposed to account for multiple binding sites (2) and cooperatively binding proteins (3), as well as more statistically inspired methods such as Cluster Buster (4) and MSCAN (5). However, the performance of these methods depends on a number of additional parameters that cannot be derived from sequences and are difficult to estimate as they have no physical meaning. In this article, we describe a new scoring method that takes multiple binding sites and cooperative binding into account by means of a minimum number of physical parameters. Therefore, we theoretically derive the binding probability within a putative promoter sequence. First, we consider the binding probability at a single binding site, then the influence of multiple binding sites and homotypic cooperative binding is studied (i.e. cooperative binding with the same protein). Subsequently, we apply our method to the homotypic cooperatively binding ferric uptake regulator (Fur) in Pseudomonas aeruginosa and show that taking cooperativity into account yields a significant performance enhancement. Finally, we also describe how the method can be extended to heterotypic cooperatively binding proteins and pre-bound complexes with a flexible dimerization domain.
| METHODS |
|---|
|
|
|---|
The single binding site model
Our aim is to score and rank genes based on the probability that they are regulated by a given protein. This probability can be estimated as the probability that at least one protein copy is bound within the putative promoter. If each promoter contains maximum one binding site, genes can be ranked based on the binding probability at the best scoring site within their promoter. The equilibrium probability of a site
|
| (1) |
|
| (2) |
|
| (3) |
|
| (4) |
|
| (5) |
the number of sites in the genome.
equals twice (two strands) the genome length, or in the special case of a homodimeric protein only once because of rotation symmetry.
Suppose, we are given a set of aligned sequences Xin for
, where it is known that each Xin is a preferred binding site for the considered DNA-binding protein. Based on the frequency matrix f(b, j) of these sequences and the genomic base frequencies p(b), a PWM can be defined as (1)
|
| (6) |
|
| (7) |
Multiple binding sites and homotypic cooperative binding
To take multiple binding sites into account when predicting regulation, Liu and Clarke calculated the probability
that at least one of the sites is occupied within the putative promoter of a gene (2):
|
| (8) |
To derive a correct formula for
, we followed a similar reasoning as used to obtain Equation (4) starting from Equation (1). Analogously to Equations (3) and (4), we find:
|
| (9) |
|
| (10) |
|
| (11) |
To solve the problem of the unknown parameters, we propose a different ranking strategy. Intuitively, the most straightforward approach would be to increase the protein concentration starting from zero and to see which sites are bound first. Therefore, instead of ranking genes based on
given a fixed protein concentration, we fix
to a threshold probability of 50% and rank genes based on the corresponding protein concentration. This seems biologically more relevant, since it tells us in which order proteins are switched on or off:
means that the gene should be in the middle between the on and the off state. Filling in
in Equation (9) yields a threshold in terms of
:
. If we denote the sum
by
and
by
, we can write
as
|
| (12) |
|
| (13) |
|
| (14) |
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
To compare different prediction methods, we performed a case study on the Fur in P. aeruginosa. We tested the single binding site model, the multiple binding sites model and the homotypic cooperative binding model for several PWM's;. Moreover, we also evaluated the online available prediction methods PredictRegulon (8), Cluster Buster (4) and MSCAN (5). We could not make a comparison with the method of Liu and Clarke (2) or Granek and Clarke (3) since there are no data available on protein concentrations or the partition function.
The Fur protein
Fur is a conserved bacterial protein responsible for metal-dependent repression at the basis of the control exerted by iron on Fe-responsive genes. It is mainly studied in the human pathogen P. aeruginosa because the lack of this metal is a major environmental signal to trigger expression of important virulence factors (9).
The hypothetical Fur–DNA interaction model used in this article is shown in Figure 1. In this model, each monomer recognizes the same consensus sequence GATAATGAT(T/A). A second dimer can bind cooperatively 6 bp upstream or downstream from the first one (10), meaning that
can be neglected for every value of d except for d = 6.
|
Ranking genes
To rank the genes with the different methods, we used the 25 SELEX sites found by Ochsner and Vasil (11) as a set of aligned sequences and included their complements as Fur is a homodimer. The –200 to 0 region was chosen as the putative promoter.
We studied three different PWM's;: one obtained with the method developed by Djordjevic et al. (6); another for which the Pi's are maximum likelihood (ML) estimates as given in Equation (6); and also one in which Pi's are posterior mean estimates (PME's;) derived from a uniform prior [these terms are explained in reference (12)]:
|
| (15) |
Based on the ML and PME PWM's;, we scored each gene in the PA01 annotation table (13) using the single binding site model, the multiple binding sites model and homotypic cooperative binding model. The PWM of Djordjevic et al. (6) was only considered in combination with the single binding site model since it is determined up to an unknown constant factor µ, whereon the multiple binding sites and cooperative binding performances are strongly dependent.
We also tested three online available methods: PredictRegulon (8), Cluster Buster (4) and MSCAN (5). PredictRegulon uses a single binding site model with a different PWM, while the other two methods are based on statistical, instead of biophysical, multiple binding sites models. We set the model parameters equal to their default values, except for the minimum number of hits for MSCAN, which was chosen equal to 1.
Validation
We evaluated the different methods by means of the microarray analyses of Ochsner et al. (14) and Palma et al. (15). In the experiment of Ochsner et al. duplicate cultures were grown to stationary phase under iron-limiting or iron-replete conditions, while Palma et al. studied the early transcriptional response of exponentially growing Pseudomonas aeruginosa to iron. The differentially expressed genes are shown in the online supporting material.
Since Fur controls several other regulators, including the pyoverdine siderophore biosynthesis sigma factor PvdS (14), a certain number of iron-regulated genes will be regulated by Fur in an indirect way and no Fur-box will be found in the neighborhood of their promoters. As a consequence, only the high-scoring differentially expressed genes will be directly Fur-regulated and can be used as a true positive set; and reliable performance assessment can only be obtained for high-scoring genes. Therefore, unlike Granek and Clarke (3), we do not use ROC curves to evaluate our method, but we determine the number of true positives TP versus the number false positives FP for a limited number of false positives and evaluate the area under this curve. The higher this area, the better the method.
In Figure 2, the TP versus FP curves are plotted for PredictRegulon, Cluster Buster, MSCAN, the single binding site model that uses the PWM derived by Djordjevic et al. and the homotypic cooperative binding model for the PME PWM. The binding constant in the cooperative model was chosen as
with
kcal/mol; this choice will be explained later. Table 1 shows the areas under the FP versus TP curves with
for all the different methods.
|
|
Apparently, the performances of the single binding site and multiple binding sites models are comparable, while the cooperative binding model outperforms all the other methods as it concentrates more true positives at the beginning of the ranking. The corresponding gene ranking for the PME PWM can be found in the online supporting material. In fact, it is not surprising that the performances of the single binding site models and the biophysical multiple binding sites models are not significantly different. Binding energies for the best binding sites typically differ by several kJ/mol [i.e. the energy related to a non-covalent bond (16)]; and binding probabilities differ by a factor that depends exponentially on the difference in binding energy. Thus, without cooperative interactions, the binding probabilities at different binding sites typically differ by several orders of magnitude, and most of them are negligible, therefore.
Figure 3 shows the performance of the cooperative binding model for several values of
within a realistic range. The area under the FP versus TP curve is plotted for
and
to make sure that the trend does not depend too much on the considered number of false positives. From this figure, it can immediately be seen that the cooperative binding model explains the microarray data better than the multiple binding sites model: the curves reach a minimum for
. Furthermore, the trend corresponds well to our expectations. As long as the estimated
is smaller than the true value, we expect that the performance of the method increases with
. When the
becomes greater than the true value, we anticipate that the performance saturates because a sequence of twice the binding site length does not occur by chance; otherwise the performance would decrease. Only when
exceeds the binding energy of a single protein by a few orders of magnitude, the proteins will not be able to discriminate sites in the DNA well anymore since protein–protein recognition will dominate protein–DNA recognition. This results in a performance drop at
kcal/mol (this is not shown in Figure 3 for scaling reasons).
|
We chose
Extensions to more advanced binding models
Our method can be extended to heterotypic cooperatively binding proteins and pre-bound complexes with a flexible multimerization domain. In the first case, the order in which genes are up- or down-regulated depends on which protein concentration is varied and the concentrations of the proteins that bind cooperativily. Assume that the concentration of
changes under the considered conditions, and that the concentrations of the cooperatively binding proteins
approximately remain the same (this assumption is equivalent to the assumption of Granek and Clarke where
is the crucial protein). If
,
will first bind promoters that contain a binding site for
and vice versa. Note that this kind of regulatory mechanism may especially be important in eukaryotes where the same regulators have to switch on different sets of genes in different cell types.
To illustrate how the extension for heterotypic cooperatively binding proteins can be obtained, we consider the case of two cooperatively binding proteins
and
and derive the probability
that at least one of the sites is occupied by protein
. Again we find Equation (9) but now with
|
| (16) |
|
| (17) |
|
| (18) |
Before Equation (17) can be applied, we should first calculate the constant factors in the Px,i's and determine one additional parameter compared with the case of homotypic cooperative binding:
. In general, if there are more proteins involved in cooperative binding, one additional parameter
is required per added protein
. The partition function Zx can be determined by measuring the binding constant for one specific binding site and using Equation (5). The protein concentration
can be estimated based on the measurements of the average number of proteins per cell volume
and the average cell volume V:
|
| (19) |
|
| (20) |
Cooperative binding does not always have to deal with individual proteins that interact cooperatively upon DNA binding. It is also possible that proteins form pre-bound complexes and that a flexible multimerization domain allows for multiple distances between the binding sites of the individual proteins. To deal with such situations, we need a PWM and a relative binding constant
for each possible distance d. Then, we can determine Pi's for each PWM and scale them properly to make sure that the Pi's for the different distances will be determined up to the same constant factor.
| CONCLUSION |
|---|
|
|
|---|
We have introduced a new scoring method that utilizes the underlying physical binding model of protein–DNA and protein–protein recognition. This method requires a minimum number of physical parameters and detects cis-regulatory modules in almost the same way as they are recognized by the proteins. The more the parameters approach their true values, the more the detection method reflects physical reality. Therefore, a better performance can be obtained if the parameters are estimated or measured with a higher precision. The reliability of a prediction and the influence of each parameter on the performance can be estimated by testing several values of the parameters within a realistic range. For example, a certain distance of d basepairs may never occur between two important binding sites, and, therefore, the corresponding cooperative binding constant will not affect performance.
To obtain highly accurate predictions, we suggest that cooperative binding constants should be measured for relevant distances, as well as protein concentrations and partition functions in the case of heterotypic cooperative binding. Binding constants and partition functions can be derived from electrophoretic mobility shift analyses, and protein concentrations can be determined by measuring the mean number of proteins per cell volume and the mean cell volume. Furthermore, these measurements can provide clear insight into how binding energies and distances are related to gene regulation, and will allow further validation and development of biophysical prediction methods.
| ACKNOWLEDGEMENTS |
|---|
B.D.M is a full professor at the Katholieke Universiteit Leuven, Belgium. Y.M. is a professor at the Katholieke Universiteit Leuven, Belgium. P.C. is a professor at the Vrije Universiteit Brussel, Belgium. Research supported by: * Research Council KUL: GOA AMBioRICS, CoE EF/05/007 SymBioSys, several PhD/postdoc & fellow grants; * Flemish Government: – FWO: PhD/postdoc grants, projects G.0241.04 (Functional Genomics), G.0499.04 (Statistics), G.0232.05 (Cardiovascular), G.0318.05 (subfunctionalization), G.0553.06 (VitamineD), G.0302.07 (SVM/Kernel), research communities (ICCoS, ANMMM, MLDM); – IWT: PhD Grants, GBOU-McKnow-E (Knowledge management algorithms), GBOU-ANA (biosensors), TAD-BioScope-IT, Silicos; SBO-BioFrame; * Belgian Federal Science Policy Office: IUAP P6/25 (BioMaGNet, Bioinformatics and Modeling: from Genomes to Networks, 2007–2011); * EU-RTD: ERNSI: European Research Network on System Identification; FP6-NoE Biopattern; FP6-IP e-Tumours, FP6-MC-EST Bioptrain, FP6-STREP Strokemap.
Conflict of interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
- Stormo GD, Fields DS. Specificity, free energy and information content in protein-dna interactions. TIBS (1998) 23:109–113.[Medline]
- Liu X, Clarke ND. Rationalization of gene regulation by a eukaryotic transcription factor: Calculation of regulatory region occupancy from predicted binding affinities. J. Mol. Biol. (2002) 323:1–8.[CrossRef][ISI][Medline]
- Granek JA, Clarke ND. Explicit equilibrium modeling of transcription-factor binding and gene regulation. Genome Biol. (2005) 6:R87.[CrossRef][Medline]
- Frith MC, Li MC, Weng Z. Cluster-buster: Finding dense clusters of motifs in dna sequences. Nucleic Acids Res. (2003) 31:3666–3668.
[Abstract/Free Full Text] - Alkema WB, Johansson O, Lagergren J, Wasserman WW. Mscan: Identification of functional clusters of transcription factor binding sites. Nucleic Acids Res. (2004) 32:W195–W198.
[Abstract/Free Full Text] - Djordjevic M, Sengupta AM, Shraiman BI. A biophysical approach to transcription factor binding site discovery. Genome Res. (2003) 13:2381–2390.
[Abstract/Free Full Text] - Benos PV, Lapedes AS, Stormo GD. Is there a code for protein-dna recognition? probab(ilistical)ly. BioEssays (2002) 24:466–475.[CrossRef][ISI][Medline]
- Yellaboina S, Seshadri J, Kumar MS, Ranjan A. Predictregulon: A web server for the prediction of the regulatory protein binding sites and operons in prokaryote genomes. Nucleic Acids Res. (2004) 32:W318–W320.
[Abstract/Free Full Text] - Escolar L, Perez-Martin J, De Lorenzo V. Opening the iron box: Transcriptional metalloregulation by the fur protein. J. Bacteriol. (1999) 181:6223–6229.
[Free Full Text] - Pohl E, Haller JC, Mijovilovich A, Meyer-Klaucke W, Garman E, L VM. Architecture of a protein central to iron homeostasis: Crystal structure and spectroscopic analysis of the ferric uptake regulator. Mol. Microbiol. (2003) 47:903–915.[CrossRef][ISI][Medline]
- Ochsner UA, Vasil ML. Gene repression by the ferric uptake regulator in Pseudomonas aeruginosa: Cycle selection of iron-regulated genes. Proc. Natl Acad. Sci. USA (1996) 93:4409–4414.
[Abstract/Free Full Text] - Durbin R, Eddy S, Krogh A, Mitchison G. Biological Sequence Analysis (2001) Cambridge: Cambridge University Press.
- Winsor GL, Lo R, Sui SJ, Ung KS, Huang S, Cheng D, K CW, Hancock RE, Brinkman FS. Pseudomonas aeruginosa genome database and pseudocap: Facilitating community-based, continually updated, genome annotation. Nucleic Acids Res. (2005) 33:D338–D343.
[Abstract/Free Full Text] - Ochsner UA, Wilderman PJ, Vasil AI, Vasil ML. Genechip expression analysis of the iron starvation response in Pseudomonas aeruginosa: Identification of novel pyoverdine biosynthesis genes. Mol. Microbiol. (2002) 45:1277–1287.[CrossRef][ISI][Medline]
- Palma M, Worgall S, Quadri LEN. Transcriptome analysis of the Pseudomonas aeruginosa response to iron. Arch. Microbiol. (2003) 180:374–379.[CrossRef][ISI][Medline]
- Berg JM, Tymoczko JL, Stryer L. Biochemistry (2001) New York: W.H. Freeman and Company.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



Gcoop,6. The area under the TP versus FP curve is shown as a function of