Nucleic Acids Research Advance Access originally published online on May 25, 2007
Nucleic Acids Research 2007 35(11):e80; doi:10.1093/nar/gkm268
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Research, 2007, Vol. 35, No. 11 e80
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Methods Online |
General transfer matrix formalism to calculate DNAproteindrug binding in gene regulation: application to OR operator of phage 
Institute of Bioorganic Chemistry, Belarus National Academy of Sciences, Street Kuprevich 5/2, 220141, Minsk, Belarus
*To whom correspondence should be addressed. Tel: +375 17 267 82 63; Fax: +375 17 267 86 47; Email: teif{at}iboch.bas-net.by
Received February 2, 2007. Revised April 9, 2007. Accepted April 9, 2007.
| ABSTRACT |
|---|
|
|
|---|
The transfer matrix methodology is proposed as a systematic tool for the statisticalmechanical description of DNAproteindrug binding involved in gene regulation. We show that a genetic system of several cis-regulatory modules is calculable using this method, considering explicitly the site-overlapping, competitive, cooperative binding of regulatory proteins, their multilayer assembly and DNA looping. In the methodological section, the matrix models are solved for the basic types of short- and long-range interactions between DNA-bound proteins, drugs and nucleosomes. We apply the matrix method to gene regulation at the OR operator of phage
. The transfer matrix formalism allowed the description of the
-switch at a single-nucleotide resolution, taking into account the effects of a range of inter-protein distances. Our calculations confirm previously established roles of the contact CICroRNAP interactions. Concerning long-range interactions, we show that while the DNA loop between the OR and OL operators is important at the lysogenic CI concentrations, the interference between the adjacent promoters PR and PRM becomes more important at small CI concentrations. A large change in the expression pattern may arise in this regime due to anticooperative interactions between DNA-bound RNA polymerases. The applicability of the matrix method to more complex systems is discussed. | INTRODUCTION |
|---|
|
|
|---|
Motivation
Gene regulation is governed by a number of biomolecules competing for DNA-binding sites, recognizing each other, assembling on the double helix, binding ligands on their backs, forming sophisticated DNA structures, etc. This picture is further complicated because DNA is tightly packed in vivo, and because proteins or drugs may link DNA segments separated by large distances along the sequence. Knowing the information about all molecular players and the rules of their interaction, Nature calculates the transcription level for each gene. Is this biological LEGO game solvable on a computer? Let us take this as a working hypothesis.
Statistical mechanics of gene regulation
It is now believed that most of the binding events involved in gene regulation are reversible and governed by the thermodynamic equilibrium (13). Nowadays, high-throughput microarray technology allows us to determine thousands of thermodynamical parameters from a single experiment (4). In addition, the bioinformatics sequence analysis methods provide a way to predict the proteinDNA-binding affinities (57). There is a growing understanding now that a statisticalmechanical methodology is required to predict gene regulation based on this large amount of data (814). Some methods consider just several predefined binding sites to find a solution for comparatively simple gene regulatory systems (8). For example, the Escherichia coli's lac-operator containing several binding sites for LacI and C-reactive proteins (CRPs) that may multimerize and assist DNA loop formation, can even be described analytically (9). The combinatorial regulation at a single eukaryotic enhancer is also a solvable task (15,16). More complex systems may be accessed with the help of different network approaches (17). However, once we identify all the important states (which is not a trivial task), a huge number of states, parameters and computation time make many interesting systems practically incalculable without special tricks.
Fortunately, although the binding events are very complex, everything is still centered on the DNA, which provides a one-dimensional template for protein binding. We show here that it is possible to describe it mathematically as a one-dimensional system even with DNA looping and multilayer protein assembly involved. One-dimensionality significantly simplifies theoretical description. Instead of jumping from node to node in the multidimensional reaction space, we screen all possible binding reactions, going in one direction along the DNA. This allows us not to skip seemingly unimportant states (e.g. non-specific binding), which gives the method more predictive power.
Lattice models for DNA-ligand binding
The principles of calculation of macromolecule binding to a one-dimensional lattice were formulated in the second part of the 20th century. The purpose of this field was initially to take into account some of the following features of DNAprotein binding: (i) binding site overlapping; (ii) competitions between different protein types, or different binding modes; (iii) site specificity determined by the DNA sequence; (iv) contact interactions between proteins bound to the DNA (e.g. when the protein is assembled from several subunits with sticky ends); (v) long-range interactions (e.g. the DNA conformational transitions, changes in the DNA charge density or topology).
Several methods of solving one-dimensional lattice models have been developed in the past, including the generating functions (18,19), the transfer matrix method (20,21), the combinatorial approaches (22,23) and other modifications (2426). In the case of non-site-specific binding, the problems of site overlapping, competitions and contact interactions may be solved analytically by any of these methods. The McGheevon Hippel (MvH) approach is probably most widely used for the description of typical DNAprotein and DNAdrug experiments (22). However, site-specificity requires calculations according to the real polymer sequence, which rules out any analytical solutions (22,23,26). Taking into account the long-range interactions between the proteins bound to the DNA, poses additional difficulties that cannot be easily resolved by the combinatorial approaches (23,26). For example, the recent GOMER algorithm allows treating long-range interactions between a protein and a DNA promoter, but not the long-range interactions between two DNA-bound proteins (11). The generating functions method is a more general tool that has been extensively tested for many kinds of one-dimensional problems (18,19). At first, this method seemed inapplicable for the case of long-range interactions (27). Later studies have showed that the generating functions method still allows treating long-range cooperativity, but it fails if more then one type of large protein exists in the system (28). On the other hand, the transfer matrix method allows treating site-specificity (20), long-range interactions (27) and multiple binding (29). Yet there are other basic binding features such as the multilayer assembly, DNA looping, nucleosome sliding, etc. for which none of these methods have been tested. It seems from the literature analysis that only the transfer matrix method is left as a potential approach to solve the whole complexity of DNAproteindrug lattice models in a unified systematical way. Up to now, there were no attempts to apply this method to the biophysical characterization of complex gene regulatory systems. On the other hand, a complementary field of mathematical analysis of DNA sequences now actively uses matrix methods. This provides an additional argument for choosing the transfer matrix formalism as a general systematic tool.
The legacies of Markov and Ising
At this point, we have to make several methodological comments. All models for DNA-ligand binding mentioned above belong to the class of the so-called Ising models. In his doctoral thesis in 1924, Ernest Ising was studying ferromagnetism and introduced the model of a linear chain of magnetic moments, which are only able to take two positions, up and down, and which are coupled by interactions between the nearest neighbors. Later the Ising model became popular in many fields of physics. Naturally, when a number of physicists moved to biophysics inspired by Schrödinger's definition of DNA as a one-dimensional aperiodic crystal, they brought the Ising model to the new field, in particular to the study of DNA melting and DNA-ligand binding (24). At that time, the field of bioinformatics did not yet exist.
When bioinformatics emerged later, it was mostly driven by the biologically inspired mathematicians who came with their own concepts such as the Markov chains. In the 1920s, a pioneer of cybernetics, Norbert Wiener, performed a first rigorous study of a continuous Markov process. This work and the later Kolmogorov's probability theory popularized the ideas of the 19th-century mathematician Andrei Markov, who studied the sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. The Markov chains are now widely used in bioinformatics, in particular in the DNA sequence analysis (57, 3032).
Both the Markov chains and the Ising lattices may be formulated with the help of the transfer matrices containing the probabilities of transition of a system between different states. Consequently, the Ising model may be converted into the Markov model, and vice versa. The general transfer matrix formalism is evidently the ancestor of both the Markov chains and the Ising lattices. However, the differences between the transfer matrices employed in the biophysical and bioinformatical studies of DNAprotein interaction are not just historical. Bioinformatics is mostly interested in DNA sequence analysis, motif finding, etc. Therefore the different states in the Markov chains are either (A, T, G, C) or the occurrences of dinucleotides, trinucleotides, etc. or some other words composed from the nucleotide dictionary (57, 3032). On the other hand, in the biophysical models, the DNA sequence is fixed, and the different states are free or bound depending on the presence or absence of a protein at a given DNA site (1929). In the matrix models considered below, the states are also divided into free or bound, but these states are further subdivided into a number of microstates allowing us to treat complex models of DNAproteindrug interaction. Hopefully our work will help to join the efforts of biophysics and bioinformatics in the description of gene regulation.
In the current work, we provide the unified matrix formalism to calculate the multiprotein assembly on the DNA. We consider the most important experimental featuresthe site-specificity, competitions, multilayer binding, nucleosomes and DNA loopsand show that all these essential constituents of gene regulation are computable using the transfer matrix formalism. Then we test the method on gene regulation of phage
. The matrix formalism not only allowed a correct description of this well-documented system, but also suggested new biologically relevant predictions.
| GENERAL METHODOLOGY |
|---|
|
|
|---|
The idea of the transfer matrix method is to consider the DNA molecule as a 1D lattice of units, each unit being characterized by a matrix of statistical weights corresponding to all its possible states (20,29). The matrix of statistical weights is called the weight matrix or the transfer matrix. The weight matrix depends on the DNA sequence and the chosen model of DNAprotein and proteinprotein interactions. The partition function is given by the product of the matrices corresponding to all DNA units. The probabilities of the binding events may be calculated from the partition function. The general methodology consists of choosing the elementary DNA unit, enumerating all its possible states, constructing the corresponding transfer matrices, applying the boundary conditions and finally calculating the maps of binding or the binding curves.
Choosing the elementary unit
Let us consider the DNA molecule as a linear lattice of N units numbered by index n, n = 1 ... N (Figure 1A). In the case of independent ligand binding (non-interacting binding sites) it is convenient to choose the elementary unit coinciding with the binding site (29). However, when the binding is sequence-specific and the binding sites may overlap, one may use some physically distinguished units instead (the nucleotide, base pair, nucleosome, etc.). Throughout this article, we will assume that the elementary unit is the base pair.
|
States enumeration
We have to list all available states for each elementary DNA unit. This may be done in several ways. If we forget a state, this will lead to an error in the partition function. If we enumerate a state more than once, this will add unnecessary parameters and increase the computation time. Therefore, it is important to find the shortest complete number of states, R. The transfer matrix will then contain R x R elements corresponding to all possible combinations of states of a given unit and its nearest neighbor.
Transfer matrix construction
The transfer matrices Qn consist of the elements Qn (i, j) equal to statistical weights corresponding to the nth DNA unit being in state i followed by the next unit in state j. The physical meaning of each element of the weight matrix is the conditional probability of having the unit n in state i provided the next unit is in state j. Evidently only several combinations of states i and j are allowed. For example, although the binding sites might overlap, the bound proteins cannot overlap if they are in the same layer. The allowed states are characterized by statistical weights given as a combination of the concentrations and energetic parameters. The prohibited states are characterized by zero statistical weights.
Boundary conditions
The transfer matrices constructed at the previous stage are the regular matrices corresponding to the DNA units far from any obstacles. All regular matrices keep the same locations of zero elements. Close to the DNA ends or close to the physical obstacles (Figure 1B) the transfer matrices change according to the boundary conditions. Previous studies have showed that boundary conditions may set strong constraints on sequence-specific target location by proteins on DNA (3335). Our calculations confirm this conclusion. In general, protein hanging out from DNA ends may be either prohibited or allowed. In the former case, the transfer matrices corresponding to the DNA ends have more zero elements then the regular matrices. If protein hanging out from DNA ends is allowed, then the matrix mask is not changed, but the binding constants depend on the distance from the DNA end. The special boundary conditions may be also set inside the DNA segment.
Calculation of the binding probabilities
The final output of the calculations is either the equilibrium protein distribution along DNA (the map of binding), or the dependencies of the binding probabilities on protein concentrations (the binding curves). The intermediate results necessary for these calculations are the partition function and its derivatives. The standard expression for the partition function Z of a linear lattice of N units is given by Equation (1) (20).
| (1) |
If we deal with a homopolymer, it is possible to diagonalize the matrices and turn matrix multiplication into multiplication of the diagonal elements. However, in the case of gene regulation the DNA sequence specificity is a significant feature, and this trick will not work out. All our matrices are different and should be multiplied according to the sequence of the DNA units. The straightforward partition function calculation according to Equation (1) leads to extensive matrixmatrix manipulations. It is easier to replace it by the vector-matrix multiplication and calculate the partition function Z and its derivatives recursively according to Equations (2) and (3) (21).
| (2) |
| (3) |
Here X is any parameter explicitly entering at least one of the statistical weights. Since we have all elements of the matrix Qn in the analytical form, the matrix
Qn/
X may be found analytically. This saves computational time and decreases errors in the numerical calculations.
The probability of a given state is, as with all partition functions, the term in the partition corresponding to that state divided by the sum of all of the terms (24). Suppose we have a parameter X uniquely entering the statistical weight of a given state of a given DNA unit. Then differentiating the partition function by X will filter all configurations of the system, which contain a given DNA unit in a given state. If X enters the partition function linearly, then the probability of this state is equal to the corresponding derivative of the partition function, multiplied by X and divided by the partition function. In particular, the probability that the nth DNA unit is covered by the protein of type g is given by Equation (4):
|
| (4) |
Here Kng is the binding constant for a protein of type g binding to the DNA site starting at the nth unit. The whole set of cng values determines the complete map of protein binding to DNA. Having in mind that the proteins may also assemble on DNA in a multilayer fashion, the term map of binding may be generalized to a multidimensional plot giving the probabilities for all binding events.
The binding curve cg gives the average degree of protein g binding to DNA:
| (5) |
When protein binding to DNA is coupled to other reactions in the solution (e.g. protein dimerization, modification, activation, etc.), the equilibrium concentrations c0g should be found from the corresponding laws of mass action. In most cases, a small number of proteins bound to DNA site-specifically does not affect the bulk concentration of free proteins. On the other hand, the non-specific binding to DNA should be taken into account to correct the concentration of free proteins. In vivo, a large fraction of regulatory proteins binds DNA non-specifically (36). The non-specific binding may be viewed as one of the simple equilibriums to be solved before calculating the maps of binding. Once the effective concentrations of free proteins in the solution have been determined, the maps of protein binding to DNA may be calculated from the matrix method. Thus, the calculation of the protein arrangements on the DNA may be decoupled from the simple equilibrium calculations in solution. While calculating the maps of binding, we treat all binding events as sequence-specific, including the non-specific binding.
| CONSTRUCTION OF THE BASIC MATRIX MODELS |
|---|
|
|
|---|
The methodology described in the previous section allows construction of the weight matrices for different models of multimolecular interaction. Now we have to show that it is possible to split the complex gene regulatory processes into a number of well-defined physical events solvable using the transfer matrix formalism. Firstly, we extract the basic binding features that may be used as the building blocks for more complex interaction models. Figure 2 summarizes the models, which we solve in the current section using the transfer matrix formalism. The other models may be constructed either as derivatives or as combinations of the basic ones.
|
Sequence-specific binding
Let us first consider a single-protein sequence-specific binding to DNA (Figure 2A). The protein may be either bound or not. If the protein covers m DNA units upon binding, then each unit may be in RA = m + 1 states. Let us number the states of the DNA unit by index i. States i = 1, ... ,m correspond to the DNA unit being covered by the protein, depending on where the protein starts respectively to the given DNA unit, from the protein's left (i = 1) to its right (i = m) end. State i = m + 1 corresponds to the free DNA unit.
The allowed combinations of states of the nth unit and the (n + 1)th unit correspond to the following non-zero elements of the transfer matrix Qn(i, j): the protein starts at the nth DNA unit (i = 1, j = 2), the protein starts before the nth unit and covers the nth unit (i = 2 ... m 1, j = i + 1), the protein ends at the nth unit and is followed by a free unit (i = m, j = m + 1), the protein ends at the nth unit and is followed by the next protein (i = m, j = 1), and the nth unit is free from proteins (j = i = m + 1). The other elements of the transfer matrix are equal to zero.
We want to deal with the binding constants available from experiments. These values cannot be easily localized among the multiple protein contacts with individual DNA units. Therefore, we assign the whole energy of proteinDNA binding to the first protein contact with the DNA. In the case of a single-protein binding this corresponds to the matrix element Qn(1, 2). We set Qn(1, 2) = Kn x c0, where Kn is the binding constant for the frame of m DNA units starting at the nth unit, c0 is the molar protein concentration. All the rest non-zero matrix elements are units. Equation (6) shows an example of the transfer matrix constructed according to this algorithm for m = 3.
| (6) |
Similar weight matrices have been used in the early DNA-ligand studies (20). Unlike this simple example, below we will deal with large matrices that are not easy to represent in a journal page. The site-specificity of binding will be preserved in the same way in all our following models.
Competitive binding
Competitive binding (Figure 2B) is another basic feature, which is important for many experimental systems. Competitions of several types may be encountered in gene regulation. Two most commonly used models are the competition of different proteins and the competition of different modes of binding of the same protein. Although the biology is different, the transfer matrices for these two competitive models are described by the same mathematics.
Let the protein binding to DNA be characterized by f different complexes numbered by index g, g = 1, ... , f. Each proteinDNA complex of type g involves mg DNA units. Then each DNA unit may be in RB states given by Equation (7):
|
| (7) |
The detailed states enumeration for this model and the algorithm of transfer matrix construction is given in the Supplementary Data. Analogously to the single-protein matrix, we describe the binding of a protein of type g to a frame of mg DNA units starting at unit n, by the statistical weight Kng x c0g. Here Kng is the binding constant, and c0g is the bulk concentration of g-type protein. The number of protein types may be less then the number of types of the proteinDNA complexes. In the case of the competition between different binding modes of the same protein the concentrations c0g are the same.
Small drugs
Many anticancer and antimicrobial drugs exert their activity through direct DNA binding. Leaving aside the compounds that induce covalent DNA modifications, let us look here at the reversible minor groove binders. This is a wide class of drugs such as netropsin, distamycin and their derivatives, many of which bind DNA sequence-specifically (37). A number of potential drugs are now being developed based on synthetic polyamides, recognizing the regulatory protein-binding sites on DNA, and thus interfering with gene regulation. This field has opened several years ago and is already becoming pharmaceutically important (38). For example, it was shown recently that sequence-specific polyamides alleviate the transcription inhibition associated with long GAA·TTC repeats in Friedreich's ataxia (39).
The competition of proteins and small drugs may be better described by the allosteric competition model shown in Figure 2C rather then the simple competition in Figure 2B. The allosteric competition model allows overlapping of molecules bound to DNA. This may be realized if a small drug slides along the minor groove, while the protein slides along the major groove (40). The drug changes the DNA conformation, widening the minor groove and narrowing the major groove. In this case, the states enumeration of the basic competitive model (Table S1) will be changed. The description of the states should now indicate whether a unit is bound to a small drug in the minor groove and whether it is bound to a protein in the major groove. The change in the algorithm for transfer matrix construction is also straightforward. Those states that include the nearby binding of both the protein and the drug are charged an additional allosteric cooperativity constant.
Nucleosomes
In eukaryotes, many regulatory sites are covered by the nucleosomes. A nucleosome may free DNA in two ways. The first possibility is that the nucleosome dissociates from the DNA and goes to the solution. This situation may be described by the basic proteinprotein competition model (Figure 2B). The second possibility is that the nucleosome slides along the DNA to a new position without being completely dissociated (Figure 2D). The nucleosome binding is site-specific as well as the protein binding. It is known experimentally that there are specific DNA sequences that are more bendable and phased to favor nucleosome binding (41,42). The nucleosomes are probabilistically positioned in the chromatin according to the DNA sequence. Moving a nucleosome along the DNA requires some work. All nucleosomes are the same with respect to DNA binding, and therefore the assignment of the nucleosome-binding constants from the DNA sequence analysis is even simpler and more effective then in the general DNAprotein case.
When protein binding is strong enough to compensate for the nucleosome dissociation energy, it will result in the competition of a protein and a nucleosome octamer described by a standard competitive model [Equation (7)]. When protein binding is weak, the proteinnucleosome competition will result in the nucleosome sliding. In the latter case, the order and the total number of nucleosomes on the DNA is fixed, and the competitive model should be modified to keep track of all nucleosomes within a given DNA sequence. Due to a large number of states, such a model is hardly applicable to the whole-genome studies, but it may be important for short DNA segments containing a regulatory sequence covered by just several nucleosomes. In such a system, proteins help each other bind DNA by displacing nucleosomes; hence, their collaborative competitions may cause experimentally observable binding cooperativity (43). Apart from nucleosomes, the sliding/dissociation model may be also useful for a broad class of ring-like proteins that assemble on the double helix and slide along DNA, e.g. helicases.
Cooperative protein assembly
Let us consider the cooperative multiprotein binding within a single layer along DNA (Figure 2E). This is a conventional class of models, which includes both the short- and long-range interactions between the proteins bound to the DNA (44). The proteins may interact either through direct contacts or through the DNA. The binding of proteins on the backs of other proteins is not allowed here (see below for the multilayer binding).
Contact cooperativity
The contact cooperativity is easy to incorporate in the basic competitive matrix considered above. Since we have already defined the matrix elements for the proteinprotein contacts, we just have to set additional statistical weights for these elements. These values are usually called the contact cooperativity parameters. They are denoted in our formalism as w(0, g1, g2) = exp(
(0, g1, g2)/RT), where
(0, g1, g2) is the free energy of a contact between the proteins of types g1 and g2 bound to the adjacent DNA units, RT
0.6 kcal.
The f-mer assembly model
The simplest and most commonly used model of the contact proteinprotein interactions is known as the McGhee-von Hippel (MvH) cooperativity (22). This model corresponds to the situation when each of the bound proteins may interact with two of its nearest neighbors. A more general case is the multiprotein assembly on the DNA forming f-mer complexes with f ranging from two (dimer assembly) to infinity. Many proteins involved in gene regulation through f-mer assembly lay between the two extreme cases of the classical MvH contact cooperativity (which may lead to f-mers of infinite size, e.g. RecA assembly into large filaments along DNA), and the pairwise cooperativity (which leads to f-mers composed of two proteins, e.g. homodimerization). An example of the pairwise cooperativity is the phage
repressor CI. This protein binds DNA in a dimeric form. The CI dimers bound to adjacent DNA sites recognize each other due to direct proteinprotein contacts formed by their C-terminal domains. Each CI dimer may interact only with one nearest neighbor. Once a pair of dimers is formed, a third dimer binds DNA non-cooperatively (45,46). Such interactions may be described by the extended MvH model, where two different orientations of the bound proteins are taken into account (47,48). The f-mer assembly model provides a further generalization.
In the general case of f-mer assembly with asymmetrical interactions, we distinguish f different complex types, even if the proteins are identical. The first protein in f-mer, the second protein in f-mer, ... , the last protein in f-mer. Each protein-DNA complex (not each protein) should be assigned its contact cooperativity parameters for interactions with other complexes. The matrix formalism is thus very convenient for this model. The problem of the hetero-f-mer assembly is treated in the same way as the homo-f-mer assembly. In particular, for the MvH cooperativity we use a single complex type (f = 1) for each protein.
Long-range interactions
A further generalization of the cooperative models is to take into account the interactions beyond direct proteinprotein contacts. There are several possibilities to take this into account in the matrix formalism (27,49). Let us assume that the interaction of the proteins of types g1 and g2 depends on the distance l between them. Analogously to the contact interactions, we characterize it by the cooperativity constants w(l, g1, g2) = exp(
(l, g1, g2)/RT). The contact interactions model is thus a particular case of long-range interactions with l = 0. The states enumeration includes now Vg states for a DNA unit being between bound proteins, where Vg is the maximum length of g-type protein interaction. Three additional states correspond to the units belonging to the left and right free DNA ends, and the units between non-interacting isolated proteins (Supplementary Data has the detailed states enumeration). Each DNA unit may be in RE states:
| (8) |
How large is the interaction range in vivo? The DNA conformational transitions or changes in the DNA charge distribution induced by protein binding may propagate for up to several dozens of base pairs (50). On the other hand, the changes in DNA topology or DNA looping may cover very large distances. In the worst case, long-range interactions involve the whole DNA molecule. The problem with long-range interactions in the matrix formalism is that the number of states RE increases linearly with the maximum interaction length max(Vg). A straightforward calculation for max(Vg) = 300 takes several hours on a Pentium M 725 laptop. Special algorithms for sparse matrix handling allow accelerating the calculations.
Multilayer binding
Although DNA provides a one-dimensional template for protein binding and we are using the one-dimensional mathematics, the binding events are not confined to one dimension (Figure 2F). Proteins may form multilayer structures on the DNA. Once bound to the DNA, a protein by itself may provide a lattice for new binding events. Proteins may bind other proteins and small ligands such as metal ions and ATP. One important case belonging to this class of models is the piggy-back binding model which was initially formulated for the DNA-dependent ATPase of DNA gyrase (51). This model describes binding of small ATP riders to the backs of the DNA-bound proteins. Another important example of the vertical assembly is the multimerization of the proteins of comparable size with the possibility of formation of bridge-like structures. For example, a CI dimer bound to the DNA may bind another CI dimer on its back to form a tetramer (12).
Let each protein may form from zero to fff additional vertical complexes with other proteins or drugs. Then each DNA unit may be in RF states:
|
| (9) |
The algorithm for transfer matrix construction is analogous to the algorithm for the competitive binding model (Supplementary Data). The second layer of proteins is characterized by the binding constants in the same way as the convenient single layer proteinDNA binding.
DNA loops
DNA looping is an essential component of gene regulation (1,12). DNA loops in vivo may range from tens or hundreds of nucleotides in the case of the proximal promoters, to thousands in the case of the distant enhancers and to millions in the case of the chromosome structure maintenance elements. Our calculations indicate that the loops larger than 1000 units are practically incalculable in the frame of the long-range interactions model. Thus, we divide the loops into two computable classes. Small loops (Figure 2G) may be calculated taking into account the long-range interactions between all the units. Large loops cannot be calculated like this. However, the behavior of a very large loop depends mainly on the binding events at the segments brought together by the protein crosslinks, since the system forgets what happens deep inside the loop (Figure 2H). Between the small and large loops, stands a class of intermediate loops that are difficult to calculate in the frame of the matrix formalism. Fortunately, most of the loops involved in gene regulation belong to one of the two calculable classes of small or large loops. For example, one may encounter sophisticated loops of intermediate classes in single-stranded RNA folding (52), but hardly in gene regulatory systems of our interest where large transcription factors bind stiff double-stranded DNA.
Small loops
In many cases, gene regulatory events are limited to a relatively small unpacked DNA segment accessible for the regulatory proteins. In this case, only simple loops shown in Figure 2G may be formed. Let the DNA unit may be either inside the loop, or outside of the loop. The looped segment consists of the necklace (the two polymer segments brought together by protein crosslinks) and the loop itself (the polymer units between the starting and ending crosslinks, numbered according to their position in the loop). Let the largest loop consist of NL units, and the longest necklace has Nc crosslinks. Then for a situation shown in Figure 2G, a DNA unit may be in RG states:
|
| (10) |
The first crosslink starting a loop of length l is assigned an additional statistical weight w(l). For a loop much larger then the DNA persistence length, the statistical weight may be taken in the general form w(l) = const x l
(49). For smaller loop lengths we may either use the tabulated experimental values (12,56), or try to calculate protein-dependent DNA looping on the basis of a rigorous statistical mechanics (54). In the small loops model, we do not assign the loop length before calculations. Knowing the sequence and the binding constants, we may predict, which loop configuration is the most probable one. However, as mentioned above, this model poses large computational difficulties, which are a hard nut to crack for the loop lengths larger than several hundreds of units.
Large loops
There are several reasons for distinguishing large loops from the previous case. First, when the loop is much larger then the persistence length, the energy of its formation almost does not depend on the local bending, twisting, etc. as in the case of small loops. Second, due to the compact DNA packing in vivo, large loops are also not that sensitive to the entropic contributions since the large loop is not given the whole space it would require for an arbitrary set of conformations. The loop may be formed only at a specific predefined position, compartmentalized both in the 3D space of the chromatin and in the 1D space of the genome sequence. The last but not the least argument is that the class of large loops is actually abundant in gene regulation (55).
The large loops may be formed by different mechanisms: either by a single protein cross-linking two DNA segments, as in the case of LacI repressor of E. coli (Figure 2H, top), or by a multilayer assembly of several proteins creating a bridge between two DNA segments, as in the phage
CI repressor (Figure 2H, bottom). In both cases, the loop is formed by connecting the DNA regions that have specific affinity for the cross-linking proteins. The two DNA sequences are predefined for a potential loop closure. We may still consider this sandwich as a 1D system. Since we know where the complementary bottom and top segments start and end, the unit n belonging to the bottom DNA segment uniquely determines the unit n in the top DNA segment in front of the unit n (Figure 2H, bottom).
Let us construct the enumeration of states of a DNA unit based on the concept of multiple protein layers. The states of the DNA unit indicate whether a protein is bound at a given position in each layer. The crosslink between the two DNA segments is formed when proteins fill the corresponding vacant places in both layers. The state corresponding to the first crosslink is assigned an additional statistical weight characterizing the probability of the loop formation (wloop = exp(
Gloop/(RT)), where
Gloop is the energy of the DNA loop formation). The states corresponding to the protein of type g bound at the first layer are characterized by the binding constants Kng. The l-bp gap between the proteins g1 and g2 belonging to the same layer is assigned a statistical weight w(l, g1, g2) as in the long-range interactions model. The vertical contact between the proteins g1 and g2 belonging to the first and second layers is assigned a statistical weight w
(g1, g2). The contacts of the second layer proteins with the DNA are characterized by the binding constants Kng. When we have more then two protein layers, the description is analogous. The detailed states enumeration for the large loops model is given in Table S3 (Supplementary Data). A double-layer loop shown in Figure 2H, is characterized by RH states for a DNA unit:
|
| (11) |
The behavior of the DNA loops has been analyzed in detail for the case of non-specific protein binding (2,56,57). The general feature of the models G and H in Figure 2 is the following. For a large number of potential cross-linking sites, DNA looping occurs abruptly at a critical protein concentration. The loop is stable in a large interval of protein concentrations. At very high concentrations, DNA again unloops because the multiprotein structures are assembled at both DNA segments instead of joining two DNA segments. In the case of site-specific binding, the system bears these general non-specific features, but also provides a possibility of a delicate combinatorial control of gene regulation, as we will see in the next section.
Thus, we have solved all basic models in Figure 2, and showed how to construct the other models as modifications or combinations of the basic ones. Now let us perform calculations for a concrete genetic system.
CALCULATION OF GENE REGULATION AT OR OPERATOR OF PHAGE
|
|---|
|
|
|---|
The binding events at OR operator of bacteriophage
control the famous genetic switch from the lysogenic state (when
peacefully lives inside the infected E. coli) to the lytic state (when
duplicates itself in a large number of copies leading to the lysis of the host cell). Two regulatory proteins, the CI and Cro repressors, act at OR operator. In the lysogenic state, the cro gene coding the Cro protein is
95% suppressed, while the cI gene coding the CI protein is on. The CI protein aims to maintain its own expression and to switch off all other genes. CI domination determines that the phage is in the lysogenic state. When the host cell is damaged or irradiated, its SOS system activates RecA protein that stimulates self-cleavage of CI. This leads to induction of the lytic state of phage
. In the lack of CI, Cro dominates at OR, maintaining its own expression, switching on early lytic genes, and suppressing the cI gene (46).
The CI and Cro proteins homodimerize in solution due to C-terminal domains and bind DNA as dimers using a helix-turn-helix motif in the N-terminal domains. The dimers may adsorb on DNA non-specifically (36) or bind sequence-specific sites (45). The specific binding constants are orders of magnitude larger then the non-specific ones. The CI and Cro dimers cover 17 bp upon binding to DNA. The structure of OR operator is shown in Figure 3. The OR operator consists of three 17-bp specific binding sites for Cro and CI proteins, enumerated OR1, OR2 and OR3. Each site may bind Cro and CI with different affinities. The OR site overlaps with the promoters PRM and PR. The
subunit of the RNA polymerase (RNAP) binds to the 10 and 35 recognition regions at the promoter, making
35 bp inaccessible for binding by other proteins. RNAP binding to PRM starts transcription of CI protein (direction to the left from OR in Figure 3A). RNAP binding to PR starts transcription of Cro and other yearly lytic proteins (to the right from OR). PR overlaps with OR1 and OR2. Thus, binding of repressor proteins to any of OR1 and OR2 sites precludes RNAP binding at PR. PRM overlaps with OR3 and borders upon OR2. RNAP bound at PRM contacts with CI dimer bound at OR2. This contact activates transcription from PRM. The CroCro and CICI contacts are also energetically favorable (45).
|
The binding scheme shown in Figure 3A allows 40 distinguishable arrangements of three proteins, Cro, CI and RNAP among three binding sites OR1, OR2 and OR3 (45). Several years ago it seemed natural that this picture completely describes the regulatory events at OR operator (58,59). However, then it was shown that it should be further complicated to take into account the interaction of OR with another operator OL situated
2.4 kb from OR. OL may be linked to OR thought DNA looping due to bridging by CI proteins (Figure 3B). The OL operator consists of three binding sites OL1, OL2, OL3 similar to OR, and overlaps with promoter PL symmetrical to PR. Most of the binding energies for this system are known from the experiments. The energies of non-specific binding of Cro and CI are 4.2 and 4.1 kcal/mol correspondingly (36). Cro binds OR1, OR2 and OR3 sites with the energies 12.0, 10.8 and 13.4 kcal/mol (45). CI binds OR1, OR2 and OR3 sites with the energies 12.5, 10.5 and 9.5 kcal/mol correspondingly (45).
The cooperativity of repressor binding
Let us first look into the competitive cooperative binding of Cro and CI at the region of lambda-phage (
-phage) sequence 37 93038 030 containing the OR operator and the flanking regions including PRM and PR promoters (58). In order to simplify the system, in the first series of calculations (Figures 46![]()
) we are considering a mutant without the spacers between the OR1, OR2 and OR3 binding sites shown in Figure 3A. (A detailed treatment of the intact lambda (
) sequence will be provided later in Figure 7). On the other hand, a small 3-bp overlapping of OR1 site and PR promoter is surely critical for the behavior of the system, and we have to consider this in all calculations.
|
|
|
|
Figure 4 shows the maps of binding calculated for the set of energies chosen above and the concentrations characteristic to the lysogenic state of phage lambda (phage
): [CI] = 0.2 µM, [Cro] = 0.02 (60). Figure 4A shows how the system of two regulatory proteins, Cro and CI would behave in the absence of cooperativity. Since Cro-binding constants for OR2 and OR3 are larger then CI binding constants, Cro dominates at these sites when Cro and CI are at comparable concentrations. That is not what is observed in the experiments where Cro domination would mean the end of the lysogenic state.
Figure 4B shows the results of the calculations for the same system taking into account proteinprotein interactions in the MvH model (22). The proteins interact with their left and right neighbors symmetrically. The energies of CICI and CroCro interactions are set as 2.5 and 0.3 kcal/mol correspondingly. These values are in line with the experimental data (45). Taking into account the MvH cooperativity dramatically changes the non-cooperative map of binding, the OR1 and OR2 sites are now covered by CI proteins with almost the same probability, while only OR3 is left for Cro binding. This is already closer to the experimental situation in the lysogenic state of phage
. However, now the cooperativity is too crude: CI proteins not only bind OR1 and OR2, but also tend to occupy the adjacent promoter regions, which should be the targets for RNAP. A small but not negligible Cro presence at OR2 is also not compatible with the discrete nature of the lambda-switch (
-switch).
Experimentally, the interactions between CI dimers are not symmetric as in Figure 4B. CI dimers bound to adjacent DNA sites interact with each other due to a direct contact between C-terminal domains. There is only one recognition domain per dimer and hence, once a pair of interacting dimers is formed, the third dimer binds the adjacent site non-cooperatively (Figure 4C). On the other hand, CroCro interactions are weaker and do not show such asymmetry. Darling and coauthors (45) determine different energies of CroCro interaction for the proteins bound to OR1OR2 and OR2OR3 sites. However, most of the literature treats CroCro interactions as symmetric and does not provide any structural information for the asymmetry as in the case of CI dimers (46). The different cooperativity for different Cro-binding sites may just reflect the fact that the interactions are length-dependent (the OR1OR2 and OR2OR3 spacers are equal to 7 and 6 bp correspondingly). Since we cannot distinguish length dependence from asymmetry in the model with deleted spacers, in Figures 46![]()
we will assume that CroCro interactions are weak and symmetrical. In these calculations, we set the CroCro interactions equal to 0.3 kcal/mol based on the cooperativity value 0.9 kcal/mol for a complete saturation of three OR sites by Cro (45). Later in Figure 7, we will take care of the length-dependent characteristics of all proteinprotein interactions.
Thus, in the frame of the matrix formalism, we assign two types of complexes to CI binding (g = 1, 2) and one type to Cro binding (g = 3). The different complexes are indicated in Figure 4C. This situation corresponds to the following set of the contact cooperativity parameters: w02 = 0, w22 = 0, w33 = 1.7 (calculated from the experimental CroCro interaction energy 0.3 kcal/mol), w12 = 74.5 (calculated from the experimental CICI interaction energy 2.5 kcal/mol) (45). All the rest contact cooperativity parameters wij are units. According to Equation (7), the transfer matrix constructed for each DNA unit has R x R elements, R = 3 x 17 + 1 = 52.
Figure 4D shows the map of binding calculated according to the scheme in Figure 4C. The energies of interaction used in Figure 4D, are the same as in Figure 4B. We see that a subtle change in the model (the asymmetry in CICI interactions) leads to the distinguishable changes in the map of binding. Substantial efforts have been undertaken to construct a proper lattice model for cooperative interactions in this system (61,62). The treatment of such effects is quite natural in the matrix method. We construct the transfer matrix for a general situation and then just set unique weights for all conceivable proteinprotein contacts.
The combinatorial control of RNAP
Now let us add the RNAP to the system of
OR, Cro and CI. RNAP forms an additional complex, g = 4, which covers m4
35 bp. According to Equation (7), the system is now characterized by the transfer matrices with R = 3 x 17 + 35 + 1 = 87 states.
In all following calculations, we take [RNAP] = 3 µM, which corresponds to the lysogenic
state (63). The energies of RNAP binding to PRM and PR promoters are equal to 11.5l and 12.5 kcal/mol correspondingly (45).
The non-specific RNAP-binding constant differs from 102 to 107 depending on the ionic conditions (10,64). In our calculations, we set it equal to 1 x 103 M1, which is close to the experimental value of non-specific holoenzyme binding to double-stranded DNA at 0.01 M MgCl2, 0.2 M NaCl (64). This binding constant determines that
10% of RNAP is bound to DNA non-specifically at 3 µM concentration. However, our calculations indicate that the
-switch is quite robust concerning the choice of the non-specific binding constants. The pattern of site-specific binding is almost unaffected since the site-specific binding constants are more than five orders of magnitude larger then the non-specific ones. RNAP binding to DNA is required but is not enough to start transcription. Two other events are important as well: (i) the RNAP-binding site should contain a promoter, and (ii) the activator binding may be required to stimulate RNAP.
Early studies have showed that PRM is maximally stimulated (
10-fold) only when OR1 and OR2 are both occupied by CI dimers (58). The role of CI bound at OR1 in stimulating PRM is primarily to promote repressor binding to OR2 through cooperative CICI interaction (58). Thus, the activating action of CI on RNAP comes from a single RNAPCI contact. The RNAPCI contact does not alter the initial RNAP binding to DNA, but it induces the conformational changes in the DNARNAP complex helping the open complex formation (65). The open complex formation is one of the compulsory steps in transcription initiation. This multistep reaction may be characterized by a pseudo first-order equilibrium (66). It is this change in the rate of the open complex formation that we may roughly associate with the experimentally observed 10-fold PRM stimulation by CI. In the frame of our formalism, it is equivalent to setting w(0, RNAP, CI) = 10.
Figure 5 shows the maps of binding of CI, Cro and RNAP at OR operator and its flanking sequences, calculated for the set of parameters chosen above. Figure 5A and B corresponds to the situation when RNAP and CI do not interact with each other. Figure 5A is calculated for the typical lysogenic concentrations: [CI] = 0.2 µM, [Cro] = 0.02 µM. Under these conditions, CI repressors are always bound to OR1 and OR2 sites, and RNAP covers PRM promoter with
80% probability. The probability of RNAP binding to PR promoter is vanishingly small, as it should be expected for the lysogenic state. The cI gene is switched on by the PRM promoter, while the cro gene regulated by PR promoter is off. In this situation, CI provides a positive regulation of its own gene. The more we have CI proteins, the higher is the probability of RNAP binding to PRM, the higher is the level of cI expression and the more we have new CI proteins. At a higher Cro concentration (Figure 5B) RNAP binding to PRM is suppressed. A 20-fold increase in Cro concentration determines 8-fold decrease of RNAP binding to PRM, as seen from the comparison of Figure 5A and B.
Figure 5C and D is calculated taking into account interactions between RNAP and CI with w(0, RNAP, CI) = 10. A favorable RNAPCI contact allows these proteins to collaborate against Cro. Cro binding to DNA is almost completely suppressed at 0.02 µM (Figure 5C). Figure 5D corresponds to Cro = 1 µM. It shows that although increasing Cro concentration helps competing with RNAP, the probability of RNAP binding to PRM operator is still
30%. Thus, the CIRNAP contact makes the
switch less sensitive to the Cro influence.
The role of DNA looping
Now let us consider the possibility of the DNA loop formation between the operators OR and OL (Figure 3B). According to Table S3, we need additional parameters to construct the transfer matrix for the large loops model. We have to set the energies of CI, Cro and RNAP interaction with OL operator, the energy of the loop formation and the energies of the vertical CICI contacts. Three binding sites at OL, OL1, OL2, OL3 are symmetrical and almost energetically equivalent to the corresponding binding sites at OR operator. The structure of the promoters surrounding OL and OR operators differs. OL has only one promoter PL (symmetrical to PR) which binds RNAP with the energy 12.5 kcal/mol. There is no direct experimental data on the energy of the OLOR loop formation. We set the energy of the loop formation equal to
Gloop = 12 kcal/mol which results in wloop = 1 x 109. This value is close to the asymptotic value for length-dependent DNA loops studied for the constructs based on the lac-repressor system (50). The experimental data on the vertical CICI interactions in CI octamerization and DNA loop formation are not that clear as the values for the horizontal CICI interactions (67,68). We use here the estimate that the energy of the vertical CICI contact is equal to the energy of the horizontal CICI contact. The single-layer protein interactions are still described by the model in Figure 4C and the parameter set above. According to Equation (11), taking into account DNA looping between the OL and OR operators results in the weight matrices of R x R elements, with R = 518.
Figure 6 shows the maps of binding at OR operator calculated in the frame of the large loops model taking into account the possibility of OROL contact due to bridging by CI proteins. Figure 6A is calculated for the typical lysogenic concentrations ([CI] = 0.2 µM, [Cro] = 0.02 µM). The CI proteins occupy the OR1 and OR2 sites with almost 100% probability, while the probability of the OLOR loop formation is
70%, and the probability of RNAP binding at PRM promoter is
80%. When we add 1 µM Cro, the level of PRM expression is still above 40% (Figure 6B). Only a five-micromolar Cro concentration is enough to suppress RNAP binding to PRM (Figure 6C). This is much larger then the typical concentrations in vivo. Yet, this would suppress RNAP binding not only to PRM, but also to PR. Thus, we cannot switch
from lysogeny to lysis just playing with Cro concentrations. Even a large number of Cro proteins occasionally trapped inside the cell will not trigger the
-switch. Only the cleavage of CI repressors may act as a trigger.
These calculations provide an argument in support to the recent mutation studies, which show that cro gene might be unimportant for the lysogenic to lytic switch, and that Cro's primary role in induction is instead to mediate the weak repression of the early lytic promoters (69). We see that one of the functions of the OROL loop is to make the lysogenic state more stable by decreasing its dependence on Cro fluctuations. That is not the only function of the OLOR loop.
Figure 6D shows what happens if CI is overexpressed during the lysogenic
state. At high CI concentrations, CI occupies OR3 instead of RNAP. This is because the vertical CICI assembly may take place not only at OR1 and OR2 but also at OR3. An additional vertical contact between CI dimers bound to OR3 and OL3 sites may recruit a CI tetramer that strongly competes with RNAP at PRM. At lower concentrations CI occupies only OR1 and OR2 while OR3 is left for Cro/RNAP competition (Figure 6AC).
The situation shown in Figure 6D corresponds to the decrease of CI synthesis from PRM by the OLOR loop due to a too high CI concentration. Thus at high CI concentrations this protein implements a negative regulation of its gene cI. Taken together with the positive regulation at small CI concentrations (70), this provides a stable mechanism of the lysogeny maintenance: the concentration of the regulatory proteins always tends to be at the lysogenic level. Only external signals leading to the degradation of CI may switch the phage from the lysogeny to lysis, where PR promoter becomes active instead of PRM, and Cro proteins start to increase their own synthesis.
Figure 6 shows that the OLOR loop is formed only at high-enough lysogenic CI concentrations. The probability of the OLOR loop formation is sufficiently lower in the lytic state. For example, our calculations give just 0.1% to the probability of the loop formation at [CI] = 0.01 µM, [Cro] = 0.1 µM, [RNAP] = 3 µM. Thus at small CI concentrations in the lytic state the OLOR loop may be neglected. Still, the switch from the lysogenic to lytic
state should be discrete. What, if not the loop, contributes to this discreteness? In order to answer this question, in the next section we look at the fine structural details of the intact
OR.
The role of a range of inter-protein distances
Now let us perform calculations for the intact
OR sequence. Remember, that OR1 and OR2 are separated by a 6-bp spacer, while OR2 and OR3 are separated by a 7-bp spacer, the details that we have skipped in our previous calculations. Therefore, we now have to use the long-range interaction model for the CICI and Cro






