A computational genomics approach to the identification of gene networks
A computational genomics approach to the identification of gene networksAndreas Wagner*
The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA
Received June 5, 1997;Revised and Accepted August 4, 1997
ABSTRACT
To delineate the astronomical number of possible interactions of all genes in a genome is a task for which conventional experimental techniques are ill-suited. Sorely needed are rapid and inexpensive methods that identify candidates for interacting genes, candidates that can be further investigated by experiment. Such a method is introduced here for an important class of gene interactions, i.e., transcriptional regulation via transcription factors (TFs) that bind to specific enhancer or silencer sites. The method addresses the question: which of the genes in a genome are likely to be regulated by one or more TFs with known DNA binding specificity? It takes advantage of the fact that many TFs show cooperativity in transcriptional activation which manifests itself in closely spaced TF binding sites. Such `clusters' of binding sites are very unlikely to occur by chance alone, as opposed to individual sites, which are often abundant in the genome. Here, statistical information about binding site clusters in the genome, is complemented by information about (i) known biochemical functions of the TF, (ii) the structure of its binding site, and (iii) function of the genes near the cluster, to identify genes likely to be regulated by a given transcription factor. Several applications are illustrated with the genome of Saccharomyces cerevisiae, and four different DNA binding activities, SBF, MBF, a sub-class of bHLH proteins and NBF. The technique may aid in the discovery of interactions between genes of known function, and the assignment of biological functions to putative open reading frames.
INTRODUCTION
The ultimate challenge to molecular biology is to identify and fully characterize the complete network of interactions among genes and their products in an organism. In facing this challenge, the wealth of information created by genome sequencing efforts will be an invaluable resource. However, our ability to extract biologically important information about gene interactions from genome sequences is still quite limited. Most of the biological interpretation of genome sequences pertains to the number and types of genes in an organism. Sorely needed are novel approaches that permit the formulation of experimentally testable hypotheses about gene interactions from sequence data alone. The advantage of such approaches are clear. They could vastly improve efficacy of experiments by pointing out likely candidates for interacting genes.
In devising such tools, the fundamental question is: what types of gene interactions leave traces on the DNA, traces that could lead to the identification of interacting gene products. Maybe the prime candidate for such interactions is the transcriptional regulation of protein coding genes in eukaryotes. Here, transcription factors (TFs) bind enhancer sequences near the coding region of a gene, recruit a basal transcription machinery to the transcription initiation site, and activate the transcription of the gene (1 ). Alternatively, TFs can repress transcription of a gene by interfering with the basal transcription apparatus in various ways (2 ). The common theme is that the binding of TFs to specific, often short sequences on the DNA is necessary for transcriptional regulation. Undoubtedly the predominant mechanism regulating gene expression in eukaryotes, transcriptional regulation accounts for an enormous number of gene interactions. The availability of an efficient tool for the analysis of genes that are regulated by a given TF would thus permit analysis of a significant part of the global network of gene interactions. It would put cell biology a large step closer to its ultimate goal.
Naively, one might assume that it is sufficient to look for binding sites of specific TFs near a gene to identify candidate genes for regulation by the TF. This approach is standard practice on a small scale, and its extension to entire genomes is straightforward (3 ). However, for many known enhancer sites, it is also deeply problematic. For example, the minimally functional binding site of the heat shock transcription factor (4 ,5 ) occurs more than 106 times in the genome of Saccharomyces cerevisiae (unpublished observation). The promoters of most genes would contain one or more such binding sites, making any biological conclusions based on binding site occurrence meaningless. Is there a modification of this approach that would render it useful? It has long been recognized that most transcriptional regulators display (homotypic or heterotypic) cooperative interactions, either when binding DNA, or when activating transcription. Cooperativity is usually reflected in the occurrence of multiple closely spaced binding sites on the DNA (6 ). The approach introduced below takes advantage of the ubiquity of cooperative interactions to identify genes putatively regulated by given TFs. Its basic tenet is that groups (`clusters') of TF binding sites linked much more tightly than expected by chance alone, are probably relevant to the transcriptional regulation of a nearby gene. The central problem is to find a statistically sensible definition of a highly significant cluster of binding sites. It will be seen below that common plausibility arguments about the significance of binding site clusters can be quite misleading, if one takes the genome-wide distribution of binding sites into account. In only accepting the statistically most significant groups of binding sites, it is attempted to minimize the method's false positive rate. In addition, various sources of biological information are incorporated into the analysis, information that is likely to decrease this rate further. However, the price paid for such conservativism is that many genes regulated by a TF may not be detected. It is a price well worth paying, given that a conservative approach will generate candidate genes that seriously merit further experimental investigation.
A well known general problem in the analysis of DNA sequences is the enormous heterogeneity of sequence composition, which violates assumptions needed for most conventional statistical techniques (7 ,8 ). Any statistical approach to the analysis of DNA sequences will thus provide only a crude assessment of sequence properties. The method used here cannot altogether avoid the problems of sequence heterogeneity, but attempts to alleviate them by taking both global (genome-wide) and local sequence properties into account.
While the technique is applicable to any eukaryote, it is here illustrated with the genome of S.cerevisiae. The main reasons are that potential yeast promoter regions are in general short and located upstream of the coding region (9 ,10 ), and that the yeast genome does not contain many tandemly repeated sequences other than rDNA and CUP1 genes (11 ). Four different applications are illustrated with different yeast DNA binding proteins. They include, but are not limited to the identification of novel interactions among genes of known function, and the putative assignment of biological function (cell cycle regulation, etc.) to ORFs with unknown function. The particular choice of four DNA-binding proteins (out of the ~75 characterized to date) was motivated by (i) their well characterized DNA binding sites, (ii) the length of their binding sites (for methodological reasons discussed below), and (iii) the variety of applications that they can illustrate. Needless to say, all candidate gene interactions identified by the method have to be tested experimentally. However, while tentative, the results may aid in sifting through the astronomical number of possible gene interactions, and identify candidates worthy of experimental investigation.
STATISTICAL METHODS
This section illustrates the statistical techniques used to identify highly significant clusters of transcription factor binding sites which are then further analyzed using biological information about the respective transcription factors. The general approach has three steps. First, significant clusters of particular binding sites are detected by what is referred to as a `genome walk' analysis. Second, some of the clusters thus identified are eliminated from further consideration because of their location in the genome. Third, the statistical significance of the remaining clusters is reassessed on the basis of local sequence composition. By taking both global and local sequence properties into account, it is attempted to alleviate problems caused by compositional heterogeneity of DNA. Both the first and the third step critically depend on methods to estimate the probability of binding site occurrence on the DNA. These methods are therefore discussed first. Then, the three steps are explained in greater detail.
Estimates of the probability of site occurrence
What is the probability that a random oligonucleotide with compositional features similar to those of genomic DNA, and with the same length as the binding site of interest, matches that site? To ensure wide applicability of the technique, conventional consensus sequences are used here instead of position weight matrices (PWMs; 12 ,13 ) for binding sites, because very few transcription factors are sufficiently well characterized to allow construction of a PWM. When addressing the above question, one has to take into account that functional transcription factor binding sites (i) may occur in either orientation on the DNA, (ii) may have relaxed sequence requirements at some positions, as reflected by standard IUB nucleotide codes (14 ), (iii) in addition to such `ambiguous' positions, may show a substantial number of mismatches to their consensus binding site.
The relative frequency of a binding site S of length l (an l-word) in a DNA sequence of N nucleotides is denoted by pS, and determined by dividing the number of word occurrences NS in that sequence by the maximally possible number N - l + 1, i.e.,
1
Special cases are the mono- and dinucleotide frequencies pA, pC, pG, pT, pAA, . . . , pTT. The relative frequencies of a word with exactly k or at most k mismatches to a given word S of the same length are denoted as pSk, and pS[les]k, respectively, where pS = pS0. Obviously,
2
The corresponding statistical predictors of the probabilities of word occurrence will be denoted as ^pS, ^pSk and ^pS[les]k.
Global predictor based on site counts. Here, the predictor ^pS[les]k of site occurrence probability is the relative frequency ^pS[les]k, as determined by equations 1 and 2, for an admissible number of mismatches, k. Under the Poisson model of site distribution, where the probability of observing k sites in a DNA sequence of length N is given by
3
^pS = pS (given by equation 1) is a maximum likelihood estimator of the distribution parameter [lambda]. One has to count a large number of sites to ensure a narrow confidence interval for this [lambda] (15 ). Given that many transcription factor binding sites are longer than 10 bases (16 ), very large amounts of sequence may have to be analyzed to ensure a narrow confidence interval. To maximize site count, ^pS was not determined for each yeast chromosome separately, but for all 16 chromosomes together.
Prediction based on mononucleotide frequencies. For an oligonucleotide generated by independently and randomly selecting successive letters from an underlying alphabet, the predicted probability ^pS is simply the product of the letter frequencies, pA, . . . , pT. ^pS[les]k is calculated via equation 2. To calculate individual ^pSi's, one sums the respective probabilities over all i-tupels of positions where i mismatches can occur. For example, to calculate ^pS2 for the 8-word 5'-CACWANAA-3', one has to sum over
configurations of sites at which two mismatches can occur. To predict the probability of finding a word with mismatches at positions, say, 1 and 4, one calculates
(1 - pC)pApC(1 - pW)pA pA pA, where pW = pA + pT.
Prediction based on dinucleotide frequencies. In this case a DNA sequence is viewed as a sequence of letters generated as a first-order Markov chain (17 ). The probability of finding a particular word S, say 5'-CACTAA-3' is then predicted as
For words S containing positions with relaxed sequence requirements (W, N etc.), and k permissible mismatches to the consensus, all words were explicitly generated that fulfill the sequence requirements, and contain only letters A through T. Their respective probabilities were calculated using the above formula with observed mono- and dinucleotide frequencies, and added to obtain ^pS[les]k.
So far, for all three predictors, only the probability of encountering the word S, and not that of its equally functional reverse complement --S was given. For palindromic words, where S = --S, and for k = 0 allowed mismatches, the predicted probability of encountering the word or its reverse complement is simply ^pS itself, because whenever S occurs, S will occur as well. For non-palindromic words, and for k > 0, the situation is more complicated because there may be non-palindromic words, e.g., 5'-GAWTTC-3', that admit some palindromic matches, 5'-GAATTC-3', and some non-palindromic matches, 5'-GATTTC-3'. In such cases, the quantity ^pS + ^p--S will over-estimate word probability by as much as a factor of two, because it counts the palindromic word occurrences twice. However, because the binding sites to be analyzed below are either perfect palindromes, or contain features that prohibit palindromic matches, such as strong asymmetries, overestimation of site probability is not likely to be a problem here.
The next three sections list the principal steps of the statistical analysis carried out here.
Step 1. Identification of binding site clusters by genome walk analysis
The most simple, albeit problematic, null-hypothesis of binding site distribution is the Poisson approximation (equation 3). It can be violated for two reasons, the first of which is the structure of the sites themselves. Very short sites, long sites in which a large number of mismatches is allowed, or sites with a repetitive structure (e.g., 5'-GGGGG-3') will not follow a Poisson distribution even in random DNA with independently distributed nucleotides. However, this is not a problem for the sites studied here (see next section). The second reason for deviations from the Poisson approximation is compositional heterogeneity and the complex statistical structure of DNA. It is addressed in step 3 below. In step 1, however, statistically significant clusters of transcription factor binding sites are identified by testing site spacing against the null-hypothesis of a Poisson distribution.
Denote as X1, . . . , Xn the positions at which a site S or its reverse --S complement are encountered on the DNA. Further, define as X0 the beginning (5' end of the top strand) of the DNA sequence. The quantity
Di,j = Xj - Xi
denotes the distance between site Xj and Xi.
4
is the length of a stretch of DNA spanning exactly k words. It will be referred to as a k-cluster. Under the Poisson null-hypothesis equation 3, the distribution of the distance between successive words, Di, i+1, is exponential with density
[lambda]e-[lambda]z .
5
This is the probability distribution of the length of 2-clusters. More generally, the length of k-clusters follows a Pearson type III distribution with density
6
where [Gamma](k) = (k - 1)! is the gamma function. This is easily seen from the characteristic functions of equations 5 and 6 (18 ). The probability of observing a k-cluster of length less than x is
7
To assess whether the length, x, of an observed k-cluster, Di, i+k-1, is shorter than would be expected `by chance alone' under the null-hypothesis, and for a given significance level P, equation 7 is used to determine whether
Prob(Di,i+k-1 < x) < P.
8
The appropriate choice of P is discussed below.
The parameter [lambda] needed in the above statistical tests was estimated here via relative site frequencies in the genome. However, from each pair of overlapping sites only one site was (randomly) chosen, and included in the absolute site count NS + N--S. This was done because in general only one of two overlapping sites can be functional, i.e., occupied by a TF at any given time. In terms of the statistical analysis, it leads to more conservative significance tests, because very short and thus highly significant 2-clusters are eliminated. Starting at X0, the lengths of all k-clusters up to k = 11, i.e., D0,1, D0,2, . . . , D0,10, was determined. If for any of these k-clusters equation 8 was true, the cluster was retained for further analysis. This procedure was repeated for clusters starting at X1 (D1,2, D1,3, . . . , D1,11), X2, through Xn-10, hence the name `genome walk' analysis.
For all binding sites analyzed here, except those for the transcription factor MBF, a significance level of P = 0.001 was chosen, because of the large number of site counts, and thus the large number of significance tests to be carried out. For example, for a TF with a genomic site count of NS + N--S = 5000, there are ~500 non-overlapping 10-clusters, and thus 500 independent significance tests for 10-clusters. A value of P = 0.05 or P = 0.01 would lead to a high type I error probability. The particular choice of P is motivated by the counts observed for the binding sites studied here (103-104 per genome), such that P is of the order of the number of independent tests carried out for a given cluster size k.
Binding site counts and tests for Poisson distribution in S.cerevisiae genome and in random DNA
Site
Yeast genome
Random DNA
Mismatches allowed
No of sites
[chi]2 (df)
G (df)
[chi]2 (df)
G (df)
SBF
5'-CACGAAAA-3'
1
15331
26.40 (10)a
25.37 (10)a
3.78 (10)
4.32 (10)
MBF
5'-ACGCGT-3'
0
692
2.61 (6)
2.57 (6)
7.94 (7)
8.48 (7)
bHLH
5'-CACGTG-3'
0
953
7.28 (6)
7.29 (6)
3.01 (7)
3.21 (7)
NBF
5'-ATGTGAAAT-3'
1
5509
16.34 (9)
16.52 (9)
12.43 (9)
13.17 (9)
aSignificant at 0.005 < P < 0.001
Step 2. Elimination of some statistically significant clusters
Yeast transcriptional regulators function in general only when bound upstream of the coding region (9 ,10 ), with the possible exceptions of the transcription of Ty retrotransposons (19 ). Moreover, regulatory regions that lie interspersed among various genes and in enormous distances from the gene they regulate seem to be absent or infrequent in S.cerevisiae (9 ). Thus, statistically significant clusters were not considered further, if they (i) overlapped or were located inside exons, and (ii) if they occurred downstream of both adjacent open reading frames (ORFs).
Step 3. Analysis of remaining clusters based on local sequence composition
Estimating [lambda] via actual site counts in step 1 is necessary because global sequence composition is a poor predictor of site occurrence (20 ). However, local biases in sequence composition may affect the local probabilities of site occurrence, and thus the actual significance of the detected clusters. Thus, in the last step of the analysis, DNA mono- and dinucleotide composition was analyzed in each of the remaining clusters, or in a 500 bp window centered around the cluster, whichever was longer. Precisely those mono- and dinucleotides that occur in the binding sites will be overly frequent in small clusters. This is why a DNA segment larger than the actual cluster was used for small clusters. Two new estimates of [lambda], based on mono- and dinucleotide distributions in these regions were used to reassess the significance (equation 8) of the clusters remaining after step 2. In statistical terms, the underlying null hypothesis is that site distribution in the genome follows an inhomogeneous Poisson process, i.e., a Poisson process whose parameter [lambda] = [lambda](y) is a function of the location y in the genome (21 ). Higher order correlations among nucleotides were not taken into account for reasons of computational feasibility.
R-scan analysis
This statistical technique (20 ,22 ,23 ) can be used to assess on a global level whether words show a clumped distribution in genomic DNA. It uses only the extreme values of the distribution of Di,i+k (a k-scan in Karlin's terminology). Denote as mlk the lth smallest k+1-cluster, Di,i+k. R-scan analysis asks whether mlk is smaller than expected by chance alone under the Poisson null-hypothesis. The relevant formalism can be found in equation 5 of ref. 20 .
Goodness of fit tests for exponential distribution
Likelihood ratio and [chi]2 goodness of fit tests were carried out as described in ref. 24 (Ch. 17) to establish whether the lengths of Di,i+1 followed an exponential distribution. Estimates of [lambda] were based on global site counts. Williams' correction was applied to the likelihood ratio test (24 , p704).
RESULTS AND DISCUSSION
Several applications of the method introduced above are illustrated with different yeast transcriptional regulators. The first example concerns two transcriptional regulators, SBF and MBF (DSC1), known to regulate the expression of a large number of genes that are expressed in the late G1 phase of the cell cycle (25 ). Both factors are heterodimers that share a common subunit. However, their consensus DNA binding sequences differ (see Table 1 ), and they appear to regulate non-overlapping sets of genes (26 -28 ). SBF regulates the transcription of the HO endonuclease, the cyclins CLN1 and CLN2, and the putative cyclin HCS26 (29 ). MBF regulates a large number of DNA synthesis genes, the cyclins CLB5 and CLB6, the kinase SPK1, and the transcription factor SWI4 (25 ,28 ,30 ).
CONCLUSION AND OUTLOOK
The transcription factors studied here illustrate that despite a large number of transcription factor binding sites in a genome, the number of significant clusters of sites can be very small. Such clusters also show unexpected features, such as their preferred occurrence in non-coding regions. These features and the fact that the method presented here detects genes whose regulation by a given transcription factor was shown experimentally, indicate its usefulness. However, the critical question regarding the false positive rate of the method can only be decided by experimentally testing its predictions. This may be a challenging task, especially because presence or absence of a transcription factor alone may not be sufficient for regulation of a target gene. The availability of necessary cofactors may critically depend on the environment, or on the physiological state of a cell.
Many further applications of the method are conceivable, other than analyzing all characterized transcription factor binding sites in yeast. For example, the usefulness of the method can be considerably enhanced by not only considering homotypic cooperativity, but also heterotypic interactions at a promoter. That is, consider not only clusters of binding sites for one transcription factor, but also clusters of binding sites for different transcription factors. This extension of the method would require only a slight modification to the statistical approach. The method can also be applied to higher eukaryotes, where genomic sequences are now rapidly accumulating. Such an application will raise new challenges because of (i) the vastly larger genomes involved, (ii) the abundance of tandem repeats, (iii) the existence of regulatory regions interspersed between genes, and (iv) the often ill-defined location of coding regions. In these cases, existing complementary techniques (50 -52 ), e.g., techniques suitable to determine the location of likely promoter regions, will have to be used in conjunction with the method introduced here.
ACKNOWLEDGEMENTS
I would like to thank Bill Bruno, Patrik D'haeseleer, Catherine Macken and David Torney for invaluable discussions on the subject of this paper. The financial support of the Santa Fe Institute is gratefully acknowledged.
11 Olson, M.V. (1992) In Jones, E.W., Pringle, J.R. and Broach, J.R. (eds) The Molecular and Cellular Biology of the Yeast Saccharomyces. Vol. I. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York.
34 Kaufman, P.D., Kobayashi, R. and Stillman, B. (1997) Genes Dev., 11, 345-357.MEDLINE Abstract
35 Sommers, C.H., Miller, E.J., Dujon, B., Prakash, S. and Prakash, L. (1995) J. Biol. Chem., 270, 4193-4196.MEDLINE Abstract
36 Johnson, P.F. and McKnight, S.L. (1989) Annu. Rev. Biochem., 58, 799-839.MEDLINE Abstract
37 Lamb, P. and McKnight, S.L. (1991) Trends Biochem. Sci., 16, 417-422.MEDLINE Abstract
38 Murre, C., McCaw, P.S. and Baltimore, D. (1989) Cell, 56, 777-783.MEDLINE Abstract
39 Johnston, M. and Carlson, M. (1992) In Jones, E.W., Pringle, J.R. and Broach, J.R. (eds) The Molecular and Cellular Biology of the Yeast Saccharomyces. Vol. II. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York.
40 Dowell, S.J., Tsang, J.S.H. and Mellor, J. (1992) Nucleic Acids Res., 20, 4229-4236.MEDLINE Abstract
48 Paltauf, F., Kohlwein, S.D. and Henry, S.A. (1992) In Jones, E.W., Pringle, J.R. and Broach, J.R. (eds) The Molecular and Cellular Biology of the Yeast Saccharomyces. Vol. II. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York.