Nucleic Acids Research Advance Access published online on October 15, 2008
Nucleic Acids Research, doi:10.1093/nar/gkn738
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Evolutionary computation for discovery of composite transcription factor binding sites
Gary B. Fogel1,
V. William Porto1,
Gabor Varga2,
Ernst R. Dow2,
Andrew M. Craven2,
David M. Powers2,
Harry B. Harlow2,
Eric W. Su2,
Jude E. Onyia2 and
Chen Su2,*
1Natural Selection, Inc., 9330 Scranton Rd., Suite 150, San Diego, CA 92121 and 2Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA
*To whom correspondence should be addressed. Tel: +1 317 277 9657; Fax: +1 317 276 6009; Email: chen_su{at}lilly.com
Received July 7, 2008. Revised September 5, 2008. Accepted October 2, 2008.
 |
ABSTRACT
|
|---|
Previous research demonstrated the use of evolutionary computation
for the discovery of transcription factor binding sites (TFBS)
in promoter regions upstream of coexpressed genes. However,
it remained unclear whether or not composite TFBS elements,
commonly found in higher organisms where two or more TFBSs form
functional complexes, could also be identified by using this
approach. Here, we present an important refinement of our previous
algorithm and test the identification of composite elements
using NFAT/AP-1 as an example. We demonstrate that by using
appropriate existing parameters such as window size, novel-scoring
methods such as central bonusing and methods of self-adaptation
to automatically adjust the variation operators during the evolutionary
search, TFBSs of different sizes and complexity can be identified
as top solutions. Some of these solutions have known experimental
relationships with NFAT/AP-1. We also indicate that even after
properly tuning the model parameters, the choice of the appropriate
window size has a significant effect on algorithm performance.
We believe that this improved algorithm will greatly augment
TFBS discovery.
 |
INTRODUCTION
|
|---|
Transcription factors (TFs) are key regulatory proteins that
are commonly associated with coexpressed genes. Following microarray
analysis, examination of the upstream regions of coexpressed
genes may lead to the identification of common features, such
as TF binding sites (TFBS) that facilitate coregulation (
1–3).
Computational approaches have been offered to assist in the
discovery of these elements (
4–6) including evolutionary
algorithms (
7–14). The low signal-to-noise ratio of the
generally short TFBS
n-mer sequences relative to the much longer
upstream region makes identification and prediction of TFBS
difficult. Improved modeling of the background sequence distribution
for upstream regions that do not contain TFBS can assist in
TFBS discovery by increasing the signal-to-noise ratio (
15).
For example, it is possible to make use of exhaustive calculation
to evaluate the frequency of occurrence of all possible
n-mers
to update a nucleotide probability matrix. The distribution
of
n-mers sampled from sets of genes known to be independently
regulated can be used as a basis for comparison to the entire
distribution of
n-mers sampled from a genome and any
n-mers
that are more overly abundant than the expectation are labeled
as putative TFBSs (
6,
16–20). This approach is guaranteed
to identify
n-mers with the highest
Z-scores (a length and composition
correct measure of similarity) if there are no substitutions
in the matching segments and if
n < 7 to afford computation
in reasonable time (
5). It is also possible to specify the
n-mer
as a probability matrix and utilize an iterative approach, such
as an expectation maximization (
21) or Gibbs sampling procedure
(
5,
22–27). These approaches can increase the
n-mer length
to
n > 8 nt but are also susceptible to local entrapment
during optimization. Improved models of the background sequence
distribution are also possible using methods, such as higher
order Markov models (
15) or discriminative seeding motif discovery
(
28), but these approaches are organism (or even gene cluster)
specific. Approaches for TFBS discovery that are not organism
specific, do not require exhaustive calculation, and report
back to the user-putative TFBS locations are required by the
community.
Previously, we developed an approach for TFBS discovery using evolutionary computation (EC) that met these conditions (29). Testing of the algorithm was with respect to single TFBS considered in isolation (Oct-1 and NF-
B). However it is known that composite binding sites exist in upstream regions, such as the binding site for NFAT/AP-1, a heterodimer form of a TF complex. In general, portions of TF complexes take the form of homodimers or heterodimers, with TFs binding to nonadjacent DNA sites due to DNA tertiary structure. It is therefore a technical challenge to predict the binding sites of TF complexes in silico, because the distance between individual TFBSs is not well understood and because different TFs can form complexes with different partners in different biological contexts. Despite these challenges, it is important to understand TF complexes and TFBSs, especially given that for most eukaryotes, TFs work in complexes and because the content of the TF complex is a reflection of the biological conditions of the cell.
 |
ALGORITHM
|
|---|
The code for EC from the previous effort (
29), for the discovery
of similar TFBS motif windows, one for every upstream
region, was adopted for TF complex discovery. The sequence information
contained in the grouping of these windows was used to calculate
a nucleotide likelihood matrix. Fitness was measured with respect
to a basis matrix where both a metric for similarity and a metric
for complexity were used. Additional details on the implementation
can be found in Ref. (
29). Revisions and additions to the previous
approach are detailed below and afford the possibility for composite
TFBS element discovery.
Complexity normalization
The previous approach to TFBS discovery made use of compositional complexity as a portion of the objective function. This remained unnormalized with respect to window size, and therefore maximum complexity was highly dependent on the choice of window size. The compositional complexity calculation was revised such that the maximum possible complexity score is calculated for the user-defined window size prior to EC. During evolution, each complexity score was rescaled between [0, 1] where the maximum possible complexity score was 1. This removed any potential bias in complexity relative to window size.
Ambiguous nucleotide assignment
Ambiguous nucleotide positions resulting from sequencing errors may exist in upstream sequences examined. We determined that these positions can have a significant and undesired effect on the way in which complexity is calculated when using the equation given in Ref. (30). Previously, any ambiguous positions were simply ignored during complexity and fitness calculation. Given this, Ns in the sequence windows could affect fitness by artificially increasing the complexity score for a solution beyond a theoretical maximum complexity sequence composed of nucleotides A, T, C, G. This was corrected by applying a score of 0.25 for any N under the assumption of possibly equal representation of A, T, C, G at that position (Figure 1). The adjustment can be observed in Figure 1C where the last position of sequence 3 is given as an N. Comparison of the scores for complexity and similarity under this revised condition relative to Figure 1A and B show that the revised scoring scheme separates a matched position (a C in position 8 of sequence 3, Figure 1A) from a mismatch (a T in position 8 of sequence 3, Figure 1D) from an N at that same position (Figure 1C) where there is still a possibility for either a C or a T. Any IUPAC symbols other than A, T, C, G or N remain ignored during calculation of fitness given that their occurrence in upstream regions is considered infrequent.

View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 1. (A–D) Complexity and similarity scoring improvement. (A) A set of sequences of identical sequence and length result correctly in a score of 1 for complexity and 1 for similarity. (B) The addition of a region of all N's in sequence 3 causes the complexity to remain the same at 1, while lowering the similarity to 0.67. (C) A score of 0.25 was used for any position that was an N, under the assumption that position truly represented an equal probability of A, T, C or G. This adjustment can be observed in (C) above where only one position (the last position of sequence 3) is an N. Comparison of the scores for (A) and (C) above demonstrates that this method now adjusts both complexity and similarity accordingly. The N position in solution (C) above might contain A, T, C or G. In the case where that position truly is a C, the resulting solution will score as given in (A) above. In the case where that position is A, T or C [shown as a T in (D) above], the solution receives a different and lower complexity and similarity score than where it is an N.
|
|
Bonus scoring for similarity
Within TFBSs it is generally accepted that core
sequences are conserved relative to flanking nucleotide positions
(
31). To incorporate this notion into our scoring method for
similarity, we added a central weighting bonus
to the similarity score. This bonus increases the overall fitness
of putative motifs that were similar in the core region rather
than at the ends. The weight of sequence similarity
was adjusted by position over a window following a Gaussian
distribution. Each column's similarity score was scaled according
to this arbitrarily chosen Gaussian distribution (
Figure 2).
We also incorporated an adjacent conservation
bonus for adjacent columns in the nucleotide likelihood matrix
with 100% conservation. A bonus of 1 was given for every two
adjacent columns that met this condition and this bonus was
applied only to the calculation of similarity (
Figure 3).

View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 2. Central weighting scheme for similarity scoring. For the example above, four hypothetical sequences of 8 nt each are shown. The score for their similarity is calculated and sums to 0.25. To this similarity score, a weight range is applied from [0.5, 1.0] where positions of similarity in the inner portion of the motif will essentially receive a bonus for their location relative to the outer positions. The original score is multiplied by the weight and the revised score is used (a sum of 0.3206) for similarity. Note that the weight distribution shown above is based on a Gaussian distribution and can be set to any range or distribution that the user desires.
|
|

View larger version (4K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 3. Adjacent conservation bonus. For the sequences here, a bonus of 3 is given for the three pairs (four columns) of adjacent A's that are 100% conserved. This dramatically increases the similarity score for this solution.
|
|
Self-adaptation
Self-adaptation of the number of variations performed at each
generation as well as the probability associated with each variation
operator was added to the EC. Self-adaptation is a method to
automatically evolve and tune the parameters associated with
the evolutionary algorithm concurrent with the main process
of evolution. In this regard, self-adaptation can be thought
of as a hierarchical or meta-level evolutionary
optimization process (
32). In cases where it is difficult to
assign proper weights for parameters by hand, self-adaptation
can save time and effort during parameter adjustment.
Automated clustering
A main deficiency of the previous approach for TFBS discovery was the requirement for hand-curation of the top solutions from the evolutionary algorithm into clusters of similar solutions. Clustering is time intensive but results in groups of similar putative motif sequences while simultaneously reducing redundancy of those solutions. We extended the previous approach with an automated clustering method (Figure 4). In order for two solutions to be grouped together in a cluster, we imposed a requirement that 90% of the sequences in each group must be identical. This threshold was set after repeated testing and analysis on the ability of the method to recapitulate hand-curated clusters for Oct-1 and NF-
B.
Parallelization
For parallelization, five Compaq Pentium III machines operating
with Linux RedHat 9 were used as a testing environment. Parallelization
was made using single master and multiple clients using a hub
and spoke architecture, where the master was the hub
and the clients were the spokes. No client processes were run
on the master. A user-specified number of client processes was
generated each with independent evolutionary optimization. After
the first 50 generations, the best results from each client
were sent back to the master for sorting and storage. After
subsequent 50 generation intervals, best client results were
sent to the master process where the results were added to the
best result list, the list was sorted and pruned to retain a
constant number of best results throughout the evolutionary
process. Upon completion of all client evolutionary optimizations,
the master process performed an automated clustering of results
and output the results to file for user interpretation.
Program implementation
For the experiments presented here, unless otherwise noted, a population of 15 parents and 30 offspring per parent was used with elitist selection. Evolution was allowed to proceed for 1000 generations using all of the above revisions and additions to the code, including central bonus weights, adjacent conservation and self-adaptation. Automated clustering was used to identify clusters of similar solutions. The window size was then varied starting from 25 and decreasing to 7 to determine if it was possible to find known solutions.
 |
RESULTS
|
|---|
Evaluation of code enhancement using Oct-1 and NF-
B
Complexity scoring
We tested the revised complexity scoring on both the Oct-1 and
NF-

B examples from the previous experimentation and determined
that for the case of Oct-1, the known Oct-1 TFBS solution dropped
from being the top scoring solution to solution #6. In the case
of NF-

B, however, the known NF-

B TFBS solution increased in
resulting score from solution #29 to solution #6 (
Figure 5).
Thus, the normalized complexity scoring relative to window sizes
and ambiguous nucleotides resulted in the algorithm to be equally
successful over both Oct-1 and NF-

B cases with known TFBS motifs
recovered in the top 10 solutions.

View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 5. Solutions for Oct-1 and NF- B after normalization of the complexity scoring to include ambiguous positions and to account for window sizes. Sequence information is provided with the COMPEL identifier (C00###) and start and end position indices for the motifs relative to the transcription start site.
|
|
Bonus scoring for similarity
To test the central bonusing method, two sequences were generated
with identical scores under the original scoring method: one
where the outer region was conserved and the inner region was
not, and another where the inner region was conserved and the
outer region was not. For initial tests, the weights for the
complexity and similarity scores were 0 and 1, respectively.
Sequence complexities were made to be identical so that only
the effects of similarity could be observed (
Figure 6).
To test the adjacency scoring method, two sets of sequences
were generated that scored identically under the original scoring
technique. One set of sequences contained no adjacency in the
conserved columns whereas the other set of sequences had four
adjacent columns that were conserved. The weights used for the
complexity and similarity scores were 0 and 1, respectively
(
Figure 7). The new scoring method indeed generated higher fitness
values to putative motifs with conserved core
regions.

View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 7. Test of the operation of the adjacency bonus method. Solutions 1 and 2 have equivalent scores when using the former method, but with the addition of the adjacency bonus, Solution 2 scores higher in terms of similarity and fitness.
|
|
We evaluated the worth of these modifications on the Oct-1 and
NF-

B examples used in Ref. (
29). After preliminary testing,
the central weights were scaled from 0.8 to 1.0. Using these
settings the known Oct-1 solution was reported as solution #6
(
Figure 8) and the top two NF-

B solutions were nearly identical
to the known NF-

B solution (
Figure 9). This suggests that similarity
scoring for core regions can drive the algorithm
to find more useful solutions.
Figures 10 and
11 demonstrate
that self-adaptation reduces the number of generations required
for the convergence of the population to the known best solutions
for the cases of Oct-1 and NF-

B. The output for the automated
clustering of the Oct-1 solutions is shown in
Figure 12. The
known Oct-1 solutions were found in the fourth cluster.

View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 10. Mean best fitness of evolutionary algorithm (a) without self-adaptation and (b) with self-adaptation for the Oct-1 example. Note that self-adaptation arrives at the best solution in approximately 125 generations, with a slight further increase in performance at generation 225 (b) whereas the evolutionary algorithm without self-adaptation rapidly becomes stuck in a local optimum until approximately generation 325.
|
|

View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 11. Mean best fitness of evolutionary algorithm (a) without self-adaptation and (b) with self-adaptation for the NF- B example. Note that self-adaptation arrives at the best solution in approximately 50 generations and that this solution is better than the best solution discovered without self-adaptation over 500 generations.
|
|


View larger version (54K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 12. Automated clustering of the Oct-1 solutions showing the known solutions in bold in cluster 4. Each cluster is identified by the number of sequences in the cluster, the solution numbers from which those sequences were derived, COMPEL identifier for the upstream sequences, position upstream of the transcription start site, and putative TFBS motif identified. In this case, the known Oct-1 solutions appear together in cluster 4, in addition to other 8mer windows that share some unspecific commonality, typically with a AAA repeat in the 3'-half of the sequence.
|
|
Composite element discovery: NFAT/AP-1
The nuclear factors of activated T-cells (NFAT) family of TFs
are known to be involved with the upregulation of T-cell genes
following antigenic stimulation (
33). Cooperation of NFAT with
activating protein 1 (AP-1) provides a model system for combinatorial
transcriptional regulation, as NFATp/c factors are known to
play a role in cytokine regulation during immune response. Most
of the experimental data for NFAT and AP-1 demonstrate that
they bind at closely juxtaposed sites.
The NFAT family [NFAT_Q6 in TRANSFAC® Professional version 11.4 (34)] is based on 26 NF-ATp/c binding sites with a resulting nucleotide distribution matrix 12 bp in length (Table 1). NFAT TFBSs are composite elements of an NFATp/c and an AP-1 binding site. The recognition of these NFAT TFBS elements has been investigated previously in the literature (35) providing a very useful means for the evaluation of algorithms designed for composite TFBS discovery.
View this table:
[in this window]
[in a new window]
|
Table 1. Nucleotide distribution matrix for the matrix V$NFAT_Q6 in TRANSFAC derived from 26 experimentally verified binding sites
|
|
The nucleotide distribution matrices for NFAT and AP-1 [
Figure 1,
Ref. (
35)] differ only with respect to AP-1. For NFAT, the core
was identified as GGAAA with a highly conserved
W just upstream of the core. For AP-1, the core was identified
as TGASTCA (
Table 2). Experimentally determined
NFATp/c binding sites are available from a variety of sources
in the literature. In addition, 13 known NFAT TFBSs specific
for activated T-cells, many of which are NFATp/AP-1 composite
elements have been identified [
Table 5, Ref. (
35)]. We used
this information to determine if these motifs could be identified
using our approach for TFBS discovery.
View this table:
[in this window]
[in a new window]
|
Table 2. Nucleotide distribution matrix for the matrix AP-1 from Ref. (13) derived from 47 experimentally verified binding sites with position relative to the transcription start site
|
|
To evaluate the performance of this Bioinspired Evolutionary
Algorithm for Genes with Linked Expression BEAGLE algorithm,
1000 nt of sequence information upstream of the transcription
start site from the following genes: human GM-CSF (NM_000758
[GenBank]
.2;
reverse complement), human IL-2 (NM_000586
[GenBank]
.2), human IL-4 (NM_000589
[GenBank]
.2;
reverse complement), human TNF member 2 (NM_000594
[GenBank]
.2; reverse
complement), mouse GM-CSF (X03020
[GenBank]
), mouse IL-2 (X14473
[GenBank]
), mouse
IL-4 (X05064
[GenBank]
) and mouse IL-5 (D14461
[GenBank]
) (
35). The weights for
similarity and complexity were varied over the ranges [0.2,
0.4] and [0.8, 0.6], respectively. Weights of 0.25 for similarity
and 0.75 for complexity provided the best solutions after repeated
testing.
Using a window size of 25, the top cluster of 26 resulting clusters contained 27 similar putative TFBS motifs, four of which could be identified as known NFAT binding motifs [motifs N2, N3, N13 and N38 from Table 1, Ref. (35)]. Despite the length of the window size relative to the smaller known TFBS, known motifs with conserved similarity as small as length 5 not only appeared in the top solution, but also were grouped automatically in the top cluster. The other 23 solutions in the top cluster had broad similarity in terms of repeated A's central to, or on the 5'-end of, the window. Window sizes as small as 13 nt could be used to recover the known NFAT binding sites. However, when the window size was made to be
12 positions, the known NFAT binding motifs appeared further down on the list of best clusters in the output from the evolutionary algorithm. With window sizes as small as 7 positions, NFAT binding sites were no longer observed but other, apparently well-conserved sequences, were discovered. These well-conserved short sequences were considered novel putative TFBSs.
For window sizes 25, 13, 11 and 7, a complete examination of clusters was made such that any output sequences containing known NFAT TFBS motif of GGAAA were identified (Tables 3–6

). Any pairs of putative TFBSs were determined as potential TFBS motifs for other transcription binding factors. Four of the known NFAT sites can be discovered with this method, two of which also contain known composite elements (Table 3), and the NFAT motif can be centralized to a smaller window of 13 for seven of the eight input sequences (Table 4, panel a). In some cases, the similarity over all 13 bp in these motifs is striking (e.g. mouse IL-2 relative to human GM-CSF). Four sets of similar sequences were discovered with a window size of 13 that were non-NFAT motifs (Table 4, panel b). In one case, this represented a complete match of 13 nt between human IL-2 and mouse IL-2 of TTGTCCACCACAA. In another case, 12 of 13 positions were conserved between mouse IL-4 and human IL-4 (CATTGGAAAWTTT).
View this table:
[in this window]
[in a new window]
|
Table 4. NFAT motifs discovered with window size 13 (Panel a). Non-NFAT motifs discovered when using a window size of 13 on the NFAT data set that have high pairwise similarity (Panel b)
|
|
View this table:
[in this window]
[in a new window]
|
Table 5. NFAT motifs discovered with window size 11 (Panel a). Non-NFAT motifs discovered when using a window size of 11 on the NFAT data set that have high pairwise similarity (Panel b)
|
|
View this table:
[in this window]
[in a new window]
|
Table 6. Non-NFAT motifs discovered when using a window size of 7 on the NFAT data set that have high pairwise similarity
|
|
With a window size of 11 the difficulty associated with discovering
the known NFAT solution for all eight sequences increased. The
motif GGAAA was discovered only in human TNF and GM-CSF sequences,
and mouse IL-4 (
Table 5, panel a). Other very similar pairs
of motifs of length 11 bp also exist (
Table 5, panel b). For
window length 7, no NFAT sequences were observed in the top
100 solutions, however other very well-conserved sequences of
7 bp were discovered. Many of these small motifs were conserved
at 100% between human and mouse. The motif AATKGCT appears to
be highly conserved and repeated in these sequences. The motif
CTGAGDV also appears to be highly conserved but not as repeated
and is found in all eight sequences known to harbor the NFAT/AP-1
complex.
MatInspector v5.0 (36) was used to determine if any of the conserved motifs were previously experimentally determined (Table 7). For the putative TFBS elements, we then searched the literature to determine any TFs that were known to have some relation to NFAT. Nine of these putative TFs had no known relation to NFAT. These included nuclear matrix protein 4 (NMP4)/Cas-interacting zinc-finger protein (CIZ), PAX2, c-Ets-1, E2F, Basonuclin, Atp1a1 regulatory element binding factor 6 (AREB6), Fork head-related activator-2 (FOXF2), Binding site for S8 type homeodomains and Sox-5. It remains unclear if any of these represented motifs share coregulation with NFAT, AP-1 or any other TF in these upstream regions.
However, HMGI(Y) high-mobility-group protein I (Y) is known
to be involved in the suppression of IL-4 transcription whereas
NFAT-1 is involved in the enhancement of IL-4 promoter activity
(
37). Myocyte enhancer factor 2 (MEF2) binding sites have recently
been found to be co-located with NFAT in the regulation of prion
protein gene (PRNP) and its normal product PrP(C) (
38). POU
factor Brn2 (BRN-2) has been identified as being upstream of
KCNN3 in a region of the human genome implicated in schizophrenia
by linkage (
39). NFAT and AP-1 TFBSs have also been discovered
upstream of KCNN3 (
39). MEF2 has the greatest number of references
in the literature in common with NFAT. NFAT appears to be a
key regulator of MEF2 in skeletal muscle, which in turn regulates
GLUT4 whole-body insulin action (
40). In addition, it appears
that both MEF2 and NFAT play key roles in the T-cell receptor-mediated
signal transduction pathways leading to IL-2 transcription (
41).
While calcineurin and NFAT were known to have roles in mediation
of the calcium signaling required for IL-2 transcription regulation,
MEF2 has only recently been implicated and may serve as a novel
target for development of immunosuppressants (
41). These and
other sources from the literature indicate that our method of
TFBS discovery can assist in not only the identification of
composite elements but also in the identification of neighboring
conserved elements that are experimentally verified as TFBSs.
 |
CONCLUSIONS
|
|---|
A previous approach for TFBS discovery making use of evolutionary
computation was extended with a variety of additional features.
These refinements include incorporation of complexity normalization,
ambiguous nucleotide assignment, bonuses in the scoring function
for similarity within different regions of the TF window considered
to be the core, self-adaptation of the evolutionary
parameters associated with the optimization method, a procedure
for automated clustering of evolved solutions and parallelization
of the entire evolutionary search. In this article, we have
evaluated the ability of this revised approach to discover composite
TFBS elements using NFAT/AP-1 as an example. The discovery of
composite TFBS elements was not possible with the prior approach.
Our results demonstrate that by using appropriate existing parameters such as window size and novel scoring methods such as central bonusing, TFBSs of different sizes and complexity can be identified as top solutions. Identification of NFAT/AP-1 elements was possible when using window sizes of 25-nt positions. However, the probability of identifying composite elements decreased with decreasing window length. While it was still possible to detect the NFAT element with window size 13, by window size 11 this became nearly impossible and by window size 7 other small motifs were discovered that had more similarity than any of the true NFAT/AP-1 composite sites. These results suggest that the choice of window size can directly influence the success of any approach for TFBS discovery including those that are capable of finding composite elements. Such a result is important not only with respect to the current approach but for any TFBS search methods that make use of window lengths and suggests that researchers should evaluate a range of window sizes for TFBS discovery. A problem with short TFBS motifs is that other, longer motifs of equal conservation may exist within the upstream regions, providing solutions that score higher than the known shorter truth. These other conserved regions may still be interesting, but only with the inclusion of larger sequence windows is there enough noise in the surrounding nucleotides to make the smaller NFAT element score similar to the larger putative TFBS motifs.
Given the knowledge that the appropriate window size is very likely unknown for each upstream region, we propose a dual approach for TFBS discovery. The first approach is to use a top-down approach with large window sizes to find putative composite elements in upstream regions and minimize the window size over time. The second approach is to use a bottom-up approach with small window sizes to find putative single TFBS motifs and then increase the window size over time, continually reseeding with the best results from the previous window size and iterating the evolutionary process. Any commonality between these two methods lends additional credence to the discovered motifs.
The COMPEL database (42) contains examples of experimentally verified composite TFBS motifs. In the future, we plan on reviewing this database to determine if there are any changes to the evolutionary algorithm strategy that can assist when specifically looking for composite elements. For instance, statistics on known motif–motif distances could be added to the fitness function. Distances between putative motifs that fall within the distribution of known samples could be given a bonus, further driving the evolutionary optimization toward reasonable solutions.
The approach to automated clustering drastically reduced the time required to comb through the output of the evolutionary algorithm and identify key similar motifs. However, it was also clear that the resulting clusters were in many cases still redundant. We made use of an initial threshold of 90% similarity over all sequences in each solution to all sequences in a cluster before that solution could be added to a cluster. The cutoff of 90% may be too stringent in some cases. Thus, we envision a subsequent version that iterates the clustering, first with a highly stringent threshold of 90%, followed by repeated rounds of merging clusters with ever lower thresholds to allow for similar cluster contents to be merged successfully, minimizing the number of clusters visible to the user and aiding in results interpretation.
An additional bonus could be incorporated that makes use of the database of known motifs from TRANSFAC® and composite elements from COMPEL to give bonuses for any window that contains sequences with strong similarity to known TFBSs. For instance, if the goal of a research project is to discover completely novel elements, then any previously experimentally determined TFBS motifs (either represented as specific sequences or statistical matrices) can be given a strong penalty. The resulting solutions will, in theory, contain a greater percentage of new motifs if those motifs exist upstream. If, however, the goal of the research project is to discover previously known elements, then large bonuses can be given to experimentally determined TFBS, driving the algorithm away from discovery of novel TFBS motifs, but potentially resulting in the discovery of novel composite elements or new TFBS motifs that neighbor known TFBS motifs. For any of these approaches to make use of the information in TRANSFAC® or COMPEL, there is a clear benefit to using the nucleotide distribution matrices that have been derived from as many phylogenetically diverse sequences as possible.
 |
FUNDING
|
|---|
Funding for open access charge: Eli Lilly and Co.
Conflict of interest statement. None declared.
 |
REFERENCES
|
|---|
- Brazma A, Vilo J. Gene expression data analysis. FEBS Lett. (2000) 480:17–24.[CrossRef][Web of Science][Medline]
- Bucher P. Regulatory elements and expression profiles. Curr. Opin. Struct. Biol. (1999) 9:400–407.[CrossRef][Web of Science][Medline]
- Zhang MQ. Large-scale gene expression data analysis: a new challenge to computational biologist. Genome Res. (1999) 9:681–688.[Abstract/Free Full Text]
- Ohler U, Niemann H. Identification and analysis of eukaryotic promoters: recent computational approaches. Trends Genet. (2001) 17:56–60.[CrossRef][Web of Science][Medline]
- Liu X, Brutlag DL, Liu JS. Bioprospector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput. (2001) 6:127–138.
- Jensen LJ, Knudsen S. Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and functional annotation. Bioinformatics (2000) 16:326–333.[Abstract/Free Full Text]
- Congdon CB, Aman J, Nava G, Gaskins HR, Mattingly C. An evaluation of information content as a metric for the inference of putative conserved noncoding regions in DNA sequences using a genetic algorithms approach. IEEE/ACM Trans. Comput. Biol. Bioinf. (2008) 5:1–14.[CrossRef]
- Lones MA, Tyrell AM. Regulatory motif discovery using a population clustering evolutionary algorithm. IEEE/ACM Trans. Comput. Biol. Bioinf. (2007) 4:403–414.[CrossRef]
- Mauri G, Mosca R, Pavesi G. A GA approach to the definition of regulatory signals in genomic sequences. In: Proc. GECCO 2004. (2004) Springer, Berlin: LNCS 3102. 380–391.
- Carlos R-R. Finding DNA motifs using genetic algorithms. In: In: Proceedings of the Fifth Mexican International Conference on Artificial Intelligence (2006) New Jersey: IEEE, Piscataway.
- Paul TK, Iba H. Identification of weak motifs in multiple biological sequences using genetic algorithm. In: In: Proceedings of GECCO 2006 (2006) New York: ACM. 271–278.
- Wei Z, Jensen ST. GAME: detecting cis-regulatory elements using a genetic algorithm. Bioinformatics (2006) 22:1577–1584.[Abstract/Free Full Text]
- Chan T-M, Leung K-S, Lee K-H. TFBS identification based on genetic algorithm with combined representations and adaptive post-processing. Bioinformatics (2008) 24:341–349.[Abstract/Free Full Text]
- Kato M, Tsunoda T. MotifCombinator: a web-based tool to search for combinations of cis-regulatory motifs. BMC Bioinformatics (2007) 8:100.[CrossRef][Medline]
- Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouzé P, Moreau Y. A higher-order background model improves detection of promoter regulatory elements by Gibbs sampling. Bioinformatics (2001) 17:1113–1122.[Abstract/Free Full Text]
- Atteson K. Calculating the exact probability of language-like patterns in biomolecular sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. (1998) 6:17–24.[Medline]
- Tompa M. An exact method for finding short motifs in sequences with application to the ribosome binding site problem. Proc. Int. Conf. Intell. Syst. Mol. Biol. (1999) 7:262–271.
- Brazma A, Jonassen I, Vilo J, Ukkonen E. Predicting gene regulatory elements in silico on a genomic scale. Genome Res. (1998) 8:1202–1215.[Abstract/Free Full Text]
- Vanet A, Marsan L, Labigne A, Sagot MF. Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals. J. Mol. Biol. (2000) 297:335–353.[CrossRef][Web of Science][Medline]
- Kie
basa SM, Korbel JO, Beule D, Schuchhardt J, Herzel H. Combining frequency and positional information to predict transcription factor binding sites. Bioinformatics. (2001) 17:1019–1026.[Abstract/Free Full Text]
- Lawrence CE, Reilly AA. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins (1990) 7:41–51.[CrossRef][Web of Science][Medline]
- Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science (1993) 262:208–214.[Abstract/Free Full Text]
- Bailey T, Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach. Learn. (1995) 21:51–80.
- Hughes JD, Estep PW, Tavazoie S, Church GM. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. (2000) 296:1205–1214.[CrossRef][Web of Science][Medline]
- Neuwald AF, Liu JS, Lawrence CE. Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci. (1995) 4:1618–1632.[Web of Science][Medline]
- Roth FP, Hughes JD, Estep PW, Church GM. Finding DNA regulatory motifs within unaligned noncoding sequence clustered by whole genome mRNA quantitation. Nat. Biotechol. (1998) 16:939–945.[CrossRef][Web of Science][Medline]
- Tharakaraman K, Mariño-Ramírez L, Sheetlin S, Landsman D, Spouge JL. Alignments anchored on genomic landmarks can aid in the identification of regulatory elements. Bioinformatics (2005) 21(Suppl. 1):i440–i448.[Abstract]
- Fauteux F, Blanchette M, Strömvik MV. Seeder: discriminative seeding DNA motif discovery. Bioinformatics (2008) 24:2303–2307.[Abstract/Free Full Text]
- Fogel GB, Weekes DG, Varga G, Dow ER, Harlow HB, Onyia JE, Su C. Discovery of sequence motifs related to coexpression of genes using evolutionary computation. Nucleic Acids Res. (2004) 32:3826–3835.[Abstract/Free Full Text]
- Wootten JC, Federhen S. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. (1996) 266:554–571.[Web of Science][Medline]
- Fogel GB, Weekes DG, Varga G, Dow ER, Craven AM, Harlow HB, Su EW, Onyia JE, Su C. A statistical analysis of the TRANSFAC database. BioSystems (2005) 81:137–154.[CrossRef][Web of Science][Medline]
- Fogel DB. Evolutionary Computation: Toward a New Philosophy of Machine Intelligence (2006) 3rd. Piscataway, NJ: IEEE Press.
- Shaw JP, Utz PJ, Durand DB, Toole JJ, Emmel EA, Crabtree GR. Identification of a putative regulator of early T cell activation genes. Science (1988) 241:202–205.[Abstract/Free Full Text]
- Matys V, Fricke E, Geffers R, Gösling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, et al. TRANSFAC®: transcriptional regulation, from pattern to profiles. Nucleic Acids Res. (2003) 31:374–378.[Abstract/Free Full Text]
- Kel A, Kel-Margoulis O, Babenko V, Wingender E. Recognition of NFATp/AP-1 composite elements within genes induced upon the activation of immune cells. J. Mol. Biol. (1999) 288:353–376.[CrossRef][Web of Science][Medline]
- Quandt K, Frech K, Karas H, Wingender E, Werner T. MatInd and MatInspector – new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. (1995) 23:4878–4884.[Abstract/Free Full Text]
- Chuvpilo S, Schomberg C, Gerwig R, Heinfling A, Reeves R, Grummt F, Serfling E. Multiple closely-linked NFAT/octamer and HMG I(Y) binding sites are part of the interleukin-4 promoter. Nucleic Acids Res. (1993) 21:5694–5704.[Abstract/Free Full Text]
- Premzl M, Delbridge M, Gready JE, Wilson P, Johnson M, Davis J, Kuczek E, Marshall Graves JA. The prion protein gene: identifying regulatory signals using marsupial sequence. Gene (2005) 349:121–134.[CrossRef][Web of Science][Medline]
- Sun G, Tomita H, Shakkottai VG, Gargus JJ. Genomic organization and promoter analysis of human KCNN3 gene. J. Hum. Genet. (2001) 46:463–470.[CrossRef][Web of Science][Medline]
- McGee SL, Hargreaves M. Exercise and myocyte enhancer factor 2 regulation in human skeletal muscle. Diabetes (2004) 53:1208–1214.[Abstract/Free Full Text]
- Pan F, Ye Z, Cheng L, Liu JO. Myocyte enhancer factor 2 mediates calcium-dependent transcription of the interleukin-2 gene in T lymphocytes: a calcium signaling module that is distinct from but collaborates with the nuclear factor of activated T cells (NFAT). J. Biol. Chem. (2004) 279:14477–14480.[Abstract/Free Full Text]
- Kel-Margoulis OV, Romashchenko AG, Kolchanov NA, Wingender E, Kel AE. COMPEL: a database on composite regulatory elements providing combinatorial transcriptional regulation. Nucleic Acids Res. (2000) 28:311–315.[Abstract/Free Full Text]

CiteULike
Connotea
Del.icio.us What's this?