Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (86K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (15)
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Bolshoy, A.
Right arrow Articles by Ioshikhes, I.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bolshoy, A.
Right arrow Articles by Ioshikhes, I.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 1997 Oxford University Press 3248-3254

Enhancement of the nucleosomal pattern in sequences of lower complexity

Enhancement of the nucleosomal pattern in sequences of lower complexity Alex Bolshoy1,2,3,*, Kevin Shapiro3,4, Edward N. Trifonov3 and Ilya Ioshikhes3

1Center for Biological Sequence Analysis, Department of Chemistry, The Technical University of Denmark, DK-2800 Lyngby, Denmark, 2Department of Membranes Research and Biophysics, Genome Project and 3Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel and 4Harvard University, Cambridge, MA 02138, USA

Received May 14, 1997; Revised and Accepted July 1, 1997

ABSTRACT

Intuitively, the complexity of a given DNA sequence is related to the number of various superimposed biological messages it contains. Here we assess the expectation that in nucleosome DNA sequences of lower linguistic complexity, the nucleosome DNA positioning pattern would be more pronounced than in those of higher linguistic complexity. The nucleosome DNA positioning pattern is one of the weakest (highly degenerate) sequence patterns. It has been extracted recently by specially designed multiple alignment procedures. We applied the most sensitive of these procedures to nearly equal subsets of a nucleosome database separated according to linguistic complexity. The pattern extracted from the subset of the simpler nucleosome sequences not only possesses all major attributes of the known nucleosomal pattern, but is substantially stronger with respect to amplitude in comparison with the total database. This result constitutes the first demonstration that a weak pattern can be significantly enhanced by selective treatment of a lower complexity subset of the sequence ensemble under consideration.

INTRODUCTION

In addition to packaging DNA into a dynamic chromatin structure, nucleosomes can, by virtue of their positions along the DNA sequence, play an essential role in gene repression and other regulatory schemes (1 ). Many mechanisms have been proposed for nucleosome positioning and most of them have found some experimental support (see 2 for a review). Most notably, it is clear that the DNA sequence itself is instrumental in determination of nucleosome phasing; the positioning signal in the sequence, however, is very weak, difficult to extract and hence not fully understood.

This signal is believed to be related to the anisotropic deformability (bendability) of DNA wrapped around histone octamers (3 -5 ), which is strongly influenced by sequence-specific interactions between neighboring nucleotides. Remarkably, certain dinucleotides, in particular the complementary pairs AA and TT, are known to display regular patterns of recurrence in nucleosomal sequences (3 ,6 ,7 ), the periods of which mirror the helical repeat of DNA in chromatin.

Although positional distributions of dinucleotides other than AA/TT also characterize the full nucleosomal pattern (8 ), the dinucleotides AA and TT are clearly the main contributors to the nucleosome positioning signal. In a recent study (7 ) we were able to discern the following major attributes of the DNA nucleosomal pattern: (i) both AA and TT dinucleotide positional distributions display detectable periodicity, with a period of 10.3 +- 0.2 bases; (ii) TT dinucleotides appear to be distributed symmetrically relative to AA dinucleotides of the same DNA strand, with the center of symmetry at the midpoint of the nucleosome core DNA; (iii) the phase shift between the AA and TT patterns is ~6 bp.

These features of the nucleosomal pattern were revealed through the application of several multiple alignment procedures to a database of nucleosomal sites. As we have shown previously (7 ), the main features of the extracted pattern do not depend on the algorithm used to derive them and, in particular, the pattern obtained by one of the applied techniques (we called it `statistical multicycle consecutive multiple alignment') possesses practically all the features of the final refined pattern. Since the nature of DNA-protein recognition in the nucleosome and the strength of a corresponding sequence signal differ so substantially from other studied DNA-protein interactions, we used this procedure rather than other widely used methods of multiple alignment for pattern extraction (see 9 for a review).

One reason for the DNA nucleosomal pattern being weak and fuzzy is the necessity for chromatin unfolding during replication and transcription. To facilitate these processes, neither binding of the histone octamers nor the corresponding sequence signal should be very strong. The degeneracy of the weak nucleosome pattern, on the other hand, ensures minimal interference of the pattern with many other superimposed messages encoded in the DNA sequence (10 ). It is natural to assume that the greater the number of biological messages (i.e. patterns of various kinds) the nucleosome sequence carries, the fewer options remain free to serve the nucleosome positioning signal. In other words, in complex sequences the weak nucleosome pattern might be even weaker.

Experimental nucleosome sequence data are needed to reveal attributes of the nucleosomal pattern. It would be perfect if such a database could be developed from results of a hypothetical experimental technique of nucleosome mapping, which would precisely indicate the nucleosome center position along the DNA sequence. In reality, however, only a small proportion of the nucleosome DNA sequences available from the literature have been mapped with a high degree of accuracy with regard to a nucleosome center (+-1 base, three possible positions of the center). For most the uncertainty of mapping is higher, ranging up to +-55 nt (or 111 possible positions of the nucleosome DNA sequence midpoint). The most recent release of a database (11 ), referred to below as the Nucleosome Database, includes a total of 204 sequences, sorted in order of descending accuracy of their experimental mapping (7 and supplementary material; see also 11 ,12 ).

Of many biomolecular messages or codes in existence (13 -15 ), only a few general codes are currently known (see 14 ,16 for reviews). The most `complex' DNA sequences may carry an unknown large number of biological signals, with the same base utilized simultaneously in several functionally independent messages. Any of several existing criteria (13 ,17 -19 ; see also 20 ,21 for reviews) can be adapted or applied directly for quantitative evaluation of the complexity and by extension its influence on the nucleosomal sequence signal. We have chosen in the present study to use the computationally simple linguistic complexity measure (13 ) for this purpose, though, again, other measures, for instance algorithmic complexity or the minimal description length principle (22 ), could also in theory be used to distinguish simple sequences from complex ones.

According to our calculations, nucleosome sequences indeed display an appreciable range of complexity values. Assuming that complexity is a reflection of the degree to which various messages are superimposed, one would expect that the nucleosomal pattern, present in all the nucleosomal sequences, should have a stronger expression in simpler sequences, i.e. those of lower complexity. In other words, it was our expectation that application of the multiple alignment routine (6 ,7 ,23 ) to a subset of the simpler sequences in the Nucleosome Database should produce a sharper and stronger pattern than that extracted from the database as a whole. The results of this study demonstrate that this expectation was successfully met.

MATERIALS AND METHODS

Linguistic complexity of nucleosomal sequences

The linguistic complexity measure (13 ) exploits the major distinguishing feature between natural nucleotide sequences and uniformly random ones: the repetitiveness of the natural sequences, i.e. the frequent repetition, not necessarily a tandem one, of some oligonucleotides (`words'), while others are avoided. Thus, more complex sequences have richer oligonucleotide vocabularies, while repetitious sequences have relatively lower complexities. Complexity (C) can be directly calculated as the extent to which the maximal possible vocabulary (all word sizes considered) is utilized in a given strength of sequence; vocabulary usage (Ui) for oligomers of a given size i can be defined as the ratio of the actual vocabulary size of a given sequence to the maximal possible vocabulary size for a sequence of that length. For example, U2 for the sequence ACGGTAAGCTGATTCCA = 16/16 = 1, as it contains all 16 of 16 possible different dinucleotides and is therefore maximally complex on a dinucleotide basis; for the sequence ACACACACACACACACA, however, U2 = 2/16 = 0.125, as it has a simple vocabulary of only two dinucleotides. For short oligomers maximally complex usage is defined as equal or closest to equal usage; in the first example, therefore, U1 = 17/17 = 1 (A is used five times, C, G and T four times each), while for the second sequence U1 = 9/17 (only A and C are used, exceeding 5+4 occurrences of the maximal vocabulary). For any fragment of length n complexity is calculated as
C = U1[middot]U2[middot]...[middot]Ui[middot]...[middot]Un-1 1

This value, C, provides a natural measure of sequence complexity in the convenient range 0 < C <= 1 for various sequences of a given length. To characterize a relatively long sequence, the values C could be calculated for all subsequences within that sequence of fixed length, or `window size'. The mean value of the complexity of all such fragments would then be taken as the complexity value for the longer sequence.

The program written to calculate complexity in the present analysis utilized a window size of 17 bases, although in principle any other window size could equally well have been used. In Figure 1 the smoothed distributions of the complexity values along two different sequences are presented. The upper curve reflects the distribution of the complexity values for sequence 18 from the database, the Saccharomyces cerevisiae 5S rRNA, gene mapped with an accuracy of +-4 bp with regard to the nucleosome center (24 ), whereas the lower curve corresponds to sequence 27, the frog Xenopus vitellogenin B1 promoter, with the same +-4 bp accuracy of mapping (25 ). The curves were smoothed by a running window of 50 bases for the purposes of clarity. The complexity values vary significantly along the sequences, but an averaged complexity value for the rRNA gene fragment is higher than the averaged complexity for the other fragment. Accordingly, the more complex sequence 18 belongs to a subset of complex sequences, while sequence 27 belongs to a subset of simpler sequences.


Figure 1. The distributions of the linguistic complexity values along the two different sequences are presented: sequence 18 from the database, the Saccharomyces cerevisiae 5S rRNA gene (24), and sequence 27, the frog Xenopus vitellogenin B1 promoter (25). The upper curve shows the distribution of complexity for sequence 18, whereas the lower curve corresponds to sequence 27. The curves are smoothed by a running window of 50 bases.

The programs for the calculation of complexity, mapping and multiple sequence alignment are available from the authors upon request.

Nucleosome DNA sequence database

During the last decade many nucleosome DNA sequences have been determined using various experimental approaches. The latest release of the compilation of nucleosome sequence data consists of 204 sequences from 18 eukaryotes and three viruses, along with source references (supplementary material to 7 ; see also 11 ,12 ). The data were taken from both in vitro and in vivo nucleosome mapping experiments. Each record in this database indicates the center of the mapped nucleosome in the sequence as well as the accuracy (published or evaluated) of the mapping. To allow extra space for the purposes of alignment, the records contain 400 bp long sequence fragments, with position 200 corresponding to the center of the nucleosome as determined in the particular experiment.

The multiple alignment algorithm

The principle of the algorithm is that the DNA sequences represented by AA and TT dinucleotide matrices are aligned consecutively one at a time with the pattern derived from alignments obtained at earlier steps. In other terms, at a step K the shift of the Kth nucleosomal sequence is determined so as to maximize the sum of scores of pairwise alignments of the Kth sequence with K - 1 sequences aligned at previous steps. The sequences have been best matched consecutively by the AA and TT dinucleotides to the two line matrix of AA and TT frequencies obtained for previous sequences, gaps and deletions not allowed. The best match provides maximal correlation between the current matrix and the sequence. One can assume that the alignment obtained in one such cycle is a suboptimal one for the aligned K sequences. This depends, however, on the order in which the sequences are matched, therefore, one cycle of the procedure cannot provide a final pattern. Consequently, this procedure should be repeated many times, choosing different sequence orders. The resulting matrices are then averaged. For a more detailed description of a multicycle consecutive alignment see Ioshikhes et al. (7 ) and Bolshoy et al. (23 ).

Spectral analysis of the dinucleotide distributions

After subtraction of the `non-oscillating' background, the oscillating part was approximated by a sinusoid. Its amplitude was calculated from the corresponding coefficients of the Fourier spectrum (Kolker and Trifonov, submitted). Amplitudes calculated as described for all periods of interest form an amplitude spectrum, which can be used for evaluation of the presence and significance of the expected periodic component.

RESULTS

High and low complexity subsets of the Nucleosome Database

The program written to calculate linguistic complexity was applied to each of the 204 database nucleosome sequences. The values were calculated in a running window of 17 bases sliding along the sequence in the range [200 - 72 - acc + 8, 200 + 72 + acc - 8], where position 200 is the center of nucleosome, 72 corresponds to half nucleosome size, acc is the accuracy (published or evaluated) of the mapping and 8 is half the window size. The complexity values along each sequence were averaged to obtain one value for every whole sequence; over the entire database these values ranged from a low of C = 0.23 to a high of C = 0.60. The mean complexity value for all sequences in the database was C = 0.46. Based on this value, the sequences were then divided into two subsets: 114 sequences with high complexities (complex sequences) in the range [0.47, 0.60] and 90 simple sequences with complexity values in the range [0.23, 046]. Both groups had statistically equivalent mean accuracies (23.8 +- 15.1 bp for the group of complex sequences; 21.7 +- 17.4 bp for the low complexity sequences). The various organisms (18 eukaryotes and three viruses) represented in the Nucleosome Database were evenly distributed statistically among two subsets: SV40 data yielded 12 complex and 12 simple sequences; yeast, 53 complex and 31 simple sequences; Drosophila, 10 complex and 15 simple sequences. The subgroups are thus quite comparable and differ essentially only in their average complexity.

The complex and the simpler sequences were subsequently analyzed for the presence of AA and TT periodicity by applying the multiple alignment technique and the results were compared with the results for the database as a whole.

Periodically distributed AA and TT dinucleotides in complex and simple nucleosome sequences

Our original statistical multiple alignment procedure was applied separately to the whole Nucleosome Database and to the two subsets described above; the patterns were obtained both for AA dinucleotides and for TT dinucleotides. It is evident from these that the distribution of TT dinucleotides is essentially symmetrical with that of AA dinucleotides, with the center of symmetry at the midpoint of the nucleosome DNA sequence (7 ). As in the original study, the AA and TT distributions were combined by the rules of symmetry in order to improve the quality of the patterns, i.e. new AA and TT profiles were calculated: AAsym = [AA(x) + TT(-x)]/2 and TTsym = [TT(x) + AA(-x)]/2. The coordinate x here is counted from a new origin in the middle of the nucleosome DNA sequence (position 73 from the start of the 145 base nucleosome segment).

As the most important attribute of the nucleosomal pattern is its periodicity, spectral analysis was applied to all extracted profiles to clarify the periodic distribution of dinucleotides. In Figure 2 the AAsym profiles generated by this procedure and their amplitude spectra are shown.


Figure 2. The dinucleotide frequency patterns and their spectra for complementarily symmetrized AA and TT distributions generated by multiple alignment. (a) Calculated for the subset of 114 sequences of the nucleosome database with higher complexity values (complex sequences). (b) Calculated for the subset of 90 sequences of the nucleosome database with lower complexity values (simpler sequences). (c) Calculated for the whole set of 204 sequences of the nucleosome database.

In the symmetrized AA distribution obtained from the 114 complex sequences (Fig. 2 a), neither the characteristic periodicity nor other typical features of the nucleosomal pattern are seen. The spectrum of this distribution shows no major peak in the neighborhood of 10 bp. There is a small peak at 10.25 bp with an amplitude of 0.0094, which is close to the background amplitude. Clearly, had we had only these complex sequences in the Nucleosome Database, we would not be able to extract any salient pattern from them. (This is not to say, however, that these sequences do not carry the nucleosome pattern at all.) The symmetrized pattern extracted from all 204 sequences of the Nucleosome Database (Fig. 2 c), on the other hand, has a major period of 10.15 +- 0.2 with an amplitude of ~0.017. This, we then expect, should be due in large part to the added input of simpler nucleosome sequences. Indeed, the symmetrized pattern extracted from the 90 simple sequences (Fig. 2 b) has evident periodicity and the spectrum shows a large major peak at 10.55 +- 0.2 with an amplitude of 0.020.

This tendency remains detectable when not only the symmetrized patterns but also the initial AA and TT profiles are compared separately (data not shown). The distribution of AA dinucleotides along the simple nucleosome sites resembles an enhanced version of the distribution calculated for the total database: maxima and minima are placed at virtually the same positions. Apparently, the simpler sequences carry all the features of the AA distribution for the total database, but with reduced noise. The amplitude of the period of 10.25 +- 0.2 in the AA dinucleotide spectrum of simpler sequences is 0.031. This is the highest value for amplitude of the nucleosome pattern observed in our analyses thus far (see discussion of this issue in 23 ).

DISCUSSION

The DNA nucleosomal signal is known to be very subtle. In the current study we have laid the basis for a novel method that can provide an answer in an easier and more effective way. The broader problem implicit in this study is that of selecting the most representative subset of a whole database without prior knowledge about a hidden pattern the database contains.

We have formulated a specific prediction on the basis of several assumptions: (i) that the DNA nucleosomal positioning pattern, present in all nucleosome sequences in the database, is expressed more strongly when the `background noise' of other, interfering messages is reduced; (ii) that sequence complexity and, specifically, the linguistic complexity measure is an appropriate tool to select DNA nucleosome sequences with minimal interfering informational context, since the complexity of a sequence presumably reflects the number of overlapping messages present within it; (iii) that the chosen multiple alignment technique is adequate for extracting the nucleosomal pattern, despite a reduced number of sequences in the ensemble.

The method of linguistic complexity analysis

The method was introduced by Trifonov (13 ). Equation 1 is taken from this article and provides a measure of the richness or non-repetitiveness of a given text. Among the results presented in that publication were: (i) that translated sequences have generally higher complexity as compared with non-translated ones; (ii) that translated sequences of eukaryotes are usually less complex than translated prokaryotic sequences. These observations illustrate the relevance of the linguistic complexity measure for analysis of nucleotide sequences.

Indeed, the higher complexity evinced in prokaryotic sequences might be expected a priori, as in both DNA and RNA they are essentially open for various interactions, i.e. they carry many messages, unlike eukaryotic sequences, which are excluded from many interactions by folding in chromatin and hnRNP structures. The finding that translated sequences are characteristically more complex than non-translated ones fulfils yet another natural expectation: whatever other overlapping biological messages are present, the coding sequences carry one extra message (the protein translation code) and should thus have greater complexity.

Another known technique, the method of local compositional complexity (LCC), has independently yielded very similar results: (i) eukaryotic DNA is `simpler' than bacterial DNA; (ii) the mean LCC is different for introns and exons (19 ). We believe that a tentative division of the Nucleosome Database into complex and simpler subsets by the LCC measure should lead to very similar results. The comparison of different methods of measuring complexity, however, was not a goal of the study at hand.

Evaluation of possible artifacts

The Nucleosome Database consists of 204 different nucleosomal sequences available in the literature, with a wide range of experimental accuracies in mapping of the nucleosomes. The multiple alignment procedure works more effectively for more accurately mapped sequences, thus it might be more effective for simpler sequences if it happened that the linguistic complexity measure selected preferentially more accurate sequences as simpler ones. However, as we indicated in Results, the distributions of experimental error in the simpler and complex subsets are rather similar, thus excluding this possibility. Another possible artifact could be a preferable selection by the complexity criterion of the nucleosome sequences from certain species. This possibility is to be rejected as well, since various species are found to be distributed practically uniformly between the two subsets (see Results).

The Nucleosome Database consists of sequences that were taken from both in vitro and in vivo nucleosome mapping experiments. There are many cases known which demonstrate that DNA-histone interactions are sufficient to position the nucleosomes (e.g. 23 ,26 ). The DNA binding sites from in vivo experiments may, however, exhibit a rather weak sequence pattern, since nucleosome positioning in vivo may involve other factors absent in in vitro experiments. Possible domination of in vitro data in the lower complexity half of the database may thus imitate the observed effect of enhancement of the signal. Analysis of the Nucleosome Database shows that there are only 33 in vitro fragments among the total 204 sequences. Twenty of these in vitro sequences belong to the subset of simpler sequences, which is, obviously, a statistically insignificant excess. Thus this possible artifact can be rejected as well; the in vitro sites provide only a small part of the nucleosomal signal extracted from the subset of simpler sequences.

Another potential artifact may be due to the AA/TT content of the nucleosome sequences. The low complexity sequence subset may be biased towards a high AA/TT content, which makes the sequences simpler. As Table 1 demonstrates, the low complexity sequence subset is indeed somewhat biased towards a high AA/TT content. However, AA/TT content does not necessarily mean that the AA/TT periodicity should be stronger. The CC/GG content of the same subset is lower (Table 1 ) but the amplitude of the CC (GG) periodic component calculated as in (8 ) is higher for the subset of simpler sequences (data not shown). Obviously, not only AA/TT components of the nucleosomal pattern are expressed more strongly in the simpler sequences and there is no correlation between composition and respective amplitudes of periodic components. Among simpler sequences 25 actually have an AA+TT content lower than average. We thus conclude that signal enhancement in sequences of lower complexity is not a reflection of AA/TT composition.

Table 1 . The dinucleotide composition of the database
  Total   Lower complexity
  Average SD Average SD
AA 0.103 0.004 0.122 0.006
AC 0.054 0.001 0.050 0.002
AG 0.063 0.002 0.058 0.003
AT 0.076 0.002 0.086 0.004
CA 0.068 0.002 0.062 0.002
CC 0.052 0.003 0.048 0.005
CG 0.028 0.002 0.022 0.002
CT 0.062 0.002 0.060 0.003
GA 0.060 0.002 0.053 0.003
GC 0.047 0.002 0.039 0.003
GG 0.052 0.003 0.044 0.005
GT 0.049 0.001 0.044 0.002
TA 0.064 0.002 0.079 0.004
TC 0.057 0.002 0.055 0.003
TG 0.066 0.002 0.057 0.002
TT 0.100 0.004 0.122 0.007

The analysis described presents only an indirect account of the influence of other (largely unknown) sequence patterns. Strictly speaking, our studies demonstrate only that the particular technique of pattern extraction used in this work gives better results on a certain subset of the database than on the whole set. It is possible that the simpler sequences do not in fact carry any stronger a pattern; rather, this subset could somehow be better suited to the multiple alignment routine, which might generate the periodic signal by itself. Against this possibility stand results of experiments performed in silico (23 ). In that work, the multicycle consecutive alignment technique was calibrated on a simulated system of random sequences with an introduced weak periodic signal; a periodic variation of occurrence of the dinucleotides AA and TT with the period 10.33 bases and approximately a half-period relative phase shift was taken as a signal imitating the nucleosome DNA pattern. These simulations showed that signal components with amplitudes as small as 0.02 could be successfully extracted from the model database. Thus there is fairly good evidence that the multicycle routine indeed extracts an existing signal and that the strength of the resultant pattern is not merely an artifact of the procedure.

Periods of the nucleosomal pattern and of individual sequences

Although the multiple alignment procedure has proved its efficiency when applied to model sets (23 ), to the full size database and to a subset of simpler sequences, it does not show the 10.3 base periodicity characteristic as a dominating feature in the set of sequences with high complexity. Does this mean that the complex sequences do not contribute to the final nucleosomal pattern? Due to the sophisticated iterational nature of the applied routine, there is no obvious way to estimate contributions of individual sequences to the final pattern. However, one can estimate the amplitudes of the AA and TT periodic distributions along individual sequences using the output value of the period 10.3. These amplitudes were obtained with the same spectral analysis procedure as that used to extract the characteristics of the patterns shown in this paper. The simpler sequences have an average amplitude of 0.026, while the average for the complex sequences is 0.020 (see Table 2 for a more detailed description of the two distributions). These data show that although the fraction of periodic sequences for AA(TT) distributions is larger in the set of simpler sequences, the set of more complex sequences does, indeed, contribute to the overall pattern.

The value of 10.3 for the nucleosome DNA period is chosen for reasons discussed previously (7 ). A confusion may arise from the observed scatter of the values of periods obtained by spectral analysis: the final pattern has the major period at 10.3, while the intermediate patterns obtained by the consecutive alignment have major peaks at ~10.15 for AAs and ~10.5 for TTs (7 ); analysis of positional double autocorrelation functions for CCs and AAs reveals 10.0 and 10.1 bp respectively (8 ); the period of the symmetrized pattern of the simpler sequences is ~10.5, while before symmetrization it is 10.25. For the individual sequences values of period may vary over a wider range. Different methods of multiple alignment and different subsets of the sequences of the database thus reveal seemingly non-identical periods. However, they are all consistent with an averaged estimate of 10.3 +- 0.2 bp. To improve the accuracy of this estimate a much larger database is necessary.

Possible applications of the complexity selection approach

Many studies of biological patterns involve collections of functionally equivalent sequences (FES) from which the pattern must be extracted (for a review see 16 ). Usually the pattern is presented in the form of a profile, calculated with equal weight assigned to all of the sequences in the collection. It is clear, however, that the quality of the hidden degenerate pattern may be very different for different sequences in the set, which have varying signal-to-noise ratios and thus should not be treated equally in calculation of the profile. However, since neither signal nor noise are normally known in advance, appropriate weights can be ascribed to the sequences only a posteriori. The complexity discrimination approach proposed here allows a priori identification of the sequences more likely to have smaller signal-to-noise ratios, by calculating the complexities of all the sequences in the collection (e.g. by the measure of linguistic complexity, as in this work) and removing or giving less weight to those that are most complex. We believe that this approach will prove useful in the extraction of other degenerate biological signals, such as eukaryotic or prokaryotic promoters or various weakly expressed transcription factor binding sites. This study provides only a single illustration of a potentially important approach utilizing sequence complexity for the analysis of overlapping degenerate patterns. Further studies are necessary to fully realize this potential.

Table 2 . Distribution of spectral amplitudes among individual sequences
Complexity Average amplitude SD Min amplitude Max amplitude >0.017 >0.034
Low 0.026 0.023 0.001 0.122 60% 25%
High 0.020 0.016 0.0002 0.090 51% 16%

ACKNOWLEDGEMENTS

The authors are thankful to E.Kolker, who kindly provided an original program for spectral analysis, and to Dr E.Shpigelman for his version of a program for the linguistic complexity calculations. The authors would also like to express their gratitude to Drs S.Brunak and H.Herzel for their invaluable suggestions during the preparation of the manuscript. A.B. was supported by the Bat Sheva de Rothschild Fund for the Advancement of Science in Israel and the National Laboratory for Bioinformatics and DNA Sequencing of the Israel Council for Higher Education and is supported by the Danish National Research Foundation. I.I. was supported by a L.Bein WIS scholarship. K.S. received the Clarice D.Kaufmann Scholarship to the 28th Dr Bessie F.Lawrence International Summer Science Institute.

REFERENCES

1 Wolffe,A.P. (1994) Trends Biochem. Sci., 19, 240-244. MEDLINE Abstract

2 Simpson,R.T. (1991) Prog. Nucleic Acids Res., 40, 143-184. MEDLINE Abstract

3 Trifonov,E.N. and Sussman,J.L. (1980). Proc. Natl. Acad. Sci. USA, 77, 3816-3820. MEDLINE Abstract

4 Trifonov,E.N. (1980) Nucleic Acids Res., 8, 4041-4053. MEDLINE Abstract

5 Drew,H.R. and Travers,A.A. (1985) J. Mol. Biol., 186, 773-790. MEDLINE Abstract

6 Ioshikhes,I., Bolshoy,A. and Trifonov,E.N. (1992) J. Biomol. Struct. Dyn., 9, 1111-1117. MEDLINE Abstract

7 Ioshikhes,I., Bolshoy,A., Derenshteyn,K., Borodovsky,M. and Trifonov,E.N. (1996) J. Mol. Biol., 262, 129-139. MEDLINE Abstract

8 Bolshoy,A. (1995) Nature Struct. Biol., 2, 446-448. MEDLINE Abstract

9 Waterman,M.S. (ed.) (1989) Mathematical Methods for DNA Sequence Analysis. CRC Press, Boca Raton, FL.

10 Trifonov,E.N. (1989) Bull. Math. Biol., 51, 417-432. MEDLINE Abstract

11 Ioshikhes,I. and Trifonov,E.N. (1993) Nucleic Acids Res., 21, 4857-4859. MEDLINE Abstract

12 Ioshikhes,I. and Trifonov,E.N. (1994) EMBL accession on FTP server FTP.EBI.AC.UK, directory nucleosome_dna.

13 Trifonov,E.N. (1990) In Sarma,R.H. and Sarma,M.H. (eds), Structure and Methods, Vol. 1, Human Genome Initiative and DNA Recombination. Adenine Press, New York, NY, pp. 69-77.

14 Trifonov,E.N. (1991) In Lavery,R., Rivail,J.-L. and Smith,J. (eds), AIP Conference Proceedings 239: Advances in Biomolecular Simulations. American Institute of Physics, New York, NY.

15 Konopka,A.K. (1992) Comput. Chem., 16, 83-84.

16 Konopka,A.K. (1994) In Smith,D.W. (ed.), Biocomputing. Academic Press, San Diego, CA, pp.119-174.

17 Thiele,H. (1974) In Klix,F. (ed.), Organismische Informationsverarbeitung. Akademie-Verlag, Berlin, Germany.

18 Ebeling,W. and Jimenez-Montano,M.A. (1980) Math. Biosci., 52, 53-71.

19 Konopka,A.K. and Owens,J. (1990) In Bell,G.I. and Marr,T.G. (eds), Computers and DNA. Addison-Wesley Longman, Redwood City, CA, pp. 147-155.

20 Grassberger,P. (1989) Helv. Phys. Acta, 62, 489-511.

21 Wackerbauer,R., Witt,A., Atmanspacher,H., Kurths,J. and Scheingraber,H. (1994) Chaos, Solutions Fractals, 4, 133-173.

22 Rissanen,J. (1978) Automatica J. IFAC, 14, 465-471.

23 Bolshoy,A., Ioshikhes,I. and Trifonov,E.N. (1996) Comp. Applic. Biosci., 12, 383-389. MEDLINE Abstract

24 Buttinelli,M., Di Mauro,E. and Negri,R. (1993) Proc. Natl. Acad. Sci. USA, 90, 9315-9319. MEDLINE Abstract

25 Schild,C., Claret,F.-X., Wahli,W. and Wolffe,A.P. (1993) EMBO J., 12, 423-433. MEDLINE Abstract

26 Jackson,J.R. and Benyajati,C. (1993) Nucleic Acids Res., 21, 957-967. MEDLINE Abstract


* To whom correspondence should be addressed at: Center for Biological Sequence Analysis, Department of Chemistry, The Technical University of Denmark, Building 206, DK-2800 Lyngby, Denmark. Tel: +45 4545 2472; Fax: +45 4593 4808; Email: alex@cbs.dtu.dk
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Print PDF (86K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (15)
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Bolshoy, A.
Right arrow Articles by Ioshikhes, I.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bolshoy, A.
Right arrow Articles by Ioshikhes, I.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?