ABSTRACT
The potentiation and subsequent initiation of transcription are complex
biological phenomena. The region of attachment of the chromatin fiber to the
nuclear matrix, known as the matrix attachment region or scaffold attachment
region (MAR or SAR), are thought to be requisite for the transcriptional
regulation of the eukaryotic genome. As expressed sequences should be contained
in these regions, it becomes significant to answer the following question: can
these regions be identified from the primary sequence data alone and
subsequently used as markers for expressed sequences? This paper represents an
effort toward achieving this goal and describes a mathematical model for the
detection of MARs. The location of matrix associated regions has been linked to
a variety of sequence patterns. Consequently, a list of these patterns is
compiled and represented as a set of decision rules using an AND-OR formulation. The DNA sequence was then searched for the presence of
these patterns and a statistical significance was associated with the frequency
of occurrence of the various patterns. Subsequently, a mathematical potential
value,
MAR-Potential
, was assigned to a sequence region as the inverse proportion to the probability
that the observed pattern population occurred at random. Such a MAR detection
process was applied to the analysis of a variety of known MAR containing
sequences. Regions of matrix association predicted by the software essentially
correspond to those determined experimentally. The human T-cell receptor and the DNA sequence from the Drosophila bithorax region
were also analyzed. This demonstrates the usefulness of the approach described
as a means to direct experimental resources.
Recent studies have established that human somatic cell chromatin is organized
in loops that span ~50-100 kb (
1
). The points of attachment of these chromatin loops serve as specific sequence
landmarks as they anchor the DNA sequence to the fibers of chromosomal
scaffold. These sites of DNA attachment to the nuclear scaffold are termed
scaffold (metaphase) or matrix (interphase) attachment regions (SAR or MAR).
They are known to facilitate the expression of genes and may function as the
origins of replication.
The matrix or scaffold attachment regions are relatively short (100-1000 bp long) sequences that anchor the chromatin loops to the nuclear
matrix. MARs often include the origins of replication (ORI) and can possess a
concentrated area of transcription factor binding sites (
2
). Approximately 100 000 matrix attachment sites are believed to exist in the
mammalian nucleus of which ~30 000-40 000 serve as ORIs (
3
). MARs have been observed to flank the ends of genic domains encompassing
various transcriptional units. It has also been shown that MARs bring together
the transcriptionally active regions of chromatin such that the transcription
is initiated in the region of the chromosome that coincides with the surface of
nuclear matrix (
3
,
4
).
Matrix attachment regions have been categorized as constitutive (permanent) or
facultative (cell-type specific) (
2
). The constitutive MARs occur in all types of cells irrespective of the tissue
in which they are found. In contrast, the presence of a facultative MAR is
tissue specific and its use is governed by that tissue. MARs have been
experimentally defined for several gene loci, including the chicken lysozyme
gene (
5
), human interferon-[beta] gene (
6
), human [beta]-globin gene (
7
), chicken [alpha]-globin gene (
8
), p53 (
9
) and the human protamine gene cluster (
10
).
It is widely accepted that the next phase of the Human Genome Project will focus
on completing the transcript map. This will entail the mapping of the
transcribed sequences to the appropriate regions of the chromosomes. To help
identify these regions, some sequence identifiers, such as promoters, enhancers
and locus control regions (LCR) are typically used. One of the clearest
indicators of functional sequences are MARs. In light of the key role of MARs
in genetic processes, and their localization to functional chromatin domains, a
means to model these markers so that they could be placed on the map from
sequencing data was sought. The results of our studies to computationally
define MARs for experimental validation are presented.
MARs are polymorphic and appear to be distributed throughout the genome. There
is no known consensus sequence that is characteristic of a MAR. Biologists have
physically identified MARs and have tried to correlate their presence with the
occurrence of several DNA sequence motifs, including the ORI, curved and/or
kinked DNA. A description of some of the motifs that have been identified
within several MARs is as follows.
Origin of replication.
It is known that DNA replication is associated with the nuclear matrix. It has
also been demonstrated that nuclear matrix attachment sites, homeotic protein
recognition and binding sites and the origins of replication share the ATTA,
ATTTA and ATTTTA motifs. This implies that differential activation of origins
of replication (important for development) are regulated while part of the
nuclear matrix (
2
). ORI motifs
m
1
. . .
m
3
in Figure
1
have been used to formulate Rule 1 in Figure
2
.
Several other characteristics, some of which have not been included in the
current analysis, have been proposed for MARs. For example, MARs have been
shown to contain palindromic sequences, Z-DNA and DNase I hypersensitive sites (
2
). A few
Alu
elements have also been identified within MARs. Specifically,
Alu
s with a high AT content may interact with the nuclear matrix. With the
exception of
Alu
elements, whose role in matrix attachment is unclear (and their occurrence
limited to primates), the above elements cannot be easily fashioned into a
definitive consensus pattern. Moreover, these elements may not be appropriate
or necessary for the mathematical determination of MAR-potential. For example, many bacterial, viral and mammalian ORI sequences,
which are characteristic of MARs, are also palindromic. The occurrence of
palindromic sequences at sites of nuclear matrix attachment may thus reflect
the presence of origins of replication, which are already identified by Rule 1
in Figure
2
. DNase I hypersensitivity, while possibly a characteristic of MARs, is likely a
result of the interaction with the nuclear matrix and not its cause. Thus,
DNase I hypersensitivity may represent a useful method for identifying MARs
experimentally rather than computationally.
There is ample evidence to suggest that transcription factor binding sites,
promoter regions and other regulatory regions of the genome may be nuclear
matrix associated. While there are a myriad regulatory elements and promoter
sequences of known composition, these have not been utilized in the model
presented. The computational model proposed in this report will most likely
identify constitutive or class 2 MARs, although, when adjusted, it has been
successful in detecting facultative class 3 MARs (
10
).
Known consensus sequences for eukaryotic transcription factors and promoters can
be identified using algorithms such as SIGNAL SCAN (
14
) and PROMOTER SCAN (
15
). These programs look for singular patterns, and are thus not very useful for
determining the significance of the co-occurrence of many patterns. Similarly, `gene grammar' has been utilized (
16
) to capture relations between the promoters, introns and exons. While this
approach has its merits for detecting functional units within a sequence, its
application is limited to recognizing patterns where the relationship between
the component motifs is known
a priori
. A neural network can be utilized to recognize higher level patterns when such
a relationship between the component motifs is not formally defined but learned
by the computational agents. Such a system has been utilized by Grail where the
network was trained to detect genes ( >= 100 bases in size) in raw DNA sequence (
17
). However, the lack of an appropriate training set (there are very few MAR
regions experimentally mapped) makes neural network an impractical tool for
this problem. Thus, our approach based on assigning a statistical significance
to the co-occurrence of several MAR specific patterns represents a unique and, as we
demonstrate below, viable solution to the MAR detection problem.
As a first step toward the algorithmic detection of regions of probable matrix
association (MARs), an effective mathematical framework for representing such
patterns must be adopted. In our approach, the underlying architecture used to
represent patterns is based on an AND-OR Boolean decision tree. As shown in Figure
3
, such a tree represents a disjunction (OR) of the conjunctions (AND) on motifs
detected in the sequence. Thus, sequence level motifs serve as the lowest level
predicates used to detect the presence of a higher level pattern in the
sequence. Note that the lowest level predicates may be negated before being
used in the AND layer. In such an instance, the absence of motifs is sought to
satisfy the conditions for the occurrence of the higher level pattern.
Figure
As an example, consider a simpler instantiation of such an AND-OR decision tree. A rule to define the origin of DNA replication (
R
1
) can be based on an OR or the or operator applied to the three motifs
m
1
= ATTA,
m
2
= ATTTA and
m
3
= ATTTTA. The motif detectors bypass the AND layer in this case, and directly
feed into the OR layer.
R
1
=
m
1
or
m
2
or
m
3
1
Similarly, the requirement for co-occurrence of multiple motifs can be specified using the AND or the [and] operator. In the AND rule for multiple patterns, an additional
parameter is incorporated to constrain the allowable gap between the two co-occurring motifs. For example, the AT-richness (
R
6
) rule has been formulated as the occurrence of two hexanucleotide strings,
m
18
= WWWWWW (note: the IUPAC code W denotes an ambiguous base A or T), that are
separated by distance of 8-12 nt. Thus, the AT-richness rule can be written as:
{R sub 6} = {m sub {1 8}} {andsign from 8 to {1 2}} {m sub {1 8}}
2
Such a formulation uses an augmented AND operator, [sube][supe][part][clubs]i[hearts][supe] [clubs] <- [equivalent to] {[real] o -[Pi] [clubs] <- [=/=] {h i [hearts] h[Pi], to define the acceptable distance between the two motifs.
As depicted in Figure
1
, sequence motifs serve as predicates in the modeling of the MAR-detection rules. These predicates essentially represent the various
sequence motifs that have been known to occur in the vicinity of MARs. With the
motif indices thus defined, a set of rules for detecting higher level MAR
patterns were developed and shown in Figure
2
. This rule database represents the core set of rules needed to identify MARs in
a DNA sequence. The core set of MAR pattern rules are considerably simpler than
what can be represented by the general rule format (
18
,
19
). However, as the experimental determination of some additional regions of
matrix association continues, other related patterns are likely to emerge. The
more generalized framework for pattern representation will facilitate the
incorporation of new (and potentially complex) pattern rules.
Associated with each of the motifs is a probability of its
random occurrence. This is derived using the base composition of the sequence
being analyzed (
20
). For example, the probability of finding the motif ATTA in a sequence with composition {A,C,G,T} = (0.2, 0.2,
0.3, 0.3) is equal to (0.2
2
[middot]0.3
2
) = 3.6 * 10
-3
.
In order to calculate the random probability of occurrence of a rule, the motif
probabilities are multiplied across an AND layer when the motifs are
independent, and added across an OR layer when their occurrence is mutually
exclusive. Furthermore, assuming a Poisson density function for motif
occurrences, the probability of finding at least one motif within an acceptable
distance from the reference motif can be computed. Thus, the random occurrence
probabilities for rules
R
1
and
R
6
in Equations
1
and
2
will be given by:
Pr
(
R
1
) =
Pr
(
m
1
) +
Pr
(
m
2
) +
Pr
(
m
3
)
Pr
(
R
6
) =
Pr
(
m
18
) * {1 - exp[-(12 - 8 + 1) [middot]
Pr
(
m
18
)]}
3
=
Pr
(
m
18
) [middot] {1 - exp[-5 [middot]
Pr
(
m
18
)]}
4
As we demonstrate, these probability values for random occurrences of the
various rules can be used for assigning statistical significance to the set of
complex patterns that are detected in a given region of the sequence.
The task of detecting regions of matrix association is modeled as a problem of
hypothesis testing. Since matrix association is a property of a span of the
sequence, a sliding window algorithm was considered appropriate for detecting
MARs. The sliding window algorithm uses two parameters,
W
and [delta] to measure a local property of the DNA sequence. The statistic of
interest is measured in a window of size
W
centered at location
x
along that sequence. Successive window measurements are then made by sliding
the window in increments of [delta] nucleotides. If [delta] is small, linear interpolation can be used to join the individual
window statistics gathered at
x
,
x
+ [delta], . . . ,
x
+
k
[delta]. In this manner, a continuous distribution of the parameter of interest
is obtained as a function of
x
.
The null hypothesis,
H
0
, tested in each window is as follows,
H
0
: the frequency of the MAR patterns observed is not significantly different from
the frequency expected in a random
W
nt sequence of the same composition as the sequence being analyzed.
Thus, large deviation from the expected frequency of patterns in a window will
force the rejection of
H
0
, which in turn will imply the presence of a MAR. Under
H
0
, the cumulative probability,
p
, of observing a frequency vector with each of its components greater than or
equal to
f
i
is essentially the probability that the null hypothesis will be erroneously
rejected. In other words, a small value for
p
signifies that the observed event would be a rare occurrence under the null
hypothesis and hence qualify the window sample as a candidate for containing a
site for matrix attachment.
In order to quantify the significance of this deviation, the statistic measured
for each window is [-
log(p)
]. The value [-
log(p)
] is also referred to as MAR-potential, and denoted as [rho]. The value of [rho] is computed for both the forward and the reverse strands of the
DNA sequence and the average of the two values is considered to be the
potential at a given location. We now describe the mathematical model for
calculating the value of [rho].
Let us assume that we are searching for
k
distinct types of MAR patterns within a given window of the sequence.
Typically, these patterns are defined as rules
R
1
,
R
2
, . . . ,
R
k
. Using the probability formulation defined in Equation
3
, the random probability of the occurrence of the various rules is calculated.
Let these values be
p
1
,
p
2
, . . . ,
p
k
.
Next, a random vector of pattern frequencies,
F
, is constructed.
F
is a k
-
dimensional vector with components,
F
= {
x
1
,
x
2
, . . . ,
x
k
}, where each component
x
i
is a random variable representing the frequency of the pattern
R
i
in the
W
nt window. Furthermore, the component random variables
x
i
are assumed to be independently distributed Poisson processes, each with the
parameter [lambda]
i
. The joint probability of observing a frequency vector
F
obs
= {
f
1
,
f
2
, . . . ,
f
k
} purely by chance is given by Equation
5
.
{{P ( F} sub {o b s}} ) = {{roman PI} from {i = 1} to k} {{{{e x p} sup {- {roman lambda} i}} {{roman lambda} sub i} {"" sup {f i}}} over {{f sub
i} !}} ^ w h e r e ^ {lambda sub i} = {p sub i} cdot W
5
The steps required for computation of the
p
, the cumulative probability that the observation satisfies
H
0
, is given by Equation
6
below.
lpile {{p = P r {{( x} sub 1} >= {f sub 1} , {x sub 2} >= {f sub 2} , ... , {x
sub k} >= {f sub k} )} above {" " = {{P r ( x} sub 1} >= {f sub 1} ) cdot {P
sub r} {{( x} sub 2} >= {f sub 2} ) cdot ... cdot P r {{( x} sub k} >= {f sub
k} )} above {" " = {sum from {{x sub 1} = {f sub 1}} to inf} {{{exp sup {- {lambda sub 1}}} {lambda sub 1 sup {x sub 1}}} over {{x sub 1} !}} cdot
{sum from {{x sub 2} = {f sub 2}} to inf} {{{exp sup {- {lambda sub 2}}} {lambda sub 2 sup {x sub 2}}} over {{x sub 2} !}} cdot
... cdot {sum from {{x sub k} = {f sub k}} to inf} {{{exp sup {- {lambda sub k}}} {lambda sub k sup {x sub k}}} over {{x sub k} !}}}}
6
The
p
-value in Equation
6
is next utilized to compute the value of [rho] or the MAR-potential as given by Equation
7
below.
lpile {{rho = ln {{1 . 0} over p} = - ln ( p )} above {^ ^ = {sum from {i = 1} to k} {lambda sub i} + {sum
from {i = 1} to k} ln {f sub i} ! - {sum from {i = 1} to k} {f sub i} ln {lambda sub i}} above {^ ^ ^ ^ - {sum from {i = 1} to k} ln ( 1 + {{lambda sub i} over {{f sub i}
+ 1}} + {{lambda sub i sup 2} over {{{( f} sub i} + {{1 ) ( f} sub i} + 2 )}} +
... + {{lambda sub i sup t} over {{{( f} sub i} + {{1 ) ( f} sub i} + 2 ) ...
{{( f} sub i} + t )}} )}}
7
It is not surprising that a pattern's contribution to the overall MAR-potential is strongest when its observed frequency is high while its
probability of random occurrence is low. The infinite summation term in
Equation
7
quickly converges and thus can be adaptively calculated to the precision
desired. For small values of [lambda]
i
, the series may be truncated such that the last term satisfies the following
condition:
[ {{lambda sub i sup t} over {{{( f} sub i} + {{1 ) ( f} sub i} + 2 ) ... {{( f}
sub i} + t )}} ] <= epsilon
8
Depending upon the level of accuracy desired, the value of <-> [=/=][clubs]i[real]o[supe] is typically set to a small positive number. In
our implementation, <-> [=/=][clubs]i[real]o[supe] was set at 10
-4
.
After computing the values for [rho], the next task requires the interpretation of the statistical potential
values to infer the location of matrix attachment sites. The following provides
a discussion of the basis for selecting some key parameters that can influence
the determination of these sites.
Run length.
A true MAR will be characterized by a run-length of high potential values. In other words, if the average span of a
MAR is
M
span
bases, we should expect to see successive high potential values in an average
run length of:
{r sub L} = {{W - {M sub {s p a n}} - delta / 2} over delta}
9
Thus, if
W
= 1000 bp and [delta] =100 bp, and if
M
span
is assumed to be ~600 bp, an average run-length of three high potential values is expected.
The use of this parameter is illustrated in Figure
4
. The top panel shows the analysis of the sequence using a window size
W
= 1000 bp, while the lower panel shows the same analysis done with
W
= 2000 bp. In both these cases, however, the windows were stepped by [delta] = 100 bp. Using the formula in Equation
9
, the expected value for
r
L
is 3 in the first case and 13 in the second.
Figure
The
r
L
cut-offs specified above were used in establishing the locations of the
various MARs. Although, locations of MARs are only plotted on Figure
4
b, the values were identical in both cases. It may be noted that in general, the
larger window size tends to smooth the potential values and widen any peaks.
This is to be expected since a larger value of [lambda]
i
shifts the Poisson density function to the right and increases the probability
of observing a high
f
i
under the null hypothesis.
Figure
To visualize the locations of these MARs, the analysis window can be zoomed to
the area of interest. This is illustrated in Figure
5
c where the displayed region spans locations 360 kb through 460 kb. The process
is referred to as being intelligent since the normalization of the potential
values occurs only on the basis of the potential values present in the zoomed
region. That is, the saturation values for the entire sequence may be different
than that used in the viewing window.
The ability of the MAR prediction algorithm described above to identify MARs is
shown in Figure
6
. When physical evidence is available, as in the human-globin and PRM1 -> PRM2 -> TNP2 domains, the matrix attachment regions predicted by the
software closely match those established experimentally (
18
). This analysis also successfully identified the MAR at the 5' end of the human apolipoprotein B locus (
18
,
19
). Based on these analyses, a normalized MAR-potential value of 0.6-0.75 was considered to yield reasonable results. The higher value
is generally used for smaller window size.
Figure
The algorithm was then applied to identify candidate MARs in the
Drosophila melanogaster
bithorax region and the human T-cell receptor beta locus, for which the location of MARs are not known.
The results for the Drosophila bithorax region are shown in Figure
4
while Figure
5
shows the analysis for the human [beta] T cell receptor locus. Potential MARs spaced ~60-100 kb apart were identified, which is in accord with data
suggesting that MARs in somatic nuclei occur at intervals of 60 kb to >100 kb (
1
). This suggests that reasonable candidate regions of interest for physical
analysis were identified.
To the best of our knowledge, there is no data to define the spacing of MARs in
the Drosophila genome. It is reasonable to assume that the spacing would be
similar to that seen in other eukaryotic nuclei. The ~338 kb region queried in Figure
4
constitutes a region of the Drosophila genome that contains several homeotic
genes. This has been suggested to be rich in constitutive type MARs. The
regions identified by this analysis need to be verified experimentally, as they
may represent characteristic constitutive or class 2 MARs (
10
).
The method of analysis described above has successfully identified known MARs in
several well characterized loci. Furthermore, it has determined MAR candidates
in a reasonable fashion in large sequences where sites of attachment to the
nuclear matrix are unknown. This is despite the lack of a clear consensus
sequence for MARs. However, care should be taken when applying this approach,
as the results can vary when different combinations of rules are applied. This
may reflect the beginning of the mathematical resolution of the four classes of
MARs. The algorithm described should be readily applicable for the
identification of any functional element for which a consensus sequence is
unclear or unknown. The utility of this approach is immediately obvious for the
identification of regulatory regions of locus control, which may contain
multiple individual sequence motifs. Such regulatory motifs, which may occur at
random throughout the genome, would be in close association with other
regulatory and promoter motifs at regions of locus control, like the well
characterized locus control region of the multigenic [beta]-globin domain. Use of this algorithm in conjunction with a database
containing known regulatory motifs should enable the identification of
potential regions of locus control in large sequences, just as the current
analysis identified MARs.
A beta-version of the software described in this paper may be obtained by sending
an e-mail message to gbs@acm.org.



Intelligent zoom.
While processing a large sequence, caution must be exercised in the
interpretation of a graphical plot of the potential values. Specifically, if a
large number of bases are packed into a single pixel location, multiple peaks
will have the tendency to overlap one another. This could possibly leave one
with a false impression that the results are noisy.

REFERENCES
Return
