Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences
Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences
Jürgen
Kleffe*
,
Klaus
Hermann
,
Wolfgang
Vahrson
,
Burkhardt
Wittig
and
Volker
Brendel
1
Freie Universität Berlin, Institut für Molekularbiologie und Biochemie, Bereich Molekularbiologie und
Informatik, Arnimallee 22, 14195
Berlin
,
Germany
and
1
Department of Mathematics, Stanford University,
Stanford
, CA 94305,
USA
Received August 19, 1996;
Revised and Accepted October 16, 1996
ABSTRACT
Pre-mRNA splicing in plants, while generally similar to the processes in
vertebrates and yeast, is thought to involve plant specific
cis
-acting elements. Both monocot and dicot introns are typically strongly
enriched in U nucleotides, and AU- or U-rich segments are thought to be involved in intron recognition,
splice site selection, and splicing efficiency. We have applied logitlinear
models to find optimal combinations of splice site variables for the purpose of
separating true splice sites from a large excess of potential sites. It is
shown that plant splice site prediction from sequence inspection is greatly
improved when compositional contrast between exons and introns is considered in
addition to degree of matching to the splice site consensus (signal quality).
The best model involves subclassification of splice sites according to the
identity of the base immediately upstream of the GU and AG signals and gives
substantial performance gains compared with conventional profile methods.
INTRODUCTION
An important aspect of eukaryotic genome research is scanning functionally
unknown sequences for potential split genes. Because experimental capacity is
small compared with the rate of accumulation of genomic sequence data, the
majority of new sequences must be studied by statistical, computational
methods. Current approaches to finding genes and functional sites in DNA
sequences have recently been reviewed (
1
,
2
). Most algorithms involve identification of potential splice sites (search by
signal) as a preliminary step to the task of parsing the sequence into
consistently ordered, translatable exons. Ideally, one would like splice site
prediction to be based on the same signals that are recognized by the nuclear
splicing machinery. Practically, most methods are based on consensus motifs and
weight matrices scoring the degree of fit to some average signal pattern around
known splice sites in a learning set (e.g.,
3
-
6
). Neural network applications were introduced by Brunak, Engelbrecht and
Knudsen (
7
). The construction of decision trees based on categorical discriminant analysis
by Sirajuddin
et al
. (
8
) has recently promised improved donor site recognition.
The general features of splicing appear to be conserved throughout all
eukaryotes. The failure of accurate splicing of heterologous introns in
transformed plant cells suggested that particular features of plant introns are
essential for accurate pre-mRNA processing (for reviews, see
9
,
10
). In particular, several studies have demonstrated that U-rich segments within plant introns influence splice site selection (e.g.,
11
-
13
). Most recently, it was shown that the relative contrast in U and G+C content
between introns and their flanking exons correlates with splicing efficiency
(Carle-Urioste, Brendel and Walbot, submitted). Contrast-enhancing changes within either introns or exons improved splicing
efficiency. It was suggested that evaluation of compositional contrast could
improve prediction of splice sites.
Here we present a novel algorithm for the prediction of splice sites in higher
plants based upon the two variables of splice site signal strength and
compositional contrast. Our model seeks to incorporate a minimal set of local
sequence properties accessible to the nuclear splicing machinery, and, in
particular, does not explicitly consider reading frame and codon usage
information. In this way, analysis of false positive as well as false negative
predictions may point to missing variables (e.g., branch point consensus,
specific sequence motifs; Brendel, Kleffe and Walbot, manuscript in
preparation). On the other hand, for the practical task of identifying split
genes, incorporation of global coding potential assessment greatly reduces the
number of falsely predicted splice sites, as demonstrated in the recent
comprehensive study of splice site prediction in
Arabidopsis thaliana
by Hebsgaard
et al
. (
14
). From a statistical standpoint, our method is a standard application of
logitlinear models. We provide a rationale for the applicability of such models
for a wide variety of sequence analysis problems.
MATERIALS AND METHODS
Gene collections
Genomic sequences from
Zea mays
and
A.thaliana
were retrieved from GenBank and compiled into specifically annotated non-redundant databases (redundancy as a result of significant sequence
similarity was assessed as described in ref.
15
). Only completely sequenced genes were included, i.e. those genes for which all
introns between the start codon and the stop codon are available. For
Z.mays
, our database contains 46 genes that encode distinct proteins and comprise a
total of 250 exons and 204 introns (this database denoted GBEzm). Obvious
annotation errors in GenBank entries were corrected. For
Arabidopsis
, a database of 131 distinct genes was obtained with a total of 709 exons and
578 introns (GBEat). In this case, because many genes are available, we simply
excluded GenBank entries with likely erroneous annotation. Korning
et al
. (
16
) offer a detailed account of problems with GenBank entries and provide a cleaned
Arabidopsis
gene set of similar size.
Splice site collections and control sets
Databases for donor and acceptor sites and respective control sets were derived
from our gene collections in the following way. Identification of splice and
control sites was restricted to the pre-mRNA portions between the known start and stop codons of each gene. This
restriction accomplishes several goals. First, it limits the number of control
sites, which is already large even prior to scanning the flanking regions and
the opposite DNA strand. Second, correct identification of splice sites in this
restricted region is of practical interest, because there are now several
independent promising methods for predicting eukaryotic promoters and
terminators (
17
-
19
). These methods, if applied successfully, would similarly limit the search
space. Third, in terms of understanding the cellular splicing machinery, the
task is, of course, to distinguish true splice sites from non-sites in pre-mRNA, not in genomic DNA.
Our donor site and corresponding control sets consist of all GU dinucleotides
occurring within the prescribed sequence bounds and including 50 nt on each
side. For each such site, we record the signal sequence and the percent
nucleotide composition evaluated separately for the 50 nt in the 5' and 3' flanks. The donor signal sequence was chosen to cover the 3 nt 5' and the 4 nt 3' to the GU. The donor site data sets comprise 201
true sites plus 6305 control sites for
Z.mays
and 577 true sites plus 14 964 control sites for
Arabidopsis
. In three of the 204 maize introns the donor site consensus GU is replaced by
GC. These exceptional sites were not considered for the purposes of this study.
For acceptor sites and their controls, we selected all AG dinucleotides with 50
nt flanks and created corresponding data sets comprising a total of 204 true
sites plus 6290 control sites in maize and 577 true sites plus 15 712 control
sites in
Arabidopsis
. The acceptor signal sequence was defined to cover 13 nt to the left of AG and
2 nt to the right.
Logitlinear models for splice site recognition
As evident from the sizes of the data sets described above, for every true
splice site within a particular pre-mRNA there are on the order of 30 non-sites conserving the consensus GU for donors or AG for acceptors. We
wish to derive rules that distinguish the real sites from the bulk of non-sites on the basis of local sequence properties. From a practical
standpoint, any such rules would suffice if their general applicability could
be demonstrated on a set of new sequences not involved in the derivation of
these rules. Beyond this practical goal, we advocate a modeling approach that
also seeks to incorporate known and presumed properties of the biochemistry
underlying splice site recognition. In this way, the modeling results provide a
consistency check on possible factors contributing to splice site recognition
by evaluating the predictive usefulness of these factors individually and in
various combinations.
The
in vivo
mechanisms of splice site selection are undoubtedly quite complex and details
are still largely unknown. Thus, any modeling attempt will necessarily involve
considerable simplifications. Here, we consider splice site recognition in
terms of a classical dose-response assay. In this case, `dose' refers to measurable characteristics
of a site (degree of matching to the extended signal consensus, compositional
contrast between the upstream and downstream signal flanking regions, possibly
other features). The response is whether or not splicing occurs at that site
under standard conditions.
For a given site, let
P
denote the probability that splicing succeeds. Experimentally
P
could be measured in terms of splicing efficiency expressed as the proportion of
pre-mRNA transcripts that are spliced at that site. All other things being
equal, we assume that
P
depends only on the dose of measured sequence characteristics. For simplicity,
first assume that splice site sequence characteristics can be measured in terms
of a single variable,
x
, such that high positive values of
x
correspond with
P
-values close to 1 and low negative values of
x
correspond with
P
-values close to 0. It is convenient and customary in this case to utilize
a sigmoidal curve of the explicit form:
P ( x ) = {{e sup {alpha + x}} over {1 + {e sup {alpha + x}}}}
1
where
x
= -[alpha] is the dose that gives half-maximal response. More generally, various characteristics of
a site are measured simultaneously such that a specific site,
i
,
is represented by the vector
{x vec sub i} prime
= (1,
xi
1,
xi
2, ......,
xi
k) of observables, and, in the simplest generalization of
1
, [alpha] +
x
is replaced by the linear combination {x vec sub i} prime {beta vec}, where beta vec = ([alpha], [beta]
1
, [beta]
2
, ....., [beta]
k
)' is a set of parameters to be estimated. Parameter estimation for this
logitlinear model can be carried out in standard fashion as follows (e.g.,
20
,
21
). Let
Y
be the indicator variable, set to 1 for true splice sites and to 0 for non-sites. Let y vec be the vector of observed
Y
values, consisting of
AP
ones for true sites and
AN
zeros for false sites in a given training set (
AP
and
AN
stand for actual positives and actual negatives, a notation also used below).
Then the joint loglikelihood of the observations
y vec
is given by
L = L ( {beta vec} ) = {sum from {i = 1} to N} {left ( {{y sub i} ln P ( {x vec
sub i} prime {beta vec} ) + ( 1 - {y sub i} ) ln ( 1 - P ( {x vec sub i} prime {beta vec} )} right )}
2
where
N
=
AP
+
AN
is the total number of sites in the training set. The maximum likelihood
estimator for [beta] is found numerically as the unique solution to the system of non-linear equations X ( {y vec} - {P vec} ) = {0 vec}, where
X
is the (1 + k) *
N
matrix with columns equal to the vectors x vec sub i (
X
is assumed to be of full rank), and P vec is the vector of probabilities
P
({x vec sub i} prime {beta vec}). Many computer programs are available to obtain such parameter estimates
numerically, e.g., by application of the Newton-Raphson algorithm (
20
,
21
). We used corresponding functions from StatUnit by Tue Tjur (
22
) as implemented in our freely distributed DNASTAT package (
23
). Convergence was always achieved after a few steps, but the calculations were
lengthy due to the large number of observations and parameters.
A cautionary note concerning interpretation of the model is in order. Assuming
the validity of the model,
P
would truly be an estimate of splicing efficiency if the training data
consisted of repeated observations for each particular combination of site
characteristics. In other words, the indicator variable
Y
would have to be observed repeatedly for each site, and
P
is simply the estimated success probability for this binomial variable. Such
data are commonly not available. Instead, we sample over distinct sites in the
pre-mRNA and make only one observation per site (true splice site or non-site). The validity of this sampling scheme as an approximation to
the repeated observations sampling scheme rests on the continuity of the
P
function in its dependence on {x vec} prime {beta vec}. Thus, we replace sampling of repeated experiments on the same site by sampling
of other sites with similar characteristics x vec. Sites with scores {x vec} prime {beta vec} similar to those of many non-sites are considered poor splice sites; sites with exceptionally high
scores are considered efficient splice sites.
There is experimental evidence in support of these modeling assumptions for our
choice of independent variables. Thus, it is well documented that improved
matching of the splice site consensus increases splicing efficiency and that
base compositional changes in both introns and exons affect splice site
selection and splicing efficiency in the predicted manner (e.g.,
11
-
13
,
24
-
26
).
Selection of independent variables
In this paper we consider the prediction of splice sites on the basis of three
local sequence properties: (i) the signal sequence, (ii) the U content of 50 nt
sections upstream and downstream of the consensus donor GU or acceptor AG, and
(iii) the G+C content of these sections. The latter two properties were
quantified as the difference in percent nucleotide content and denoted by
X
U
and
X
GC
, respectively (Fig.
1
). These variables are clearly not independent but emphasize different aspects
of a potential splice site and its context.
Splice site prediction and evaluation of models
Given the above interpretation of
P
, prediction of splice sites may be based on the following simple rule:
decision:
true site if
P
>
c
decision:
non-site if
P
<=
c
5
where
c
is an appropriately chosen constant between 0 and 1. The problem of splice site
prediction then consists of optimal choices for
P
as a function of a set of observed site characteristics and for
c
given
P
in order to minimize prediction errors. For any particular function
P
and threshold
c
, the described classification splits a sample of splice sites
S
into the two sub-samples
S
>
c
and
S
3
c
of sites with score >
c
or <=
c
, respectively. In general, both sub-samples contain true sites and non-sites. The prediction method is better the more the distributions of
true sites and non-sites are biased towards true sites in
S
>
c
and non-sites in
S
3
c
.
Our evaluation of the performance of the various models follows the treatment of
Brunak
et al
. (
7
), using the notation of Snyder and Stormo (
5
). Thus, let the number of sites in
S
>
c
be predicted positives (
PP
), consisting of
TP
true positives (real sites) and
FP
false positives (non-sites). Let
S
3
c
consist of
FN
false negatives (real sites of low score) and
TN
true negatives (non-sites), adding up to a total of
PN
predicted negative sites. Let
AP
=
TP
+
FN
be the number of actual positives (true sites), and let
AN
=
FP
+
TN
be the number of actual negatives (non-sites). Then
Sn
=
TP
/
AP
measures the sensitivity of the method: what fraction of the real sites are
correctly predicted?
Sp
=
TP
/
PP
measures the specificity of the method: what fraction of the predicted
positives are real sites? An overall measure of the quality of the method is
given by the Kendall tau rank correlation coefficient for dichotomous
classifications (e.g.,
27
),
tau = {{( T P ) ( T N ) - ( F P ) ( F N )} over sqrt {( P P ) ( P N ) ( A P ) ( A N )}}
6
[tau] is 1 for a completely accurate prediction and -1 for a completely erroneous prediction.
High specificity is important for gene prediction programs based on signal
identification. Each false positive splice site prediction may generate a large
number of false exon-intron structures. However, high sensitivity may be even more important.
Each true site in
S
3
c
that is discarded by the gene prediction algorithm causes it to miss the true
parsing of the gene. We therefore focused in this study on site prediction
methods that keep the rate of false negative predictions very low while
minimizing the rate of false positive predictions. For each model we report
sensitivity, specificity, and [tau] for three choices of
c
(relative to the training set): (i) the highest response value
P
for non-sites giving
Sn
= 1 (predicting all true sites correctly), (ii) the highest response value
P
for non-sites giving
Sn
>= 0.95 (allowing 5% of real sites to be missed), and (iii) the smallest value
of
c
that maximizes [tau]. Note that there are always different choices of
c
giving the same predictor performance for the training sample. In each case the
response value close to an appropriately chosen true site value would work as
well. Restriction to response values of non-sites as candidates for the threshold
c
gives more reliable results, because the sample of non-sites is much larger than the sample of true sites.
When comparing the performance of different models it is convenient to employ
the monotonicity of
P
({x vec sub i} prime {beta vec}) and re-scale the
P
-values such that the decision threshold is at a fixed value. For example,
changing the constant [alpha] to [alpha] - ln[
c
/(1-
c
)] adjusts the decision threshold to 0.5.
RESULTS
Splice site profiles
All true splice sites were initially aligned with respect to the consensus GU or
AG, respectively, and the nucleotide distribution was determined for each
column of this alignment. Let
f
ib
denote the frequency of nucleotide
b
in position
i
, where
i
extends over all signal positions. The entropy values
I
(
i
) = -[Sigma]
b
f
ib
ln
f
ib
provide a means of defining the extent of the signal sequence (
28
).
I
(
i
) attains its minimal value 0 in the extreme case that one of the bases
b
in position
i
is completely conserved and its maximal value -ln 0.25 = 1.386 in the uniform case
f
ib
= 0.25. Selecting the region around the splice sites with significantly lower
than base level entropy, we defined the extent of the donor signal from -3 to +6 (negative integers extending into the exon, positive integers
extending into the intron) and the acceptor signal from -15 to +2 (negative integers extending into the intron, positive integers
extending into the exon). The donor site signal corresponds to the well known
region of complementarity to U1-snRNA, and the acceptor site signal contains the conserved U-rich tract upstream of position -4 (for recent reviews see
9
,
10
).
None of the 204 sequences in the maize acceptor site set has a G in position -3. If regarded as an absolute requirement, then the search algorithm for
potential acceptor sites would have to discard any GAG/.. triplet site, no
matter how well the other positions of the consensus are matched. A more
sanguine approach would consider the complete lack of G as reflecting the
relatively small sample size of the training set and allow for a small
proportion of GAG/.. sites for consideration. We opted for the latter
(particularly because in the
Arabidopsis
set of 578 sites there are three sequences with G in position -3) and replaced the observed frequencies in the profiles with
probabilities estimated from pseudocounts derived as in (
29
) based on a Bayesian principle with uniform prior distribution.
Signal site extent and the derived profile frequencies do not differ
significantly from previously published tabulations (
9
,
28
,
30
) and are therefore not shown here. The most frequent nucleotides in each
position yield the familiar (A or C)AG/
GU
AAGU donor site and U
11
GC
AG
/GU acceptor site consensuses. The control set profiles in each position closely
reflect the average genomic base composition, as expected, because the non-sites are sampled equally from exons and introns.
Compositional contrast
The average mononucleotide composition in windows of length 50 nt flanking both
sides of the donor GU and acceptor AG signals is displayed in Table
1
. The intron internal flanks display a U percentage ~13.5 and 16.5 points higher than that in the exons around the donor and
acceptor sites, respectively. The exon parts are 8-10.5 percentage points higher in G content, and to a lesser extent higher
in C and, for acceptor sites, also in A. The Wilcoxon signed-rank test for comparison of matched pairs was used for testing the
statistical significance of the content differences (
31
). The contrast in C, G and U content is statistically significant for both
donor and acceptor sites, whereas the A content differs significantly only for
acceptor sites. Note that we cannot similarly specify a typical compositional
contrast for the set of non-sites, because this set contains sites with flanking regions entirely
within exons, or entirely within introns, or involving exons and introns in
either order. In particular, the AG immediately upstream of the GU in the donor
site consensus gives rise to a fair proportion of non-acceptor sites with donor-like contrast. Similarly, the GU immediately downstream of AG in the
acceptor site consensus gives rise to non-donor sites with acceptor-like contrast.
Splice site prediction
We initially studied 12 different models for the prediction of splice sites.
Results for the maize data are shown in Table
2
; the
Arabidopsis
set gave similar results (not shown). For each model, the sensitivity,
specificity, and [tau] value were derived for different levels of the threshold
c
in equation
5
. The models are represented by their defining sets of variables given in column
one of Table
2
. For instance, the set of variables
W
s
,
W
n
,
X
U
,
X
GC
denotes model equation
3
, and the set
W
s
,
W
n
,
X
U
denotes model equation
3
without considering the variable
X
GC
, i.e., setting the parameter [mu] = 0.
.
Compositional contrast around maize and
Arabidopsis
splice sites.
Donor sites
Acceptor sites
exon
intron
[Delta]
intron
exon
[Delta]
Maize
U
21.1
34.5
-13.4
37.6
21.5
16.1
A
24.0
23.7
0.3
21.5
25.0
-3.5
G
26.9
18.9
8.0
19.2
29.5
-10.3
C
28.0
22.9
5.1
21.7
24.0
-2.3
Arabidopsis
U
27.3
41.0
-13.7
43.4
26.0
17.4
A
26.9
27.0
-0.1
24.5
29.0
-4.5
G
23.0
14.6
8.4
16.9
26.4
-9.5
C
22.8
17.4
5.4
15.2
18.6
-3.4
Displayed are the average percent base frequencies in 50 nucleotide flanks
upstream and downstream of the conserved donor GU and acceptor AG, respectively. For
convenience, the differences between upstream and downstream percentages are
given in columns 4 and 7.
Predictions are based on 201 true and 6305 false donor sites and on 204 true and
6290 false acceptor sites. FN, number of false negatives; FP, number of false
positives; Sn, sensitivity; Sp, specificity; tau, correlation coefficient,
equation
6
. The 12 different models are based on equations
3
and
4
of the text with only the indicated variables included. For example,
W
s
,
W
n
,
X
GC
refers to model equation
3
with [delta] = 0. Numbers that highlight performance differences between selected
models appear in bold face.
As a standard model for comparisons we chose prediction based on the usual profile scores,
W
sn
. Requiring 100% sensitivity, such that no true splice sites are overlooked, forces choice of the predictor
threshold
c
of equation
5
below the level of the weakest true site in the training set. This strict
requirement results in a very large number of false splice site predictions. In
the case of the standard model, falsely predicted splice sites outnumber the
true ones in a ratio of about fifteen to one. Allowing 5% of true sites to be
missed greatly reduces the number of false positive predictions. Specificity
improves about 4-fold to 23% for donor sites and ~2-fold to 12% for acceptor sites in the standard model (Table
2
).
Inclusion of the contrast variables
X
U
and
X
GC
significantly improves prediction quality. For example, for the standard model
at 95% sensitivity the number of false positive donor sites decreases from >600
to <300 when either
X
U
alone or both
X
U
and
X
GC
are considered in the model. Similarly, the number of false positive acceptor
sites drops from nearly 1400 to <700 upon inclusion of both contrast variables.
Replacing the
W
sn
variable by the separate terms
W
s
and
W
n
for the most part does not change the predictor performance very much. An
exception are the acceptor site models involving
X
U
, for which the
W
s
,
W
n
models are clearly superior to the models at the 95% sensitivity level (but not
in terms of comparing the [tau] values). A much stronger and consistent improvement is obtained with the
L
models involving the contrast variables. Specificity at the 95% sensitivity
level is up to 40% for donor sites and 32% for acceptor sites for the
L
,
X
U
,
X
GC
model.
It is noteworthy that, at the 100% sensitivity level, the gains obtained with
the
L
models compared with the standard profile models are at best slight when the
contrast variables are not considered. However, the improvements are
substantial for the full models including the contrast variables: donor site
specificity increases from 6% to 12% for the
L
model as a result of adding the variables
X
U
and
X
GC
compared with an increase from 6% to only 8% for the profile models, and
acceptor site specificity increases from 7% to 12% compared with an increase
for the profile models from 6% to 7%. A distinction between the
L
variable and the profile variables
W
sn
or
W
s
and
W
n
is that, for the former, weights in each signal position are derived
ab initio
during the training, whereas for the latter the positional weights are fixed as
log-frequencies and only the relative contribution of their overall sum per
site is estimated in the regression model. The greater flexibility of the
L
variable appears advantageous. Moreover, the large performance differences
between the models upon inclusion of the contrast variables suggests that
signal strength and compositional contrast do not contribute independently to
splice site recognition. This apparent correlation is obvious for acceptor
sites, because the U-rich section of the acceptor signal sequence clearly contributes to the U-content of the upstream flank, but it is less obvious for donor site
prediction.
To further investigate these issues, we categorized the splice sites according
to presence or absence of the most conserved signal residue apart from the
donor GU and acceptor AG. For donor sites this residue is G in position -1 immediately upstream of the GU. Eighty-six percent of the donor sites in the maize set and 79% in the
Arabidopsis
set conserve this G. For acceptor sites, signal position -3 immediately upstream of the AG is C in 77% of the maize introns and in
68% of the
Arabidopsis
introns. Table
3
and Table
0
display the predictor performance of the L,
X
U
,
X
GC
model trained separately on the respective subsets of splice sites. For maize
donor sites, great improvement is obtained at the 100% sensitivity level:
compared with the non-categorized model, the number of false positive predictions is reduced
nearly 3-fold. Improvement is also substantial for
Arabidopsis
donor sites at 100% sensitivity. A small but consistent improvement is evident
for acceptor sites in both species, and for all comparisons with the 95%
sensitivity and maximal [tau] criteria.
While 100% sensitivity is a desired goal for splice site prediction in gene
finding algorithms, this is met with considerable difficulty for methods based
entirely on local sequence characteristics. This is particularly evident for
Arabidopsis
acceptor site prediction. There are three sites in our training set which score
so low in all models that their inclusion forces acceptance of an exceedingly
large number of false sites (data not shown). One of these sites precedes a
nine base exon (exon 2 of
Atbfruct1
, GenBank accession number X74515) and accordingly displays atypical
compositional contrast as the downstream 50 base flank is essentially all
intron. The second site occurs in the second intron of the cytochrome c gene
(GenBank accession number M85253). This intron is very short (59 bases) and
only 13.6% U, resulting in poor profile and contrast scores. The third site
(intron 10 of GenBank accession number U05599) is one of only three
Arabidopsis
acceptor sites featuring G in position -3 and there are only two U nucleotides in the -15 to -5 region. Exclusion of these troublesome sites from the
training set restores normal levels of predictor performance (Table
4
), thus clearly identifying these sites as outliers.
Prediction of splice sites in maize pre-mRNAs upon subclassification of splice sites according to extended
consensus
Set
FP
Sp (%)
tau
FN
Sn (%)
FP
Sp (%)
tau
FN
Sn (%)
FP
Sp (%)
tau
5' GGU
361
32
0.49
8
95
166
50
0.64
53
69
24
83
0.73
HGU
148
16
0.40
1
97
119
19
0.42
5
83
11
69
0.75
combined
509
28
0.51
9
96
285
40
0.60
58
71
35
80
0.75
CAG
857
15
0.27
7
96
232
39
0.56
32
80
61
67
0.70
DAG
453
9
0.29
2
96
69
39
0.61
12
74
5
88
0.81
combined
1310
13
0.33
9
96
301
39
0.60
44
78
66
71
0.74
Results are shown for the
L
,
X
U
,
X
GC
model, trained separately on the indicated subsets of donor and acceptor site
samples. Notation is as in Table 2. Prediction criteria were set to 100%
sensitivity (columns 2-4), >= 95% sensitivity (columns 5-9), and maximal tau (columns 10-14). The set 5' GGU consists of 172 true and 1466 false donor
sites, all with G preceding the GU donor consensus. H denotes non-G. 5' HGU consists of 29 true and 4839 false donor sites. D denotes non-C. The set 3' CAG consists of 157 true and 1645 false acceptor
sites, and 3' DAG consists of 47 true and 4645 false acceptor sites. The combined sets
show improved prediction compared with the non-categorized models (cf. corresponding bold face entries in Table 2).
Prediction of splice sites in
Arabidopsis
pre-mRNAs
Set
FP
Sp (%)
tau
FN
Sn (%)
FP
Sp (%)
tau
FN
Sn (%)
FP
Sp (%)
tau
5' all
3594
14
0.32
28
95
819
40
0.60
101
82
205
70
0.75
5' GGU
1509
23
0.37
22
95
395
52
0.66
52
89
158
72
0.77
5' HGU
1011
11
0.31
5
96
362
24
0.47
31
74
38
70
0.72
combined
2520
19
0.39
27
95
757
42
0.61
83
86
196
72
0.77
3' all
4417
12
0.29
28
95
754
42
0.61
122
79
216
68
0.72
3' CAG
1855
17
0.27
19
95
323
53
0.67
79
80
102
75
0.75
3' DAG
2289
8
0.25
9
95
305
37
0.58
56
70
72
65
0.67
combined
4144
12
0.30
28
95
628
47
0.65
135
76
174
72
0.73
Results are shown for the
L
,
X
U
,
X
GC
model, trained separately on the indicated subsets of donor and acceptor site
samples. Notation is as in Table 2. Prediction criteria were set to 100%
sensitivity (columns 2-4), >= 95% sensitivity (columns 5-9), and maximal tau (columns 10-14). The 5' all set consists of 577 true and 14 964 false
donor sites. 5' GGU consists of 458 true and 3729 false donor sites, all with G
preceding the GU donor consensus. H denotes non-G. 5' HGU consists of 119 true and 11 235 false donor sites. The 3' all set consists of 574 true and 15 712 false acceptor
sites. Three poorly scoring true acceptors were excluded as discussed in the
text. D denotes non-C. The set 3' CAG consists of 387 true and 3240 false acceptor sites, and 3' DAG consists of 187 true and 12 472 false acceptor sites.
The rows labeled `combined' give the overall values for prediction based on the
subclassifications.
Interpretation of model parameters
The parameter values estimated in the wake of fitting the models to the training
sets should reflect the relative importance of the different variables and thus
may guide future modeling as well as interpretations in terms of the underlying
splicing process. We discuss these possibilities for the
Arabidopsis
donor site model with subclassification. The estimated model parameters are
given in Table
5
. Several observations stand out: (i) The constant term [alpha] is ~9-fold higher for the GGU sites than for the HGU sites.
Inserting into equation
4
we calculate a
P
-value of 0.97 for a perfectly matching GGU donor site
L
= 0 assuming no effect of compositional contrast
X
U
= X
GC
= 0
, compared with a
P
-value of only 0.59 for a perfectly matching HGU donor site. Thus, our
model predicts that splice site quality is considerably reduced if the
consensus G in position -1 is replaced. (ii) From Table
1
, the expected contrast values are -
X
U
[approx]
X
GC
[approx] 0.14. Inserting these values (with
L
remaining zero) results in
P
-values of 1.00 and 0.97 for GGU and HGU sites, respectively. Thus, the
model predicts that average or better compositional contrast can fully restore
splice site quality in the HGU class of sites. (iii) The most negative weights
occur in positions -2, 1 and 3 for GGU sites, and in positions 1-3 for HGU sites (positions counted relative to GU, cf. Fig.
1
). Note that the penalties in positions 1-3 are more severe for the HGU set compared with the GGU set, indicating
that further mismatches to the consensus are probably ill tolerated in the
former set. The importance of G at position 3 was also confirmed by Hebsgaard
et al
. (
14
) and likely reflects a requirement of hybridization by U1-RNA (
9
,
10
).
Validation tests for the splice site predictor
We tested in several ways how well the prediction rule
5
works to identify true splice sites in genomic DNA. Standard cross-validation techniques proved all models robust so that over-fitting during training could be ruled out (data not shown). For an
additional test, we compiled independent test sets consisting of five maize
genes (originally excluded from our training set as a result of missing
sequence information for one intron in each case) and for
Arabidopsis
comprising 65 sequences contained in the Korning
et al
. (
16
) set, but not in our training set. Splice sites in these sets were predicted
with the predictor threshold
c
set to the values derived in the training. As shown in Table
6
, the prediction quality on the test sets is entirely comparable with the
results obtained for the training sets (Table
3
and Table
4
). Prediction with the thresholds that maximize the [tau] correlation measure on the training sets is the least stable. However,
this is of little concern because in typical applications the threshold will be
set to the more stable extremes that give either high sensitivity or high
specificity. It is of particular significance to note that the
L
,
X
U
,
X
GC
model with subclassification outperforms the other models on the test sets as
it did on the training sets. Thus, the displayed performance values seem to
reflect the best possible accuracy for splice site prediction based on the
scope of models we investigated.
Parameters are given for the
L
,
X
U
,
X
GC
model (equation
4
) with subclassification. Parameters set to 0 correspond to the consensus donor site residues.
GGU/HGU and CAG/DAG denote the
L
,
X
U
,
X
GC
models with subclassification. The three levels of the predictor threshold were
set in accord with the respective training data (Tables 3 and 4). Notation is
as in Table 2. The maize test set consists of 35 true and 951 false donor sites
and 37 true and 933 false acceptor sites. The
Arabidopsis
test set consists of 277 true and 6101 false donor sites and 280 true and 6022
false acceptor sites.
DISCUSSION
Elucidation of the
trans
-acting factors and
cis
-acting elements involved in splice site selection and determination of
splicing efficiency in plant pre-mRNAs has been hampered by the absence of a plant
in vitro
splicing system. Similarities to yeast and vertebrates in terms of splice site
consensuses and the sequences of the small nuclear ribonucleic acids are
juxtaposed by plant specific features. Thus, plant introns generally lack a
polypyrimidine run upstream of the 3' splice site, a feature that is conserved in most yeast and vertebrate
introns, nor do they contain a characteristic branchpoint sequence. On the
other hand, plant introns are typically distinguished from the flanking exons
by a strong bias towards U bases (Table
1
), and this bias plays a role in accurate pre-mRNA processing (for reviews see
9
,
10
).
Here we pursued the challenge of predicting splice sites in plant pre-mRNA from local sequence inspection. A successful solution to this problem
would both suggest that our understanding of splice site recognition variables
is sound and also be of considerable practical importance in the context of
gene identification algorithms. Specifically, we attempted to derive rules that
would distinguish true donor and acceptor splice sites from the large excess of
alternative sites that minimally conserve only the characteristic GU and AG
consensuses, respectively. For each site we evaluated (i) the particular
sequence in the signal positions that are typically conserved in true splice
sites, and (ii) the compositional contrast between the flanking 50 bases
upstream and downstream of the sites (Fig.
1
). We explored logitlinear models to map linear combinations of the values of
these variables onto the [0,1] interval, where 1 indicates strong prediction of
a splice site and 0 indicates strong rejection of that hypothesis.
Profile methods
The strongest evidence that the considered variables truly determine splice site
selection would derive from a clear separation of the scores for true splice
sites from those of control non-sites in both training and test sets. The degree of separation is most
stringently assessed by how many false positive predictions are made when the
predictor is trained to 100% sensitivity, i.e., not to reject any true site.
For example, just defining the splice sites by the consensus GU and AG
dinucleotides leads to ~30 false positive predictions for every truly identified site. Our results
(Table
2
) show that discrimination based on the usual profile scores
W
sn
decrease the number of false positive predictions only ~2-fold. For every true donor site in the maize training set the model
would also accept an average of 15 false sites, and for every true acceptor
site it would include ~16 non-acceptor sites.
One hundred percent sensitivity is, of course, a very strict requirement, and
one that is easily foiled by a few exceptional sites in the training set,
including possible cases of erroneous splice site determinations or
annotations. Some such problems occurred with our original
Arabidopsis
acceptor site training set (Table
4
). But even allowing 5% of true sites to be overlooked by the predictor (i.e.,
training the predictor to 95% sensitivity), the profile method would still
accept an average of three or seven false sites for every correctly predicted
maize donor or acceptor site, respectively (Table
2
).
Improved methods
The above numbers demonstrate the need for improved splice site prediction
methods. We showed that including compositional contrast variables and modeling
the signal sites by the more general weights
L
(Fig.
1
) greatly enhanced the performance of the predictor: at 100% sensitivity, the
ratio of false positive sites to correctly identified sites decreased to ~7, and at 95% sensitivity, this ratio decreased to 2 or less (Table
2
).
The result that the logitlinear models involving
L
instead of the usual profile scores performed so much better was initially
surprising. Our interpretation is that the equal weighting of the different
signal positions in a common profile application is inadequate. For example, in
the maize donor sites the position immediately upstream of GU is 86% G.
However, in the non-sites exceeding the minimal profile score of the true sites this position
is only 33% G. Thus, the majority of these non-sites compensate for the lack of this G with a more favorable match to the
consensus in other positions. But the compensation in terms of scoring
apparently does not reflect splice site functionality.
As a refinement of the model, we therefore subclassified true splice sites and
control non-sites according to the presence or absence of the most strongly conserved
splice site bases apart from the characteristic GU and AG. This approach
succeeded in further lowering the numbers of false positive predictions, both
in the training sets (Table
3
and Table
4
) and in independent test sets (Table
6
). The improvements in splice site prediction are illustrated for a typical gene
in Figure
2
. The best model (
L
,
X
U
,
X
GC
with subclassification) is built into our
SplicePredictor
program.
Perspective
Despite the substantial improvements in splice site prediction with the refined
models, the success rate of the predictor is not entirely satisfactory. This
may reflect inherent limitations within the current framework. First, there are
other factors influencing splice site selection that have not been considered.
For example, recent studies have identified branchpoints in some plant pre-mRNAs (
32
,
33
). The exact sequence requirements for plant branchpoints are unclear, but, when
present, the branchpoint likely has a role in 3' splice site selection. There are also suggestions of specific intron and
exon recognition factors (
9
,
10
), which may actively contribute to true splice site selection or otherwise mask
high scoring non-sites by binding to pre-mRNA.
Also not considered in our current model are changes in the local site
attributes during the sequential process of splicing. For example, the splice
sites delineating very short exons would typically display poor compositional
contrast. Once one of the introns is removed, the remaining intron will be
flanked by the fusion of the short exon with its adjacent exon, which should
restore the typical compositional contrast. It is also possible that relatively
high scoring non-sites are locally suboptimal compared with nearby true sites and are
irrelevant because splice site selection is accomplished by scanning for the
locally best matching site. Moreover, interaction of adjacent splice sites in
either 5' to 3' (`intron definition') or 3' to 5' polarity (`exon definition';
34
) may indicate selection of splice sites in pairs rather than individually.
The above issues will have to be addressed by further modeling studies. The
logitlinear modeling approach provides a very flexible tool in this context
which is statistically very well understood and, as discussed, thoroughly
motivated for sequence signal recognition problems.
PROGRAM AVAILABILITY
The databases and profiles used in this study as well as the
SplicePredictor
program, which implements our current algorithm for splice site prediction in
plant genes, are available electronically from either J. Kleffe
(jkleffe@euler.grumed.fu-berlin.de) or V. Brendel (volker@gnomic.stanford.edu).
SplicePredictor
is also implemented as a Web service at http://gnomic.stanford.edu/
~
volker/ SplicePredictor.html.
ACKNOWLEDGEMENTS
We thank C. Burge, J. C. Carle-Urioste, R. Pincus, and V. Walbot for helpful discussions. V.B. was
supported in part by NIH grants 2R01HG00335-09 and 5R01GM10452-32. K.H was supported by Deutsche Forschungsgemeinschaft project KL
760/1-3 awarded to J.K.
4 Guigó, R., Knudsen, S., Drake, N. and Smith, T. (1992) J. Mol. Biol. 226, 141-157.
5 Snyder, E.E. and Stormo, G.D. (1993) Nucleic Acids Res. 21, 607-613.MEDLINE Abstract
6 Dong, S. and Searls, D.B. (1994) Genomics 23, 540-551.MEDLINE Abstract
7 Brunak, S., Engelbrecht, J. and Knudsen, S. (1991) J. Mol. Biol. 220, 49-65.MEDLINE Abstract
8 Sirajuddin, K. Nagashima, T. and Ono, K. (1995) Comp. Appl. Biol. Sci. 11, 349-359.
9 Filipowicz, W., Gniadkowski, M., Klahre, U. and Liu, H.-X. (1995) Pre-mRNA splicing in plants. In Lamond, A.I. (ed.) Pre-mRNA Processing. R.G. Landes Publishers, Georgetown, TX, pp. 65-77.
10 Luehrsen, K.R., Taha, S. and Walbot, V. (1994) Prog. Nucleic Acids Res. Mol. Biol. 47, 149-193.
11 Goodall, G.J. and Filipowicz, W. (1989) Cell 58, 473-483.MEDLINE Abstract
12 Lou, H., McCullough, A.J. and Schuler, M.A. (1993) Mol. Cell. Biol. 13, 4485-4493.MEDLINE Abstract
13 Luehrsen, K.R. and Walbot, V. (1994) Genes Dev. 8, 1117-1130.MEDLINE Abstract
14 Hebsgaard, S.M., Korning, P.G., Tolstrup, N., Engelbrecht, J., Rouzé, P. and Brunak, S. (1996) Nucleic Acids Res. 24, 3439-3452.MEDLINE Abstract
15 Brendel, V. (1992) Mathl. Comput. Modelling 16 (6/7), 37-43.
16 Korning, P.G., Hebsgaard, S.M., Rouzé, P. and Brunak, S. (1996) Nucleic Acids Res. 24, 316-320.MEDLINE Abstract
17 Kondrakhin, Y.V., Kel, A.E., Kolchanov, N.A., Romashchenko, A.G. and Milanesi, L. (1995) Comp. Appl. Biol. Sci. 11, 477-488.
20 Santner, T.J. and Duffy, D.E. (1989) The Statistical Analysis of Discrete Data. Springer-Verlag, New York.
21 Christensen, R. (1990) Log-linear Models. Springer-Verlag, New York.
22 Tue Tjur (1992) StatUnit: Turbo Pascal unit for statistical analysis. Institute of Mathematical Statistics, University of Copenhagen. http://www.math.ku.dk/~tuetjur.
23 Kleffe, J., Hermann, K., Gunia, W., Vahrson, W. and Wittig, B. (1995) Comp. Appl. Biol. Sci. 11, 449-455.
24 McCullough, A.J., Lou, H. and Schuler, M.A. (1993) Mol. Cell. Biol. 13, 1323-1331.MEDLINE Abstract
25 Lou, H., McCullough, A.J. and Schuler, M.A. (1993) Plant J. 3, 393-403.MEDLINE Abstract
26 Luehrsen, K.R. and Walbot, V. (1994) Plant Mol. Biol. 24, 449-463.MEDLINE Abstract
27 Kendall, M. and Gibbons, J.D. (1990) Rank Correlation Methods. 5th edition, Edward Arnold, London.
28 White, O., Soderlund, C., Shanmugan, P. and Fields, C. (1992) Plant Mol. Biol. 19, 1057-1064.MEDLINE Abstract
29 Tanner, M. and Wong, W.H. (1987) J. Am. Stat. Assoc. 82, 528-540.