Published online 6 September 2005
Article |
The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment
National Center for Biotechnology Information, National Library of Medicine Bethesda MD 20894, USA
*To whom correspondence should be addressed. Tel: +301 402 9310; Fax: +301 480 2288; Email: spouge{at}ncbi.nlm.nih.gov
Received May 12, 2005. Revised July 13, 2005. Accepted August 12, 2005.
| ABSTRACT |
|---|
|
|
|---|
The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter
and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter
can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters
and k within the errors required (
, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor. | INTRODUCTION |
|---|
|
|
|---|
Local sequence alignment is an indispensable computational tool in modern molecular biology. It is frequently used to infer the functional, structural and evolutionary relationships of a novel protein or DNA sequence by finding similar sequences of known function in a database. Arguably, the most important sequence database search program available is BLAST (the Basic Local Alignment Search Tool) (1,2). Using a heuristic algorithm, BLAST implicitly performs a local alignment of a protein or DNA query against sequences in the corresponding database. The BLAST output then ranks each potential database match according to an E-value, which is derived from the corresponding local maximum score, given in bits. For each local maximum score y, the corresponding E-value Ey gives (under a random model) the expected number of false positives with a lower rank in the output. Thus, a small E-value indicates that the corresponding alignment is unlikely to occur by chance alone, whereas a large E-value indicates an unremarkable alignment. Without doubt, BLAST's E-values contribute substantially to its popularity.
Let us discuss the BLAST E-value Ey further here. (The Materials and Methods section also continues the discussion.) BLAST assumes a random model in which each unrelated pair of sequences A[1, m] = A1 ··· Am and B[1, n] = B1 ··· Bn consists of random letters chosen independently from a background distribution. BLASTP (BLAST for proteins), e.g. assumes that random proteins are composed of amino acids chosen independently from the Robinson and Robinson frequency distribution (3). BLAST also requires an input, a matrix s(Ai, Bj) for scoring matches between the letters Ai and Bj. BLASTP, e.g. uses the BLOSUM62 scoring matrix (4) as its default, offering as alternatives a few other PAM (5) and BLOSUM matrices. BLAST also enhances its detection of remote sequence similarities by using gapped sequence alignment. The cost of introducing a gap into an alignment is given by the gap penalty
(g), where g is the gap length. Practical gap penalties
are usually super-additive, i.e.
(g) +
(h)
(g + h), so the concatenation of optimal subsequence alignments has a score no less than the sum of their scores. (However, our theory is not restricted to super-additive gap penalties). Affine gap penalties
(g) = a + bg are typical in database searches. We refer to the letter distribution, the scoring matrix, and gap penalty collectively as BLAST parameters.
Throughout the paper, we assume a logarithmic regime (6) where the alignment scores of long random sequences have a negative expectation. In the logarithmic regime, the BLAST E-value Ey is approximately
![]() | (1) |
and pre-factor k.
For ungapped local alignment (i.e. the special case
(g) =
, which disallows gaps in the optimal local alignment), a rigorous theory furnishes analytic formulas for the Gumbel parameters
and k (7,8). For gapped local alignment, analytic results are scarce and usually come at a price: they depend on approximations whose accuracy in general is unknown (912). In the absence of a rigorous theory for gapped local alignment, computer simulations have confirmed the validity of Equation 1 (1316), and in the absence of formulas, they also have provided estimates of
and k (1619).
Because of the exponentiation in Equation 1, errors in
have a greater practical impact than errors in k. Thus, for use in BLAST,
must be known to within 14% relative error; k, to within 10% (20). Therefore, in statements about computational speed, the following implicitly assumes that the estimation of
and k is carried out to these accuracies, unless stated otherwise.
Presently, the BLAST program precomputes
and k offline, using the so-called island method (15,20). Because of the precomputation, users are given a narrow choice indeed of BLAST parameters. The choice of BLAST parameters would be much less restricted, if
and k could be computed online (in, say, less than 1 s) before searching a database with arbitrary BLAST parameters. Accordingly, much recent research has been directed toward speeding estimation of
and k.
With the ultimate aim of estimating
and k online, Bundschuh gave some interesting conjectures about
(21,22). He then applied them in global alignment simulations that estimated
as much as five faster than the island method. Later, we extended his conjectures, reducing the sequence length required to estimate
by almost a factor of 10 (23).
Despite their obvious promise, even with further improvements in speed and global alignment simulations will remain impractical for online estimation in BLAST, unless they can be made to estimate k as well. To remedy the problem, we relate k to global alignment and then exploit the relationship in simulations that estimate both
and k.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Notation for global sequence alignment
We denote the non-negative integers by
+ = {0, 1, 2, 3,...}. Throughout the paper, the letters g, h, i, j, m, n and the letter y are the integers.
Consider a pair A = A1A2... and B = B1B2... of infinite sequences. The corresponding global alignment graph
is a directed and weighted lattice graph in two dimensions, as follows. The vertices of
are
, the non-negative two-dimensional integer lattice. Three sets of directed edges e come out of each vertex v = (i, j): northward, northeastward and eastward. One northeastward edge goes into (i + 1, j + 1) with weight s(Ai+1, Bj+1). For each g > 0, one eastward edge goes into (i + g, j) and one northward edge goes into (i,j + g); both are assigned the same weight
(g) < 0. For simplicity, we assume s(Ai, Bj) and
(g) are always integers, with greatest common divisor 1.
A directed path
= (v0, e1, v1, e2,...eh, vh) in
is a finite, alternating sequence of vertices and edges that starts and ends with a vertex. We say that the path
starts at v0 and ends at vh. For instance, each gapped alignment of the subsequences A[i + 1, m] = Ai+1...Am and B[j + 1, n] = Bj+1...Bn corresponds to exactly one directed path that starts at v0 = (i, j) and ends at vh = (m, n). The alignment's score is the path weight
, the sum of the weights W(ei) of the edges ei. By convention, any trivial path
= (v0) consisting of a single vertex has weight W
= 0.
Let
ij be the set of all paths
starting at v0 = (0, 0) and ending at vh = (i, j). Define the global score Sij = max{W
:
ij}. The paths
starting at v0 and ending at vh with weight W
= Sij are optimal global paths and correspond to optimal global alignments between A[1, i] and B[1, j]. The NeedlemanWunsch algorithm computes the global scores Sij (24).
Let
be the set of all paths
starting at v0 = (0,0). Define the global maximum M = max{W
:
}, which is also the maximum
of all global scores. Let
denote the number of vertices with global score y.
Define the lattice rectangle [0, n] = {0,1,...,n}. Our simulations involved a square subset [0,n]2 of
. In particular single subscripts connote quantities for the square: Mn = max{Sij:(i, j)
[0, n]2}, the square's global maximum; En = max{max0
i
nSin, max0
j
nSnj}, its edge maximum; and Nn (y) = #{(i, j)
[0, n]2:Sij = y}, the number of its vertices with global score y.
The formula for k from global alignment
We can show heuristically that k = limy
ky, where
![]() | (2) |
Equation 2 computes ky from three components: the scale parameter
, the probability P(M = y) of a global maximum y, and the expected number
N(y) of vertices with global score Sij = y. We now describe how our simulations determined the three components.
Numerical scheme for 
First, we estimated
from random global alignments (23). All simulations used to affine gap penalties
(g) = a + bg and the corresponding global alignment algorithms for computing Sij (25).
Recall the edge maximum En (defined at the end of the notation for global sequence alignment). As shown elsewhere (23), its cumulant generating function satisfies
![]() | (3) |
< 1. The root
of ß1(
) = 0 is our estimate for
.
To estimate
exp(
En) efficiently, we used Bundschuh's importance sampling methods (21), which apply if the gap penalty is affine. Briefly, importance sampling is a variance-reduction technique for simulating rare events. In global alignment simulations, e.g. a large edge maximum is a rare event. By simulating optimal subsequence pairs in hybrid alignment (a type of optimized Bayesian local alignment) (26), we ensured that our realizations frequently generated a large edge maximum En. Accordingly, we simulated a pair of sequences of some base length n = l. After correcting for biases induced by the importance sampling distribution, we estimated
exp(
El).
Equation 3 corresponds to an asymptotic equality with two free parameters to ß0 and ß1(
), which we estimated with robust regression. Robust regression was originally developed as an antidote to outliers (27), which badly skew least-square regression (2831). As noted elsewhere (23), however, robust regression is also remarkably suited for extracting asymptotic parameters like ß0 and ß1(
).
Robust regression requires the specification of an influence function, to quantify the influence of potential outliers on the regression result. Many influence functions exist (27), but the Andrews function with a = 1.339 [(27), p. 388; (29)] works well in asymptotic regression, because it ignores points that obviously lie outside the asymptotic regime (23).
Accordingly, we applied robust regression to Equation 3. To solve ß1(
) = 0, let
u be the scale parameter for ungapped local alignment, which can be determined analytically. Because 0
u, with repeated bisection of the interval [0,
u] yielded an estimate
for the root of the equation ß1(
) = 0. In practice, multiple roots did not occur.
Numerical scheme for k
Next, we estimated
(M = y) and
N(y). Importance sampling has already generated sequence-pairs of base length l for estimating
. The bias in importance sampling tends to yield large global scores Sij, ascending toward the global maximum M. To determine N(y), we needed to simulate and count all vertices with global scores Sij = y. Therefore, we extended the sequence pair beyond the base length l using random letters with the unbiased Robinson and Robinson frequencies. The global scores Sij beyond the base length l became progressively smaller, thereby permitting determination of N(y).
Given
> 0, we simulated a random number
of unbiased letters in each sequence, until we found some total length
such that
![]() | (4) |
> 0, if the edge maximum EL of the contributing 2L + 1 vertices satisfies Equation 4, it is probable that M = ML, because elongating the sequences is unlikely to increase the estimate of M. Similarly, the elongation does not increase the estimate of
N(y) much. After appropriate averaging, our simulations therefore yielded estimates
and
for
(M = y) and
N(y).
With the simulation estimates
,
and
in hand, we found that errors in
were negligible in practice. In contrast, the standard deviations sample (32) of
and
, denoted by sM and sN, were not.
We calculated an estimate
for ky by substituting
,
, and
into Equation 2. We estimated the error
in
from the equation
![]() | (5) |
.
Finally, we used robust regression to extract a summary estimate
from the estimates
for individual y. To begin with, consider a constant regression model
= 1
+ e, where
is a column vector consisting of the values
, 1 is a column vector whose elements are all 1, the constant
is the summary estimate
, and e is the column vector consisting of the errors
.
Our ultimate aim is to compute
rapidly, with as few realizations as possible. Unfortunately, for small numbers of realizations, the errors sM and sN are correlated with the corresponding estimates
and
. The correlations propagate to
, noticeably biasing the summary estimate
, with
(see Figure 1).
|
To avoid the bias, we applied the constant regression model
' = 1
' + e' to the errors
themselves. The elements of the column vector
' were the errors
, with errors in each
is taken to be a constant s derived though a standard formula [(27), p. 387], e' = 1s. Robust regression thus gave a constant estimate
of the errors
. We substituted the constant error estimate
back into the constant regression
= 1
+ e of
to derive a robust regression estimate
for k. Although somewhat ad hoc, the constant regression of the errors successfully reduced biases (see Figure 3).
|
Even for large simulations (e.g. 106 realizations), however, sampling of the event [M = y] was inadequate for many large y, with
(M = y) likely being underestimated. Although the corresponding average was unbiased (in theory, at least), we suspect that it had a distribution whose skewing increased with y. Consequently, for large y,
often slightly underestimated the true k, with improbable but substantial overestimations maintaining a correct expectation
(see Figure 2). The putative skewing also made the anticipated relation
(M = y)
e
(M = y + 1) fail for large y. To avoid skewing, we therefore restricted robust regression of
to the range [a, b] of y that minimized the function
![]() | (6) |
|
Software and Hardware
Computer code was written in C++ and compiled with the Microsoft® Visual C++® 6.0 compiler. The computer had a single Intel® Pentium® 4 2.8 GHz processor with 0.5 GB RAM and employed the Microsoft® Windows® 2000 operating system.
| RESULTS |
|---|
|
|
|---|
Tables 1 and 2 give estimates of the Gumbel parameters
and k for all online options of the BLASTP parameters. They therefore confirm that our simulations and our formulas for k produced correct results. Other figures show results for the BLASTP default parameters, namely, the Robinson and Robinson amino acid frequencies (3), the BLOSUM62 scoring matrix and the gap cost
(g) = 11 + g. Other BLAST parameters tested gave comparable results, unless indicated otherwise (data not shown).
|
|
Empirically, simulations using BLASTP default parameters needed a base length of l = 50 and a stringency
= 102 for the accuracies required for (
, 1%; k, 10%). For scoring matrices with more dominant diagonals than BLOSUM62, shorter base lengths sufficed, (e.g. for PAM30, l = 15 sufficed).
Figure 1 plots the estimates
with their standard error bars
against global score y, up to y = 25. Each point represents 30 000 realizations. The horizontal thick line represents the previous best estimate k
0.041 and the dotted line, the biased summary estimate
due to the positive correlation between
and
. Therefore Figure 1 motivated us to regress the errors in
, to produce a constant error estimate
, as described in the Materials and Methods.
Figure 2 plots the estimates
against global score y, up to y = 100. Each point represents 106 realizations. We obtained the estimate
and used it to estimate
. The range y
[0, 3] is not asymptotic, so the
do not approximate the true k very well. The range y
[4, 40] is asymptotic, and it is adequately sampled, so the
fluctuate randomly around the true k. The range y > 40 is also asymptotic, but it is not adequately sampled, so the
usually underestimate the true k. Figure 2 motivated us to regress only in the range [a, b] minimizing Equation 6, as described in the Materials and Methods.
Figure 3 plots the relative errors of the summary estimate
using
(with skewed error estimates
and those using
(with constant error estimate
against different numbers of realizations). All errors in
were computed relative to the approximation k
0.041. Each error plotted is the average of the absolute relative error for 20 independent simulations, each using the indicated number of realizations. White bars show the results for
; black bars, for
. For 10 000 realizations, the constant error estimate
reduces the relative errors dramatically. As the number of realizations increases, the difference in efficiency of estimation between
and
decreases. Figure 3 shows that 10 000 realizations estimated k with less than 10% relative error. The same 10 000 realizations also estimated
with less than 0.8% relative error (data not shown).
The simulations of Figure 3 estimated
from 10 000 realizations, in less than 30 s. For comparison, the same simulations could have estimated
in less than 7 s. For the PAM 30 matrix with
(g) = 9 + g, they estimated
and k in less than 4 s.
| DISCUSSION |
|---|
|
|
|---|
BLAST programs (BLASTP, PSI-BLAST, etc.) are restricted to specific scoring schemes, because time-consuming local alignment simulations for estimating the corresponding Gumbel parameters must be done offline. However, simulations of global alignment can estimate the Gumbel scale parameter
for local alignment (6). Some global alignment methods are as much as five times faster than the best local alignment methods (21,23), so global alignment has considerable potential for online estimation of the Gumbel parameter
. This paper surmounts an obstacle to online estimation by demonstrating that simulations of global alignment can determine the Gumbel pre-factor k. Table 2 displays the results of global alignment simulations over a wide range of BLAST parameters, all of which gave correct estimates of the corresponding k and supported the validity of our methods for computing k.
Global alignment simulation therefore appears a feasible method for estimating both Gumbel parameters,
and k. (The BLASTP default parameters provide a standard for quantifying speed, so the following results apply to the BLASTP defaults, unless stated otherwise.) With local alignment, estimates of
required 40 000 sequence-pairs of minimum length 600 (21); with our methods, 5000 sequence-pairs of maximum length 50 (23). In fact, our methods attained 1.3% accuracies in
with only 1000 sequence-pairs of maximum length 50. In our hands, k was more difficult to estimate than
, with 10% relative errors requiring 10 000 sequence-pairs of average length 140. In summary, the methods presented here for estimating the Gumbel parameters
and k represent at least a 3-fold improvement in speed over local alignments.
Online computation of the BLAST P-value requires more than the Gumbel parameters. It also requires an estimate of the finite-size effect (10,13,33,34). Global alignment (or some variant of it) can indeed produce the required estimate (manuscript in preparation). Without the finite-size estimate in hand, however, we were not strongly motivated to incorporate technical improvements or heuristics into our methods. Bundschuh, e.g. implemented a diagonal-cutting heuristic to remove irrelevant off-diagonal elements in the global alignment matrix (21); we did not. The heuristic could probably speed our computation by a further factor of at least three.
Online BLAST estimation of the Gumbel parameters is likely just a few years away.
APPENDIX
In the Appendix, we give a heuristic derivation of Equation 2.
Notation for local sequence alignment
For local alignment, consider a pair
and
of doubly-infinite sequences. Their local alignment graph
is a directed, weighted lattice graph in two dimensions, as follows. The vertices v of
are v = (i, j)
2, the entire two-dimensional integer lattice. In other respects, particularly with respect to the edges between its vertices,
has the same structure as the global alignment graph
.
We base the graph
on the entire two-dimensional integer lattice
2 because of our interest in the Gumbel distribution. In intuitive terms, the BLAST E-value Ey follows the Gumbel distribution, only if the local alignment does not see the ends of the sequences, so finite-size effects can be neglected (13,33).
Let
be the set of all paths
ending at vh = (i,j), regardless of their starting vertex. Define the local score
. The paths
ending at vk = (i,j) with local score
are optimal local paths corresponding to optimal local alignments matching subsequences of
and
up to and including the letters
and
.
Unlike the singly-infinite sequences A and B, the doubly-infinite sequences
and
correspond to the entire lattice
2. The lattice
2 is invariant under translation (i.e. it appears the same from each of its vertices). Thus, if
and
are sequences with independent random letters, the corresponding local scores
are stationary (i.e. their joint distribution is invariant under translation). Stationary scores carry a prime elsewhere (i.e.
) (35), which we drop here for brevity. For many purposes, translation invariance renders all vertices in
2 equivalent, so it usually suffices to define quantities below solely at the origin, (0,0). The definition at other vertices is usually left implicit.
If the sequences
and
were singly-infinite, the SmithWaterman algorithm could compute the corresponding local scores
(36). Although the algorithm is unable to compute
for
and
, a rigorous treatment shows that doubly-infinite sequences pose no essential difficulties in the logarithmic regime (35).
For efficiency, many simulations of random local alignments partition the vertices in
2 into islands (described below). To avoid technical nuisances, each vertex must belong to exactly one island, so we define the following strict total order on
2: (i', j')
(i, j), if and only if either i' + j' < i + j or else, i' + j' = i + j and j' < j.
Let us say that a vertex
belongs to the origin if (0,0) is the greatest vertex v0 = (i',j') (under the total order
) such that
, for some path
starting at v0 = (i',j') and ending at vh = (i, j). The island belonging to (0,0) is the set
of all vertices (i, j) belonging to (0, 0), and we say that (0, 0) owns the island. [Equation 12 below uses the translate
i,j of the set
00, where
i,j is the set of all vertices belonging to (i,j)].
By the following reasoning,
00 is empty if and only if
. First, if
, there is some path
' ending at (0, 0) with a positive score. If (0, 0) owned any vertex (i, j), there would be a path
starting at (0, 0) and ending at (i, j) with
. Then, the path concatenating
and
' would have a weight exceeding
, contrary to the definition of
. Thus, if (0,0) owns some vertex,
. Conversely, if
, then by deliberate construction, the definition of the total order
implies that (0,0) owns itself [because the weight of the trivial path containing only (0,0) is 0].
Accordingly, define the local maximum [implicitly, on the island
00 belonging to (0, 0)] as
, with the default
, if
00 is empty (i.e. if
). Let
denote the number of island vertices with local score y.
To connect our quantities explicitly to the Gumbel parameters, define
, the maximum local score in the lattice rectangle [0, m] x [0, n]. Let
y the density of islands yielding a local score
, or equivalently, the density of their owners in
2. Under certain conditions in the logarithmic regime,
, where as m, n
,
![]() | (7) |
Simulations indicate that to a good approximation, islands yielding a large local score
occur independently of each other (15). Therefore, Equation 7 asserts that
y
ke
y. In a Poisson approximation,
y represents the intensity of the Poisson process on
2 that generates the owners of islands yielding a local score
.
Because of translation invariance, the density
y equals the probability that any particular vertex in
2 [e.g. (0, 0)] owns an island yielding a local score
. In other words,
. Thus, the limit
![]() | (8) |
Path reversal identity
To determine k from global alignments, we first relate the global maximum M to the local scores
with a path reversal identity. Recall the global maximum M = max {W
:
}, where
is the set of all paths
in
starting at v0 = (0, 0). Recall also the local score
, where
is the set of all paths
in
2 ending at vh = (i, j).
It is believable that for any fixed (i, j)
2, each path in
with random edge-weights corresponds to a reversal of a path in
with the same random edge-weights. Thus, for every (i, j)
2,
, i.e. the local score and the global maximum have the same distribution. (Note: the equality is solely distributional. In any particular random instance, the local score
and global maximum score M are unlikely to be related.)
Because the distributional equality holds for every (i, j)
2, we drop the subscript ij on
and write
![]() | (9) |
A formal proof of Equation 9 can be found elsewhere (35).
The Poisson clumping heuristic
Consider the Poisson clumping heuristic (37)
![]() | (10) |
Equation 10 states that at any fixed vertex (i, j)
2, the probability that
is the density of vertices with a local score y. This density equals
, the density of islands where some local score is at least y, multiplied by
, the expected number
of island vertices (i,j) where the local score
, is given
.
Equation 10 can be demonstrated as follows. First,
![]() | (11) |
, then
. Equation 11 follows, because the event
contributes nothing to the expectation on the left.
Next, define the indicator
A = 1 if the event A occurs and
A = 0 otherwise. Then,
![]() | (12) |
The first equality is essentially the definition of
, which counts the number of vertices belonging to (0,0) with local score
. The second equality exploits the translation invariance of the probabilities associated with
. The third inequality merely notes that in the logarithmic regime, (0,0) must belong to some vertex (35). Equation 10 follows.
Our speculations
Based on the success of our simulation results, we speculate. First,
![]() | (13) |
In fact,
and
are likely to exist as a common finite limit, but Equation 13 suffices for present purposes.
Equation 13 can be justified intuitively, as follows. As y
, any vertices satisfying Sij = y become likely to cluster on a single island that has a large maximum local score. Thus, given M
y, the vertices with Sij = y have a comparable structure to vertices with
on the island belonging to (0, 0), given that the island satisfies
. In particular, given M
y, the number N(y) of vertices with Sij = y has a similar random behaviour to the number
of vertices with
, given
. Thus, the expectations approximate each other:
.
Though hardly a speculation, we assume that c = limy
e
y
(M
y) exists. Unfortunately, still there is no rigorous proof of the limit's existence.
The formula for k from global alignment
Equation 11 has an analog for global alignment, with a similar demonstration:
![]() | (14) |
Together, Equations 810, 13 and 14 yield
![]() | (15) |
Recall our assumption that s(Ai,Bj) and
(g) are always integers:
![]() | (16) |
Let
. From Equations 15 and 16, k = limy
ky, where ky is given by Equation 2.
| ACKNOWLEDGEMENTS |
|---|
We would like to acknowledge helpful discussions with Dr Ralf Bundschuh and Dr Stephen Altschul. This work was supported in whole by the Intramural Research Program of the National Library of Medicine at National Institutes of Health/DHHS. Funding to pay the Open Access publication charges for this article was provided by National Library of Medicine at National Institutes of Health/DHHS.
Conflict of interest statement. None declared.
| Footnotes |
|---|
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
| REFERENCES |
|---|
|
|
|---|
- Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool J Mol. Biol., 215, 403410[CrossRef][Web of Science][Medline] .
- Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res., 25, 33893402
[Abstract/Free Full Text] . - Robinson, A.B. and Robinson, L.R. (1991) Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins Proc. Natl Acad. Sci. USA, 88, 88808884
[Abstract/Free Full Text] . - Henikoff, S. and Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks Proc. Natl Acad. Sci. USA, 89, 1091510919
[Abstract/Free Full Text] . - Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. Atlas of Protein Sequence and Structure, (1978) Silver Spring, MD National Biomedical Research Foundation Vol. 3, pp. 345352 .
- Arratia, R. and Waterman, M.S. (1994) A phase transition for the score in matching random sequences allowing deletions Ann Appl Probab., 4, 200225 .
- Dembo, A., Karlin, S., Zeitouni, O. (1994) Limit distributions of maximal non-aligned two-sequence segmental score Ann Probab., 22, 20222039 .
- Karlin, S. and Altschul, S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes Proc. Natl Acad. Sci. USA, 87, 22642268
[Abstract/Free Full Text] . - Mott, R. (1999) Local sequence alignments with monotonic gap penalties Bioinformatics, 15, 455462
[Abstract/Free Full Text] . - Mott, R. (2000) Accurate formula for P-values of gapped local sequence and profile alignments J Mol. Biol., 300, 649659[CrossRef][Web of Science][Medline] .
- Storey, J.D. and Siegmund, D. (2001) Approximate p-values for local sequence alignments: numerical studies J Comput. Biol., 8, 549556[CrossRef][Web of Science][Medline] .
- Siegmund, D. and Yakir, B. (2000) Approximate p-values for local sequence alignments Ann Stat., 28, 657680[CrossRef] .
- Altschul, S.F. and Gish, W. (1996) Local alignment statistics Methods Enzymol., 266, 460480[Web of Science][Medline] .
- Waterman, M.S. and Vingron, M. (1994) Rapid and accurate estimates of statistical significance for sequence data base searches Proc. Natl Acad. Sci. USA, 91, 46254628
[Abstract/Free Full Text] . - Olsen, R., Bundschuh, R., Hwa, T. (1999) Rapid assessment of extremal statistics for gapped local alignment Proc. Int. Conf. Intell Syst Mol Biol., 211222 .
- Mott, R. (1992) Maximum-Likelihood-Estimation of the Statistical Distribution of Smith-Waterman Local Sequence Similarity Scores B Math Biol., 54, 5975[CrossRef] .
- Smith, T.F., Waterman, M.S., Burks, C. (1985) The statistical distribution of nucleic acid similarities Nucleic Acids Res., 13, 645656
[Abstract/Free Full Text] . - Collins, J.F., Coulson, A.F., Lyall, A. (1988) The significance of protein sequence similarities Comput. Appl. Biosci., 4, 6771
[Abstract/Free Full Text] . - Mott, R. and Tribe, R. (1999) Approximate statistics of gapped alignments J Comput. Biol., 6, 91112[Web of Science][Medline] .
- Altschul, S.F., Bundschuh, R., Olsen, R., Hwa, T. (2001) The estimation of statistical parameters for local alignment score distributions Nucleic Acids Res., 29, 351361
[Abstract/Free Full Text] . - Bundschuh, R. (2002) Rapid significance estimation in local sequence alignment with gaps J Comput. Biol., 9, 243260[CrossRef][Web of Science][Medline] .
- Grossmann, S. and Yakir, B. (2004) Large deviations for global maxima of independent superadditive processes with negative drift and an application to optimal sequence alignments Bernoulli, 10, 829845 .
- Park, Y., Sheetlin, S., Spouge, J.L. (2005) Accelerated convergence and robust asymptotic regression of the Gumbel scale parameter for gapped sequence alignment Journal of Physics A: MATHEMATICAL AND GENERAL, 38, 97108 .
- Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins J Mol. Biol., 48, 443453[CrossRef][Web of Science][Medline] .
- Gotoh, O. (1982) An improved algorithm for matching biological sequences J Mol. Biol., 162, 705708[CrossRef][Web of Science][Medline] .
- Yu, Y.K. and Hwa, T. (2001) Statistical significance of probabilistic sequence alignment and related local hidden Markov models J Comput. Biol., 8, 249282[CrossRef][Web of Science][Medline] .
- Montgomery, D.C., Peck, E.A., Vining, G.G. Introduction to Linear Regression Analysis, (2001) NY John Wiley & Sons, Inc .
- Andrews, D.F., Bickel, P.J., Hampel, F.R., Huber, P.J., rogers, W.H., Tukey, K.W. Robust Estimates of Location: Survey and advances, (1972) Princenton, NJ Princenton University Press .
- Andrews, D.F. (1974) A robust method for multiple linear regression Technometrics, 16, 523531[CrossRef] .
- Huber, P.J. (1964) Robust estimation of a location parameter Ann. Math. Statist., 35, 73101 .
- Huber, P.J. (1973) Robust regression: Asymptotics, conjectures and Monte Carlo Ann Stat, 1, 799821 .
- Dwass, M. Probability and Statistics, (1970) NY W.A. Benjamin .
- Spouge, J.L. (2001) Finite-size corrections to Poisson approximations of rare events in renewal processes J Appl Probab., 38, 554569[CrossRef] .
- Spouge, J.L. (2005) Finite-Size Corrections to Poisson Approximations in General Renewal-Success Processes J Math Anal Appl., 301, 401418[CrossRef] .
- Spouge, J.L. (2004) Path Reversal and Islands in the Gapped Alignment of Random Sequences J Appl Probab., 41, 975983[CrossRef] .
- Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences J. Mol. Biol., 147, 195197[CrossRef][Web of Science][Medline] .
- Aldous, D. Probability approximations via the Poisson clumping heuristic, (1989) 1st edn NY Springer-Verlag
.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


















