Skip Navigation


Nucleic Acids Research Advance Access originally published online on December 4, 2008
Nucleic Acids Research 2009 37(2):463-472; doi:10.1093/nar/gkn945
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (134K) Freely available
Right arrow Screen PDF (149K) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
37/2/463    most recent
gkn945v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Lu, Y.
Right arrow Articles by Sze, S.-H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lu, Y.
Right arrow Articles by Sze, S.-H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2009, Vol. 37, No. 2 463-472
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.


Computational Biology

Improving accuracy of multiple sequence alignment algorithms based on alignment of neighboring residues

Yue Lu1 and Sing-Hoi Sze1,2,*

1Department of Biochemistry and Biophysics, and 2Department of Computer Science, Texas A&M University, College Station, TX 77843, USA

*To whom correspondence should be addressed. Tel: 1 979 845 5009; Fax: 1 979 847 8578; Email: shsze{at}cs.tamu.edu

Received August 1, 2008. Revised October 21, 2008. Accepted November 7, 2008.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 SUPPLEMENTARY DATA
 FUNDING
 REFERENCES
 
While most of the recent improvements in multiple sequence alignment accuracy are due to better use of vertical information, which include the incorporation of consistency-based pairwise alignments and the use of profile alignments, we observe that it is possible to further improve accuracy by taking into account alignment of neighboring residues when aligning two residues, thus making better use of horizontal information. By modifying existing multiple alignment algorithms to make use of horizontal information, we show that this strategy is able to consistently improve over existing algorithms on a few sets of benchmark alignments that are commonly used to measure alignment accuracy, and the average improvements in accuracy can be as much as 1–3% on protein sequence alignment and 5–10% on DNA/RNA sequence alignment. Unlike previous algorithms, consistent average improvements can be obtained across all identity levels.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 SUPPLEMENTARY DATA
 FUNDING
 REFERENCES
 
The construction of multiple sequence alignments is among the most important techniques to perform biological sequence analysis, with important applications to many areas of computational biology. The most popular strategy to construct multiple sequence alignments is by employing a progressive alignment algorithm, in which each sequence is treated initially as an alignment and the next two most similar alignments are repeatedly combined until a single multiple alignment is obtained (1–7). This is often followed by iterative refinements that improve the accuracy of the final alignment (3,4,6–8).

There are many recent efforts that lead to significant improvement of alignment accuracy, including the incorporation of consistency-based pairwise alignments that improve the quality of the initial pairwise alignments by aligning through other sequences to increase their agreement with the final multiple alignment (2,4–7), the use of maximal expected accuracy alignment (4–6), the incorporation of secondary structure predictions (7,9–11), the use of local structural information (12–14), and the incorporation of additional sequences from database search (9,10,15,16).

While most of these algorithms are able to significantly improve alignment accuracy by making better use of vertical information, either by incorporating consistency-based pairwise alignments or by using profiles in which each column of an alignment is modeled independently, we observe that most of these algorithms do not make use of horizontal information when constructing alignments, and it may be useful to take into account alignment of neighboring residues when aligning two residues.

There are a few previous approaches that use neighboring information to obtain significant performance improvements in other applications. Spang et al. (17) obtained a jumping alignment that is suitable for remote homology detection between a given sequence and a multiple alignment by aligning each position in the given sequence to one of the sequences in the multiple alignment while penalizing each vertical jump between horizontal moves. Panchenko et al. (18) used average conservation scores across spatial neighboring sites in the local structural environment to improve functional site prediction, while Capra and Singh (19) used conservation scores from neighboring residues to improve the prediction of functionally important residues in aligned sequences.

To incorporate horizontal information in alignments, we develop a window-based method that adjusts the pairwise score of a residue pair between two sequences (or a column pair between two sub-alignments) by incorporating the scores of neighboring residue pairs (or column pairs). This method can be applied to any multiple alignment algorithm that uses pairwise scores during the construction of a multiple alignment.

Since conserved residues in core regions tend to be clustered together (19,20), this strategy reduces the differences among neighboring scores within these regions, and can potentially lead to better gap placements by encouraging higher concentrations of consecutively aligned residues and more extensive grouping of consecutive indels, which is especially helpful when the similarity within a core region has large fluctuations. Figure 1 illustrates an example in which the strategy removes one incorrect long gap within an alignment that arises from a short fragment of sequence similarities which do not agree in secondary structure.


Figure 1
View larger version (5K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. Illustration of the beginning portion of the alignment of sequences 1smvA and 4sbvC from PREFAB (3) by different algorithms. (a) Alignment by MUSCLE (3). (b) Alignment by our algorithm NRAlign that modifies MUSCLE, which agrees with the reference structural alignment in PREFAB, where SS is the secondary structure assignment from DSSP (21), with L denoting loop and E denoting extended strand.

 
We test the strategy by modifying existing multiple alignment algorithms to make use of horizontal information and show that consistent average improvements can be obtained for these algorithms on all sets of benchmark alignments that we have tested. By using a statistical test that pairs the alignments before and after algorithm modification, we show that highly statistically significant improvements are obtained not just in relative accuracy but also in paired accuracy. We also verify that better gap placements are achieved by comparing the distributions of gaps and the lengths of alignments before and after algorithm modification.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 SUPPLEMENTARY DATA
 FUNDING
 REFERENCES
 
Incorporating horizontal information into pairwise scores
Given a residue (or column) at position x in the first sequence (or sub-alignment) s=s1···sm, a residue (or column) at position y in the second sequence (or sub-alignment) s' = s'1··· s'n}, and a parameter {omega}, define the window that includes at most {omega} positions to the left and to the right of (x, y) by the following set of valid offsets in the neighborhood of (x, y) (Figure 2):

Formula
 We use the following equation to incorporate the scores of the neighboring pairs at position (x + i,y + i) over all offsets i in N{omega}(x,y) into the score of the given pair (x, y):

Formula 1 (1)
where Sold is the original score, Snew is the adjusted score, and β is a parameter that specifies the weight of the neighboring scores during the adjustment. For each alignment of two sequences (or two sub-alignments), this step takes O({omega}l2) time, where l is the maximum sequence (or sub-alignment) length.


Figure 2
View larger version (2K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2. Illustration of the window on two sequences s and s' with {omega} = 2. (a) The offsets in N{omega}(x,y) = {–2,–1,1,2} are included. (b) Since y + 1 is the last position in s', only one position is used to the right of (x,y) and the offsets in N{omega}(x,y) = {–2,–1,1} are included.

 
We apply this strategy to TCoffee 5.31 (2) without using structural information, which is among the first multiple alignment algorithms that utilize consistency-based pairwise alignments, to MUSCLE 3.6 (3), which is among the most efficient multiple alignment algorithms that also have high accuracy, to ProbCons 1.10 (4), which is among the first multiple alignment algorithms that utilize the maximal expected accuracy alignment based on a pair-HMM model, and to MUMMALS 1.01 (5), which uses secondary structure information during pair-HMM training to further improve alignment accuracy. Except for MUMMALS, we test both the protein and DNA/RNA versions of each algorithm. For ProbCons, the DNA/RNA version ProbConsRNA was obtained from parameter training on BRAliBase II (22).

In each case, we evaluate the accuracy of each of the modified algorithms (called NRAlign) against each of the original algorithms while using the same parameter setting across different benchmark alignments for each modified algorithm (Table 1), with values of {omega} in the DNA/RNA version being three times as large as the protein version. Horizontal information is incorporated into each of the algorithms either during the computation of consistency-based pairwise alignments or during the progressive alignment step.


View this table:
[in this window]
[in a new window]

 
Table 1. Parameter settings for the modified version of each algorithm that uses horizontal information

 
Modification of TCoffee
The TCoffee algorithm consists of the following steps: construct a library of pairwise alignments from the input sequences by using global alignments from ClustalW (1) and local alignments from Lalign (23), assign a weight to each pair of aligned residues in the library according to sequence identity, apply library extension to all the weights in the library to obtain an extended library that utilizes consistency-based information by using a triplet approach, and perform progressive alignment according to a guide tree by aligning two groups of pre-aligned sequences using the average scores between column pairs in the extended library. In NRAlign, we apply Equation (1) to adjust the average extended library scores between column pairs before each progressive alignment step.

Modification of MUSCLE
The MUSCLE algorithm consists of the following steps: compute the k-mer distance for each pair of input sequences to produce an initial tree and perform progressive alignment according to the tree by utilizing log-expectation scores between two aligned columns to obtain an initial multiple alignment, re-estimate the tree using Kimura distances (24) computed from the multiple alignment and perform progressive alignment according to the new tree, then perform iterative refinements to obtain the final alignment. In NRAlign, we apply Equation (1) to adjust the log-expectation scores before each progressive alignment step.

Modification of ProbCons
The ProbCons algorithm consists of the following steps: compute the posterior probability matrix for each pair of input sequences according to the pair-HMM model, compute maximal expected accuracy alignment for each sequence pair by dynamic programming, re-estimate the match quality score matrix for each sequence pair by performing probabilistic consistency transformation, construct a guide tree according to the maximal expected accuracy alignments, perform progressive alignment according to the guide tree by using the transformed scores, and perform iterative refinements to obtain the final alignment. In NRAlign, we apply Equation (1) to adjust the match quality scores for each sequence pair before consistency transformation is performed.

Modification of MUMMALS
The MUMMALS algorithm consists of the following steps: compute the k-mer distance for each pair of input sequences to produce an initial tree and perform progressive alignment according to the tree to obtain an initial multiple alignment, re-estimate the tree using sequence identities computed from the multiple alignment, perform a two-stage progressive alignment in which highly similar sequences are first aligned by using weighted sum-of-pairs BLOSUM62 scores (25), and a representative is chosen from each pre-aligned group to perform progressive consistency-based multiple alignment based on transformed pairwise maximal expected accuracy alignments that are obtained from a pair-HMM model that also includes secondary structure states, then merge the pre-aligned groups according to the alignment of the representatives to obtain the final alignment. In NRAlign, we apply Equation (1) to adjust the scores between column pairs before each progressive alignment step.

Availability
NRAlign is available for download at http://faculty.cs.tamu.edu/shsze/nralign.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 SUPPLEMENTARY DATA
 FUNDING
 REFERENCES
 
Benchmark alignments
To evaluate the accuracy of NRAlign on multiple protein sequence alignment, we use benchmark multiple alignments from BAliBASE 3.0 (26), which contains manually refined structural alignments that are subdivided into five categories, from HOMSTRAD (27), which contains a collection of manually edited structure-based alignments, from PREFAB 4.0 (3), which contains structural alignments of two sequences and automatically generated alignments that are obtained from adding high scoring hits of the two sequences from database search, and from SABmark 1.65 (13), which contains alignments that are derived from the SCOP classification (28).

To evaluate the accuracy of NRAlign on multiple DNA/RNA sequence alignment, we use benchmark multiple alignments from BRAliBase II (22), which contains alignments of non-coding RNA sequences of Group II introns, 5S rRNA, SRP, tRNA and U5 splicesomal RNA from the Rfam database (29), and DNA PREFAB (30), which contains alignments of DNA sequences that are obtained from database search of protein sequences from PREFAB 4.0.

Two reference-dependent scores are used to evaluate the accuracy of each algorithm, including the sum-of-pairs score (SPS), which measures the percentage of residue pairs that are aligned correctly in the reference alignment, and the column score (CS), which measures the percentage of entire columns that are aligned correctly (31). For BAliBASE, PREFAB and DNA PREFAB, evaluations are made only on the core regions that are specified in the reference alignments. For PREFAB, SABmark and DNA PREFAB, the reference alignments are based on sequence pairs and the CS score is not used. For PREFAB and DNA PREFAB, the Q score (3) is computed on the original input sequence pair, which has the same meaning as the SPS score. For SABmark, reference alignments are specified for each sequence pair, and the fD score, which is a sensitivity score that has the same meaning as the SPS score, and an additional fM score, which is a specificity score that measures the percentage of residue pairs that are aligned correctly in the test alignment, are computed by averaging the scores over all sequence pairs for each multiple alignment (13).

In addition to reference-dependent scores, four reference-independent scores are used in the presence of known 3D structures to evaluate the structural agreement between aligned protein sequence pairs, including the Dali Z-score (32), which computes a structural similarity score as a weighted sum of similarities of intramolecular distances between residues in aligned columns normalized according to alignments of random structure pairs [see Equations 2–4 in (32)], the GDT_TS score (33,34), which computes the average of the maximum number of aligned residue pairs that can be superimposed within four different distance thresholds of 1, 2, 4 and 8 Å [see the Equation in (34)], and two LiveBench contact scores ContactA and ContactB (35,36), which compute an overlap score that is the lower of two contact scores, one for each structure, computed based on intramolecular distances between residues in aligned columns that are separated by at least five residues [see Equation 2 in (36)], with ContactA normalized for each residue and ContactB normalized over the entire contact map.

To compute the GDT_TS score, multiple superpositions of aligned residue pairs are needed that optimize the individual score components, and the software from (37) is used with the set of aligned residue pairs as input while omitting the final normalization step. Following the procedure in (5), each score is further weighted and normalized against the reverse alignment that represents a random model. For SABmark, three-dimensional coordinates are extracted from the given PDB files (38), and the scores for each multiple alignment are computed by averaging the scores over all sequence pairs.

For RNA sequence alignment, the structure conservation index (SCI) in (39) is used, which is a reference-independent score that computes the ratio of the consensus RNA folding minimum free energy of an alignment to the average of the RNA folding minimum free energy of each individual sequence in the alignment [see the Equation in (39)].

To evaluate whether the use of NRAlign leads to significant improvements, we use the Wilcoxon matched-pairs signed-ranks test (40) over subsets that are large enough with P = 0.05 as significance cutoff, in which the alignments before and after algorithm modification are paired to evaluate whether the improvements are consistent not just in relative accuracy but also in paired accuracy.

Accuracy of NRAlign on protein sequence alignment
Table 2 shows accuracy comparisons on full length protein sequences in BAliBASE 3.0. Among all the subsets that are large enough, NRAlign always performed at least as well as the original algorithm. Except for MUSCLE, the improvements of NRAlign were more statistically significant in the CS score when compared to the SPS score, and this is especially evident on TCoffee. The improvements in the CS score were >2% in references 1V2, 2, 3 and 4 over MUSCLE and in reference 1V2 over MUMMALS, >4% in reference 5 over MUMMALS, and >1% in the entire set over MUSCLE and MUMMALS.


View this table:
[in this window]
[in a new window]

 
Table 2. Average SPS and CS scores (in %) on full length protein sequences in BAliBASE 3.0

 
Table 3 shows accuracy comparisons on HOMSTRAD. Except for 70–100% protein sequence identity where the improvements of NRAlign were statistically significant only over MUMMALS, all improvements at other identity levels were statistically significant (except for 0–20% over MUMMALS). The improvements were especially statistically significant when the identity is moderately low (20–40%), while the overall improvements were highly statistically significant over all algorithms.


View this table:
[in this window]
[in a new window]

 
Table 3. Average SPS and CS scores (in %) on HOMSTRAD

 
Table 4 shows accuracy comparisons on PREFAB 4.0. When only the original input protein sequence pair are aligned, the accuracy improvement characteristics of NRAlign were similar to those of HOMSTRAD, except that the improvements of NRAlign were statistically significant also for 70–100% identity over ProbCons. The improvements were more statistically significant in this case than in the case when the full set of at most 50 protein sequences are aligned, although using the full set of sequences gives better accuracy for each of the original and modified algorithms on divergent sequences (0–40% identity).


View this table:
[in this window]
[in a new window]

 
Table 4. Average Q scores (in %) on PREFAB 4.0

 
Table 5 shows accuracy comparisons on the Twilight and Superfamily subsets of SABmark 1.65. Unlike previous algorithms that have improvements mostly on divergent protein sequences, the improvements of NRAlign were more statistically significant on the Superfamily subset than on the more divergent Twilight subset. Similar to the results in (5), there are strong correlations between the reference-dependent and reference-independent results, which indicate that the improvements are not only at the protein sequence level but also at the structural level.


View this table:
[in this window]
[in a new window]

 
Table 5. Average fD and fM scores and average normalized Dali Z-score, GDT_TS score, and ContactA and ContactB scores (in %) on the Twilight and Superfamily subsets of SABmark 1.65

 
When comparisons were made on the improvements among the different algorithms, we found that MUMMALS was the hardest to improve on HOMSTRAD and on PREFAB when using the original input protein sequence pair for moderate to low identity. ProbCons was the hardest to improve on BAliBASE, the easiest to improve on HOMSTRAD except for 70–100% identity and on PREFAB when using the original input protein sequence pair, while the improvements on TCoffee and MUSCLE varied across different benchmarks. This is in contrast with the better accuracy of ProbCons and MUMMALS over TCoffee and MUSCLE for moderate to low identity. The improvement characteristics were especially different on PREFAB depending on whether the original input protein sequence pair or the full set of protein sequences are aligned, when it was easier to improve on MUMMALS than on MUSCLE in the latter case.

Accuracy of NRAlign on DNA/RNA sequence alignment
Table 6 shows accuracy comparisons on Data-set 1 of BRAliBase II. The improvements of NRAlign were more statistically significant in the reference-independent SCI score when compared to the SPS and CS scores, where the improvements in the SCI score were >3% for moderate to low RNA sequence identity (0–75%) and were statistically significant for high RNA sequence identity (75–100%) over all algorithms. This is especially evident on TCoffee, where the improvements in the SCI score were >12% for moderate to low identity (0–75%) and >9% in the entire set. In the SPS and CS scores, except for 75–100% identity where the improvements of NRAlign were statistically significant only over ProbConsRNA in the CS score, all improvements at other identity levels were statistically significant (except for 55–75% over MUSCLE in the SPS score), with improvements of >3% over TCoffee.


View this table:
[in this window]
[in a new window]

 
Table 6. Average SPS, CS and SCI scores (in %) on Data-set 1 of BRAliBase II

 
Table 7 shows accuracy comparisons on the mdsa_all set of DNA PREFAB. Except for 70–100% DNA sequence identity over MUSCLE, all the improvements of NRAlign were statistically significant. The improvements were especially statistically significant for moderate to low identity (20–70%) and for the entire set, with improvements of >7% over MUSCLE and >4% for 40–70% identity over all algorithms.


View this table:
[in this window]
[in a new window]

 
Table 7. Average Q scores (in %) on the mdsa_all set of DNA PREFAB

 
When comparisons were made on the improvements among the different algorithms, we found that MUSCLE was the hardest to improve on BRAliBase and the easiest to improve on accuracy in DNA PREFAB, while TCoffee was the easiest to improve on BRAliBase. This is in contrast with the better accuracy of MUSCLE and ProbConsRNA over TCoffee.

Overall accuracy of NRAlign
In all the subsets that we have assessed, NRAlign always performed at least as well as the original algorithm (except for one case). The overall improvements were highly statistically significant in most cases, even when the average improvements in accuracy can sometimes be small, and the improvements were much more evident on DNA/RNA sequence alignment than on protein sequence alignment. Unlike previous algorithms that have improvements mostly on divergent sequences, consistent average improvements can be obtained across all identity levels, and it is not always the case that the most improvements were obtained on highly divergent sequences.

Supplementary Tables S1 to S6 show the percentage of cases in which each of the scores in Tables 27 respectively becomes better and worse on each set of benchmark alignments when comparing the results of NRAlign to the original algorithm. The percentage of cases that become better was almost always larger than the percentage of cases that become worse even when the identity is very high, and the degree of relative improvement was reflected by the corresponding P-value in Tables 27, with less cases becoming better and less cases becoming worse simultaneously as identity increases in most situations.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 SUPPLEMENTARY DATA
 FUNDING
 REFERENCES
 
Characteristics of alignments
Supplementary Tables S7 to S12 show that the number of gaps in an alignment (a string of consecutive indels within a sequence is counted as one gap), the average length of gaps and the length of the alignment had the tendency to become smaller, larger and smaller respectively when comparing the results of NRAlign to the original algorithm, with generally decreasing tendencies as we move from one category to the next in the above order as demonstrated by the P-values. This confirms that better gap placements are achieved to a larger extent through reducing the number of gaps. While each tendency to become smaller, larger and smaller respectively was almost always larger than the opposite tendency to become larger, smaller and larger respectively, each tendency to become smaller, larger and smaller respectively also diminished simultaneously with the opposite tendency to become larger, smaller and larger respectively as identity increases in most situations.

Pairwise alignment versus multiple alignment
The above results on PREFAB show that the improvements of NRAlign were more statistically significant on pairwise alignments. Since the reference alignments for SABmark are based on sequence pairs, we investigate this further by performing pairwise alignments over all protein sequence pairs instead of obtaining a single multiple alignment, and computing the scores for each multiple alignment by averaging the scores over all sequence pairs.

When compared to Table 5, Table 8 shows that the improvements in SABmark were more statistically significant when pairwise alignments are performed, and this is especially evident on the Superfamily subset, although obtaining a single multiple alignment gives better accuracy on both the Twilight and Superfamily subsets for each of the original and modified algorithms of ProbCons and MUMMALS, and on the Superfamily subset for each of the original and modified algorithms of MUSCLE.


View this table:
[in this window]
[in a new window]

 
Table 8. Average fD and fM scores and average normalized Dali Z-score, GDT_TS score, and ContactA and ContactB scores (in %) on the Twilight and Superfamily subsets of SABmark 1.65 when pairwise alignments are performed over all protein sequence pairs instead of obtaining a single multiple alignment

 
To further investigate the effect of the number of sequences on the accuracy of NRAlign, we group the results on HOMSTRAD according to the number of protein sequences in each alignment. Table 9 shows that except for TCoffee, the improvements on HOMSTRAD were more statistically significant when the number of sequences is small, and the differences are especially evident when comparing pairwise alignments to multiple alignments.


View this table:
[in this window]
[in a new window]

 
Table 9. Average SPS and CS scores (in %) on HOMSTRAD

 
Effect of parameters {omega} and β
While the same parameters {omega} and β are used for each modified algorithm across different benchmarks, we found that not only different algorithms have different preferences of {omega} and β, different benchmarks also have different preferences of {omega} and β even when the same algorithm is used. Table 10 shows that the effect of varying {omega} that specifies the maximum number of horizontal positions that are included to the left and to the right was much more pronounced than varying β that specifies the weight of the neighboring scores on HOMSTRAD and BRAliBase, and our chosen parameter setting was not the one that gives the best accuracy. It is possible to further improve accuracy significantly if another parameter setting is chosen that is different across benchmarks, even when no significant differences in accuracy were obtained with our chosen parameter setting.


View this table:
[in this window]
[in a new window]

 
Table 10. Average SPS and CS scores (in %) on HOMSTRAD and average SPS, CS and SCI scores (in %) on Data-set 1 of BRAliBase II by varying the parameter {omega} that specifies the maximum number of horizontal positions that are included to the left and to the right, and the parameter β that specifies the weight of the neighboring scores

 

    CONCLUSION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 SUPPLEMENTARY DATA
 FUNDING
 REFERENCES
 
We have developed a strategy NRAlign that incorporates horizontal information in alignments and it proves to be useful in all situations. Unlike previous algorithms, consistent average improvements can be obtained that are mostly not dependent on the identity level, even for very high identity. Table 11 shows that NRAlign was at most a few times slower than TCoffee and MUSCLE, and was slightly slower than ProbCons and MUMMALS, which indicates that the window-based adjustment procedure takes up a small part of the computation time of ProbCons, and the use of a small {omega} = 1 does not add much to the computation time of MUMMALS.


View this table:
[in this window]
[in a new window]

 
Table 11. Computation time on HOMSTRAD and on Data-set 1 of BRAliBase II represented as a pair of the form average,maximum in seconds

 
To further improve accuracy, it may be useful to utilize different weights for neighboring scores that are at different distances from the given pair (x, y). In addition to using horizontal information from neighboring scores in sequences, it is also possible to utilize spatial neighboring information in the local structural environment when such information is available and combine the scores from both types of neighbors.


    SUPPLEMENTARY DATA
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 SUPPLEMENTARY DATA
 FUNDING
 REFERENCES
 
Supplementary Data are available at NAR Online.


    FUNDING
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 SUPPLEMENTARY DATA
 FUNDING
 REFERENCES
 
National Science Foundation (DBI-0624077). Funding for open access charge: National Science Foundation (DBI-0624077).

Conflict of interest statement. None declared.


    ACKNOWLEDGEMENTS
 
We thank the anonymous referees for invaluable comments that significantly improved the paper.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 SUPPLEMENTARY DATA
 FUNDING
 REFERENCES
 

  1. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res. (1994) 22:4673–4680.[Abstract/Free Full Text]

  2. Notredame C, Higgins DG, Heringa J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. (2000) 302:205–217.[CrossRef][Web of Science][Medline]

  3. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. (2004) 32:1792–1797.[Abstract/Free Full Text]

  4. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. (2005) 15:330–340.[Abstract/Free Full Text]

  5. Pei J, Grishin NV. MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res. (2006) 34:4364–4374.[Abstract/Free Full Text]

  6. Roshan U, Livesay DR. Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics (2006) 22:2715–2721.[Abstract/Free Full Text]

  7. Katoh K, Toh H. Recent developments in the MAFFT multiple sequence alignment program. Brief. Bioinformatics (2008) 9:286–298.[Abstract/Free Full Text]

  8. Gotoh O. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol. (1996) 264:823–838.[CrossRef][Web of Science][Medline]

  9. Zhou H, Zhou Y. SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics (2005) 21:3615–3621.[Abstract/Free Full Text]

  10. Pei J, Grishin NV. PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics (2007) 23:802–808.[Abstract/Free Full Text]

  11. Wilm A, Higgins DG, Notredame C. R-Coffee: a method for multiple alignment of non-coding RNA. Nucleic Acids Res. (2008) 36:e52.[Abstract/Free Full Text]

  12. O'Sullivan O, Suhre K, Abergel C, Higgins DG, Notredame C. 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol. (2004) 340:385–395.[CrossRef][Web of Science][Medline]

  13. Van Walle I, Lasters I, Wyns L. Align-m—a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics (2004) 20:1428–1435.[Abstract/Free Full Text]

  14. Pei J, Kim B-H, Grishin NV. PROMALS3D: a tool for multiple protein sequence and structure alignments. Nucleic Acids Res. (2008) 36:2295–2300.[Abstract/Free Full Text]

  15. Marti-Renom MA, Madhusudhan MS, Sali A. Alignment of protein sequences by their profiles. Protein Sci. (2004) 13:1071–1087.[CrossRef][Web of Science][Medline]

  16. Simossis VA, Kleinjung J, Heringa J. Homology-extended sequence alignment. Nucleic Acids Res. (2005) 33:816–824.[Abstract/Free Full Text]

  17. Spang R, Rehmsmeier M, Stoye J. A novel approach to remote homology detection: jumping alignments. J. Comput. Biol. (2002) 9:747–760.[CrossRef][Web of Science][Medline]

  18. Panchenko AR, Kondrashov F, Bryant S. Prediction of functional sites by analysis of sequence and structure conservation. Protein Sci. (2004) 13:884–892.[CrossRef][Web of Science][Medline]

  19. Capra JA, Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics (2007) 23:1875–1882.[Abstract/Free Full Text]

  20. Bartlett GJ, Porter CT, Borkakoti N, Thornton JM. Analysis of catalytic residues in enzyme active sites. J. Mol. Biol. (2002) 324:105–121.[CrossRef][Web of Science][Medline]

  21. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers (1983) 22:2577–2637.[CrossRef][Web of Science][Medline]

  22. Gardner PP, Wilm A, Washietl S. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res. (2005) 33:2433–2439.[Abstract/Free Full Text]

  23. Huang X, Miller W. A time-efficient linear-space local similarity algorithm. Adv. Appl. Math. (1991) 12:337–357.[CrossRef]

  24. Kimura M. The Neutral Theory of Molecular Evolution (1983) Cambridge, U.K: Cambridge University Press.

  25. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA (1992) 89:10915–10919.[Abstract/Free Full Text]

  26. Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins (2005) 61:127–136.[CrossRef][Web of Science][Medline]

  27. Mizuguchi K, Deane CM, Blundell TL, Overington JP. HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci. (1998) 7:2469–2471.[Web of Science][Medline]

  28. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. (1995) 247:536–540.[CrossRef][Web of Science][Medline]

  29. Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR. Rfam: an RNA family database. Nucleic Acids Res. (2003) 31:439–441.[Abstract/Free Full Text]

  30. Carroll H, Beckstead W, O'Connor T, Ebbert M, Clement M, Snell Q, McClellan D. DNA reference alignment benchmarks based on tertiary structure of encoded proteins. Bioinformatics (2007) 23:2648–2649.[Abstract/Free Full Text]

  31. Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. (1999) 27:2682–2690.[Abstract/Free Full Text]

  32. Holm L, Sander C. Dictionary of recurrent domains in protein structures. Proteins (1998) 33:88–96.[CrossRef][Web of Science][Medline]

  33. Zemla A, Venclovas C., Moult J, Fidelis K. Processing and analysis of CASP3 protein structure predictions. Proteins (1999) (Suppl. 3):22–29.

  34. Venclovas C., Zemla A, Fidelis K, Moult J. Assessment of progress over the CASP experiments. Proteins (2003) 53:585–595.[CrossRef][Web of Science][Medline]

  35. Rychlewski L, Fischer D, Elofsson A. LiveBench-6: large-scale automated evaluation of protein structure prediction servers. Proteins (2003) 53:542–547.[CrossRef][Web of Science][Medline]

  36. Wallner B, Elofsson A. Can correct protein models be identified? Protein Sci. (2003) 12:1073–1086.[CrossRef][Web of Science][Medline]

  37. Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins (2004) 57:702–710.[CrossRef][Web of Science][Medline]

  38. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. (2000) 28:235–242.[Abstract/Free Full Text]

  39. Washietl S, Hofacker IL, Stadler PF. Fast and reliable prediction of noncoding RNAs. Proc. Natl Acad. Sci. USA (2005) 102:2454–2459.[Abstract/Free Full Text]

  40. Wilcoxon F. Probability tables for individual comparisons by ranking methods. Biometrics (1947) 3:119–122.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Print PDF (134K) Freely available
Right arrow Screen PDF (149K) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
37/2/463    most recent
gkn945v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Lu, Y.
Right arrow Articles by Sze, S.-H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lu, Y.
Right arrow Articles by Sze, S.-H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?