Nucleic Acids Research Advance Access originally published online on April 22, 2007
Nucleic Acids Research 2007 35(Web Server issue):W649-W652; doi:10.1093/nar/gkm227
Nucleic Acids Research, 2007, Vol. 35, No. suppl_2 W649-W652
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
PROMALS web server for accurate multiple protein sequence alignments
Jimin Pei1,*,
Bong-Hyun Kim2,
Ming Tang1 and
Nick V. Grishin1,2
1Howard Hughes Medical Institute and 2Department of Biochemistry, University of Texas Southwestern Medical Center, 6001 Forest Park Road, Dallas, Texas 75390-9050, USA
*To whom correspondence should be addressed. Tel: +214 645 5951; Fax: +214 645 5948; Email: jpei{at}chop.swmed.edu
Received January 31, 2007. Revised March 22, 2007. Accepted March 28, 2007.
 |
ABSTRACT
|
|---|
Multiple sequence alignments are essential in homology inference,
structure modeling, functional prediction and phylogenetic analysis.
We developed a web server that constructs multiple protein sequence
alignments using PROMALS, a progressive method that improves
alignment quality by using additional homologs from PSI-BLAST
searches and secondary structure predictions from PSIPRED. PROMALS
shows higher alignment accuracy than other advanced methods,
such as MUMMALS, ProbCons, MAFFT and SPEM. The PROMALS web server
takes FASTA format protein sequences as input. The output includes
a colored alignment augmented with information about sequence
grouping, predicted secondary structures and positional conservation.
The PROMALS web server is available at:
http://prodata.swmed.edu/promals/
 |
INTRODUCTION
|
|---|
The quality of multiple sequence alignments directly affects
their applications in similarity searches, structure modeling,
functional prediction and phylogenetic analysis. Preparing accurate
multiple alignments for distantly related proteins (e.g. sequence
identity below 20%) remains a difficult task. Fast accumulation
of database protein sequences also poses a demand to improve
alignment speed. Aligning all sequences together by dynamic
programming is not feasible for large numbers of sequences (
1).
Progressive alignment methods reduce the problem of aligning
multiple sequences to making a limited number of pairwise alignments.
Although progressive methods can be fast, errors made at early
stages are not corrected. Classic progressive methods such as
ClustalW (
2) can give reasonable results for similar sequences,
but fail to produce accurate alignments for divergent sequences
(
3).
In recent years, extensive research has been conducted to improve alignment quality for progressive methods. Refinement after progressive steps is an effective way of correcting alignment errors (4,5). Consistency-based alignment strategy (6) derives a better scoring function before the progressive alignment steps. ProbCons (7) introduced and MUMMALS (8) implemented a probabilistic treatment of consistency derived from pairwise alignment hidden Markov models. Additional information from protein structures and database homologs can lead to further improvement of alignment quality (5,911).
We developed PROMALS (12), a progressive method that combines recent advanced techniques to improve multiple alignment quality, especially for distantly related proteins. PROMALS integrates additional information from database searches and secondary structure predictions into a new hidden Markov model that aligns profiles. The alignment scoring function of PROMALS is based on probabilistic consistency among profileprofile comparisons. PROMALS has shown improved results as compared to other leading methods, such as SPEM (13), MUMMALS, ProbCons and MAFFT (12).
Here, we describe the PROMALS web server for multiple protein sequence alignments. In addition to alignment construction, this server outputs useful information about predicted secondary structures, sequence grouping and positional conservation for target sequences.
 |
PROMALS MULTIPLE ALIGNMENT PROCEDURE
|
|---|
Being a progressive method, PROMALS sets the order of pairwise
alignments according to a tree built by a
k-mer counting method
(
4). To improve alignment speed, PROMALS has two alignment stages
for easy and difficult cases, as first implemented in our program
PCMA (
14). In the first stage, highly similar sequences are
progressively aligned in a fast way with a weighted sum-of-pairs
measure of BLOSUM62 (
15) scores. This procedure results in a
set of pre-aligned groups that are relatively divergent from
each other. In the second alignment stage, a representative
sequence is selected from each pre-aligned group. For each representative
sequence, PSI-BLAST (
16) is used to identify homologs from the
sequence database UNIREF90 (
17), and the PSI-BLAST profile (checkpoint
file) is used to predict secondary structures by PSIPRED (
18).
For each pair of representatives, profiles are derived from
the PSI-BLAST alignments and PSIPRED secondary structure prediction,
and a matrix of posterior probabilities of matches between positions
are obtained by a profileprofile hidden Markov model
(
12). These matrices are used to calculate the probabilistic
consistency scoring function, which is used to progressively
align the representative sequences. Then the pre-aligned groups
obtained in the first stage are merged to the alignment of the
representatives. Finally, gap placements in highly gapped regions
are refined to make the gap patterns more realistic. The alignment
accuracy results of PROMALS and several other methods on SABmark
(
19) and PREFAB 4.0 (
4) benchmarks are shown in
Table 1.
 |
PROMALS WEB SERVER
|
|---|
The PROMALS web server is available at:
http://prodata.swmed.edu/promals/(
Figure 1).

View larger version (40K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 1. Front page of the PROMALS server. The main section allows the user to paste or upload sequences and enter an email address for the results. Options to modify alignment parameters, PSI-BLAST searches and output format are provided. A brief description of each option is available by clicking on the option's name. A document with detailed description of the server is provided. The stand-alone versions of PROMALS can be downloaded from this page.
|
|
Input
The user can paste protein sequences or upload a sequence file.
The sequences can be in FASTA format and identical sequence
names are not allowed. PROMALS also recognizes CLUSTAL format
alignments as input. If such an alignment is provided, it is
split into individual sequences and these sequences will be
re-aligned by PROMALS. The user can enter a name to identify
the submitted job. It is also recommended that the user provide
an email address to receive alignment results, as PROMALS can
take a considerable amount of time to finish for a large number
of divergent sequences, due to the time-consuming steps of running
PSI-BLAST searches and profile consistency measure. On a data
set of 1785 SCOP (
20,
21) domain pairs with up to 48 homologs
added (the average number of sequences is 41.6 per alignment),
the average CPU time of PROMALS is about half an hour under
default settings (
12). The actual time to finish an alignment
job depends on factors such as the number of sequences and their
lengths, the diversity among the sequences, the numbers of homologs
found in database searches and the server load. It can take
several hours for the server to finish aligning a sequence set
with a large number of distantly related sequences (>50).
Alignment options
A number of alignment options are provided in the web page. One important parameter is the identity threshold that determines the partition of fast alignment stage and slow alignment stage, and thus balances alignment quality and speed. Lowering this threshold can cause more sequences to be aligned in a fast and less accurate way, resulting in fewer representative groups subject to the time and memory-consuming steps of PSI-BLAST searches and profile consistency measure. This tradeoff generally leads to less computational time but lower alignment quality. If the number of pre-aligned groups is large (e.g. >100), PROMALS could run out of memory during the consistency measure step and generate an error message with the report of the number of pre-aligned groups in the second alignment stage. In this case, the user can lower the identity threshold (default 0.6) so that the number of sequence groups subject to consistency measure can be reduced. We also provide options for changing weights of amino acid scoring and predicted secondary structure scoring. The default values were determined by a large scale testing on divergent SCOP superfamily domains (20,21). Several parameters for running PSI-BLAST and processing PSI-BLAST alignments (used for generating amino acid profiles) are also provided, such as e-value cutoff, the number of PSI-BLAST iterations, identity cutoff to remove divergent hits, and the number of homologs kept for profile calculation.
Output of PROMALS results
The web server reports the resulting alignment in a standard CLUSTAL format. In addition, the server provides a colored alignment with information about sequence grouping, secondary structure predictions and positional conservation (Figure 2). Sequence grouping is reflected by the color of sequence names. Sequences with magenta names are representatives from pre-aligned groups. Sequences with black names immediately under a representative sequence belong to the same pre-aligned group as the representative sequence. For example, in Figure 2, Q7U096_MYCBO_208_376 and Q1TAV7_9MYCO_205_370 belong to the same pre-aligned group, and they are aligned in the fast alignment stage. Predicted secondary structures are shown for representative sequences (residues with red and blue fonts are predicted to be
-helices and ß-strands, respectively). Above each alignment block, conserved positions are marked by their conservation indices (integer values from 0 to 9) calculated using our program AL2CO (22). The line beneath each alignment block shows consensus secondary structure predictions derived from predictions of individual representative sequences (h:
-helix; e: ß-strand). Such a coloring and labeling scheme provides additional information about the PROMALS alignment, and is helpful for further sequence and structural analysis of the target sequences. In addition to the alignments, the server also provides links to the original input sequences and intermediate results of PSI-BLAST alignments and PSI-PRED secondary structure predictions.
 |
ACKNOWLEDGEMENTS
|
|---|
The authors would like to thank Lisa Kinch and Hong Zhang for
testing the server and helpful comments. This work was supported
in part by NIH grant GM67165 to NVG. Funding to pay the Open
Access publication charges for this article was provided by
Howard Hughes Medical Institute.
Conflict of interest statement. None declared.
 |
REFERENCES
|
|---|
- Lipman DJ, Altschul SF, Kececioglu JD. A tool for multiple sequence alignment. Proc. Natl Acad. Sci. USA (1989) 86:44124415.[Abstract/Free Full Text]
- Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res (1994) 22:46734680.[Abstract/Free Full Text]
- Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res (1999) 27:26822690.[Abstract/Free Full Text]
- Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res (2004) 32:17921797.[Abstract/Free Full Text]
- Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res (2005) 33:511518.[Abstract/Free Full Text]
- Notredame C, Higgins DG, Heringa J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol (2000) 302:205217.[CrossRef][ISI][Medline]
- Do CB, Mahabhashyam MS, Brudno M, Batzoglou S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res (2005) 15:330340.[Abstract/Free Full Text]
- Pei J, Grishin NV. MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res (2006) 34:43644374.[Abstract/Free Full Text]
- Simossis VA, Heringa J. PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res (2005) 33:W289W294.[Abstract/Free Full Text]
- Thompson JD, Plewniak F, Thierry J, Poch O. DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res (2000) 28:29192926.[Abstract/Free Full Text]
- O'Sullivan O, Suhre K, Abergel C, Higgins DG, Notredame C. 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol (2004) 340:385395.[CrossRef][ISI][Medline]
- Pei J, Grishin NV. PROMALS: towards accurate multiple sequence alignments of distantly related sequences. Bioinformatics (2007) doi: 10.1093/bioinformatics/btm017.
- Zhou H, Zhou Y. SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics (2005) 21:36153621.[Abstract/Free Full Text]
- Pei J, Sadreyev R, Grishin NV. PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics (2003) 19:427428.[Abstract/Free Full Text]
- Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA (1992) 89:1091510919.[Abstract/Free Full Text]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:33893402.[Abstract/Free Full Text]
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res (2006) 34:D187D191.[Abstract/Free Full Text]
- Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol (1999) 292:195202.[CrossRef][ISI][Medline]
- Van Walle I, Lasters I, Wyns L. SABmark a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics (2005) 21:12671268.[Abstract/Free Full Text]
- Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE. The ASTRAL Compendium in 2004. Nucleic Acids Res (2004) 32:D189D192.[Abstract/Free Full Text]
- Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol (1995) 247:536540.[CrossRef][ISI][Medline]
- Pei J, Grishin NV. AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics (2001) 17:700712.[Abstract/Free Full Text]
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, et al. The Pfam protein families database. Nucleic Acids Res (2004) 32:D138D141.[Abstract/Free Full Text]

CiteULike
Connotea
Del.icio.us What's this?