Skip Navigation


Nucleic Acids Research Advance Access originally published online on May 7, 2007
Nucleic Acids Research 2007 35(Web Server issue):W675-W677; doi:10.1093/nar/gkm267
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (4147K) Freely available
Right arrow Screen PDF (533K) Freely available
Right arrowOA All Versions of this Article:
35/suppl_2/W675    most recent
gkm267v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Chikkagoudar, S.
Right arrow Articles by Livesay, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chikkagoudar, S.
Right arrow Articles by Livesay, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2007, Vol. 35, No. suppl_2 W675-W677
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.


Articles

eProbalign: generation and manipulation of multiple sequence alignments using partition function posterior probabilities

Satish Chikkagoudar1, Usman Roshan1,* and Dennis Livesay2

1Department of Computer Science, New Jersey Institute of Technology and 2Department of Computer Science and Bioinformatics Research Center, University of North Carolina at Charlotte

*To whom correspondence should be addressed. Tel: +1-973-596-2872; Fax: +1-973-596-5777; Email: usman{at}cs.njit.edu

Received February 2, 2007. Revised March 29, 2007. Accepted April 8, 2007.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 INPUT PARAMETERS
 OUTPUT AND ALIGNMENT COLUMN...
 SERVER IMPLEMENTATION
 SCALABILITY
 REFERENCES
 
Probalign computes maximal expected accuracy multiple sequence alignments from partition function posterior probabilities. To date, Probalign is among the very best scoring methods on the BAliBASE, HOMSTRAD and OXBENCH benchmarks. Here, we introduce eProbalign, which is an online implementation of the approach. Moreover, the eProbalign web server doubles as an online platform for post-alignment analysis. The heart-and-soul of the post-alignment functionality is the Probalign Alignment Viewer applet, which provides users a convenient means to manipulate the alignments by posterior probabilities. The viewer can also be used to produce graphical and text versions of the output. The eProbalign web server and underlying Probalign source code is freely accessible at http://probalign.njit.edu


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 INPUT PARAMETERS
 OUTPUT AND ALIGNMENT COLUMN...
 SERVER IMPLEMENTATION
 SCALABILITY
 REFERENCES
 
Multiple sequence alignments are frequently employed for analyzing biomolecular sequences. Their application spans a wide range of problems such as phylogeny reconstruction, protein functional site detection, and protein and RNA structure prediction (1). The research literature is abundant with programs and benchmarks for multiple sequence alignment, particularly for protein data. Traditionally, ClustalW (2) is the most popular program used for multiple sequence alignment; while BAliBASE (3) is a likely the most commonly used benchmark of protein alignments.

MAFFT, Probcons and Probalign are recent alignment strategies that are among recent programs with the highest accuracies on BAliBASE and other common benchmarks (i.e. HOMSTRAD (4) and OXBENCH (5). Both Probcons (6) and Probalign (7) compute maximal expected accuracy alignments using posterior probabilities. In Probcons, posterior probabilities are derived using an HMM whose parameters that have been estimated via supervised learning on BAliBASE unaligned sequences. Probalign, which is largely based on the Probcons scheme, derives the posterior probabilities from the input data by implicitly examining suboptimal (sum-of-pair) alignments using the partition function methodology for alignments (see (7) for a full description of the algorithm). Probalign alignments have been shown to have a statistically significant improvement over Probcons, MAFFT (8) and MUSCLE (9) on all three alignment benchmarks introduced above (7).

We present here eProbalign, a web server that automatically computes Probalign alignments; eProbalign also provides a convenient platform to visualize the alignment, generate images, and manipulate the output by average column posterior probabilities. The average column posterior probability (which is discussed further below) can be considered a measure of column reliability where columns with higher scores are more likely to be correct and perhaps biologically informative.


    INPUT PARAMETERS
 TOP
 ABSTRACT
 INTRODUCTION
 INPUT PARAMETERS
 OUTPUT AND ALIGNMENT COLUMN...
 SERVER IMPLEMENTATION
 SCALABILITY
 REFERENCES
 
eProbalign takes as input unaligned protein or nucleic acid sequences in FASTA format. eProbalign checks the dataset to make sure it conforms with IUPAC nucleotide and amino acid one letter abbreviations. White space between residues/nucleotides in the sequences are stripped and the cleaned sequences are passed on to the queuing system. The user can specify gap open, gap extension, and thermodynamic temperature parameters on the eProbalign input page (Figure 1). The input page provides a brief description of the parameters (help link) and links to the standalone Probalign code with publication and datasets.


Figure 1
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. eProbalign input page.

 
The three Probalign parameters on the input page are used for computing the partition function dynamic programming matrices from which the posterior probabilities are derived. This is the same as computing a set of (suboptimal) pairwise alignments (for every pair of sequences in the input) and then estimating pairwise posterior probabilities by simple counting. The thermodynamic temperature controls the extent to which suboptimal alignments are considered. For example, all possible suboptimal alignments would be considered at infinite temperature, whereas only the single best would be used at a temperature of zero. The affine gap parameters are used for the pairwise alignments. Subsequently, Probalign computes the maximal expected accuracy alignment from the posterior probabilities in the same way that Probcons does (6).


    OUTPUT AND ALIGNMENT COLUMN RELIABILITY
 TOP
 ABSTRACT
 INTRODUCTION
 INPUT PARAMETERS
 OUTPUT AND ALIGNMENT COLUMN...
 SERVER IMPLEMENTATION
 SCALABILITY
 REFERENCES
 
The eProbalign output provides three options for viewing and analyzing the alignment (Figure 2). The alignment can be viewed in (i) FASTA text format, (ii) pdf graphical format, and (iii) the Probalign Alignment Viewer (PAV) applet (Figure 4). Each column of the alignment in the pdf file and in the applet is colored in a shade of red according to the average column posterior probability. Bright red indicates probability close to one whereas white indicates close to zero (see Figure 4 for an example on a real BAliBASE dataset).


Figure 2
View larger version (45K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2. eProbalign output page indicating results are done.

 
The average column posterior probability is defined as the sum of posterior probabilities of all pairwise residues in the column normalized by the number of comparisons (6). The top row of the alignment in the pdf and applet displays the average column posterior probabilities multiplied by ten and floored to the lower integer (Figure 4). For example, a score of 1 indicates that the probability is between 0.1 and 0.2.

The Probalign Alignment Viewer is a Java applet that provides basic manipulation of the alignment. Basic Java and browser requirements to use the applet are listed on the output page. With the applet the user can opt to view and save the alignment with column posterior probabilities above any specified threshold. This has the benefit of "cleaning up" the alignment by column posterior probabilities, which is unique to eProbalign. The applet also displays posterior probabilities of all columns in a separate window if desired (Figure 3) and provides options to switch between the gapped and ungapped versions of the alignment.


Figure 3
View larger version (37K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3. Posterior probability of each column.

 

Figure 4
View larger version (49K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4. Probalign Alignment Viewer applet.

 

    SERVER IMPLEMENTATION
 TOP
 ABSTRACT
 INTRODUCTION
 INPUT PARAMETERS
 OUTPUT AND ALIGNMENT COLUMN...
 SERVER IMPLEMENTATION
 SCALABILITY
 REFERENCES
 
We implement a first-in/first-out queuing system that receives requests for Probalign alignments and processes them accordingly. At most, eProbalign will run two Probalign jobs at once, and it will periodically check the queue for new requests. Alignments that take longer than some defined time limit (10 hours at the time of writing of this paper) are stopped and the user is advised to download and run the standalone version. This time limit will be increased as the server hardware is upgraded.


    SCALABILITY
 TOP
 ABSTRACT
 INTRODUCTION
 INPUT PARAMETERS
 OUTPUT AND ALIGNMENT COLUMN...
 SERVER IMPLEMENTATION
 SCALABILITY
 REFERENCES
 
Currently, eProbalign is installed on a dual processor 2.8GHz Intel Xeon machine with 2GB RAM. With these settings, eProbalign can usually align datasets of up to 20 sequences within one minute. Most BAliBASE 3.0 datasets from RV11 and RV12 also finish within one minute. We have also tested large datasets (in number and length of sequences) from BAliBASE RV30 and RV40 classes on eProbalign. BB30029 and BB30008 from RV30 contain 98 and 36 sequences with lengths from 431 to 852 and 400 to 1155 respectively, and BB40002 from RV40 contains 55 sequences with lengths ranging from 58 to 1502. When the server is idle, eProbalign finished in about 20 minutes on BB30008, 55 minutes on BB30029, and 30 minutes on BB40002. Results may take longer to finish when the server queue is full and multiple jobs are running simultaneously. However, the effect of parallel jobs will diminish as the server moves to a bigger machine in the near future.


    ACKNOWLEDGEMENTS
 
We thank system administrators Gedaliah Wolosh and David Perel who have been helpful with technical issues related to the server. DRL is supported, in part, by NIH R01 GM073082 [GenBank] -0181. Funding to pay the open access publication charges for this article was provided by startup funding to DRL from the Bioinformatics Research Center at UNC Charlotte.

Conflict of interest statement. None declared.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 INPUT PARAMETERS
 OUTPUT AND ALIGNMENT COLUMN...
 SERVER IMPLEMENTATION
 SCALABILITY
 REFERENCES
 

  1. Notredame C. Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics (2002) 3:131–144.[CrossRef][Web of Science][Medline]

  2. Thompson JD, Higgins DG, Gibson TJ. ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties, and weight matrix choice. Nucleic Acids Res (1994) 27:2682–2690.[CrossRef]

  3. Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins (2005) 61:127–136.[CrossRef][Web of Science][Medline]

  4. Mizuguchi K, Deane CM, Blundell TL, Overington JP. HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science (1998) 7:2469–2471.[Web of Science][Medline]

  5. Raghava GPS, Searle SMJ, Audley PC, Barber JD, Barton GJ. OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics (2003) 4:47.[CrossRef][Medline]

  6. Do CB, Mahabhashyam MSB, Brudno M, Batzoglou S. PROBCONS: probabilistic consistency based multiple sequence alignment. Genome Res. 15:330–340.

  7. Roshan U, Livesay DR. Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics (2006) 22:2715–2721.[Abstract/Free Full Text]

  8. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res (2004) 32:1792–1797.[Abstract/Free Full Text]

  9. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res (2005) 33:511–518.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Print PDF (4147K) Freely available
Right arrow Screen PDF (533K) Freely available
Right arrowOA All Versions of this Article:
35/suppl_2/W675    most recent
gkm267v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Chikkagoudar, S.
Right arrow Articles by Livesay, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chikkagoudar, S.
Right arrow Articles by Livesay, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?