Nucleic Acids Research Advance Access originally published online on May 8, 2009
Nucleic Acids Research 2009 37(Web Server issue):W53-W56; doi:10.1093/nar/gkp301
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Research, 2009, Vol. 37, No. suppl_2 W53-W56
© 2009 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Articles |
SIB-BLAST: a web server for improved delineation of true and false positives in PSI-BLAST searches
1The Ohio State Biophysics Program, 2Department of Biochemistry, 3Department of Chemistry and 4Department of Physics, Ohio State University, Columbus, OH 43210-1117, USA
*To whom correspondence should be addressed. Tel: +1 614 688 3978; Fax: +1 614 292 7557; Email: bundschuh{at}mps.ohio-state.edu Correspondence may also be addressed to Michael K. Chan. Tel: +1 614 292 8375; Fax: +1 614 292 6773; Email: chan{at}chemistry.ohio-state.edu.
Received January 28, 2009. Revised April 6, 2009. Accepted April 15, 2009.
| ABSTRACT |
|---|
|
|
|---|
A SIB-BLAST web server (http://sib-blast.osc.edu) has been established for investigators to use the SimpleIsBeautiful (SIB) algorithm for sequence-based homology detection. SIB was developed to overcome the model corruption frequently observed in the later iterations of PSI-BLAST searches. The algorithm compares resultant hits from the second iteration to the final iteration of a PSI-BLAST search, calculates the figure of merit for each overlapped hit and re-ranks the hits according to their figure of merit. By validating hits generated from the last profile against hits from the first profile when the model is least corrupted, the true and false positives are better delineated, which in turn, improves the accuracy of iterative PSI-BLAST searches. Notably, this improvement to PSI-BLAST comes at minimal computational cost as SIB-BLAST utilizes existing results already produced in a PSI-BLAST search.
| INTRODUCTION |
|---|
|
|
|---|
Bioinformatics tools, in particular, those utilizing sequence comparison methods to search for homologs in order to obtain clues regarding functional and evolutionary relationships of uncharacterized proteins, have become an integral part of today's laboratory research. BLAST (1) is arguably the most ubiquitous sequence alignment tool for uncovering close homologs. PSI-BLAST (2,3), an extended version of BLAST, is similarly popular, and also more sensitive in detecting the harder-to-find distant homologs due to its iterative and profile-based search approach (4,5). However, in PSI-BLAST searches it is commonly observed that non-homologous proteins are incorporated into the profiles of the later iterations, leading to model corruption and meaningless results. For this reason, PSI-BLAST developers recommend users to confine their iterative search to no more than five or six rounds (3).
Numerous homology detection algorithms with different strategies to improve the discrimination of true and false positives have been developed. These include the state-of-the-art programs: SAM (6–10) and HMMER (7,11), which perform better than PSI-BLAST in remote homology detection (12), but are far less popular than PSI-BLAST due to their expensive computational costs. Thus, there is a trade-off between improved performance and computational efficiency. In light of these considerations, we have developed a novel algorithm, SimpleIsBeautiful (SIB) (13), that overcomes the model corruption problem in PSI-BLAST searches while at the same time, preserving PSI-BLAST's computational efficiency. By benchmarking resultant hits from the last iteration, where the algorithm has identified the most distant homologs, against resultant hits from the second iteration, when the profile is the least corrupted (since it is comprised mostly of close homologs), our SIB algorithm showed improved discrimination between true and false positives over standard PSI-BLAST searches. A direct performance comparison between the SIB algorithm and PSI-BLAST based on the same test set (Aravind dataset) (14) confirmed that SIB outperforms PSI-BLAST in terms of both specificity and sensitivity. Further performance comparison of SIB against another state-of-the-art sequence alignment algorithm SAM-T2K (8,10) using the same Aravind dataset revealed that SIB exhibits a comparable performance, but at a much lower computational costs.
The SIB algorithm has previously been made available as a downloadable awk script at http://bioserv.mps.ohio-state.edu/SimpleIsBeautiful. The script takes the outputs from a PSI-BLAST search (second iteration and final iteration), compares the two lists of hits and re-ranks the merged list of hits based on each hit's corresponding figure of merit (FOM) (13)—a numeric representation analogous to PSI-BLAST's E-value. Several requests have since been made for an implementation of a web interface of the SIB algorithm. In this article, we present the SIB-BLAST web server that performs PSI-BLAST searches, runs the latest version of the SIB algorithm, which features an improved FOM calculation, and returns a list of hits rank-ordered by their FOM to the users interactively as well as through email.
| METHODS |
|---|
|
|
|---|
Overview of the SIB algorithm
PSI-BLAST achieves its high sensitivity for remote homologs by performing multiple rounds of searches. At each iteration, PSI-BLAST attempts to build a better model of its original query sequence using homologous sequences found in previous rounds. Thus, the earlier iterations (namely iteration one and two) are highly specific, finding mostly close homologs—but not very sensitive. In later iterations, more weakly related homologs are found, which are then used to build models for the even more remote homologs, imparting PSI-BLAST with the sensitivity needed for detecting the more remote homologs. However, these weakened relationships and diminishing similarities between sequences increase the probability of incorporating a non-homologous sequence into the model, leading to model corruption and the false identification of non-homologous sequences as putative homologs.
As described in the initial report of this algorithm, the SIB algorithm is built on the rationale that the benefits of low model corruption from early rounds of PSI-BLAST and the high sensitivity from later rounds can both be reaped by combining results from these iterations. Thus, the hits from the final round can be validated against those of earlier rounds, which in turn, would lead to improved discrimination of true and false positives. Toward this end, SIB uses a mathematical formulation (13) that combines the reported E-values of each overlapped hit from the two PSI-BLAST iterations to calculate the FOM. This FOM serves the same role as the E-value reported by PSI-BLAST in the sense that it indicates the statistical significance of the hits relative to one another. For the FOM to be a true E-value, however, the underlying statistical independence of the two iterations assumption of the FOM formulation has to be fulfilled.
In our original FOM formulation, the E-values of the two iterations were converted to P-values. Under statistical independence, our assumption to combine P-values was to simply multiply the two P-values. This combined P-value was then converted to a FOM that was mathematically analogous to the E-value reported by PSI-BLAST. Recently, we have become aware that Bailey and Gribskov (15) have determined that the correct equation for calculating the combined P-value of two independent values should in fact be
|
|
The performance of the original SIB algorithm was directly verified using the same 103 sequences from the Aravind gold standard dataset (3,14) used by the PSI-BLAST developers. The evaluation results indicated that SIB exhibits higher specificity and sensitivity than that of PSI-BLAST for near identical computational time and a comparable performance to SAM-T2K using much less time. We have verified that the error versus coverage plot remains essentially unchanged after incorporating the new method of combining two P-values as expected since the new formula is still a monotonous function of P2Pf.
SIB-BLAST Web server
SIB-BLAST is comprised of the following steps: (i) performing a PSI-BLAST search of a query protein sequence against the non-redundant (NR) database; (ii) comparing resultant hits found in iteration 2 and the last iteration; (iii) calculating a FOM for each hit by combining its corresponding E-values at iteration 2 and at the final round; and (iv) re-ranking the merged list of hits according to their FOM.
The SIB-BLAST web server (Figure 1A) requests three inputs from the user: a protein sequence in FASTA format, the number of iterations of the PSI-BLAST search and the maximal number of target sequences reported in the PSI-BLAST search. Users are given the option to either paste a protein sequence or upload a file containing a protein sequence in FASTA format. The number of PSI-BLAST iterations to be performed is limited to be between 3 to 10 rounds, though users are advised to choose no more than five to six iterations as suggested by the PSI-BLAST developers (3). The number of target sequences reported by PSI-BLAST can be chosen as 1000, 2000, 5000, 10 000 or 20 000. This number is purposefully restricted to be higher than the PSI-BLAST default value to ensure that even weak hits are reported in the individual rounds as they might become significant once results from different rounds are combined.
|
Other than these three user-defined input parameters, the parameters used to conduct the search are preset. The NR database (updated weekly) is used for querying the sequence. The algorithm parameter for the Expect threshold, which reports the number of sequence matches, is set to 1000 instead of PSI-BLAST's default value of 10. This higher threshold is to ensure that those true but more distant homologs identified in the final iteration but having very large E-value in round two are reported in both lists of iterations, which are subjected to downstream SIB processing. All other algorithm parameters, such as the word size, the scoring matrix and PSI-BLAST threshold E-value of 0.002 for inclusion of matches in the profile for the next round are set to default values.
Users are asked to provide an email address to which their results will be forwarded. Alternatively, users can bookmark the result link to obtain the output interactively when it is available. A status page, which shows the progress of the SIB-BLAST job is displayed upon submission.
SIB-BLAST outputs (Figure 1B) a combined list of hits from the second iteration and the last iteration rank-ordered by their corresponding FOM. Based on an analysis on the FOM's coverage versus error curve on the Aravind test set (14) in our original study, it appeared that the incurrence of errors increases dramatically at a FOM of
10–8. We have mentioned this empirical FOM threshold of 10–8 in the Manual page as a point of reference for users to gauge the statistical significance of each hit. Users are cautioned to the use of this threshold as this value is expected to depend on the size of the database.
Each hit on the results page is shown with its GI number, which has a link to the protein annotation page at the NCBI's website that allows the user to obtain details of the hit to ascertain whether the hit is indeed a true homolog or not. Along with the GI number and FOM, the E-values from the second and the final iterations are provided for each hit. Users can view the PSI-BLAST pairwise alignment between the query and the hit sequence by clicking on the E-value link. Different output options sorted by E-value at iteration 2 or E-value at the last iteration and in HTML or text format are also available to the users. The results page for these different options are organized identically to the default results page.
Documentation and runtime
The SIB-BLAST manual page connected by a clickable link on the SIB-BLAST front page provides users with a brief overview of the SIB algorithm in addition to a detailed description for each input parameter and an explanation of the result page. A sample sequence is made available for users to test a trial run of the SIB-BLAST server.
As reflected in the algorithm's name, the beauty of our approach is in its simplicity—it requires minimal changes to the existing PSI-BLAST algorithm since it uses information already output by PSI-BLAST and the processing time to calculate the FOM for individual hits and re-sorting them are negligible. Thus, SIB-BLAST improves the search accuracy of PSI-BLAST without compromising its computational efficiency.
Future development
The current FOM provides a relative measure of the statistical significance of the resultant hits against one another. It is not an E-value, however, due to SIB's implicit assumption in the calculation of the combined P-value—that the hits identified in the second and last rounds of PSI-BLAST are independent. It would be more meaningful if it were possible to calculate the combined P-value under the condition of the hits being dependent. Then the FOM calculated would be the true E-value, and the value obtained would provide an accurate measure of the error probability. Bailey and Grundy's POP (product of P-values) algorithm, which calculates the product of the P-values of correlated variables may provide some suggestions as to how to achieve this as a direction for future study (16). To improve on the functionality of SIB-BLAST, it would also be useful to provide a multiple sequence alignment of the hits (as is being done now in PSI-BLAST) in future versions of SIB-BLAST.
| FUNDING |
|---|
|
|
|---|
Funding for open access charge: Ohio State University.
Conflict of interest statement. None declared.
| ACKNOWLEDGEMENTS |
|---|
NSF (DBI-0317335 to R.B., partial); OSU internal funds (to M.K.C.); OSU Presidential Fellowship (to M.M.L.).
| REFERENCES |
|---|
|
|
|---|
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. (1990) 215:403–410.[CrossRef][Web of Science][Medline]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.
[Abstract/Free Full Text] - Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. (2001) 29:2994–3005.
[Abstract/Free Full Text] - Gotoh O. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol. (1996) 264:823–838.[CrossRef][Web of Science][Medline]
- Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. (1999) 27:2682–2690.
[Abstract/Free Full Text] - Gribskov M, McLachlan AD, Eisenberg D. Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA (1987) 84:4355–4358.
[Abstract/Free Full Text] - Eddy SR. Hidden Markov models. Curr. Opin. Struct. Biol. (1996) 6:361–365.[CrossRef][Web of Science][Medline]
- Hughey R, Krogh A. Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Appl. Biosci. (1996) 12:95–107.
[Abstract/Free Full Text] - Grundy WN. Homology detection via family pairwise search. J. Comput. Biol. (1998) 5:479–491.[Web of Science][Medline]
- Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics (1998) 14:846–856.
[Abstract/Free Full Text] - Eddy SR. Multiple alignment using hidden Markov models. Proc. Int. Conf. Intell. Syst. Mol. Biol. (1995) 3:114–120.[Medline]
- Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol. (1998) 284:1201–1210.[CrossRef][Web of Science][Medline]
- Lee MM, Chan MK, Bundschuh R. Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches. Bioinformatics (2008) 24:1339–1343.
[Abstract/Free Full Text] - Aravind L, Koonin EV. Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. J. Mol. Biol. (1999) 287:1023–1040.[CrossRef][Web of Science][Medline]
- Bailey TL, Gribskov M. Combining evidence using p-values: application to sequence homology searches. Bioinformatics (1998) 14:48–54.
[Abstract/Free Full Text] - Bailey TL, Grundy WM. Classifying proteins by family using the product of correlated p-values. In: Proceedings of the Third Annual International Conference on Computational Molecular Biology. (1999) ACM New York, NY, USA. 10–14.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
