Skip Navigation

Nucleic Acids Research 2006 34(Web Server issue):W6-W9; doi:10.1093/nar/gkl164
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (511K) Freely available
Right arrow Screen PDF (434K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Ye, J.
Right arrow Articles by Madden, T. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ye, J.
Right arrow Articles by Madden, T. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2006
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org


Article

BLAST: improvements for better sequence analysis

Jian Ye, Scott McGinnis and Thomas L. Madden*

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA

*To whom correspondence should be addressed. Tel: +1 301 435 5991; Fax: +1 301 480 0814; Email: madden{at}ncbi.nlm.nih.gov

Received February 10, 2006. Revised February 22, 2006. Accepted March 20, 2006.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 REFERENCES
 
Basic local alignment search tool (BLAST) is a sequence similarity search program. The National Center for Biotechnology Information (NCBI) maintains a BLAST server with a home page at http://www.ncbi.nlm.nih.gov/BLAST/. We report here on recent enhancements to the results produced by the BLAST server at the NCBI. These include features to highlight mismatches between similar sequences, show where the query was masked for low-complexity sequence, and integrate information about the database sequences from the NCBI Entrez system into the BLAST display. Changes to how the database sequences are fetched have also improved the speed of the report generator.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 REFERENCES
 
Basic local alignment search tool (BLAST) is a sequence similarity search program that can be used via a web interface or as a stand-alone tool to compare a user's query to a database of sequences (1,2). Several variants of BLAST compare all combinations of nucleotide or protein queries with nucleotide or protein databases. BLAST is a heuristic that finds short matches between two sequences and attempts to start alignments from these ‘hot spots’. In addition to performing alignments, BLAST provides statistical information about an alignment; this is the ‘expect’ value, or false-positive rate.

The National Center for Biotechnology Information (NCBI) maintains a BLAST server with a homepage at http://www.ncbi.nlm.nih.gov/BLAST/. On the homepage the different BLAST searches are listed by type: nucleotide, protein, translated and genomes. The ‘Program Selection Guide’ (http://www.ncbi.nlm.nih.gov/blast/producttable.shtml) provides an introduction to the various programs and database options (3). When a query is submitted to the NCBI server, either as a sequence in FASTA format or as a sequence identifier, e.g. GenBank accession number, the search is sent to the BLAST server and a ‘Request Identifier’ (RID) is returned. The query and results are stored in a structured format for up to 24 h after an RID is issued. The RID identifies the search and allows the results to be viewed in several formats, which include the familiar BLAST report, a simplified ‘hit table’, XML and ASN.1 [(4) and http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.610]. The number of outstanding jobs from one IP address is taken into account when queuing requests, as described at http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.shtml#Queuetime, so that one user does not monopolize the entire service. Searches sent to the server are handled by a sophisticated queuing system that may spread the search over 10 to 20 machines, making the search much faster than if it were run on one machine. Queries and results are stored in an SQL database. More details are available at ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blast-sc2004.pdf

We report here on new display features that we have implemented. These include highlighting mismatches between similar sequences, showing where the query was masked for low-complexity sequence and integrating information from the NCBI Entrez system (5) into the BLAST display. Additionally the new report generator has been optimized for databases with large sequences.

Custom definition lines
During the past five years many genomes have become searchable and the sequences in those databases are typically long contigs or chromosomes. Additionally many long nucleotide sequences have been added to the BLAST databases as a result of high-throughput genomic projects. Traditionally sequences in the BLAST database have been associated with only one descriptive phrase that is normally the same as the ‘definition’ in the GenBank flat file. This means that only very generic information is provided for matches to long database sequences, even though such a sequence might have annotations for many genes, coding regions (CDS) and other features. The top line of Figure 1 shows a database sequence definition and merely states that the sequence is part of human chromosome 6 and is about 48 million bases long. This reveals little about the region of the database sequence containing the match. To address this issue, we now provide feature information for BLAST alignments involving long database sequences (currently defined as larger than 200 kb).


Figure 1
View larger version (56K):
[in this window]
[in a new window]
 
Figure 1 Excerpt from a BLAST result showing custom-definition lines. The query was bases 241 through 480 of a human MHC A gene nucleotide sequence (NM_002116) in a search against the human genome. The top line of the figure is the traditional sequence definition. Custom definition lines are provided for both of the alignments shown and are relevant to the region matched (first alignment) or nearby regions (second alignment).

 
Two types of sequence features (CDS and rRNA) are currently supported but this could be expanded to other features. An example is shown in Figure 1 where a custom definition line is displayed for each of the two alignments. According to the custom definition lines the query matches a region inside the human major histocompatability complex (MHC) A gene, as well as a region that is about 54 kb upstream of the MHC A gene and about 58 kb downstream of the MHC G gene, allowing one to quickly draw the conclusion that the query sequence is highly related to the human MHC. This feature is always enabled for reports at the NCBI BLAST web site.

New format options for easier sequence analysis
Frequently alignments are between very similar sequences and it's difficult to identify a few mismatches in the pairwise alignment. To address this issue we recently introduced a new format called ‘Pairwise with identities’, shown in Figure 2 on an alignment with 98% identity between the query and database sequence. A dot indicates identity between the database sequence and query at that position; mismatches are shown as the database sequence letter in place of the dot and colored red. In addition the word ‘Sbjct’ (on the left of the figure) is also colored red if there is a mismatch on the line. Enable this option with the ‘Alignment View’ pull-down menu shown in Figure 3.


Figure 2
View larger version (55K):
[in this window]
[in a new window]
 
Figure 2 Demonstration of new format options. FASTA sequence for the human cystic fibrosis trans-membrane conductance regulator sequence (NM_000492) was used as query for a BLASTN search against the nr database using default parameters. Three new display options are shown in this figure. The first is the ‘Pairwise with identities’ option. Nucleotide matches in the database sequence are shown as dots (‘.’), nucleotide mismatches in the database sequence (as well as the database sequence identification) are colored red. The second new option is the presentation of the CDS features, which is shown for both the query and database sequences above and below the BLAST alignment, respectively. The CDS feature annotated on the database sequence was retrieved from Entrez; the putative CDS feature on the query was produced automatically using the CDS of the database sequence as a guide. Mismatches for the amino acid sequence derived from the database sequence are colored pink. Finally the new masking option is shown (see text). Bases 175–181 of the query were masked for low-complexity during the search and are shown in lower-case gray letters.

 

Figure 3
View larger version (19K):
[in this window]
[in a new window]
 
Figure 3 Enabling new features on the BLAST format page. The red arrows point to new report features that may be enabled or modified from this page. The check-box highlighted by arrow 1 enables the CDS feature on a BLASTN or megaBLAST search. The two menus highlighted by arrow 2 change the default behavior for display of masked regions. The menu highlighted by arrow 3 changes how the alignments are displayed in the BLAST report.

 
The majority of BLAST searches at the NCBI web site are nucleotide queries against nucleotide databases (e.g. BLASTN). Many of these queries are mRNAs or match to sequences with annotated coding regions. The standard BLAST report does not show the amino acid sequence translated from the query or annotated on the database sequence, even though that may be of great interest to the user; furthermore figuring out the positions of the encoded amino acids on the corresponding nucleotide sequence can be challenging, especially if the coding region is long or involves multiple exons. We have introduced a new ‘CDS Feature’ to display such coding regions. With this option any pre-annotated CDS protein products on the query (if the query is an accession) or the database sequence are fetched from Entrez and shown with the residues aligned to the second base of a codon (Figure 2). For a user-submitted query in FASTA format a putative protein product is calculated using the coding frame of the database sequence as a guide. Mismatched amino acids for the database sequence can also be shown in color. Combined with the ‘Pairwise with identities’ option discussed above this format makes certain tasks easier, such as analysis for silent and replacement mutations. Owing to the overhead of fetching the CDS feature from Entrez this option is currently not the default. Enable this option by checking the ‘CDS feature’ box on the BLAST format page as shown in Figure 3.

Low complexity sequences are compositionally biased regions of amino acid or nucleotide sequence, which often result in artificially high scores in sequence similarity searches. Low-complexity filters, such as SEG (6) or DUST,mask these regions and prevent them from overly biasing the results. Traditionally BLAST has replaced the masked regions by Xs or Ns in the BLAST report. The BLAST formatter now can represent these regions by lower-case letters, making them distinct from the (upper-case) non-filtered regions (Figure 2). In addition the user may select from three colors (black, gray, red) to vary the emphasis on these regions (Figure 3). This new display option is now the default, showing the masked regions in gray lower-case.

General improvements to the BLAST web site
The BLAST graphical overview is a schematic representation of alignments matching the query sequence. It is useful for quickly localizing regions of interest in the query based on it's similarity to other sequences in the database. To reduce the complexity in generating this graphic overview we have now implemented it as HTML tables that use a few small static images (gifs). This design is more robust and also lends itself to future development of a graphical viewer for stand-alone and command-line client BLAST.

The new report generator has improved functionality to fetch part of a database sequence. This can be essential if the database sequence is long, such as a chromosome, and the alignment to be presented only involves a small fraction of the sequence. Previously the entire database sequence was fetched and much of that sequence was not used. This improved functionality has led to a dramatic decrease in formatting time for searches against genomes.

BLAST provides several different modes for viewing BLAST results. The Query-anchored view gives a stacked view of database sequences aligned to the query with indication of insertions and mismatches (3). This provides an easy method to scan alignments and locate things like SNP's and amino acid substitutions among a group of related sequences. Previously the query-anchored views were not fully supported for BLASTX and TBLASTX searches that involved translated sequences. The formatter now supports this format for all these programs. Use the ‘Alignment View’ pull-down to enable this option (Figure 3).

From the BLAST results it is now possible to select some or all of the database sequences and perform an Entrez query to fetch them. Checking the boxes in the alignment section selects the sequences to download and clicking the ‘Get Selected sequences’ button takes the user to Entrez, where the sequences can be displayed in various formats, (such as GenBank or FASTA) and saved to a file. The saved file can then be used as input to another program.

Future directions
We are currently redesigning the BLAST web pages to make them more effective tools. Some of the changes will be better organized HTML that makes options apparent to the user, such as making it easier to limit a search or results to a particular organism or subset of the data available. Results will also be made more user-friendly by better organizing the output. Nearing completion is a utility to calculate distances between sequences in the BLAST results and present those as a tree. Finally we are also working on making it possible to save search or formatting strategies for future use.


    ACKNOWLEDGEMENTS
 
The authors would like to acknowledge Richa Agarwala, Stephen Altschul, Kevin Bealer, Christiam Camacho, Peter Cooper, George Coulouris, Susan Dombrowski, Mike Gertz, David Lipman, Wayne Matten, Yuri Merezhuk, Alexander Morgulis, Jim Ostell, Jason Papadopoulos, Yan Raytselis, Eric Sayers, Alejandro Schaffer, Tao Tao, David Wheeler and Irena Zaretskaya, as well as members of the C++ toolkit group at the NCBI, for their work that has made this Web site possible. This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. Funding to pay the Open Access publication charges for this article was provided by the National Institutes of Health.

Conflict of interest statement. None declared.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 REFERENCES
 

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool J. Mol. Biol, . 215, 403–410[CrossRef][ISI][Medline] .

  2. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res, . 25, 3389–3402[Abstract/Free Full Text] .

  3. McGinnis, S. and Madden, T.L. (2004) BLAST: at the core of a powerful and diverse set of sequence analysis tools Nucleic Acids Res, . 32, W20–W25[Abstract/Free Full Text] .

  4. Madden, T.L. (2002) The BLAST sequence analysis tool In McEntyre, J. (Ed.). The NCBI Handbook [Internet], Bethesda, MD National Library of Medicine (US), National Center for Biotechnology Information .

  5. Schuler, G.D., Epstein, J.A., Ohkawa, H., Kans, J.A. (1996) Entrez: molecular biology database and retrieval system Meth. Enzymol, . 266, 141–162[ISI][Medline] .

  6. Wootton, J.C. and Federhen, S. (1996) Analysis of compositionally biased regions in sequence databases Meth. Enzymol, . 266, 554–571[ISI][Medline] .


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
M. Johnson, I. Zaretskaya, Y. Raytselis, Y. Merezhuk, S. McGinnis, and T. L. Madden
NCBI BLAST: a better web interface
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W5 - W9.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
B. Lee and D. Lee
DAhunter: a web-based server that identifies homologous proteins by comparing domain architecture
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W60 - W64.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
L.-C. Tranchevent, R. Barriot, S. Yu, S. Van Vooren, P. Van Loo, B. Coessens, B. De Moor, S. Aerts, and Y. Moreau
ENDEAVOUR update: a web resource for gene prioritization in multiple species
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W377 - W384.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Bekaert and E. C. Teeling
UniPrime: a workflow-based platform for improved universal primer design
Nucleic Acids Res., June 1, 2008; 36(10): e56 - e56.
[Abstract] [Full Text] [PDF]


Home page
Appl. Environ. Microbiol.Home page
P. Ruas-Madiedo, M. Gueimonde, M. Fernandez-Garcia, C. G. de los Reyes-Gavilan, and A. Margolles
Mucin Degradation by Bifidobacterium Strains Isolated from the Human Intestinal Microbiota
Appl. Envir. Microbiol., March 15, 2008; 74(6): 1936 - 1940.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. A. Innis
siteFiNDER|3D: a web-based tool for predicting the location of functional sites in proteins
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W489 - W494.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. L. Wheeler, T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. DiCuccio, R. Edgar, S. Federhen, et al.
Database resources of the National Center for Biotechnology Information
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D5 - D12.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Kutchma, N. Quayum, and J. Jensen
GeneSpeed: protein domain organization of the transcriptome
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D674 - D679.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. A. Fox, S. McMillan, and B. F. F. Ouellette
A compilation of molecular biology web servers: 2006 update on the Bioinformatics Links Directory.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W3 - W5.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (511K) Freely available
Right arrow Screen PDF (434K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Ye, J.
Right arrow Articles by Madden, T. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ye, J.
Right arrow Articles by Madden, T. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?