Blast Analisis II
Blast Analisis II
[Chapter 16]
Tom Madden
Summary
The comparison of nucleotide or protein sequences from the same or different organisms is a very
powerful tool in molecular biology. By finding similarities between sequences, scientists can infer
the function of newly sequenced genes, predict new members of gene families, and explore
evolutionary relationships. Now that whole genomes are being sequenced, sequence similarity
searching can be used to predict the location and function of protein-coding and
transcriptionregulation regions in genomic DNA.
,
Basic Local Alignment Search Tool (BLAST) (1 2) is the tool most frequently used for calculating
sequence similarity. BLAST comes in variations for use with different query sequences against
different databases. All BLAST applications, as well as information on which BLAST program to
use and other help documentation, are listed on the BLAST homepage. This chapter will focus more
on how BLAST works, its output, and how both the output and program itself can be further
manipulated or customized, rather than on how to use BLAST or interpret BLAST results.
Introduction
The way most people use BLAST is to input a nucleotide or protein sequence as a query
against all (or a subset of) the public sequence databases, pasting the sequence into the
textbox on one of the BLAST Web pages. This sends the query over the Internet, the search
is performed on the NCBI databases and servers, and the results are posted back to the
person's browser in the chosen display format. However, many biotech companies, genome
scientists, and bioinformatics personnel may want to use stand-alone BLAST to query
their own, local databases or want to customize BLAST in some way to make it better suit
their needs. Standalone BLAST comes in two forms: the executables that can be run from
the command line; or the Standalone WWW BLAST Server, which allows users to set up
their own in-house versions of the BLAST Web pages.
There are many different variations of BLAST available to use for different sequence
comparisons, e.g., a DNA query to a DNA database, a protein query to a protein database,
and a DNA query, translated in all six reading frames, to a protein sequence database. Other
adaptations of BLAST, such as PSI-BLAST (for iterative protein sequence similarity
searches using a position-specific score matrix) and RPS-BLAST (for searching for protein
domains in the Conserved Domains Database, Chapter 3) perform comparisons against
sequence profiles.
This chapter will first describe the BLAST architecturehow it works at the NCBI site
and then go on to describe the various BLAST outputs. The best known of these outputs is
the default display from BLAST Web pages, the so-called traditional report. As well as
obtaining BLAST results in the traditional report, results can also be delivered in structured
output, such as a hit table (see below), XML, or ASN.1. The optimal choice of output format
depends upon the application. The final part of the chapter discusses stand-alone BLAST
and describes possibilities for customization. There are many interfaces to BLAST that are
often not exploited by users but can lead to more efficient and robust applications.
Page 2
Page 3
ASN.1. The results can then be formatted by fetching the ASN.1 (fetch ASN.1) and fetching the sequences (fetch sequence) from
the BLAST databases. Because the execution of the search algorithm is decoupled from the formatting, the results can be
delivered in a variety of formats without re-running the search.
The BLAST Formatter, which sits on the BLAST server, can use the information in the
SeqAlign to retrieve the similar sequences found and display them in a variety of ways.
Thus, once a query has been completed, the results can be reformatted without having to reexecute the search. This is possible because of the QBLAST system.
Page 4
Page 5
Page 6
The traditional report is really designed for human readability, as opposed to being parsed by
a program. For example, the one-line descriptions are useful for people to get a quick
overview of their search results, but they are rarely complete descriptors because of limited
space. Also, for convenience, there are several pieces of information that are displayed in
both the one-line descriptions and alignments (for example, the E-values, scores, and
descriptions); therefore, the person viewing the search output does not need to move back
and forth between sections.
New features may be added to the report, e.g., the addition of links to Entrez Gene records
(Chapter 19) from sequence hits, which result in a change of output format. These are easy
for people to pick up on and take advantage of but can trip programs that parse this BLAST
output.
Page 7
By default, a maximum of 500 sequence matches are displayed, which can be changed on the
advanced BLAST page with the Alignments option. Many components of the BLAST
results display via the Internet and are hyperlinked to the same information at different
places in the page, to additional information including help documentation, and to the Entrez
sequence records of matched sequences. These records provide more information about the
sequence, including links to relevant research abstracts in PubMed.
The screening of many newly sequenced human Expressed Sequence Tags (ESTs) for
contamination by the Escherichia coli cloning vector is a good example of when it is
preferable to use the hit table output over the traditional report. In this case, a strict, high Evalue threshold would be applied to differentiate between contaminating E. coli sequence
and the human sequence. Those human ESTs that find very strong, near-exact E.coli
sequence matches can be discarded without further examination. (Borderline cases may
require further examination by a scientist.)
For these purposes, the hit table output is more useful than the traditional report; it contains
only the information required in a more formal structure. The hit table output contains no
sequences or definition lines, but for each sequence matched, it lists the sequence identifier,
the start and stop points for stretches of sequence similarity (offset by one residue), the
percent identity of the match, and the E-value.
Page 8
examples of structured output in which there are built-in checks for correct and complete
syntax and structure. (In the case of XML, for example, this is ensured by the necessity for
matching tags and the DTD.) For text reports, there is often no specification, but perhaps a
(incomplete) description of the file is written afterward.
ASN.1 Is Used by the BLAST Server
As well as the hit table and traditional report shown in HTML, BLAST results can also be
formatted in plain text, XML, and ASN.1 (Figure 7), and what's more, the format for a given
BLAST result can be changed without re-executing the search.
A change in BLAST format without re-executing the search is possible because when a
scientist looks at a Web page of BLAST results at NCBI, the HTML that makes that page
has been created from ASN.1 (Figure 7). Although the formatted results are requested from
the server, the information about the alignments is fetched from a disk in ASN.1, as are the
corresponding sequences from the BLAST databases (see Figure 1). The formatter on the
BLAST server then puts these results together as a BLAST report. The BLAST search itself
has been uncoupled from the way the result is formatted, thus allowing different output
formats from the same search. The strict internal validation of ASN.1 ensures that these
output formats can always be produced reliably.
Information about the Alignment Is Contained within a SeqAlign
SeqAlign is the ASN.1 object that contains the alignment information about the BLAST
search. The SeqAlign does not contain the actual sequence that was found in the match but
does contain the start, stop, and gap information, as well as scores, E-values, sequence
identifiers, and (DNA) strand information.
Page 9
As mentioned above, the actual database sequences are fetched from the BLAST databases
when needed. This means that an identifier must uniquely identify a sequence in the
database. Furthermore, the query sequence cannot have the same identifier as any sequence
in the database unless the query sequence itself is in the database. If one is using stand-alone
BLAST with a custom database, it is possible to specify that every sequence is uniquely
identified by using the O option with formatdb (the program that converts FASTA files to
BLAST database format). This also indexes the entries by identifier. Similarly, the J option
in the (stand-alone) programs blastall, blastpgp, megablast, or rpsblast certifies that the query
does not use an identifier already in the database for a different sequence. If the O and J
options are not used, BLAST assigns unique identifiers (for that run) to all sequences and
shields the user from this knowledge.
Any BLAST database or FASTA file from the NCBI Web site that contains gi numbers
already satisfies the uniqueness criterion. Unique identifiers are normally a problem only
when custom databases are produced and care is not taken in assigning identifiers. The
identifier for a FASTA entry is the first token (meaning the letters up to the first space) after
the > sign on the definition line. The simplest case is to simply have a unique token (e.g., 1,
2, and so on), but it is possible to construct more complicated identifiers that might, for
example, describe the data source. For the FASTA identifiers to be reliably parsed, it is
necessary for them to follow a specific syntax (see Appendix 1).
More information on the SeqAlign produced by BLAST can be found here or be downloaded
as a PowerPoint presentation, as well as from the NCBI Toolkit Software Developer's
handbook.
XML
XML and ASN.1 are both structured languages and can express the same information;
therefore, it is possible to produce a SeqAlign in XML. Some users do not find the format of
the information in the SeqAlign to be convenient because it does not contain actual sequence
information, and when the sequence is fetched from the BLAST database, it is packed two or
four bases per byte. Typically, these users are familiar with the BLAST report and want
something similar but in a format that can be parsed reliably. The XML produced by BLAST
meets this need, containing the query and database sequences, sequence definition lines, the
start and stop points of the alignments (one offset), as well as scores, E-values, and percent
identity. There is a public DTD for this XML output.
BLAST Code
The BLAST code is part of the NCBI Toolkit, which has many low-level functions to make
it platform independent; the Toolkit is supported under Linux and many varieties of UNIX,
NT, and MacOS. To use the Toolkit, developers should write a function Main, which is
called by the Toolkit main. The BLAST code is contained mostly in the tools directory
(see Appendix 2 for an example).
The BLAST code has a modular design. For example, the Application Programming
Interface (API) for retrieval from the BLAST databases is independent of the compute
engine. The compute engine is independent from the formatter; therefore, it is possible (as
mentioned above) to compute results once but view them in many different modes.
Page 10
Readdb API
The readdb API can be used to easily extract information from the BLAST databases.
Among the data available are the date the database was produced, the title, the number of
letters, number of sequences, and the longest sequence. Also available are the sequence and
description of any entry. The latest version of the BLAST databases also contains a taxid (an
integer specifying some node of the NCBI taxonomy tree; see Chapter 4). Users are strongly
encouraged to use the readdb API rather than reading the files associated with the database,
because the the files are subject to change. The API, on the other hand, will support the
newest version, and an attempt will be made to support older versions. See Appendix 2 for
an example of a simple program (db2fasta.c) that demonstrates the use of the readdb API.
Performing a BLAST Search with C Function Calls
Only a few function calls are needed to perform a BLAST search. Appendix 3 shows an
excerpt from a Demonstration Program doblast.c.
Formatting a SeqAlign
MySeqAlignPrint (called in the example in Appendix 3) is a simple function to print a view
of a SeqAlign (see Appendix 4).
Page 11
T
Database name
Identifier syntax
GenBank
gb|accession|locus
emb|accession|locus
dbj|accession|locus
NBRF PIR
pir||entry
prf||name
SWISS-PROT
sp|accession|entry name
pdb|entry|chain
Patents
pat|country|number
GenInfo Backbone Id
bbs|number
gnl|database|identifier
ref|accession|locus
lcl|identifier
able 1
gnl allows databases not included in this list to use the same identifying syntax. This is used for sequences in the trace databases, e.g., gnl|ti|
53185177. The combination of the second and third fields should be unique.
Page 12
return (1); }
is_prot = TRUE;
else
fp = FileOpen("stdout", "w");
bsp =
rdfp = readdb_destruct(rdfp);
return 0; }
Note that:
1
Readdb_acc2fasta fetches the ordinal number (zero offset) of the record given a
FASTA identifier (e.g., gb|AAH06776.1|AAH0676).
Page 13
Note also that Main is called, rather than main, and a call to GetArgs is used to get the
command-line
arguments.
db2fasta.c
is
contained
in
the
tar
archive
ftp://ftp.ncbi.nih.gov/blast/ demo/blast_demo.tar.gz.
[3].floatvalue;
/* Perform the actual search. */ seqalign =
BioseqBlastEngine(query_bsp, blast_program, blast_database, options,
NULL, NULL, NULL);
/* Do something with the SeqAlign... */
MySeqAlignPrint(seqalign, outfp);
/* clean up. */ seqalign =
SeqAlignSetFree(seqalign); options =
BLASTOptionDelete(options);
sep = SeqEntryFree(sep);
FileClose(infp);
FileClose(outfp);
The BLASTOptionBlk structure contains a large number of members. The most useful ones
and a brief description for each are listed in Table 2.
Page 14
Table 2
Element
Description
Nlm_FloatHi
expect_value
Int2
wordsize
Int2
penalty
Int2
reward
CharPtr
matrix
Int4
gap_open
Int4
gap_extend
CharPtr
filter_string
Int4
hitlist_size
Int2
number_of_cpus
a The types are given in terms of those in the NCBI Toolkit. Nlm_FloatHi is a double, Int2/Int4 are 2- or 4-byte integers, and CharPtr is just
char*.
while (seqalign)
{
SeqIdWrite(query_id,
SeqIdWrite(target_id,
PRINTID_FASTA_LONG,
fprintf(outfp, "%s:%ld-%ld\t%s:%ld-%ld\n",
BUFFER_LEN);
query_id_buf, (long)
seqalign = seqalign->next;
return; }
Note that:
1
SeqAlignId gets the sequence identifier for the zero-th identifier (zero offset). This
is actually a C structure.
Page 15
SeqIdWrite formats the information in query_id into a FASTA identifier (e.g., gi|
129295) and places it into query_buf.
SeqAlignStart and SeqAlignStop return the start values of the zero-th and first
sequences (or first and second).
All of this is done by high-level function calls, and it is not necessary to write low-level
function calls to parse the ASN.1.
References
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic Local Alignment Search Tool. J
MolBiol 1990;215:403410. [PubMed: 2231712]
2. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST
andPSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res
1997;25:3389 3402. [PubMed: 9254694]146917