0% found this document useful (0 votes)

64 views

BLAST Script

This document provides an overview of the BLAST and FASTA algorithms for comparing biological sequences and detecting similarities. It discusses how BLAST uses seeds of exact word matches to rapidly find local alignments between a query and database, before extending these hits to identify high-scoring segment pairs. It also describes how BLAST statistically analyzes matches to assess their significance based on Karlin-Altschul theory and Poisson distribution models of random sequences.

Uploaded by

Mousami Srivastava

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

BLAST Script

Uploaded by

Mousami Srivastava

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Algorithms in Bioinformatics I, WS06/07, C.

Dieterich

5 BLAST and FASTA

This lecture is based on the following, which are all recommended reading:
D.J. Lipman and W.R. Pearson, Rapid and Sensitive Protein Similarity Searches. Science 227, 1435-1441 (1985). Pearson, W. R. and Lipman, D. J. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci USA 85, 2444-2448 (1988). S.F. Altschul, W. Gish, W. Miller, E.W. Myers and D.J. Lipman. Basic local alignment search tool, J. Molecular Biology, 215:403-410 (1990). http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html D. Guseld, Algorithms on strings, trees and sequences, pg. 376381, 1997. I. Eidhammer, I. Jonassen, W.R. Taylor, Protein Bioinformatics, pg. 3464, 2004.

5.1

Introduction

Pairwise alignment is used to detect homologies between dierent protein or DNA sequences, e.g. as global or local alignments. This can be solved using dynamic programming in time proportional to the product of the lengths of the two sequences being compared. However, this is too slow for searching current databases and in practice algorithms are used that run much faster, at the expense of possibly missing some signicant hits due to the heuristics employed. Such algorithms are usually seed-and-extend approaches in which rst small exact matches are found, which are then extended to obtain long inexact ones. The following strategies have devised their algorithms such that a large part of the computation is already nished. This leads to algorithms for which the query is broken up into short substrings. Afterwards those database entries are searched that contain a certain number of these and similar substrings in the right order. For biological sequences of length n there are 4n (for the DNA-alphabet = {A, G, C, T }) and/or 20n (for the amino acid alphabet) dierent strings. For n not too large, it is possible to store all in a hash table. Since this is a very time intensive procedure, a certain consistency of the database entries is assumed. The large sequence databases are therefore split into two parts: one consists of the constant part, containing all the sequences that were used for the rst table, and of a dynamic part, containing all new entries.

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

5.2

BLAST

BLAST, the Basic Local Alignment Search Tool (Altschul et al., 1990), is perhaps the most widely used bioinformatics tool ever written. It is an alignment heuristic that determines local alignments between a query and a database. It uses an approximation of the Smith-Waterman algorithm. BLAST consists of two components: a search algorithm and computation of the statistical signicance of solutions. BLAST starts with the localization of substrings (so-called segment pairs or hits) in two sequences that have a certain similarity score. The hits are the starting point for deriving HSPs, locally optimal pairs that contain the hit. Extending to the left or right of an HSP would lead to a lower score.

5.2.1

BLAST terminology

Denition 5.1. Let q be the query and d the database. A segment is simply a substring s of q or d. A segment-pair (s, t) (or hit) consists of two segments, one in q and one d, of the same length. V A L L A R P A M M A R

Example:

We think of s and t as being aligned without gaps and score this alignment using a substitution score matrix, e.g. BLOSUM or PAM in the case of protein sequences. The alignment score for (s, t) is denoted by (s, t). Denition 5.2. A locally maximal segment pair (LMSP) is any segment pair (s, t) whose score cannot be improved by shortening or extending the segment pair. A maximum segment pair (MSP) is any segment pair (s, t) of maximal alignment score (s, t). Given a cuto score S, a segment pair (s, t) is called a high-scoring segment pair (HSP), if it is locally maximal and (s, t) S. Finally, a word is simply a short substring of xed length w. Given S, the goal of BLAST is to compute all HSPs.

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 5.2.2 The BLAST algorithm

Given three parameters, i.e. a word size w, a word similarity threshold T and a minimum cut-o score S. Then we are looking for a segment pair with a score of at least S that contains at least one word pair of length w with score at least T . For protein sequences, BLAST operates as follows: Preprocessing: Of the query sequence q rst all words of length w are generated. Then a list of all w-mers of length w over the alphabet that have similarity > T to some word in the query sequence q is generated. Example: For the query sequence RQCSAGW the list of words of length w = 2 with a score T > 8 using the BLOSUM62 matrix are:
word RQ QC CS SA AG GW 2 mer with score > 8 RQ QC, RC, EC, NC, DC, KC, MC, SC CS,CA,CN,CD,CQ,CE,CG,CK,CT AG GW,AW,RW,NW,DW,QW,EW,HW,KW,PW,SW,TW,WW

The search algorithm consists then of three steps: 1. Localization of the hits: The database sequence d is scanned for all hits t of w-mers s in the list, and the position of the hit is saved. 2. Detection of hits: First all pairs of hits are searched that have a distance of at most A (think of them lying on the same diagonal in the matrix of the SW-algorithm). 3. Extension to HSPs: Each such seed (s, t) is extended in both directions until its score (s, t) cannot be enlarged (LMSP). Then all best extensions are reported that have score S, these are the HSPs. Originally the extension did not include gaps, the modern BLAST2 algorithm allows insertion of gaps. In practice, w = 3 and A = 40 for proteins. With care, the list of all words of length w that have similarity > T to some word in the query sequence q can be produced in time proportional to the number of words in the list. These are placed in a keyword tree and then, for each word in the tree, all exact locations of the word in the database d are detected in time linear to the length of d.

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

As an alternative to storing the words in a tree, a nite-state machine can be used, which Altschul et al. found to have the faster implementation. As BLAST does not allow indels at that stage, hit extension is very fast. Note that the use of seeds of length w and the termination of extensions with fading scores are both steps that speed up the algorithm, but also imply that BLAST is not guaranteed to nd all HSPs (after all it is a heuristic). For DNA sequences, BLAST operates as follows: The list of all words of length w in the query sequence q is generated. In practice, w = 12 for DNA. The database d is scanned for all hits of words in this list. Blast uses a two-bit encoding for DNA. This saves space and also search time, as four bases are encoded per byte. Note that the T parameter dictates the speed and sensitivity of the search.

5.2.3

The BLAST family

There are a number of dierent variants of the BLAST program: BLASTN: compares a DNA query sequence to a DNA sequence database; qDNA sDNA BLASTP: compares a protein query sequence to a protein sequence database; qprotein sprotein TBLASTN: compares a protein query sequence to a DNA sequence database (6 frames translation); qprotein st1 (DNA) , qprotein st2 (DNA) , qprotein st3 (DNA) , qprotein stc (DNA) , qprotein stc (DNA) , qprotein stc (DNA) 1 2 3 BLASTX: compares a DNA query sequence (6 frames translation) to a protein sequence database; qt1 (DNA) sprotein , qt2 (DNA) sprotein , qt3 (DNA) sprotein , qtc (DNA) sprotein , qtc (DNA) sprotein , qtc (DNA) sprotein 1 2 3 TBLASTX: compares a DNA query sequence (6 frames translation) to a DNA sequence database (6 frames translation); qt1 (DNA) st1 (DNA) , . . ., qtc (DNA) 3 stc (DNA) 3

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

5.3

Statistical analysis

When a local alignment has been computed, next one needs to assess its statistical signicance. This is done by the general approach of hypothesis testing.

5.3.1

Poisson distribution

The Karlin and Altschul theory for local alignments (without gaps) is based on Poisson and extreme value distributions. The details of that theory are beyond the scope of this lecture, but basics are sketched in the following. Denition 5.3. The Poisson distribution with parameter v is given by P(X = x) = v x v e x! (5.1)

Note that v is the expected value as well as the variance. From the equation we follow that the probability that a variable X will have a value at least x is
x1

P(X x) = 1
i=0

v i v e i!

(5.2)

5.3.2

Statistical signicance of an HSP

Given an HSP (s, t) with score (s, t). How signicant is this match (i.e., local alignment)? To analyze how high a score is likely to arise by chance, a model of random sequences over the alphabte is needed. For proteins, the simplest model chooses the amino acid residues in a sequence independently, with specic background probabilities pa for the various residues a . Given the scoring matrix S(a, b), the expected score for aligning a random pair of amino acid is required to be negative: E= pa pb S(a, b) < 0
a,b

Were this not the case, long alignments would tend to have high score independently of whether the segments aligned were related, and the statistical theory would break down. Just as the sum of a large number of independent identically distributed (i.i.d) random variables tends to a normal distribution, the maximum of a large number of i.i.d. random variables tends to an extreme value distribution as we will see below. Assume that the length m and n of the query and database respectively are suciently large.

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

HSP scores are characterized by two parameters, K and . The parameters K and depend on the background probabilities of the symbols and on the employed scoring matrix. is the unique value for y that satises the equation pa pb eS(a,b)y = 1
a,b

K and are scaling-factors for the search space and for the scoring scheme, respectively. The number of random HSPs (s, t) with (s, t) S can be described by a Poisson distribution with parameter v = KmneS . The number of HSPs with score S that we expect to see due to chance is then the parameter v, also called the E-value: E(C) = KmneS (5.3)

Note that doubling the query or database length should double the number of HSPs attaining the given score. However, doubling the cuto score to 2S should cause an exponential drop-o of HSPs attaining the given score, as each HSP will have to achieve the score S twice in a row.

Hence, the probability of nding exactly x HSPs with a score S is given by P(X = x) = eE where E is the E-value for C. The probability of nding at least one HSP by chance is P(S) = 1 P(X = 0) = 1 eE . (5.5) Ex , x! (5.4)

Thus we see that the probability distribution of the scores follows an extreme value distribution. BLAST reports E-values rather than P -values as it is easier to interpret the dierence between E-values than to interpret the dierence between P -values. We would like to hide the parameters K and to make it easier to compare results from dierent BLAST searches. For a given HSP (s, t) we transform the raw score S into a bit-score: S := S ln K . ln 2 (5.6)

Such bit-scores can be compared between dierent BLAST searches, as the parameters of the given scoring systems are subsumed in them.

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

To determine the signicance of a given bit-score S the only additional value required is the size of the search space. Since S = (S ln 2 + ln K)/, we can express the E-value in terms of the bit-score as follows: E = KmneS = Kmne(S
ln 2+ln K)

= mn2S .

(5.7)

5.3.3

A practical demonstration of the BLAST service provided thru the NCBI BLAST output

5.3.4

BLASTN 2.2.5 [Nov-16-2002] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= gi|27666311|ref|XM_213799.1| Rattus norvegicus similar to LD36129p [Drosophila melanogaster] (LOC288717), mRNA (615 letters) Database: Mm.seq.uniq 87,543 sequences; 71,874,313 total letters Searching..................................................done Score E (bits) Value 420 387 40 38 36 36 34 e-117 e-106 0.035 0.14 0.54 0.54 2.1

Sequences producing significant alignments: gnl|UG|Mm#S1976993 gnl|UG|Mm#S1660553 gnl|UG|Mm#S1422805 gnl|UG|Mm#S2728452 gnl|UG|Mm#S2050066 gnl|UG|Mm#S2610883 gnl|UG|Mm#S2470935

Mus musculus 10, 11 days embryo whole body cD... Mus musculus tuftelin interacting protein 11 ... BB410669 Mus musculus cDNA, 3 end /clone=C43... Mus musculus, similar to KIAA1345 protein, c... Mus musculus, clone IMAGE:3594635, mRNA /cds=... Mus musculus RIKEN cDNA 4933405A16 gene (4933... 4063-18 Mus musculus cDNA /gb=BI990266 /gi=17...

>gnl|UG|Mm#S1976993 Mus musculus 10, 11 days embryo whole body cDNA, RIKEN full-length enriched library, clone:2810002G02:homolog to BK1048E9.3 (NOVEL PROTEIN), full insert sequence /cds=UNKNOWN /gb=AK012640 /gi=12849517 /ug=Mm.206666 /len=1075 Length = 1075 Score = 420 bits (212), Expect = e-117 Identities = 251/264 (95%) Strand = Plus / Plus

Query: 2

ttcagtcagactgaggtttccgttctcacctccctgggtgtgacagtcctcagtgagaat 61 ||||||||||||||||||||||||||||||||||| || ||||| ||||||||||||||| Sbjct: 436 ttcagtcagactgaggtttccgttctcacctccctcggcgtgactgtcctcagtgagaat 495

Query: 62

gaggaaggcaagcacagtgtccagagccagcccactgtcttctacatgccacactgtggg 121 ||||||||||||| |||||||||| ||||||||||||||||||||||||||||||||||| Sbjct: 496 gaggaaggcaagcgcagtgtccagggccagcccactgtcttctacatgccacactgtggg 555 Score = 385 bits (194), Expect = e-106 Identities = 313/349 (89%), Gaps = 4/349 (1%) Strand = Plus / Plus

Query: 265 Sbjct: 725

gattctgaaaggcctggaggagttcccactgcctcaaactccccagtacaccgacacctt 324 |||||||||||||||||||||| |||||||||| || |||||||||||||| |||||||| gattctgaaaggcctggaggaggtcccactgccgcagactccccagtacactgacacctt 784

Query: 325 Sbjct: 785

caatgacacatctgtccactggttccctcttctgaagctggaaggcttatctgaggacct 384 ||||| |||||||||||||||||||| || ||||||||||||| | ||| |||||| taatgatacatctgtccactggttcccactattgaagctggaaggactgcctggggacct 844

>gnl|UG|Mm#S2470935 4063-18 Mus musculus cDNA /gb=BI990266 /gi=17961276 /ug=Mm.228667 /len=600 Length = 600 Score = 34.2 bits (17), Expect = 2.1 Identities = 17/17 (100%)

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

Strand = Plus / Minus

Query: 55

tgagaatgaggaaggca 71 ||||||||||||||||| Sbjct: 229 tgagaatgaggaaggca 213 Database: Mm.seq.uniq Posted date: Jan 15, 2003 5:12 PM Number of letters in database: 71,874,313 Number of sequences in database: 87,543 Lambda 1.37 Gapped Lambda 1.37 K 0.711 H 1.31

K 0.711

H 1.31

Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 Number of Hits to DB: 33,888 Number of Sequences: 87543 Number of extensions: 33888 Number of successful extensions: 2616 Number of sequences better than 10.0: 52 Number of HSPs better than 10.0 without gapping: 52 Number of HSPs successfully gapped in prelim test: 0 Number of HSPs that attempted gapping in prelim test: 2552 Number of HSPs gapped (non-prelim): 58 length of query: 615 length of database: 71,874,313 effective HSP length: 18 effective length of query: 597 effective length of database: 70,298,539 effective search space: 41968227783 effective search space used: 41968227783 T: 0 A: 0 X1: 6 (11.9 bits) X2: 15 (29.7 bits) S1: 12 (24.3 bits) S2: 16 (32.2 bits)

5.4

FASTA

FASTA (pronounced fast-ay)1 is a heuristic for nding signicant matches between a query string q and a database string d. It is the older of the two heuristics introduced in the lecture. FASTAs general strategy is to nd the most signicant diagonals in the dot-plot or dynamic programming matrix. The performance of the algorithm is inuenced by a word-size parameter k, usually 6 for DNA and 2 for amino acids. The algorithm consists of four phases: Hashing, 1st scoring, 2nd scoring, alignment.

5.4.1

Hashing

The rst step of the algorithm is to determine all exact matches of length k (wordsize) between the two sequences, called hot-spots.
FASTA is an abbreviation of fast all. The predecessor of FASTA was FASTP which was only applicable to protein sequences
1

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

To nd these exact matches quickly, usually a hash (look-up) table is built that consists of all words of length k that are contained in the query sequence. Exact matches are then found by look-up of each word of length k that is contained in the database. A hot-spot is given by (i, j), where i and j are the locations (i.e., start positions) of an exact match of length k in the query and database sequence respectively. Any such hot-spot (i, j) lies on the diagonal (i j) of the dot-plot or dynamic programming matrix. Using this scheme, the main diagonal has number 0 (i = j), whereas diagonals above the main one have positive numbers (i < j), the ones below negative (i < j). A diagonal run is a set of hot-spots that lie in a consecutive sequence on the same diagonal. It corresponds to a gapless local alignment. A score is assigned to each diagonal run. This is done by giving a positive score to each match (using e.g. the PAM250 match score matrix in the case of proteins) and a negative score for gaps in the run, the latter scores decrease with increasing length of the gap. The algorithm then locates the ten best diagonal runs.

5.4.2

First and second scoring and alignment

Each of the ten diagonal runs with highest score are further processed. Within each of these scores an optimal local alignment is computed using the match score substitution matrix. These alignments are called initial regions. The score of the best sub-alignment found in this phase is reported as init1. The next step is to combine high scoring sub-alignments into a single larger alignment, allowing the introduction of gaps into the alignment. The score of this alignment is reported as initn Finally, a banded Smith-Waterman dynamic program is used to produce an optimal local alignment along the best matched regions. The center of the band is determined by the region with the score init1, and the band has width 8 for ktup=2. The score of the resulting alignment is reported as opt. In this way, FASTA determines a highest scoring region, not all high scoring alignments between two sequences. Hence, FASTA may miss instances of repeats or multiple domains shared by two proteins.

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

After all sequences of the databases have thus been searched a statistical signicance similar to the BLAST statistics is computed and reported. The following matrix summarizes the principle of FASTA for the two sequences ACTGAC and TACCGA: The hot spots for k = 2 are marked as pairs of black bullets, a diagonal run is shaded in dark grey. An optimal sub-alignment in this case coincides with the diagonal run. The light grey shaded band of width 3 around the subalignment denotes the area in which the optimal local alignment is searched.

5.5

BLAST and FASTA

BLAST
Database Query Query

FASTA
Database

Database Query
Query

Database

(a)

(b)

(a) In BLAST, individual seeds are found and then extended without indels. (b) In FASTA, individual seeds contained in the same diagonal are merged and the resulting segments are then connected using a banded Smith-Waterman alignment.

Wine Quality Classification
No ratings yet
Wine Quality Classification
36 pages
Case Study Peppercorn Dining
No ratings yet
Case Study Peppercorn Dining
5 pages
A Level Mathematics Practice Paper H - Statistics and Mechanics
No ratings yet
A Level Mathematics Practice Paper H - Statistics and Mechanics
8 pages
Blast Vs Fasta
No ratings yet
Blast Vs Fasta
3 pages
Basic Local Alignment Search Tool
No ratings yet
Basic Local Alignment Search Tool
8 pages
Blast
No ratings yet
Blast
18 pages
Bioinformatics Session8
No ratings yet
Bioinformatics Session8
33 pages
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
100% (1)
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
38 pages
Running BLAST Through Perl
No ratings yet
Running BLAST Through Perl
35 pages
29) Altschul 1997
No ratings yet
29) Altschul 1997
14 pages
Unit2 2
No ratings yet
Unit2 2
30 pages
Unit Iii
No ratings yet
Unit Iii
27 pages
Sequence Alignment
No ratings yet
Sequence Alignment
14 pages
Fasta Sequence Database
No ratings yet
Fasta Sequence Database
17 pages
Accelerating DNA Pairwise Sequence Alignment Using FPGA and a Customized Convolutional Neural Network - ScienceDirect
No ratings yet
Accelerating DNA Pairwise Sequence Alignment Using FPGA and a Customized Convolutional Neural Network - ScienceDirect
9 pages
Blast & Fasta
No ratings yet
Blast & Fasta
47 pages
Fast Heuristic Local Alignment Algorithms: Stephen F
No ratings yet
Fast Heuristic Local Alignment Algorithms: Stephen F
18 pages
Sequence Database
No ratings yet
Sequence Database
36 pages
Multiple Seq Alignment
No ratings yet
Multiple Seq Alignment
36 pages
DNA
No ratings yet
DNA
3 pages
A Novel Model For Dna Sequence Similarity Analysis Based On Graph Theory
No ratings yet
A Novel Model For Dna Sequence Similarity Analysis Based On Graph Theory
10 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
54 pages
Lecture2022 - 3 /!
No ratings yet
Lecture2022 - 3 /!
60 pages
A Comparative Study of Various Parallel Longest Common Subsequence (LCS) Algorithms
No ratings yet
A Comparative Study of Various Parallel Longest Common Subsequence (LCS) Algorithms
4 pages
Lecture 8- BLAST_MSA
No ratings yet
Lecture 8- BLAST_MSA
15 pages
Pattern Recognition 1
No ratings yet
Pattern Recognition 1
5 pages
Sequence DB Search
No ratings yet
Sequence DB Search
38 pages
Singular Value Decomposition and Principal Component Analysis
No ratings yet
Singular Value Decomposition and Principal Component Analysis
19 pages
NIH Public Access: Author Manuscript
No ratings yet
NIH Public Access: Author Manuscript
7 pages
BLAST Glossary With Highlights
No ratings yet
BLAST Glossary With Highlights
9 pages
Reddy and Saier JR - 2012 - BioV Suite - A Collection of Programs For The Stud
No ratings yet
Reddy and Saier JR - 2012 - BioV Suite - A Collection of Programs For The Stud
11 pages
Lecture/Lab: BLAST: Materials Last Updated June 2007
No ratings yet
Lecture/Lab: BLAST: Materials Last Updated June 2007
11 pages
2013 Liviu P. Dinu, Radu-Tudor Ionescu, 2012. Clustering Based On Rank Distance With Applications On DNA PDF
No ratings yet
2013 Liviu P. Dinu, Radu-Tudor Ionescu, 2012. Clustering Based On Rank Distance With Applications On DNA PDF
10 pages
-a-novel-model-for-dna-sequence-similarity-analysis-based-on-graph-theory
No ratings yet
-a-novel-model-for-dna-sequence-similarity-analysis-based-on-graph-theory
10 pages
ItoBI Lec10 1
No ratings yet
ItoBI Lec10 1
17 pages
DNA Sequences Analysis: Hasan Alshahrani CS6800
No ratings yet
DNA Sequences Analysis: Hasan Alshahrani CS6800
26 pages
A Novel Model For Dna Sequence Similarity Analysis Based On Graph Theory
No ratings yet
A Novel Model For Dna Sequence Similarity Analysis Based On Graph Theory
11 pages
Blast
No ratings yet
Blast
12 pages
Improved Histograms For Selectivity Estimation of Range Predicates - Poosala
No ratings yet
Improved Histograms For Selectivity Estimation of Range Predicates - Poosala
12 pages
Lecture 9 and 10 half
No ratings yet
Lecture 9 and 10 half
4 pages
TY-Exercise_4_(35)
No ratings yet
TY-Exercise_4_(35)
8 pages
FASTA& BLASTA
No ratings yet
FASTA& BLASTA
5 pages
Assignment Artificial Intelligence: Names
No ratings yet
Assignment Artificial Intelligence: Names
5 pages
Bioinformatics: Sequence Alignment Methods
No ratings yet
Bioinformatics: Sequence Alignment Methods
32 pages
Demixed Principal Component Analysis
No ratings yet
Demixed Principal Component Analysis
9 pages
Design & Implementation of Compression Algorithm For Nucleotide Sequence Using Direct Coding and LZ77
No ratings yet
Design & Implementation of Compression Algorithm For Nucleotide Sequence Using Direct Coding and LZ77
5 pages
Bioinformatics final
No ratings yet
Bioinformatics final
18 pages
Kargupta and Ray
No ratings yet
Kargupta and Ray
23 pages
Consensus Clustering For Microarray Gene Expression Data
No ratings yet
Consensus Clustering For Microarray Gene Expression Data
8 pages
The Statistics of Sequence Similarity Scores
No ratings yet
The Statistics of Sequence Similarity Scores
11 pages
FASTA
No ratings yet
FASTA
4 pages
Base Paper 1
No ratings yet
Base Paper 1
21 pages
Some Significant Databases Blast Blast
No ratings yet
Some Significant Databases Blast Blast
18 pages
Sequence Analysis 2
No ratings yet
Sequence Analysis 2
13 pages
Bioinformatics: Speeding Up Whole-Genome Alignment by Indexing Frequency Vectors
No ratings yet
Bioinformatics: Speeding Up Whole-Genome Alignment by Indexing Frequency Vectors
13 pages
BLAST
100% (1)
BLAST
4 pages
A Statistical Theory of Chord Under Churn: Abstract
No ratings yet
A Statistical Theory of Chord Under Churn: Abstract
6 pages
BLAST Analysis and Algorythim
No ratings yet
BLAST Analysis and Algorythim
11 pages
Instructions:: Name
No ratings yet
Instructions:: Name
4 pages
Ando and Kauffman 1965
No ratings yet
Ando and Kauffman 1965
13 pages
TY-Exercise_4_(35)(Updated)
No ratings yet
TY-Exercise_4_(35)(Updated)
7 pages
Paper On Digital Signal Processing On Biosequence
No ratings yet
Paper On Digital Signal Processing On Biosequence
6 pages
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Lampiran SPSS 20 CC
No ratings yet
Lampiran SPSS 20 CC
6 pages
Dwnload Full Statistics For Business and Economics 8th Edition Newbold Test Bank PDF
100% (25)
Dwnload Full Statistics For Business and Economics 8th Edition Newbold Test Bank PDF
35 pages
Mathematical Tools For Data Science
No ratings yet
Mathematical Tools For Data Science
9 pages
Chapter 7 Final
No ratings yet
Chapter 7 Final
45 pages
New Methodology For Grouping Electric Power Iet-Gtd.2012.0472
No ratings yet
New Methodology For Grouping Electric Power Iet-Gtd.2012.0472
7 pages
UCS551 Chapter 4 - Descriptive Analytics - Visualization
No ratings yet
UCS551 Chapter 4 - Descriptive Analytics - Visualization
39 pages
Two by Two
No ratings yet
Two by Two
13 pages
2021 Mphys Particle Physics With A Research Year Abroad 4424
No ratings yet
2021 Mphys Particle Physics With A Research Year Abroad 4424
14 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
Impact of Work Life Balance On Job Satisfaction of Female Faculty Members of Business Schools of Karachi
No ratings yet
Impact of Work Life Balance On Job Satisfaction of Female Faculty Members of Business Schools of Karachi
19 pages
Lean Six Sigma Green Belt
No ratings yet
Lean Six Sigma Green Belt
27 pages
Entropy (Information Theory)
No ratings yet
Entropy (Information Theory)
3 pages
Leone2015 Centifuga Horizontal
No ratings yet
Leone2015 Centifuga Horizontal
47 pages
Backtest OverFitting
No ratings yet
Backtest OverFitting
58 pages
The Impact of Digital Tools on Vocabulary Developm
No ratings yet
The Impact of Digital Tools on Vocabulary Developm
7 pages
Ncism PG S1 Ay-Bs
No ratings yet
Ncism PG S1 Ay-Bs
91 pages
Bias Format Updated
No ratings yet
Bias Format Updated
12 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
General Accreditation Guidance - Validation and Verification of Quantitative and Qualitative Test Methods
No ratings yet
General Accreditation Guidance - Validation and Verification of Quantitative and Qualitative Test Methods
31 pages
Unseen Class Discovery in Open-World Classification: A Project Report
No ratings yet
Unseen Class Discovery in Open-World Classification: A Project Report
48 pages
Charles P. Jones, Investments: Analysis and Management, Ninth Edition, John Wiley & Sons
No ratings yet
Charles P. Jones, Investments: Analysis and Management, Ninth Edition, John Wiley & Sons
23 pages
Module 06 - One Population Parameter Estimation - Topic 4A
No ratings yet
Module 06 - One Population Parameter Estimation - Topic 4A
59 pages
Microsoft Word - Sample Proposal
No ratings yet
Microsoft Word - Sample Proposal
5 pages
Hypothesis Testing - Analysis of Variance (ANOVA)
No ratings yet
Hypothesis Testing - Analysis of Variance (ANOVA)
30 pages
DBB2103 Unit-01
No ratings yet
DBB2103 Unit-01
26 pages
Chapter 3
No ratings yet
Chapter 3
8 pages
Unit 01 Chapter 2 Practice
No ratings yet
Unit 01 Chapter 2 Practice
6 pages

BLAST Script

Uploaded by

BLAST Script

Uploaded by

Algorithms in Bioinformatics I, WS06/07, C.

5 BLAST and FASTA

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 5.2.2 The BLAST algorithm

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

The BLAST family

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

Statistical signicance of an HSP

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

ttcagtcagactgaggtttccgttctcacctccctgggtgtgacagtcctcagtgagaat 61 ||||||||||||||||||||||||||||||||||| || ||||| ||||||||||||||| Sbjct: 436 ttcagtcagactgaggtttccgttctcacctccctcggcgtgactgtcctcagtgagaat 495

Query: 265 Sbjct: 725

gattctgaaaggcctggaggagttcccactgcctcaaactccccagtacaccgacacctt 324 |||||||||||||||||||||| |||||||||| || |||||||||||||| |||||||| gattctgaaaggcctggaggaggtcccactgccgcagactccccagtacactgacacctt 784

Query: 325 Sbjct: 785

caatgacacatctgtccactggttccctcttctgaagctggaaggcttatctgaggacct 384 ||||| |||||||||||||||||||| || ||||||||||||| | ||| |||||| taatgatacatctgtccactggttcccactattgaagctggaaggactgcctggggacct 844

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

Strand = Plus / Minus

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

First and second scoring and alignment

Algorithms in Bioinformatics I, WS06/07, C.Dieterich

BLAST and FASTA

You might also like