An Introduction to BLAST

What BLAST Does

The Basic Local Alignment Search Tool (BLAST), according to the BLAST home page,

finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

BLAST is available online from the National Center for Biotechnology Information (NCBI) and can be used

How BLAST works

BLAST makes use of an extensive data base of DNA and protein seqences from many organisms. This database is growing daily. The user of BLAST provides a DNA or protein sequence and BLAST will return a list of sequences from the selected portion of its databases that match the sequence provided reasonably well. A description of how well two sequences match is referred to as an alignment of the two sequences. Alignments are scored (based on scoring parameters selected by the user) and presented in order beginning with the "best match". The genius of the BLAST algorithm combines three essential ingredients

Here is output typical of a BLAST query:

>ref|NM_009586.1| UniGene infoGeoGene info Homo sapiens single-minded homolog 2 (Drosophila) (SIM2), transcript 
variant SIM2s, mRNA
Length=2823

 Score =  329 bits (178),  Expect = 6e-87
 Identities = 253/288 (87%), Gaps = 9/288 (3%)
 Strand=Plus/Plus

Query  6378  GTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGTGGGCAGATCAC-TGGAGG  6436
             ||||||||| |||||||||||||||||||||||||||||||||||| |||||| || |||
Sbjct  2541  GTGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCAAGGTGGGCGGATCACCTG-AGG  2599

Query  6437  TCAGGAGTTCGAAACCAGCCTGGCCAACATGGTGAAACCCCATCTCTACTAAAAATACAG  6496
             ||||||||| |  || |||||| |||||| | |||||||||||||| |||||||||||| 
Sbjct  2600  TCAGGAGTTTGCGACAAGCCTG-CCAACAAGCTGAAACCCCATCTCCACTAAAAATACAA  2658

Query  6497  AAATTAGCCGGTCATGGTGGTG-GACACCTGTAATCCCAGCTACTCAGGTGGCTAAGGCA  6555
             |||||||  || |||||||||| | ||||||||||||||||||||| || |||| ||  |
Sbjct  2659  AAATTAGTTGGGCATGGTGGTGAG-CACCTGTAATCCCAGCTACTCTGGAGGCTGAGATA  2717

Query  6556  GGAGAATCACTTCAGCCCGGGAGGTGGAGGTTGCAGTGAGCCAAGATCATACCACGGCAC  6615
             |||| ||||||| | |||||||||||||||||||||||||| ||||||| | ||| ||||
Sbjct  2718  GGAGGATCACTTGAACCCGGGAGGTGGAGGTTGCAGTGAGCTAAGATCACATCACTGCAC  2777

Query  6616  TCCAGCCTGGGTGACAG--TGAGACTGTGGCTCAAAAAAAAAAAAAAA  6661
             |||||||||||| ||||  |||||||||  ||||||||||||||||||
Sbjct  2778  TCCAGCCTGGGTAACAGAGTGAGACTGT--CTCAAAAAAAAAAAAAAA  2823
 
The alignment of two seqeuences is represented by placing the user supplied query sequence above a sequence from the database. Where the sequences match exactly, a vertical line is placed between the two sequences to make this stand out visually. A horizontal line represents a place where a gap is placed in one sequence in order to get the flanking regions to match better. At the beginning of the output are some summary statistics for the alignment. In the example above we see that 253 of the 288 positions match exactly, and that 9 gaps have been introduced. At the remainder of the positions there is a mismatch between the two sequences.

Parameter Selection

The quality of an alignment between two sequences (called a score in BLAST) depends on a weighting scheme selected by the user. This weighting scheme can be thought of as assigning costs to ``bad things'' like

Finding the ``best alignment'' then becomes a search for the alignment with the smallest cost. Different weighting schemes (and methods for deriving weighting schemes) have been proposed for different applications and by different scientists. Weighting schemes can be used, for example, to take into account that

Many of the proposed weighting schemes are driven by the analysis of data from many sequences.