The Basic Local Alignment Search Tool (BLAST), according to the BLAST home page,
finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
BLAST is available online from the National Center for Biotechnology Information (NCBI) and can be used
BLAST makes use of an extensive data base of DNA and protein seqences from many organisms. This database is growing daily. The user of BLAST provides a DNA or protein sequence and BLAST will return a list of sequences from the selected portion of its databases that match the sequence provided reasonably well. A description of how well two sequences match is referred to as an alignment of the two sequences. Alignments are scored (based on scoring parameters selected by the user) and presented in order beginning with the "best match". The genius of the BLAST algorithm combines three essential ingredients
Here is output typical of a BLAST query:
>ref|NM_009586.1| UniGene infoGeoGene info Homo sapiens single-minded homolog 2 (Drosophila) (SIM2), transcript variant SIM2s, mRNA Length=2823 Score = 329 bits (178), Expect = 6e-87 Identities = 253/288 (87%), Gaps = 9/288 (3%) Strand=Plus/Plus Query 6378 GTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGTGGGCAGATCAC-TGGAGG 6436 ||||||||| |||||||||||||||||||||||||||||||||||| |||||| || ||| Sbjct 2541 GTGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCAAGGTGGGCGGATCACCTG-AGG 2599 Query 6437 TCAGGAGTTCGAAACCAGCCTGGCCAACATGGTGAAACCCCATCTCTACTAAAAATACAG 6496 ||||||||| | || |||||| |||||| | |||||||||||||| |||||||||||| Sbjct 2600 TCAGGAGTTTGCGACAAGCCTG-CCAACAAGCTGAAACCCCATCTCCACTAAAAATACAA 2658 Query 6497 AAATTAGCCGGTCATGGTGGTG-GACACCTGTAATCCCAGCTACTCAGGTGGCTAAGGCA 6555 ||||||| || |||||||||| | ||||||||||||||||||||| || |||| || | Sbjct 2659 AAATTAGTTGGGCATGGTGGTGAG-CACCTGTAATCCCAGCTACTCTGGAGGCTGAGATA 2717 Query 6556 GGAGAATCACTTCAGCCCGGGAGGTGGAGGTTGCAGTGAGCCAAGATCATACCACGGCAC 6615 |||| ||||||| | |||||||||||||||||||||||||| ||||||| | ||| |||| Sbjct 2718 GGAGGATCACTTGAACCCGGGAGGTGGAGGTTGCAGTGAGCTAAGATCACATCACTGCAC 2777 Query 6616 TCCAGCCTGGGTGACAG--TGAGACTGTGGCTCAAAAAAAAAAAAAAA 6661 |||||||||||| |||| ||||||||| |||||||||||||||||| Sbjct 2778 TCCAGCCTGGGTAACAGAGTGAGACTGT--CTCAAAAAAAAAAAAAAA 2823The alignment of two seqeuences is represented by placing the user supplied query sequence above a sequence from the database. Where the sequences match exactly, a vertical line is placed between the two sequences to make this stand out visually. A horizontal line represents a place where a gap is placed in one sequence in order to get the flanking regions to match better. At the beginning of the output are some summary statistics for the alignment. In the example above we see that 253 of the 288 positions match exactly, and that 9 gaps have been introduced. At the remainder of the positions there is a mismatch between the two sequences.
The quality of an alignment between two sequences (called a score in BLAST) depends on a weighting scheme selected by the user. This weighting scheme can be thought of as assigning costs to ``bad things'' like
Finding the ``best alignment'' then becomes a search for the alignment with the smallest cost. Different weighting schemes (and methods for deriving weighting schemes) have been proposed for different applications and by different scientists. Weighting schemes can be used, for example, to take into account that
Many of the proposed weighting schemes are driven by the analysis of data from many sequences.