Download History and Philosophy of Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genome wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Metagenomics wikipedia , lookup

Expanded genetic code wikipedia , lookup

Genome evolution wikipedia , lookup

Genomics wikipedia , lookup

Point mutation wikipedia , lookup

Genetic code wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Sequence alignment wikipedia , lookup

Transcript
BLAST: A Case Study
Lecture 25
BLAST: Introduction
The Basic Local Alignment Search Tool, BLAST, is a fast
approach to finding similar strings of characters.
BLAST was developed to find sequences of nucleotides or
amino acids in a database that match a query sequence.
For example, searching the human genome for
AGCTTTTCTCTTCTGTCAACCCCACACGCCTTT
produces a list of sequences scored by similarity.
This system helps scientists find genetic homologues
across individuals and species.
Using BLAST
There are several interfaces to BLAST, and it often appears
as one component of a larger suite of informatics tools.
National Center for Biotechnology Information (NCBI) hosts
the primary website and a server farm dedicated to BLAST.
From here, a user
• enters a query,
• selects a database,
• chooses a variant of
BLAST to use, and
• sets program
parameters
Results appear in
seconds.
BLAST Results
The NCBI BLAST tool returns results in several modes,
with information centered around similarity scores.
In addition to a list of matches, the tool returns
a graphical view of
the list that visualizes
the alignments,
a detailed textual view
of each match,
and a mapping of the matches to a visual
representation of an entire genome.
How BLAST Works (Stage 1)
The core BLAST algorithm has three distinct stages.
In the first stage, the system splits the query sequence into
constant-sized words.
Assuming the constant, W, is 4, the nucleotide query
AGCTTTTCTCTTCTGTCAACCCCACACGCCTTT
produces the words
AGCT GCTT CTTT … GCCT CCTT CTTT
BLAST matches these against every possible four letter
word from the language to build similarity scores.
The subset of words whose similarity scores exceed a
threshold move on to later stages, the rest are discarded.
Side Note: Similarity in BLAST
To score the similarity of two words, BLAST builds a table
based on edit distances.
For example, comparing AGCT to ACCC could give a score
of 1, whereas comparing it to GGCT would give 3.
However, some substitutions (due to mutation) are more
likely than others, especially in the case of amino acids.
BLAST accepts a scoring matrix for protein strings (e.g.,
Point Accepted Mutations 70).
For nucleotide strings, users can specify distinct scores for
matches and mismatches.
BLAST also includes procedures for identifying and
penalizing gaps.
How BLAST Works (Stages 2 and 3)
At this point, BLAST has built a set of W-length words that
exceed a user-provided threshold.
During the second stage, the system searches for all
occurrences of these words within the database.
In the third stage, BLAST extends each of these W-length
matches to get the final similarity score.
The system also calculates the E-value for the score, which
is a statistical measure of significance.
Knowledge and Search in BLAST
BLAST differs from many of the informatics tools that we
have considered in the course.
Essentially it finds a sequence’s nearest neighbors within a
database with minimal concern for the content.
Unlike discovery or analysis tools, BLAST gathers
information and leaves the interpretation to the user.
However, like many discovery tools, BLAST relies on
domain knowledge to carry out heuristic search.
Knowledge:
match/mismatch costs for amino acid and
nucleotide sequences
Heuristic
Search:
an approximate scoring scheme, tells
BLAST where to look more closely
What Makes BLAST a Successful Tool?
Google Scholar identifies over 28,000 citations of the
original BLAST paper.
One of the key reasons for the system’s popularity is that it
addresses problems commonly encountered in biology:
• finding genetic homologues across organisms; and
• determining the source organism of a sequenced
genome (e.g., the Global Ocean Sampling Expedition).
Technical issues also contributed to BLAST’s success:
• it was much faster than competing software;
• it was distributed and maintained by the National
Institute of Health;
• it has continually evolved to meet new challenges and to
integrate with new databases and other technologies.
BLAST: Summary
A key insight in BLAST was to iteratively refine a solution:
• find a reduced set of short words to use as a heuristic for
locating similar strings;
• find matches to those short words and extend them to
refine the candidate solution.
This strategy accounts for the computational gains that this
system makes over others that seek exact comparisons.
The continued success of BLAST is attributable to
• the speed in which it can find sequence matches,
• its availability over the internet,
• its integration with other biological tools, and
• the fact that it addresses a specific need of biologists.