Download Sequence Analysis Using BLAST

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Transcript
Sequence Analysis Using
BLAST
Arun Krishnan
BioInformatics Institute
What is BLAST?
• Basic Local Alignment Search Tool
– Heuristic search algorithm
– Use statistical methods of Karlin and
Altschul (1990, 1993)
– Tailored for sequence similarity search
Components of BLAST
• Five main programs:
– blastp
• compares an amino acid query sequence against a
protein sequence database
– blastn
• compares a nucleotide query sequence against a
nucleotide sequence database
– blastx
• compares the six-frame conceptual translation products
of a nucleotide query sequence against a protein
sequence database
– tblastn
• compares a protein sequence against a nucleotide
sequence database dynamically translated in all six
reading frames
– tblastx
• compares the six-frame translations of a nucleotide
Search Strategy
• Fundamental unit is the High scoring
Segment Pair (HSP)
• HSP
– Consists of 2 sequence fragments of arbitrary but equal
length whose alignment is locally maximal.
– Alignment score meets or exceeds a threshold
• Each HSP consists of
– A segment from the query sequence and one from the
database
• Sensitivity and speed adjusted by
– W, T and X
• Selectivity adjusted by
– Cutoff score
Similarity Searching
Approach
• Look for similar segments (HSP) between the
query sequence and a database sequence
• Evaluate statistical significance of any
matches found
• Report only matches satisfying user selected
threshold
• Statistical significance ascribed to a set of
HSPs may be higher than for a single HSP of
that set.
• Only when the ascribed significance satisfies
the user selected threshold (E parameter) will
the match be reported
How to find HSPs?
• Begins with identifying short words of length W in the
query sequence that
– Match/satisfy some positive valued threshold score T when
aligned with a word of the same length in a database
sequence
– T: neighborhood word score threshold
• These initial neighborhood hits act as seeds for
initiating searches to find longer HSPs containing
them
• Word hits are extended in both directions along each
sequence for as far as the cumulative alignment
score can be increased
• Extension stopped when
– Cumulative alignment score falls off by “X” from its
maximum achieved value
Scoring Schemes
• Default: Blosum62
– For blastp, blastx, tblastn and tblastx
• Several PAM matrices provided
– Pam40, PAM120 and PAM250
• Blosum62 is a good general purpose matrix
• PAM 120 recommended for general protein
similarity searches
• pam(1) can be used to produce PAM
matrices from 2 to 511
• Each matrix most sensitive at finding
similarities at its particular PAM distance
• For more thorough searches
– Recommended to do 3 searches with PAM40,
PAM120 and PAM250
Scoring Schemes Contd…
• In blastn
– M parameter: sets the reward score for a pair of matching
residues. M > 0
– N parameter sets the penalty score for mismatching
residues. N < 0
– Relative magnitudes of M and N determines the number of
nucleic acid PAMs for which they are most sensitive at
finding homologs
– Higher ratios of M:N correspond to increasing nucleic acid
PAMs
– M:N ratio of 1 corresponds to
» 30 nucleic acid PAMs or 38 amino acid PAMs
» M:N >3 doesn’t make statistical sense
– At higher than 40 nucleic acid PAMs or 50 amino acid
PAMs, recommended to perform comparisons at amino
acid level
– Wordlength W = 11 restricts the program to finding
sequences that share at least an 11-mer stretch of 100%
id tit
P-Values & Alignment Scores
• Expectation and P-values depend on
– The scoring system employed
– The residue composition of the query sequence
– An assumed residue composition for a typical
database sequence
– Length of the query sequence
– Total length of the database
• Never compare alignment scores
obtained from differing matrices
– E.g., Blosum62 and PAM120
BLAST OUTPUT
• Consists of
– An introduction to the program
– A histogram of expectations
– A series of one-line descriptions of matching
database sequences
– The actual sequence alignment
– Parameters and statistics gathered during the search
Assignment
• Download and install the BLAST program
from ftp://ftp.ncbi.nih.gov/blast/
• Download the ecoli.nt.Z database from
ftp://ftp.ncbi.nih.gov/blast/db/
• Please see handout for your assignment.
Complete the assignment using the NCBI
BLAST webpage.
• Redo the “nucleotide” portion of your
assignment via the command line (using the
BLAST program that you have installed on
your computers).