* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Sequence Analysis Using BLAST
Survey
Document related concepts
Transcript
Sequence Analysis Using BLAST Arun Krishnan BioInformatics Institute What is BLAST? • Basic Local Alignment Search Tool – Heuristic search algorithm – Use statistical methods of Karlin and Altschul (1990, 1993) – Tailored for sequence similarity search Components of BLAST • Five main programs: – blastp • compares an amino acid query sequence against a protein sequence database – blastn • compares a nucleotide query sequence against a nucleotide sequence database – blastx • compares the six-frame conceptual translation products of a nucleotide query sequence against a protein sequence database – tblastn • compares a protein sequence against a nucleotide sequence database dynamically translated in all six reading frames – tblastx • compares the six-frame translations of a nucleotide Search Strategy • Fundamental unit is the High scoring Segment Pair (HSP) • HSP – Consists of 2 sequence fragments of arbitrary but equal length whose alignment is locally maximal. – Alignment score meets or exceeds a threshold • Each HSP consists of – A segment from the query sequence and one from the database • Sensitivity and speed adjusted by – W, T and X • Selectivity adjusted by – Cutoff score Similarity Searching Approach • Look for similar segments (HSP) between the query sequence and a database sequence • Evaluate statistical significance of any matches found • Report only matches satisfying user selected threshold • Statistical significance ascribed to a set of HSPs may be higher than for a single HSP of that set. • Only when the ascribed significance satisfies the user selected threshold (E parameter) will the match be reported How to find HSPs? • Begins with identifying short words of length W in the query sequence that – Match/satisfy some positive valued threshold score T when aligned with a word of the same length in a database sequence – T: neighborhood word score threshold • These initial neighborhood hits act as seeds for initiating searches to find longer HSPs containing them • Word hits are extended in both directions along each sequence for as far as the cumulative alignment score can be increased • Extension stopped when – Cumulative alignment score falls off by “X” from its maximum achieved value Scoring Schemes • Default: Blosum62 – For blastp, blastx, tblastn and tblastx • Several PAM matrices provided – Pam40, PAM120 and PAM250 • Blosum62 is a good general purpose matrix • PAM 120 recommended for general protein similarity searches • pam(1) can be used to produce PAM matrices from 2 to 511 • Each matrix most sensitive at finding similarities at its particular PAM distance • For more thorough searches – Recommended to do 3 searches with PAM40, PAM120 and PAM250 Scoring Schemes Contd… • In blastn – M parameter: sets the reward score for a pair of matching residues. M > 0 – N parameter sets the penalty score for mismatching residues. N < 0 – Relative magnitudes of M and N determines the number of nucleic acid PAMs for which they are most sensitive at finding homologs – Higher ratios of M:N correspond to increasing nucleic acid PAMs – M:N ratio of 1 corresponds to » 30 nucleic acid PAMs or 38 amino acid PAMs » M:N >3 doesn’t make statistical sense – At higher than 40 nucleic acid PAMs or 50 amino acid PAMs, recommended to perform comparisons at amino acid level – Wordlength W = 11 restricts the program to finding sequences that share at least an 11-mer stretch of 100% id tit P-Values & Alignment Scores • Expectation and P-values depend on – The scoring system employed – The residue composition of the query sequence – An assumed residue composition for a typical database sequence – Length of the query sequence – Total length of the database • Never compare alignment scores obtained from differing matrices – E.g., Blosum62 and PAM120 BLAST OUTPUT • Consists of – An introduction to the program – A histogram of expectations – A series of one-line descriptions of matching database sequences – The actual sequence alignment – Parameters and statistics gathered during the search Assignment • Download and install the BLAST program from ftp://ftp.ncbi.nih.gov/blast/ • Download the ecoli.nt.Z database from ftp://ftp.ncbi.nih.gov/blast/db/ • Please see handout for your assignment. Complete the assignment using the NCBI BLAST webpage. • Redo the “nucleotide” portion of your assignment via the command line (using the BLAST program that you have installed on your computers).