Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sequence Analysis Using BLAST Arun Krishnan BioInformatics Institute What is BLAST? • Basic Local Alignment Search Tool – Heuristic search algorithm – Use statistical methods of Karlin and Altschul (1990, 1993) – Tailored for sequence similarity search Components of BLAST • Five main programs: – blastp • compares an amino acid query sequence against a protein sequence database – blastn • compares a nucleotide query sequence against a nucleotide sequence database – blastx • compares the six-frame conceptual translation products of a nucleotide query sequence against a protein sequence database – tblastn • compares a protein sequence against a nucleotide sequence database dynamically translated in all six reading frames – tblastx • compares the six-frame translations of a nucleotide Search Strategy • Fundamental unit is the High scoring Segment Pair (HSP) • HSP – Consists of 2 sequence fragments of arbitrary but equal length whose alignment is locally maximal. – Alignment score meets or exceeds a threshold • Each HSP consists of – A segment from the query sequence and one from the database • Sensitivity and speed adjusted by – W, T and X • Selectivity adjusted by – Cutoff score Similarity Searching Approach • Look for similar segments (HSP) between the query sequence and a database sequence • Evaluate statistical significance of any matches found • Report only matches satisfying user selected threshold • Statistical significance ascribed to a set of HSPs may be higher than for a single HSP of that set. • Only when the ascribed significance satisfies the user selected threshold (E parameter) will the match be reported How to find HSPs? • Begins with identifying short words of length W in the query sequence that – Match/satisfy some positive valued threshold score T when aligned with a word of the same length in a database sequence – T: neighborhood word score threshold • These initial neighborhood hits act as seeds for initiating searches to find longer HSPs containing them • Word hits are extended in both directions along each sequence for as far as the cumulative alignment score can be increased • Extension stopped when – Cumulative alignment score falls off by “X” from its maximum achieved value Scoring Schemes • Default: Blosum62 – For blastp, blastx, tblastn and tblastx • Several PAM matrices provided – Pam40, PAM120 and PAM250 • Blosum62 is a good general purpose matrix • PAM 120 recommended for general protein similarity searches • pam(1) can be used to produce PAM matrices from 2 to 511 • Each matrix most sensitive at finding similarities at its particular PAM distance • For more thorough searches – Recommended to do 3 searches with PAM40, PAM120 and PAM250 Scoring Schemes Contd… • In blastn – M parameter: sets the reward score for a pair of matching residues. M > 0 – N parameter sets the penalty score for mismatching residues. N < 0 – Relative magnitudes of M and N determines the number of nucleic acid PAMs for which they are most sensitive at finding homologs – Higher ratios of M:N correspond to increasing nucleic acid PAMs – M:N ratio of 1 corresponds to » 30 nucleic acid PAMs or 38 amino acid PAMs » M:N >3 doesn’t make statistical sense – At higher than 40 nucleic acid PAMs or 50 amino acid PAMs, recommended to perform comparisons at amino acid level – Wordlength W = 11 restricts the program to finding sequences that share at least an 11-mer stretch of 100% id tit P-Values & Alignment Scores • Expectation and P-values depend on – The scoring system employed – The residue composition of the query sequence – An assumed residue composition for a typical database sequence – Length of the query sequence – Total length of the database • Never compare alignment scores obtained from differing matrices – E.g., Blosum62 and PAM120 BLAST OUTPUT • Consists of – An introduction to the program – A histogram of expectations – A series of one-line descriptions of matching database sequences – The actual sequence alignment – Parameters and statistics gathered during the search Assignment • Download and install the BLAST program from ftp://ftp.ncbi.nih.gov/blast/ • Download the ecoli.nt.Z database from ftp://ftp.ncbi.nih.gov/blast/db/ • Please see handout for your assignment. Complete the assignment using the NCBI BLAST webpage. • Redo the “nucleotide” portion of your assignment via the command line (using the BLAST program that you have installed on your computers).