* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Basics of sequence analysis Ch.6 and Ch.7
Survey
Document related concepts
Biochemistry wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Proteolysis wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genetic code wikipedia , lookup
Community fingerprinting wikipedia , lookup
Point mutation wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Genomic library wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Protein structure prediction wikipedia , lookup
Transcript
Basics of sequence analysis Ch.6 and Ch.7 • • • • • • • Sequence acquisition Sequence data Reconstructing sequence Sequence alignment Alignment algorithms Database searching Uses of alignments http://upload.wikimedia.org/wikipedia/commons/c/cb/Sequencing.jpg ABI era Source: wikipedia Scaling up by brute force $3,000,000,000 genome Source: G. Church Source: G. Church A Genome Analyzer flowcell (left) and imaging region or ‘tile’ (right), with a magnified section showing a cluster. Source: Bioinformatics. 2009 Sep 1;25(17):2194-9. Epub 2009 Jun 23. http://www.eurofinsdna.com Source: Bioinformatics. 2009 Sep 1;25(17):2194-9. Epub 2009 Jun 23. Confidence call Name Sequence Alignment Read CCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC Reference genome Read Hypothesis #1 Genome ref Read Hypothesis #2 Genome ref Quality, Q, is the function of the probability, P, that the sequences called a wrong base Q = −10 log10 ( P) Q=10: 1 in 10 chance that base was miscalled Q=20: 1 in 100 chance that base was miscalled Q=30: 1 in 1000 chance Q is estimated by the sequencing sofware. Read Hypothesis #1 Genome ref Read Hypothesis #2 Genome ref Read Q=30 Hypothesis #1 Genome ref Read Q=10 Hypothesis #2 Genome ref A penalty scheme to account for different types of dissimilarities Let’s stipulate that small gaps (indels) occur in bacterial genomes at 1 in 10K positions Penalty gap = −10 log10 ( Pgap ) = = −10 log10 (0.0001) = 40 Let’s stipulate that SNPs occur in bacterial genomes at 1 in 1K, but depending on Q, sequencing error maybe more likely Penmin ≡ min{− 10 log10 ( Pmiscall ), − 10 log10 ( PSNP )} ≡ ≡ min{Q, − 10 log10 (0.001)} = min{Q,30} Penaltygap = 40 PenaltySNP =30 Penalty = 40 Q=40 Penalty = 70 Q=10 Penalty = 50 How do we find our sequence in the first place? Local alignment Given a string P (“pattern”) of length m and a string T (“text”) of length n, find substrings a and b of P and T, respectively, having maximal global alignment score Event Example Penalty Match TTG ||| TTG +15 Mismatch TCG ||| TTG -30 Gap in text Gap in pattern TCG ||| T-G -40 T-G ||| TTG -40 Smith-Waterman Smith-Waterman Smith-Waterman Smith-Waterman Smith-Waterman Smith-Waterman Dynamic Programming Indexing Indexing Dot Plots Window:2, Stringency:1 Dot Plots Window:2, Stringency:1 Dot matrix analysis of DNA sequence encoding λ cI repressor (vertical) and P22 c2 repressor Window - 11; Stringency - 7 Analysis of the regions of low complexity Calculation of an alignment score Pairwise Alignment Examples (II) Dispersed alignment without gaps may have higher score than more visually appealing alignment with gaps An alignment scoring system is required to evaluate how good an alignment is • positive and negative values assigned • gap creation and extension penalties • positive score for identities • some partial positive score for conservative substitutions • global versus local alignment • use of a substitution matrix “Window location” by FASTA and BLAST Two kinds of sequence alignment: global and local The global alignment algorithm of Needleman and Wunsch (1970). The local alignment algorithm of Smith and Waterman (1981). BLAST, a heuristic version of Smith-Waterman. Should result of alignment include all amino acids or proteins or just those that "match"? If yes, a global alignment is desired In a global alignment, presence of mismatched elements is neutral - doesn't affect overall match score Should result of alignment include all amino acids or proteins or just those that "match"? If no, a local alignment is desired Local alignments accomplished by including negative scores for "mismatched" positions, thus scores get worse as we move away from region of match Instead of starting traceback with highest value in first row or column, start with highest value in entire matrix, stop when score hits zero What is Database Search ? • Find a particular (usually) short sequence in a database of sequences (or one huge sequence). • Problem is identical to local sequence alignment, but on a much larger scale. • We must also have some idea of the significance of a database hit. – Databases always return some kind of hit, how much attention should be paid to the result? • A similar problem is the global alignment of two large sequences • General idea: good alignments contain high scoring regions. Imperfect Alignment • What is an imperfect alignment? • Why imperfect alignment? • The result may not be optimal. • Finding optimal alignment is usually to costly in terms of time and memory. Database Search Methods • Hash table based methods – FASTA family • FASTP, FASTA, TFASTA, FASTAX, FASTAY – BLAST family • BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ, MegaBLAST, PsiBLAST, PhiBLAST – Others • FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS • Suffix tree based methods – Mummer, AVID, Reputer, MGA, QUASAR Database Search Methods • Hash table based methods – FASTA family • FASTP, FASTA, TFASTA, FASTAX, FASTAY – BLAST family • BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ, MegaBLAST, PsiBLAST, PhiBLAST – Others • FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS • Suffix tree based methods – Mummer, AVID, Reputer, MGA, QUASAR History of sequence searching • • • • 1970: 1980: 1985: 1990: NW SW FASTA BLAST Hash Table • K-gram = subsequence of length K • Ak entries – A is alphabet size • Linear time construction • Constant lookup time FASTP Lipman & Pearson, 1985 FASTP • Three phase algorithm 1. Find short good matches using k-grams 1. K = 1 or 2 2. Find start and end positions for good matches 3. Use DP to align good matches FASTP • Three phase algorithm 1. Find short good matches using k-grams 1. K = 1 or 2 2. Find start and end positions for good matches 3. Use DP to align good matches FASTP: Phase 1 (2) • Similar to dot plot • Offsets range from 1-m to n-1 • Each offset is scored as – # matches - # mismatches • Diagonals (offsets) with large score show local similarities FASTP: Phase 2 • 5 best diagonal runs are found • Rescore these 5 regions using PAM250. – Initial score • Indels are not considered yet FASTP: Phase 3 • Sort the aligned regions in descending score • Optimize these alignments using Needleman-Wunsch • Report the results FASTA – Improvement Over FASTP Pearson 1995 FASTA (1) • Phase 2: Choose 10 best diagonal runs instead of 5 FASTA (2) • Phase 2.5 – Eliminate diagonals that score less than some given threshold. – Combine matches to find longer matches. It incurs join penalty similar to gap penalty FASTA Variations • TFASTAX and TFASTAY: query protein against a DNA library in all reading frames • FASTAX, FASTAY: DNA query in all reading frames against protein database BLAST Altschul, Gish, Miller, Myers, Lipman, 1990 BLAST (or BLASTP) • BLAST – Basic Local Alignment Search Tool • An approximation of Smith-Waterman • Designed for database searches – Short query sequence against long database sequence or a database of many sequences • Sacrifices search sensitivity for speed BLAST Algorithm (1) • Eliminate low complexity regions from the query sequence. – Replace them with X (protein) or N (DNA) • Hash table on query sequence. – K = 3 for proteins MCGPFILGTYC CGP MCG BLAST Algorithm (2) • For each k-gram find all k-grams that align with score at least cutoff T using BLOSUM62 PQGMCGPFILGTYC QGM – 20k candidates – ~50 on the average per kgram – ~50n for the entire query • Build hash table PQG PQG PQG PEG PRG PSG PQA 18 15 14 13 12 T = 13 BLAST Algorithm (3) • Sequentially scan the database and locate each k-gram in the hash table • Each match is a seed for an ungapped alignment. BLAST Algorithm (4) • HSP (High Scoring Pair) = A match between a query word and the database • Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A • Extend the hit until the score falls below a threshold value, X BLAST Algorithm (5) • Keep only the extended matches that have a score at least S. • Determine the statistical significance of the result BLASTN • BLAST for nucleic acids • K = 11 • Exact match instead of neighborhood search. BLAST Variations Program Query Target Type BLASTP Protein Protein Gapped BLASTN Nucleic acid Nucleic acid Gapped BLASTX Nucleic acid Protein Gapped TBLASTN Protein Nucleic acid Gapped TBLASTX Protein Nucleic acid Gapped Even More Variations – PsiBLAST (iterative) – BLAT, BLASTZ, MegaBLAST – FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS – Main differences are • Seed choice (k, gapped seeds) • Additional data structures Suffix Tree • Tree structure that contains all suffixes of the input sequence • • • • • • • • • TGAGTGCGA GAGTGCGA AGTGCGA GTGCGA TGCGA GCGA CGA GA A Suffix Tree Example Suffix Tree Analysis • O(n) space and construction time – 10n to 70n space usage reported • O(m) search time for m-letter sequence • Good for – Small data – Exact matches Suffix Array • 5 bytes per letter • O(m log n) search time • Better space usage • Slower search Mummer Other Sequence Comparison Tools • Reputer, MGA, AVID • QUASAR (suffix array) Uses of sequence alignment • • • • • • Search databases Assess similarity, relatedness Identify structural variations (point, gross) Determine specificity of primers Evaluate complexity of a sequence Assemble sequence de novo