Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Pairwise Sequence Analysis-I • • • • Analogous Concepts Domain Concepts Dot-matrix Analysis Pairwise Sequence Analysis – Why is it useful? – What are the underlying concepts? – What algorithms are used? – Limitations/Open questions Lecture 2 CS566 1 Concepts • Pairwise “everything” analysis – Humans are experts at pairwise comparison – “Peas of a pod” – Basis for pecking order in society – Basis for notion of fairness (“He got a better deal than I did!”) – Basis for grouping – Basis for sorting/ranking algorithms – Basis for search strategies Lecture 2 CS566 2 Concepts • Identity versus similarity – Brand name versus generic products • Coincidence versus plagiarism – Co-discoveries versus “Microsoft Explorer” • “What you see is not what you got” – Process not evident from product • “If it ain’t broke, don’t fix it” – Why messy code that works is left to remain messy Lecture 2 CS566 3 Domain Concepts • Sequences change over generations – More changes in DNA than in proteins (Why?) • Causes of change – Error rates of replication ~ 1 in a billion – Exchange • Legitimate: Recombination • Illegitimate: Lateral transfer (a la Napster) – Mutation • Exposure to chemicals Lecture 2 CS566 4 Dot-matrix Analysis A B L E W A S O N E E R A \ B \ L \ E \ \ W \ A \ S \ I E \ \ R \ E \ \ I S \ A \ W \ E \ \ L \ B \ A \ E O N E S A W E L B A \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Lecture 2 CS566 \ \ \ \ 5 Why (Motivation) 1. Given sequence A, what are its properties? – Find any sequence(s) B similar to A, where B {Sequences with known properties} 2. Given sequences A & B, are they – – – Identical? Similar? Evolutionary related? 3. Given sequence A, is this a new discovery? 4. Given a set of sequences, can they be clustered into families of related sequences? Lecture 2 CS566 6 Components • Compare the two sequences and assign score • Evaluate the score 1. Use of a scoring scheme to assign a value to a candidate alignment 2. Finding the best alignment between two sequences 3. Use of a probability model to assess the significance of the similarity (coincidence or plagiarism?) Lecture 2 CS566 7 Concepts - Alignment • • • • Sequences may be aligned ‘locally’ (look for regions/subsequences that are similar) or ‘globally’ (align along entire length) The result of a local alignment may differ from that a global alignment (Why?) Potentially large number of alignments to be considered Each alignment is a path through a fully connected dot matrix graph, i.e., a subset of the set of all diagonal edges, connected by horizontal or vertical lines. Lecture 2 CS566 8 Concepts – Scoring scheme • Scores are based on amino-acid substitution matrices (log-odds ratio of observed versus expected random substitutions) – PAM (Percent amino acid substitution) matrices: Based on evolutionary model. E.g., PAM 1%, PAM 250%. – BLOSUM (Blocks substitution) matrices: Based on percent identity. E.g., Blosum50, Blosum62. Lecture 2 CS566 9 Concepts – Scoring scheme Identity based scoring (match XOR non-match) Isawa --- dancer Isawatapdancer Isawa ---- danc-er Isawapinkpanther Similarity based scoring (partial matching) Eyl dkv Lecture 2 CS566 10 Algorithms • Optimal/Exact solutions: – Take longer time – Typically used for comparing a small number of sequences – (Needleman-Wunsch; Smith-Waterman) • Heuristic: – Frequently close to the optimal solution – Rapid – Typically used for 1:n searches involving large number of sequences – (BLAST, FASTA) Lecture 2 CS566 11 Open Questions/Limitations Highly similar sequences or highly dissimilar sequences are easily found BUT • Gray area lies in weak similarity; probability distribution is continuous • Other forms of similarity (e. g., structural or species similarity) may be necessary reach a decision • Sequence similarity per se does not guarantee functional equivalence; small difference can be responsible for large difference in function Lecture 2 CS566 12