Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics Alignment problem Given a set of sequences, produce a multiple alignment which corresponds as well as possible to the biological relationships between the corresponding bio-molecules For homologous proteins Two residues should be aligned (on top of each other) if they are homologous (evolved from the same residue in a common ancestor protein) if they are structurally equivalent Automatic approach Need a way of scoring alignments fitness function which for an alignment quantifies its “goodness” Need an algorithm for finding alignments with good scores Not all methods provide a scoring function for the final alignment! Analysis of fitness function One can test whether the alignments optimal under a given fitness function correspond well to the biological relationships between the sequences For example, if the structure of (some of) the proteins are known. Scoring Alignments In order to find an optimal alignment, we need to be able to measure how good an alignment is Sum of pairs (SP) method: in a column, score each pair of letters and total the scores. Pairs of gaps score 0. Total up scores for each column SP Method Example S S I I S K K E Using BLOSUM62 matrix, gap penalty -8 In column 1, we have pairs -8 - 8 + 4 = -12 -,S -,S S,S k(k-1)/2 pairs per column Align by use of dynamic programming Dynamic programming finds best alignment of k sequences with given scoring scheme For two sequences there are three different column types For three sequences there are seven different column types x means an amino acid, - a blank Sequence1 x - x x - - x Sequence2 x x - x - x Sequence3 x x x - x - x Use of dynamic programming Dynamic programming finds best alignment of k sequences given scoring scheme Algorithm for dynamic programming Analysis O(nk) entries to fill Each entry combines O(2k) other entries Costs O(k2) to calculate each SP score Overall cost is O(k2 2k nk), or exponential in the number of sequences! NP-complete General progressive alignment Algorithm. General progressive alignment. Progressive alignment of the sequences {s1, s2, . . . , sm} Var C current set of alignments begin C := ∅ for i := 1 to m do C := C union {{si }} end one alignment of each seq. for i := 1 to m − 1 do choose two alignments Ap,Aq from C; C := C − {Ap,Aq } Ar := align(Ap,Aq );C := C union {Ar } end C now contains the (single) final alignment end The Clustal Algorithm Three steps: 1 2 3 Compare all pairs of sequences to obtain a similarity matrix Based on the similarity matrix, make a guide tree relating all the sequences Perform progressive alignment where the order of the alignments is determined by the guide tree (A) 1 pairwise comparison 2 clustering/making tree (B) 3 Align according to tree Clustal - summary Does not use a score for the final alignment Each pairwise alignment is done using dynamic programming Heuristics are used - tailored to globular proteins Graphical version: ClustalX Phylogeny The basic principle is that the origin of similarity is common ancestry. The field of phylogeny has the goals of working out the relationships among species, populations, individuals, or genes. Usually expressed as a tree. Phylogeny The basic principle is that the origin of similarity is common ancestry. The field of phylogeny has the goals of working out the relationships among species, populations, individuals, or genes. Usually expressed as a tree. Phylogeny A statement of phylogeny among objects assumes homology and depends on classification. Phylogeny states a topology of the relationships based on classification according to similarity of one or more sets of characters, or on a model of evolutionary processes. Phylogeny It is rare for species relationships and ancestry to be directly observable. Evolutionary trees determined from genetic data are often based on inferences from the patterns of similarity, which are all that is observable among species living now.