Download Notes

I519 Introduction to Bioinformatics, 2012 Genome Comparison Whole genome comparison/alignment     Build better phylogenies Identify polymorphism Detect gene-level events Compare different assemblies of a single genome Whole genome comparison  Aligning whole genomes is a fundamentally different problem than aligning short sequences.  Need to consider the presence of large-scale evolutionary events – – – – Gene duplication & loss Horizontal gene transfer Repetitive sequences (repeats) Gene rearrangement and inversion  Pairwise and multiple genome comparison – Multiple genome alignment provides a basis for research into comparative genomics and the study of evolutionary dynamics. Genome evolution Genome A Point Substitution Translocation Inversion Inversion and Translocation Insertion Repeat (Duplication) Basic algorithms: use anchoring as a heuristic to speed alignment  Assumption: highly similar subsequences can be found quickly and are likely to be part of the correct global alignment.  These local alignments are used to anchor a global alignment (alignment anchor), reducing the number of possible global alignments considered during a subsequent O(n2) dynamic programming step.  Select a single collinear set of alignment anchors  Many tools have been developed Rearrangement free or not  Free of rearrangement – Assume the input sequences are free from significant rearrangements of sequence elements, selecting a single collinear set of alignment anchors – Pairwise: MUMmer, GLASS, AVID, and WABA align pairs of long sequences – Multiple alignment: MAVID, MLAGAN, and MGA  Consider rearrangement – Shuffle-LAGAN (2003, first genome comparison method described that explicitly deals with genome rearrangements) – MultiPipMaker (2003) – Mauve (2004, multiple) – Enredo and Pecan (2008) – GR-Aligner (2009, pairwise) MUMer method  MUMer combines suffix trees, the longest increasing subsequence (LIS) and SW alignment  Maximal Unique Match (MUM) Identification - Identify the longest strings in Genome 1 that have one identical match in Genome 2 – Naïve method: O(N2) – Using suffix tree: O(N)  Ordered MUM Selection - Identify the longest set of MUMs such that they occur in order in each of the genomes (using a variation of the well-known algorithm to find the LIS of a sequence of integers)  Processing Non-matched Regions - Classify nonmatched regions as either insertions, SNPs or highly polymorphic regions Suffix tree  Suffix tree is data structure, which allows one to find, extremely efficiently, all distinct subsequences in a given sequence.  There are efficient algorithms to construct suffix trees given by Weiner (1973) and McCreight (1976) (in linear time)  For the task of comparing two DNA sequences, suffix trees allow one to quickly find all subsequences shared by the two inputs.  The genome alignment is then built upon this information. Suffix tree for finding MUMs Suffix Tree for sequence “gaaccgacct” An internal node is a repeated sequence in the original string Leaf is a unique suffix Every unique matching sequence is represented by an internal node with exactly two child nodes, such that the child nodes are leaf nodes from different genomes A toy example ATCGTA# # A# TA# GTA# CGTA# TCGTA# ATCGTA# ATCGAT$ $ T$ AT$ GAT$ CGAT$ TCGAT$ ATCGAT$ 7 6 5 4 3 2 1 14 13 12 11 10 9 8 ATCGTA# # $ A# AT$ ATCGAT$ ATCGTA# CGAT$ CGTA# GAT$ GTA# T$ TA# TCGAT$ TCGTA# 7 14 6 12 8 1 10 3 11 4 13 5 9 2 0 T 1 $ A CG 1 2 # A# CG T$ 6 12 13 5 3 AT$ 9 T 10 CG 4 AT$ 8 1 TA# AT$ TA# AT$ 2 TA# 2 G TA# 1 3 11 4 Suffix tree & suffix array for string matching  Preprocess text T, not pattern P – O(m) preprocess time (m: the length of the text) – O(n+k) search time (n: the length of the pattern) • k is number of occurrences of P in T  Match pattern P against tree starting at root until – Case 1, P is completely matched • Every leaf below this match point is the starting location of P in T – Case 2: No match is possible • P does not occur in T A toy example of string (pattern) matching  T = xabxac – suffixes ={xabxac, abxac, bxac, xac, ac, c}  Pattern P1: xa  Pattern P2: xb b x a c c x a a 6 c 5 b x b c x a 4 c a c 3 2 1 Suffix array Suffix array: a sorted list of the suffixes of a given string; the start positions are sorted in lexicographical (alphabetical) order Straightforward implementation: O(m2logm), reduced to O(mlogm) (utilizing partial sorts) m: the length of the text Suffix array enables binary search for any substring, e.g. CAD O(nlogm), reduced to O(n + logm) if use LCP (longest common prefix) n: the length of the pattern Suffix array is more compact than a suffix tree ABRACADABRA# 11 10 7 0 3 5 8 1 4 6 9 2 webglimpse.net/pubs/suffix.pdf # A# ABRA# ABRACADABRA# ACADABRA# ADABRA# BRA# BRACADABRA# CADABRA# DABRA# RA# RACADABRA# Ordered MUM selection G1 G2 1 2 3 4 ... A B C D ... MUMs: <1,A>, <2,C>, <3,B>, <4,D> Possible <1,A>, <2,C>, <4,D> Selections<1,A>, <3,B>, <4,D> Then process non-matched regions (by dynamic programming algorithm) See more at www.cs.rice.edu/~nakhleh/COMP571/GenomeAlignment.ppt LIS algorithm B positions is given by the sequence 1, 3, 2, 4, 6, 7, 5 The LIS (longest increasing sequence) is: 1, 2, 4, 6, 7 LIS problem can be solved by a dynamic programming algorithm Mauve  Mauve is a system for efficiently constructing multiple genome alignments in the presence of large-scale evolutionary events  Identifies conserved genomic regions, rearrangements and inversions in conserved regions, and the exact sequence breakpoints of such rearrangements across multiple genomes.  Also performs traditional multiple alignment of conserved regions to identify nucleotide substitutions and indels, using the progressive dynamic programming approach of CLUSTALW Mauve's anchor selection algorithm  Relax anchor selection method: do not assume that the genomes under study are collinear  Identifie and align regions of local collinearity called locally collinear blocks (LCBs) – Each LCB is a homologous region of sequence shared by two or more of the genomes under study – Does not contain any rearrangements of homologous sequence (within LCB) Mauve algorithm 1. Find local alignments (multi-MUMs), using seed-and-extend hashing method (time complexity O(G2n + Gn logGn), G is the number of genomes and n the average genome length) 2. Use the multi-MUMs to calculate a phylogenetic guide tree. 3. Select a subset of the multi-MUMs to use as anchors—these anchors are partitioned into collinear groups called LCBs, using a greedy breakpoint elimination algorithm 4. Perform recursive anchoring to identify additional alignment anchors within and outside each LCB. 5. Perform a progressive alignment of each LCB using the guide tree. Greedy breakpoint elimination in three genomes Darling A C et al. Genome Res. 2004;14:1394-1403 ©2004 by Cold Spring Harbor Laboratory Press An example of LCB identified among nine enterobacterial genomes Darling A C et al. Genome Res. 2004;14:1394-1403 LCBs identified among concatenated chromosomes of the mouse, rat, and human genomes Darling A C et al. Genome Res. 2004;14:1394-1403 Turnip vs cabbage: almost identical mtDNA gene sequences  In 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip (using physical mapping)  99%-99.9% similarity between genes  These surprisingly identical gene sequences differed in gene order  This study helped pave the way to analyzing genome rearrangements in molecular evolution Why we care about genome rearrangement  Evolutionary and functional analysis  Examples: – “Dynamics of Genome Rearrangement in Bacterial Populations”, using comparison of eight Yersinia (pathogenic bacteria) genomes. PLoS Genet 4(7): e1000128, 2008 – Genome-wide DNA excision (Oxytricha trifallax destroys 95% of its germline genome during development, including the elimination of all transposon DNA, through an exaggerated process of genome rearrangement). Science, Vol. 324. no. 5929, pp. 935 – 938, 2009 “Transforming” cabbage into turnip Reversals and breakpoints 1 2 3 9 10 8 4 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 7 5 6 1 2 3 9 8 4 7 10 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 5 6 The reversion introduced two breakpoints (disruptions in order). Genome rearrangements Mouse (X chrom.) Unknown ancestor ~ 75 million years ago Human (X chrom.)  What are the similarity blocks and how to find them?  What is the architecture of the ancestral genome?  What is the evolutionary scenario for transforming one genome into the other? Comparative genomic architectures: mouse vs human genome  Humans and mice have similar genomes, but their genes are ordered differently  ~245 rearrangements – Reversals – Fusions – Fissions – Translocation History of Chromosome X Rat Consortium, Nature, 2004 GRIMM  Real genome architectures are represented by signed permutations  Efficient algorithms to sort signed permutations have been developed  GRIMM web server computes the reversal distances between signed permutations: http://nbcr.sdsc.edu/GRIMM/mgr.cgi

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Notes