* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lec-GenomeAllignment2010
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene desert wikipedia , lookup
Adeno-associated virus wikipedia , lookup
Designer baby wikipedia , lookup
Human genetic variation wikipedia , lookup
Copy-number variation wikipedia , lookup
Oncogenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genome (book) wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Metagenomics wikipedia , lookup
Transposable element wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Public health genomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Non-coding DNA wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Pathogenomics wikipedia , lookup
Helitron (biology) wikipedia , lookup
Sequence alignment wikipedia , lookup
Human genome wikipedia , lookup
Genomic library wikipedia , lookup
Minimal genome wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genome editing wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Genome Alignment Alignment Methods • Needleman-Wunsch (global) and SmithWaterman (local) use dynamic programming • Guaranteed to find an optimal alignment given a particular scoring function • Too computationally intensive for genome alignment, especially multiple genomes Genome Alignment • Depending on level of similarity, genome alignments may need to contend with rearrangements and large-scale duplications and deletions • Draft or partial genomes can both benefit from and confound alignment • Need to visualize results in summary form Genome Alignment • Pair-wise – Align two genomes – Example: MUMmer • Multiple or complex samples and a reference genome – All of one genome plus whatever parts match from the other genome(s) – Example: PIPs • Multiple alignment – All of all the genomes – Example: Mauve Some aligners http://mummer.sourceforge.net/ http://www.ebi.ac.uk/~bjp/pecan/ http://asap.ahabs.wisc.edu/mauve/index.php MUMmer Pecan Mauve MUMmer (Maximal Unique Match) http://mummer.sourceforge.net/ • Fast pair-wise comparison of draft or complete genomes using nucleotide or 6frame translated sequences • MUMmer 3.0 can find all 20-basepair or longer exact matches between a pair of 5megabase genomes in 13.7 seconds, using 78 MB of memory, on a 2.4 GHz Linux desktop computer Suffix Tree Delcher et al. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002 Jun 1;30(11):2478-83. Genome 2 MUMMER plot Genome 1 5 Campylobacter PROmer analysis Fouts et al. Major structural differences and novel potential virulence mechanisms from the genomes of multiple campylobacter species. PLoS Biol. 2005 Jan;3(1):e15. • One genome is used as the x-axis for all four pair-wise comparisons • X-shape characteristic of collinearity interrupted by inversions around the origin or terminus of replication • Loss of collinearity in more distant comparisons Human Gut metagenome Percent Identity Plot (PIP) of random shotgun reads to a complete Bifidobacterium genome and a good quality draft Methanobrevibacter genome Gill et al. Metagenomic analysis of the human distal gut microbiome. Science. 2006 Jun 2; 312(5778): 1355-9. Mauve Multiple Genome Aligner • Able to identify and align collinear regions of multiple genomes even in the presence of rearrangements • Find and extend seed matches • Group into locally collinear blocks • Align intervening regions Darling et al. Genome Res. 2004 Jul;14(7):1394-403. Progressive Mauve alignment of 12 E. coli genome Aaron Darling 2006 Ph.D. thesis, http://gel.ahabs.wisc.edu/~darling/ darling_thesis.pdf Figure 1. The difference between positional homology alignment and glocal alignment. Three example linear genomes are broken into genes labeled A,B,C,D, and R. R is a multicopy (repetitive) gene, with different copies labeled using numeric subscripts. Each copy of R is assumed to be identical in sequence, so that orthology/paralogy is unknowable from nucleotide substitution (as is often the case with mobile DNA repeat elements). Genes shifted downward in a given genome are inverted (reverse complement) relative to the reference genome. The positional homology alignment would ideally create two local alignment blocks where each block has exactly one alignment row for each genome. Only positionallyconserved copies of the repetitive gene family R become aligned to each other. The glocal alignment would ideally create four local alignment blocks wherein all copies of the repetitive gene family become aligned to each other. Progressive Genome Alignment similar to CLUSTAL (next week) with integrated synteny mapping and positional homology and anchored alignment Performance Metrics Accuracy – Proportion correct TN+TP/total TPR (Recall) – Proportion of predicted positives that are correct actual \ predicted negative positive TN FP FN TP TP/FP+TP Negative Sensitivity – Proportion of positives correctly predicted TP/FN+TP Specificity – Proportion of negatives correctly predicted TN/TN+FP Positive For nucleotide pairs, a TP is a pair aligned in both the calculated and correct alignments. A FP is a nucleotide pair in the calculated alignment that is absent from the correct alignment. Likewise, a FN is a pair in the correct alignment not present in the calculated alignment. Sensitivity Positive Predictive Value (PPV) We do not quantify True Negative (TN) alignments as the number of TN possibilities is extremely large, growing with the product of sequence lengths. ENCODE project • Goal = to identify all functional elements in the human genome • Margulies et al. 2007 reports results of the pilot project to analyze 1% of the genome using genome alignment to detect which regions of the sequence are evolutionarily constrained. • 4 aligners – – – – MAVID MLAGAN TBA PECAN • 23 mammalian species • 30 Mb; 44 regions Alignment Breakpoints Alignment Coverage • • • • • For example, vs. armadillo: MAVID 27.4% MLAGAN 42.4% TBA 41.2% PECAN 40.1% • 17.4% covered by all 4 aligners • Of which 66.1% are aligned identically Performance Metrics • Sensitivity – • coverage of protein coding regions and ancestral repeats • Specificity – • primate specific repeats (Alu) and periodicity of substitutions in protein coding regions