* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download No Slide Title
Genetic engineering wikipedia , lookup
Point mutation wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Genomic imprinting wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Oncogenomics wikipedia , lookup
Gene expression profiling wikipedia , lookup
Human–animal hybrid wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Human genetic variation wikipedia , lookup
Transposable element wikipedia , lookup
Gene desert wikipedia , lookup
Microevolution wikipedia , lookup
Minimal genome wikipedia , lookup
Genome (book) wikipedia , lookup
Microsatellite wikipedia , lookup
Genomic library wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Public health genomics wikipedia , lookup
Sequence alignment wikipedia , lookup
Designer baby wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
History of genetic engineering wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Metagenomics wikipedia , lookup
Pathogenomics wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genome editing wikipedia , lookup
Human Genome Project wikipedia , lookup
Human genome wikipedia , lookup
Genome evolution wikipedia , lookup
Comparative genomics for biological discovery Lior Pachter Dept. Mathematics, U.C. Berkeley [email protected] February 3, 2004 Comparative Genomics From: Hardison RC (2003) Comparative Genomics. PLoS Biol 1(2): e58. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. February 2001 December 2002 QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Rat 2004 Picture credit: G.Bourque, P. Pevzner, G. Tesler and the Rat Genome Sequencing Consortium State of the Genomes (Jan 2004) QuickTi me™ and a TIFF ( Uncompressed) decompressor are needed to see thi s pi ctur e. v3 QuickTi me™ and a TIFF ( Uncompressed) decompressor are needed to see thi s pi ctur e. v6 0.36 0.35 Gb Gb QuickTi me™ and a QuickTi me™ and a TIFF ( Uncompressed) decompressor TIFF ( Uncompressed) decompressor are needed to see thi s pi ctur e. are needed to see thi s pi ctur e. QuickTi me™ and a TIFF ( Uncompressed) decompressor are needed to see thi s pi ctur e. QuickTi me™ and a TIFF ( Uncompressed) decompressor QuickTi me™ and a are needed to see thi s pi ctur e. TIFF ( Uncompressed) decompressor are needed to see thi s pi ctur e. QuickTi me™ and a TIFF ( Uncompressed) decompressor are needed to see thi s pi ctur e. v2 v3 v34 v3.1 v0.1 v1 1.7 Gb 2.5 Gb 2.9 Gb Aligned (multiple) 2.8 Gb Working on it QuickTi me™ and a TIFF ( Uncompressed) decompressor are needed to see thi s pi ctur e. QuickTi me™ and a QuickTi me™ and a TIFF ( Uncompressed) decompressorTIFF ( Uncompressed) decompressor are needed to see thi s pi ctur e. are needed to see thi s pi ctur e. v0e 2.4* 2.9* 1.2 Gb Gb Gb As soon as released ---- ---- 3* Gb 1.7 Gb http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Outline VISTA/AVID tools for comparative genomics Related biological stories Human/Mouse/Rat Phylogenetic Shadowing QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. http://www-gsd.lbl.gov/vista Processed ~ 11000 queries on-line, distributed > 560 copies of the program in 34 countries VISTA/AVID package • AVID: Program for global alignment of DNA fragments of any length ` N. Bray and L. Pachter, MAVID: Constrained Ancestral Alignment of Multiple Sequences, Genome Research, in press. N. Bray, I. Dubchak, L. Pachter, AVID: A Global Alignment Program , Genome Research, 13 (2003) p 97 - 102. • VISTA: Visualization of alignment and various sequence features for any number of species C. Mayor, M. Brudno, J.R. Schwartz, A. Poliakov, E. M. Rubin, K. A. Frazer, L. Pachter and I. Dubchak, VISTA: Visualizing global DNA sequence alignments of arbitrary length, Bioinformatics, 16 (2000), p 1046-1047. Aligning large genomic regions • • • • • Long sequences lead to memory problems Speed becomes an issue Long alignments are very sensitive to parameters Draft sequences present a nontrivial problem Accuracy is difficult to measure and to achieve References for other existing programs: Glass: Domino Tiling, Gene Recognition, and Mice. Pachter, L. Ph.D. Thesis, MIT (1999) Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction. Batzoglou, S., Pachter, L., Mesirov, J., Berger, B., Lander, E. Genome Research (2000). MUMmer Delcher, A.L., Kasif S., Fleischmann, R.D., Peterson J., White, O. and Salzberg, S.L. Alignment of whole genomes. Nucleic Acids Research (1999) PipMaker PipMaker: A Web Server for Aligning Two Genomic DNA Sequences. Scott Schwartz, Zheng Zhang, Kelly A. Frazer, Arian Smit, Cathy Riemer, John Bouck, Richard Gibbs, Ross Hardison, and Webb Miller. Genome Research (2000) DIALIGN Multiple DNA and protein sequence alignment based on segment-to-segment comparison B. Morgenstern, A. Dress and T. Werner, Proc. Natl. Acad. Sci. USA 93 (1996) Variations on Sequence Alignment Find the best OVERALL alignment. Global alignment Find ALL regions of similarity. Local alignment Find the BEST region of similarity. Optimal local alignment AVID- the alignment engine behind VISTA Very fast global alignment of megabases of sequence. Provides details about ordered and oriented contigs, and accurate placement in the finished sequence. Full integration with repeat masking. • • • • • ORDER and ORIENT FIND all common k-long words (k-mers) ALIGN k-mers scoring by local homology FIX k-mers with good local homology RECURSE with smaller k (shorter words) Visualization tggtaacattcaaattatg-----ttctcaaagtgagcatgaca-acttttttccatgg || | |||| | | || || | | | |||||| | || | | || tgatgacatctatttgctgtttcctttttagaaactgcatgagagcctggctagtaggg Window of length L is centered at a particular nucleotide in the base sequence Percent of identical nucleotides in L positions of the alignment is calculated and plotted Move to the next nucleotide Finding conserved regions with percentage and length cutoffs Conserved segments with percent identity X and length Y - regions in which every contiguous subsegment of length Y was at least X% identical to its paired sequence. These segments are merged to define the conserved regions. Output: 11054 - 11156 = 103bp at 77.670% 13241 - 13453 = 213bp at 87.793% 14698 - 14822 = 125bp at 84.800% NONCODING EXON EXON VISTA Plot Conserved NonCoding Sequences KIF Gene 100% 75 0kb Human Sequence (horizontal axis) 50 10kb % Identity QuickTi me™ and a TIFF ( Uncompressed) decompressor are needed to see thi s pi ctur e. Multi-Species Comparative Analysis (mVISTA) Apolipoprotein AI gene 100% human/ macaque 75% 50/100% human/ pig 75% human/ rabbit 75% 50/100% 50/100% human/ mouse 75% 50/100% human/ rat 75% 50/100% human/ chicken 75% 50% Liver enhancer Some results obtained with VISTA J Mol Cell Cardiol 34, 1345-1356 (2002) Myocardin: A Component of a Molecular Switch for Smooth Muscle Differentiation. J. Chen, C. M. Kitchen, J. W. Streb and J. M. Miano University of Oxford VSTA used to solve the gene structures of rat and human myocardin. Blood, 100, 3450-3456 (2002) Deletion of the mouse a -globin regulatory element (HS 26) has an unexpectedly mild phenotype E. Anguita, J. A. Sharpe, J. A. Sloane-Stanley, C. Tufarelli, D. R. Higgs, and W. G. Wood University of Oxford. Genome Research 11, 78 (2001) Human and Mouse - Synuclein Genes: Comparative Genomic Sequence Analysis and Identification of a Novel Gene Regulatory Element J. W. Touchman, et al. NIH Intramural Sequencing Center, National Institutes of Health Synuclein gene involved in Alzheimer’s disease EMBO reports 4:143 (2003) The kangaroo genome. Leaps and bounds in comparative genomics M. J. Wakefield and J. A. Marshall Graves Research School of Biological Sciences, The Australian National University, Canberra, ACT 0200, Australia ‘The kangaroo genome is a rich and unique resource for comparative genomics, a treasure trove of comparative genomics data’. Phylogenetic footprinting of 3’ untranslated region of the SLC16A2 gene VISTA flavors • VISTA – comparing DNA of multiple organisms • for 3 species - analyzing cutoffs to define actively conserved non-coding sequences • cVISTA - comparing two closely related species • rVISTA – regulatory VISTA Identifying non-coding sequences (CNSs) involved in transcriptional regulation rVISTA - prediction of transcription factor binding sites • Simultaneous searches of the major transcription factor binding site database (Transfac) and the use of global sequence alignment to sieve through the data • Combination of database searches with comparative sequence analysis reduces the number of predicted transcription factor binding sites by several orders of magnitude Regulatory VISTA (rVISTA) 1. Identify potential transcription factor binding sites for each sequence using library of matrices (TRANSFAC) 2. Identify aligned sites using AVID 3. Identify conserved sites using dynamic shifting window Percentage of conserved sites of the total 3-5% Ikaros-2 Human Mouse Dog Rat Cow Rabbit Ikaros-2 NFAT Ikaros-2 TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACAAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTGTCTCTCCCTTCCCCTCTG TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTCTCTCTTCCTCCCCCTCCA TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCGATTTTCTACCTACGACCTCACTTTCTGTTGCGCTCACTCCCTTCCCCTGCA TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGTTCTCTCTTCCTCCCCCTCCA TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCGTTCTCTCCCTTCCCCTCCT TGATTTCTCGGCAGCCAGGGAGGGCCCCACGAC-AAGCCATTCAAAATCCCAGAAGTGATTTTCTACTTACGACCTCACTTTCTGTTG----CTCTCTCCTTCCCTCCA 20 bp dynamic shifting window >80% ID ~1 Meg region, 5q31 Coding Noncoding Human interval Transfac predictions for GATA sites 839 20654 Aligned with the same predicted site in the mouse seq. Alligned sites conserved at 80% / 24 bp dynamic window 450 Random DNA sequence of the same length 303 2618 731 29280 2 Exp. Verified GATA-3 Sites IL 5 GATA-3 (28) GATA-3 Conserved (4) A Ik-2-All Ik-2-Aligned Ik-2-conserved 100% 75% 50% B AP-1-conserved NFAT-conserved GATA-3-conserved 100% 75% 50% C AP-1-All NFAT-All AP-1-Aligned NFAT-Aligned AP-1-Conserved NFAT-Conserved 100% 75% 50% Main features of AVID • Alignments up to several megabases • Works with finished and draft sequences • Fast • Accurate for close and distant organisms Main features of VISTA • Clear , configurable output • Ability to visualize several global alignments on the same scale • Available source code and WEB site Large scale VISTA/AVID applications: Cardiovascular comparative genomics database http://pga.lbl.gov Berkeley Genome Pipeline – comparing the human and mouse genome http://pipeline.lbl.gov/ Multiple whole genome comparisons using MAVID http://bio.math.berkeley.edu/genome/ Automatic computational system for comparative analysis of pairs of genomes http://pipeline.lbl.gov Alignments (all pair-wise combinations): Human Genome: (Golden Path Assembly) Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002) Rat assemblies: November 2002, February 2003 ---------------------------------------------------------D. Melanogaster vs D. Pseudoobscura February 2003 Main modules of the system Mapping and alignment of mouse contigs against the human genome Visualization Analysis of conservation Tandem Local/Global Alignment Approach •Finding a likely mapping for a contig •Multi-step verification of potential regions by global alignment Specificity test The ratio of the number of bp on each human chromosome covered by alignments of the reversed mouse genome and the number of base pairs covered by the actual mouse genome. Apolipoprotein(a) region. The expressed gene is confined to A subset of primates. Our method is the only one to predict that apoa(a) has NO homology in the mouse. VistaBrowser Input your own sequence to align against the Reference Genomes: Human, Mouse, Rat, D.Melanogaster GenomeVISTA Opposum BAC versus Human Genome Examples of Results • Understanding the structure of conservation • Identification of putative functional sites • Discovery of new genes • Detection of contamination and misassemblies Two assemblies are better than one Identification of a New Apo Gene on Human 11q23 Gene Name Highly Conserved Region Zoom In ApoA4 ApoC3 ApoA1 Identification of a New Apo Gene on Human 11q23 New Gene (ApoA5) Pennacchio LA et al. Science. 2001, 294:169-73. Finding regulatory regions Muscle Specific Regulatory Region: human beta enolase intronic enhancer Comparative analysis of genomic intervals containing important cardiovascular genes http://pga.lbl.gov http://pga.lbl.gov/cvcgd.html Example of CVCGD entry Short annotation of the region Detailed annotation in AceDB format VISTA plot of the region multiVISTA plot of the region Alignment Conserved regions Comparing the human, mouse and rat • Design a computational scheme for multiple genome mapping (Construction of Homology Maps) • Move from pair-wise to multiple DNA alignment (MAVID) • Novel visualization and browsing techniques (KBROWSER) MAVID architecture overview ML ancestor AVID Nicolas Bray http://baboon.math.berkeley.edu/mavid/ QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Human-Mouse-Rat Human: April 03 Mouse: Feb. 03 Rat: June 03 Homology map (Colin Dewey) ~500 HMR blocks MAVID Computer cluster Conservation Annotation ….. Result: 3-way alignment of human-mouse-rat Foundation for further analysis Can be browsed at http://hanuman.math.berkeley.edu/kbrowser/ QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Human th tm Mouse tr Rat Identification of Rodent Hotspots Human Human Mouse Rat Mouse Rat QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. http://bio.math.berkeley.edu/slam/ SLAM components • Splice site detector – VLMM • Intron and intergenic regions – 2nd order Markov chain – independent geometric lengths • Coding sequence – PHMM on protein level – generalized length distribution • Conserved non-coding sequence – PHMM on DNA level SLAM input and output • Input: – Pair of syntenic sequences (FASTA). • Output: – CDS and CNS predictions in both sequences. – Protein predictions. – Protein and CNS alignment. Input: Output: QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Summary statistics # of SLAM human/mouse genes # of SLAM human/rat genes # of SLAM genes identical in human, mouse, and rat # of SLAM human/mouse/rat genes overlapping human RefSeq % of SLAM human/mouse/rat genes with correct structure (out of genes overlapping human RefSeq) # of novel (not overlapping with human Ensembl, RefSeq, or Known genes) SLAM human/mouse/rat genes # of SLAM human/mouse/rat genes tested 29370 25427 3698 2478 36% 924 48 ortholog pairs (48 human, 48 rat) % of SLAM human/mouse/rat genes verified 73% (29 pairs verified in both human and rat, 6 verified only in rat) Comparative Genomics From: Hardison RC (2003) Comparative Genomics. PLoS Biol 1(2): e58. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Example: LXR-a exon 3 100% 75% 50% Human: chromosome 11 13 other primate sequences (~2kb each) • Begin with a multi-FASTA file • No phylogenetic tree • No alignment • No annotation Nicolas Bray http://baboon.math.berkeley.edu/mavid/ Non-conserved likelihood calculation Conserved likelihood calculation Example: LXR-a exon 3 0.4 log(lik[fast]/[slow]) -0.1 -0.6 -1.1 -1.6 -2.1 0 100% 75% 50% 500 1000 sequence (bp) 1500 Which primates should we sequence? 0.25 0 Primates Rodents Lemurs Lorises Prosimians Tarsioids Cebuella Callithrix Callimico Saguinis Leontopithecus Samiri Cebus Aotus Callicebus Pithecia Chiropotes Cacajao Alouatta Lagothrix Brachyteles Ateles Allenopithecus Miopithecus Erythrocebus Chlorocebus Cercopithecus Macaca Mandrillus Cercocebus Lophocebus Papio Theropithecus Procolobus Piliocolobus Colobus Semnopithecus Kasi Trachypithecus Presbytis Nasalis Simias Pygathrix Rhinopithecus Hylobates Pongo Gorilla Pan Homo 80 60 40 million years 20 0 New-world monkeys Old-world monkeys Hominoids k-MST problem Given a phylogenetic tree on n leaves, and an integer k<n, find the subtree of maximum weight spanning k leaves. The clamped k-MST problem is to find the subtree of maximum weight spanning k leaves where one of the leaves is human. Rodents Lemurs Lorises Prosimians Tarsioids Cebuella Callithrix Callimico Saguinis Leontopithecus Samiri Cebus Aotus Callicebus Pithecia Chiropotes Cacajao Alouatta Lagothrix Brachyteles Ateles Allenopithecus Miopithecus Erythrocebus Chlorocebus Cercopithecus Macaca Mandrillus Cercocebus Lophocebus Papio Theropithecus Procolobus Piliocolobus Colobus Semnopithecus Kasi Trachypithecus Presbytis Nasalis Simias Pygathrix Rhinopithecus Hylobates Pongo Gorilla Pan Homo 80 60 40 million years 20 0 New-world monkeys Old-world monkeys Hominoids Phylogenetic shadowing of the apo(a) promoter 4.5 log(lik[fast]/lik[slow]) 3.5 2.5 1.5 TATA HNF-1a EXON 0.5 -0.5 250 500 750 1000 1250 1500 sequence position (bp) conserved non-conserved 1750 2000 2250 Gel-shift assay to assess DNA-protein interactions conserved elements DNA-protein complex unbound DNA nuclear extract non-conserved elements Gel-shift assay to assess DNA-protein interactions conserved elements DNA-protein complex unbound DNA nuclear extract non-conserved elements Gel-shift assay to assess DNA-protein interactions conserved elements DNA-protein complex unbound DNA nuclear extract non-conserved elements Gel-shift analysis of conserved elements in the apo(a) promoter Conserved elements 7 8 9 10-1 10-2 1 2 3 4 5 6 7 N7 6 N6 5 N5 4 C8 3 C7 2 C6 1 Non-conserved elements %oligonucleotide shifted 35 30 25 20 15 10 5 0 N4 N3 N2 N1 C10.2 C10.1 C9 C5 C4 C3 C2 C1 oligonucleotide Summary and Conclusions - Phylogenetic Shadowing • Alignment problem is tractable • Trees can be constructed accurately • Total tree weight is sufficient for distinguishing conserved from non-conserved regions • Likelihood calculations are reliable because alignment are good • Can decide a-priori which organisms should be sequenced • Annotation of primate-specific elements is possible • Annotation of coding exons is accurate • Annotation of regulatory elements is possible • Sequencing is easier because comparative mapping and assembly techniques can be applied Web sites • MAVID alignment program http://bio.math.berkeley.edu/mavid/ • SLAM comparative gene prediction program http://bio.math.berkeley.edu/slam/mouse/ • VISTA http://www-gsd.lbl.gov/vista/ • KBROWSER http://hanuman.math.berkeley.edu/kbrowser/ • SHADOWER http://bonaire.lbl.gov/shadower/ Credits (M)AVID Nicolas Bray VISTA Projects and PGA Michael Brudno Gaby Loots Eddy Rubin Olivier Couronne Chris Mayor Inna Dubchak Ivan Ovcharenko Homology Mapping Colin Dewey Evolutionary Hotspots Von Bing Yap KBROWSER Kushal Chakrabarti Phylogenetic Shadowing Dario Boffelli Jon McAuliffe Gene Finding Marina Alexandersson Colin Dewey Keith Lewis Ivan Ovcharenko Michael Jordan Eddy Rubin Simon Cawley Richard Gibbs Sourav Chatterji Jia Qian Wu Kelly Frazer Alexander Poliakov