* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download I. Comparing genome sequences
Oncogenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Genomic library wikipedia , lookup
Whole genome sequencing wikipedia , lookup
DNA barcoding wikipedia , lookup
Minimal genome wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Public health genomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Designer baby wikipedia , lookup
Gene expression programming wikipedia , lookup
Transposable element wikipedia , lookup
Gene desert wikipedia , lookup
Point mutation wikipedia , lookup
Human Genome Project wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Pathogenomics wikipedia , lookup
Human genome wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Sequence alignment wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genome editing wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Metagenomics wikipedia , lookup
Koinophilia wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Microevolution wikipedia , lookup
Comparative Genomics TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT CCCTGTTTCCAGGTTTGTTGTCCCAAAATAGTGACCATTTCATATGTATA Function What do we mean when we say a sequence is “functional?” TF binding Histone mod What do we mean when we say a sequence is “functional?” Comparative genomics: leveraging evolution to identify function TGCAAACTCAAACTCTTTTGTTGTTCTTACTGT TGCAAAGTCAAACTCTTTTGTTGTACTTACTGT TGCAACTACAAACTCTTTT-TTGTTCTTACTGT TGCGACTACAAACTCTTTTGTTGTTCTTCCTGT TGCGACTACA-ACTCTCTTGTTGTT-TTCCTGT TACGAATA--GGACATTTGTTGCCCTTC-CTGT TACGA-TACAGGCCCATTTGTTG--CTTCCTGT T-CGA-TACAGGCTCATTTGTTG--CTTCCTGT Overview I. Comparing genome sequences • Concepts and terminology • Methods - Whole-genome alignments - Quantifying evolutionary conservation (PhastCons, PhyloP) - Identifying conserved elements • Available datasets at UCSC II. Comparative analyses of function • Evolutionary dynamics of gene regulation • Insights into regulatory variation within and across species Goals of comparative genomics •Infer the course of past evolution using statistical models of how sequences change over time •Identify sequence elements evolving more slowly (or more rapidly) than expected •Evaluate the precise degree of constraint on specific sites within genes, enhancers, etc. •Predict the functional effects of nucleotide or amino acid mutations in constrained sequences Vertebrates Tetrapods Mammals Primates Vertebrate genomes available for comparative studie Distribution of evolutionary constraint in the human genome 4.2% of genome is putatively constrained ~1 million putative regulatory elements Lindblad-Toh et al. Nature 478:476 Commonly used (and misused) terms Mutation vs. Substitution • Mutations occur in individuals, segregate in populations • Substitutions are mutations that have become fixed • Mutations = within species; substitutions = between species Conservation vs. Constraint • Conservation = an observation of sequence similarity • Constraint = a hypothesis about the effect of purifying selection Homology, Orthology and Paralogy • Homologous sequences = derived from a common ancestor • Orthologous sequences = homologous sequences separated by a speciation event (e.g., human HOXA and mouse Hoxa) • Paralogous sequences = homologous sequences separated by gene duplication (e.g., human HOXA and human HOXB) Basic premises in comparative sequence analysis Most sequence differences among genomes are neutral • Involve substitutions with minimal or no functional impact • Fixed by random genetic drift • Fixation rate is equal to mutation rate • Genomes become more dissimilar with greater phylogenetic distance Most mutations that affect function are eliminated by purifying selection • Constrained elements have lower substitution rates than expected from the neutral ra • Contingent on the effect of the mutation and degree of constraint on the function • Manifests as sequence conservation, even among distant species Beneficial mutations may be driven to fixation by positive selection • May be detected as “faster-than-neutral” substitution rate • Expected to be rare Phylogenies Phylogenetic trees show two things: • Evolutionary relationships among species or sequences: branching order • Evolutionary distance (e.g., degree of similarity or divergence): branch length Branch Internal node Terminal node Phylogenies Phylogenetic trees show two things: • Evolutionary relationships among species or sequences: branching order • Evolutionary distance (e.g., degree of similarity or divergence): branch length Species tree Gene tree Orthologs and paralogs in gene trees HMGCS1 HMGCS2 Capra et al. 2013 Orthologs Duplication Capra et al. 2013 Paralogs Orthologs Orthologs and paralogs in gene trees Orthologs and paralogs in gene trees 1:1 Orthologs 1:1 Orthologs 1:2 Capra et al. 2013 Human HMGCS1 Human HMGCS2 Ortholog assignments at Ensembl Ortholog assignments at Ensembl Ortholog assignments at Ensembl Steps in sequence comparisons Sequence alignment • Global vs. local • Whole-genome vs. genome segments (e.g., genes) • Identify sites that are homologous (not necessarily identical) Measure similarity and divergence of sequences • Sequence similarity – level of conservation • Rates of change among sequences - divergence Infer degree of evolutionary constraint • Are the sequences more conserved than expected from neutral evolution? Rates of sequence change are estimated using models the substitution process Transition probabilities: A A 1-aAT-aAC-aAG T C G aAT aAC aAG T aTA 1-aTA-aTC-aTG aTC aTG C aCA aCT 1-aCA-aCT-aCG aCG G aGA aGT aGC 1-aGA-aGT-aGC Substitution rates are calculated for each lineage in a sequence phylogeny Phylogeny Conserved noncoding sequences identified by local reductions in substitution rate 5 localneut 4.5 4 3.5 3 2.5 2 1.5 1 aligned position 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 aligned position 31 33 35 37 39 41 43 45 47 49 Tools for quantifying evolutionary conservation across genomes Alignment: Multiz • Generates multiple species alignment relative to a base genome • Constructed from pairwise alignment of individual genomes to reference • 46-way and 100-way alignment to hg19, 30-way to mm9; 60-way to mm10 100-way Multiz alignment in hg19 Green = level of sequence similarity at each site Conservation of synteny: “net” alignments • Conservation of genome segments • Order and orientation of genes and regulatory sequences Conservation of synteny: “net” alignments • Synteny is frequently conserved on megabase scales Tools for quantifying evolutionary conservation across genomes Alignment: Multiz • Generates multiple species alignment relative to a base genome • Constructed from pairwise alignment of individual genomes to reference • 46-way and 100-way alignment to hg19, 30-way to mm9; 60-way to mm10 PhastCons • Estimates the probability that a nucleotide belongs to a conserved element • Sensitive to ‘runs’ of conserved sites – effective for identifying conserved blocks • For hg19, elements are calculated at three phylogenetic scopes (Vertebrate, Placental Mammal, Primate) PhyloP • Measures conservation independently at individual positions • Provides per-base conservation scores: (-log p value under hypothesis of neutrality) • Positive scores suggest constraint; negative scores suggest accelerated evolution Identifying conserved elements: PhastCons PhastCons scores PhastCons elements lod: 882 Score: 694 lod score: log probability under conserved model – log probability under neutral m Score: normalized lod score on 0-1000 scale Use scores to rank elements by estimated constraint PhastCons elements estimated at 3 phylogenetic scope Primate Placental Vertebrate Level of conservation decays with increasing evolutionary distance Functional variation in the genome Changes to: • Methylation patterns • Transcription factor binding • Histone modification states • Gene expression levels Cooper and Shendure, Nat Rev Genet 12:628 (2 Extent of variation in the typical human genome 1000 Genomes project consortium PhyloP: measuring basewise conservation PhyloP scores • • • • Scores are calculated independently for each base Scores are –log P values under hypothesis of neutral evolution Positive scores = constraint Negative scores = acceleration Per-site phyloP conservation scores 4.49 1.77 Use PhastCons to identify conserved elements Use phyloP to evaluate individual sites within elements -0.96 Accessing conservation data Multiple genome alignments and conservation metrics are calculated independently for each reference genome Orthologous region in mouse: 30-way multiz alignment Conservation Regulatory info (ENCODE) Conservation identifies critical binding sites in regulatory elements Important binding sites and variants that affect function will be here Gene expression and regulation also diverges over tim Neutral Constrained Romero et al., Nat Rev Genet. 13:505 (2012) Directional Nature 478: 343 (2011) • • • • • • • • • • Human Chimpanzee Bonobo Gorilla Orangutan Macaque Mouse Opossum Platypus Chicken • • • • Custom gene models based on Ensembl + RNA-s 5,636 1:1 orthologs in amniotes 13,277 1:1 orthologs in primates Only constitutive exons Gene expression diverges as species diverge Conservation of gene expression varies across tissues Issues in comparative functional genomics •Input data are noisy: ChIP-seq, RNA-seq data are signal based, sub to considerable experimental variation •Using comparable biological states within and across species (e.g., human liver vs. mouse liver) = variation across tissues? •How do epigenetic states and gene expression diverge among individuals and across species (Neutral? Constrained?) •Can we identify variants or substitutions that drive regulatory changes?