Download I. Comparing genome sequences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Oncogenomics wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Polyploid wikipedia , lookup

Population genetics wikipedia , lookup

Minimal genome wikipedia , lookup

Gene wikipedia , lookup

DNA barcoding wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Public health genomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Designer baby wikipedia , lookup

Transposable element wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene desert wikipedia , lookup

Mutation wikipedia , lookup

RNA-Seq wikipedia , lookup

Human Genome Project wikipedia , lookup

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Human genome wikipedia , lookup

Pathogenomics wikipedia , lookup

Genomics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Sequence alignment wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genome editing wikipedia , lookup

Metagenomics wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Koinophilia wikipedia , lookup

Microevolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Comparative Genomics
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT
TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA
GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT
CCCTGTTTCCAGGTTTGTTGTCCCAAAATAGTGACCATTTCATATGTATA
Function
Overview
I. Comparing genome sequences
• Concepts and terminology
• Methods
-
Whole-genome alignments
-
Quantifying evolutionary conservation (PhastCons, PhyloP)
-
Identifying conserved elements
• Available datasets at UCSC
II. Comparative analyses of function
• Evolutionary dynamics of gene regulation
• Case studies
• Insights into regulatory variation within and across species
Goals of comparative genomics
•Infer the course of past evolution using statistical models
of sequence evolution
•Identify sequence elements evolving more slowly or more rapidly
than neutral
•Evaluate the precise degree of constraint on specific
positions
•Predict the functional effects of nucleotide or amino acid
mutations in constrained sequences
Functional variation in the genome
Changes to:
• Methylation patterns
• Transcription factor binding
• Histone modification states
• Gene expression levels
Cooper and Shendure, Nat Rev Genet 12:628 (2
Vertebrates
Tetrapods
Mammals
Primates
Vertebrate genomes available for comparative studie
Distribution of evolutionary constraint in the
human genome
4.2% of genome is putatively constrained
~1 million putative regulatory elements
Lindblad-Toh et al. Nature 478:476
Commonly used (and misused) terms
Mutation vs. Substitution
• Mutations occur in individuals, segregate in populations
• Substitutions are mutations that have become fixed
• Mutations = within species; substitutions = between species
Conservation vs. Constraint
• Conservation = an observation of sequence similarity
• Constraint = a hypothesis about the effect of purifying selection
Homology, Orthology and Paralogy
• Homologous sequences = derived from a common ancestor
• Orthologous sequences = homologous sequences separated by a speciation event
(e.g., human HOXA and mouse Hoxa)
• Paralogous sequences = homologous sequences separated by gene duplication
(e.g., human HOXA and human HOXB)
Basic premises in comparative sequence analysis
Most sequence differences among genomes are neutral
• Involve substitutions with minimal or no functional impact
• Fixed by random genetic drift
• Fixation rate is equal to mutation rate
• Genomes become more dissimilar with greater phylogenetic distance
Most mutations that affect function are eliminated by purifying selection
• Constrained elements have lower substitution rates than expected from the neutral ra
• Contingent on the effect of the mutation and degree of constraint on the function
• Manifests as sequence conservation, even among distant species
Beneficial mutations may be driven to fixation by positive selection
• May be detected as “faster-than-neutral” substitution rate
• Expected to be rare
Phylogenies
Phylogenetic trees show two things:
• Evolutionary relationships among species or sequences: branching order
• Evolutionary distance (e.g., degree of similarity or divergence): branch length
Branch
Internal
node
Terminal
node
Phylogenies
Phylogenetic trees show two things:
• Evolutionary relationships among species or sequences: branching order
• Evolutionary distance (e.g., degree of similarity or divergence): branch length
Species tree
Gene tree
Orthologs and paralogs in gene trees
HMGCS1
HMGCS2
Capra et al. 2013
Orthologs
Duplication
Capra et al. 2013
Paralogs
Orthologs
Orthologs and paralogs in gene trees
Orthologs and paralogs in gene trees
1:1 Orthologs
1:1 Orthologs
1:2
Capra et al. 2013
Human HMGCS1
Human HMGCS2
Ortholog assignments at Ensembl
Ortholog assignments at Ensembl
Ortholog assignments at Ensembl
Steps in sequence comparisons
Sequence alignment
• Global vs. local
• Whole-genome vs. genome segments (e.g., genes)
• Identify sites that are homologous (not necessarily identical)
Measure similarity and divergence of sequences
• Sequence similarity – level of conservation
• Rates of change among sequences - divergence
Infer degree of evolutionary constraint
• Are the sequences more conserved than expected from neutral evolution?
Rates of sequence change are estimated using models
the substitution process
Transition probabilities:
A
A 1-aAT-aAC-aAG
T
C
G
aAT
aAC
aAG
T
aTA
1-aTA-aTC-aTG
aTC
aTG
C
aCA
aCT
1-aCA-aCT-aCG
aCG
G
aGA
aGT
aGC
1-aGA-aGT-aGC

Substitution rates are calculated for each lineage in a
sequence phylogeny
Phylogeny







Conserved noncoding sequences identified by
local reductions in substitution rate
5
localneut
4.5
4
3.5
3

2.5
2
1.5
1
aligned position
0.5
0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
aligned position
31
33
35
37
39
41
43
45
47
49
Tools for quantifying evolutionary conservation across
genomes
Alignment: Multiz
• Generates multiple species alignment relative to a base genome
• Constructed from pairwise alignment of individual genomes to reference
• 46-way and 100-way alignment to hg19, 30-way to mm9; 60-way to mm10
100-way Multiz alignment in hg19
Green = level of sequence similarity at each site
Conservation of synteny: “net” alignments
• Conservation of genome segments
• Order and orientation of genes and regulatory sequences
Conservation of synteny: “net” alignments
• Synteny is frequently conserved on megabase scales
Tools for quantifying evolutionary conservation across
genomes
Alignment: Multiz
• Generates multiple species alignment relative to a base genome
• Constructed from pairwise alignment of individual genomes to reference
• 46-way and 100-way alignment to hg19, 30-way to mm9; 60-way to mm10
PhastCons
• Estimates the probability that a nucleotide belongs to a conserved element
• Sensitive to ‘runs’ of conserved sites – effective for identifying conserved blocks
• For hg19, elements are calculated at three phylogenetic scopes
(Vertebrate, Placental Mammal, Primate)
PhyloP
• Measures conservation independently at individual positions
• Provides per-base conservation scores: (-log p value under hypothesis of neutrality)
• Positive scores suggest constraint; negative scores suggest accelerated evolution
Identifying conserved elements: PhastCons
PhastCons scores
PhastCons elements
lod: 882
Score: 694
lod score: log probability under conserved model – log probability under neutral m
Score: normalized lod score on 0-1000 scale
Use scores to rank elements by estimated constraint
PhastCons elements estimated at 3 phylogenetic scope
Primate
Placental
Vertebrate
Level of conservation decays with increasing
evolutionary distance
PhyloP: measuring basewise conservation
PhyloP
scores
•
•
•
•
Scores are calculated independently for each base
Scores are –log P values under hypothesis of neutral evolution
Positive scores = constraint
Negative scores = acceleration
Per-site phyloP conservation scores
4.49
1.77
Use PhastCons to identify conserved elements
Use phyloP to evaluate individual sites within elements
-0.96
Accessing conservation data
Multiple genome alignments and conservation metrics are
calculated independently for each reference genome
Orthologous region in mouse:
30-way multiz
alignment
Conservation
Regulatory info (ENCODE)
Conservation identifies critical binding sites
in regulatory elements
Important binding sites and variants that affect
function will be here
Comparative functional genomics
Patterns of selection on gene expression and regulati
Neutral
Constrained
Romero et al., Nat Rev Genet. 13:505 (2012)
Directional
Nature 478: 343 (2011)
•
•
•
•
•
•
•
•
•
•
Human
Chimpanzee
Bonobo
Gorilla
Orangutan
Macaque
Mouse
Opossum
Platypus
Chicken
•
•
•
•
Custom gene models based on Ensembl + RNA-s
5,636 1:1 orthologs in amniotes
13,277 1:1 orthologs in primates
Only constitutive exons
Gene expression diverges as species diverge
Conservation of gene expression varies across tissues
Issues in comparative functional genomics
•Input data are noisy: ChIP-seq, RNA-seq data are signal based, sub
to considerable experimental variation
•Using comparable biological states within and across species
(e.g., human liver vs. mouse liver) = variation across tissues?
•How do epigenetic states and gene expression diverge among
individuals and across species (Neutral? Constrained?)
•Can we identify variants or substitutions that drive regulatory
changes?