Download Slide 1

Document related concepts

X-inactivation wikipedia , lookup

Gene therapy wikipedia , lookup

Oncogenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Point mutation wikipedia , lookup

Transposable element wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genomic library wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Public health genomics wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene nomenclature wikipedia , lookup

Primary transcript wikipedia , lookup

Non-coding DNA wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene desert wikipedia , lookup

NEDD9 wikipedia , lookup

History of genetic engineering wikipedia , lookup

Metagenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Human Genome Project wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression programming wikipedia , lookup

Human genome wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Genomics wikipedia , lookup

Gene wikipedia , lookup

Pathogenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

Genome editing wikipedia , lookup

Genome (book) wikipedia , lookup

Minimal genome wikipedia , lookup

Microevolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Orthology predictions for whole
mammalian genomes
Leo Goodstadt
MRC Functional Genomics Unit
Oxford University
“Evolution of Orthologues”
Selection pressures in orthologues and paralogs
“Gene Duplications”
Reproduction, immunity or chemosensation
“Gene birth in the human lineage”
Ongoing duplications underlie polymorphism
“Synonymous substitution rates”
Mutation and selection varies by chromosome size
Orthology is the key
How it started
We are “consumers” of
orthology / paralogy
Started off using Ensembl
predictions
Ensembl 1:1 covered 50% of
predicted mouse genes.
Ewan’s manual survey said
80%
1) General observations for all
mammalian genomes
Paralogues evolve fast (and are fun!)
2) Observations for whole clades of
species
0.14
Drosophila
Nematodes
Amniotes
Lineage specificdN/dS
0.12
0.10
0.08
0.06
0.04
0.02
0.00
Species
3) Inparalogues define
lineage specific biology
Marsupial / Monodelphis biology revealed by
lineage specific genes
• Chemosensation (OR, V1R and V2R )
• Reproduction (Vomeronasal Receptors, lipocalins, b-microseminoprotein (12:1))
• Immunity (IG chains, butyrophilins, leukocyte IG-like receptors, T-cell receptor
chains and carcinoembryonic antigen-related cell adhesion molecules )
pancreatic RNAses
• Detoxification (hypoxanthine phosphoribosyltransferase homologues nitrogen poor diets)
• KRAB ZnFingers
4) Interesting stories in the
aggregate
5) Treasure trove in the details
On going mouse inparalogues analysis: Lots and lots of reproductive
genes
clade: #2
(ortholog_id = 17117 in panda)
159 mus genes
47 genes new to assembly 36
10 genes completely new to assembly 36
Interpro matches for this clade:
!!! Expansion mainly on chr5 and 14, although single (pseudogene?) versions on chr13 and chr16.
!!! Mouse DLG5 is: chr14:22,966,420-22,978,653 (expressed in testis: AK147699)
gene identifier
-------------------MUS_GENE_21705
ENSMUSP00000086007
MUS_GENE_22420
ENSMUSP00000099126
MUS_GENE_19599
<
order
----6639
6643
6646
(
<
NCBIMUSP_83776567
MUS_GENE_23688
ENSMUSP00000094421
MUS_GENE_19774
(E)-rich protein 3 ;
6651
6657
chrm exons stop length
---- ----- ---- -----5
spermatogenesis associated glutamate
(E)-rich protein 1, pseudogene 1 ;
4
182
5
predicted gene, EG623898 ;
2
72
5
spermatogenesis associated glutamate
E)-rich protein 1, pseudogene 1
(Speer1-ps1) on chromosome 5 ;
4
157
5
predicted gene, EG623898 ;
2
72
5
spermatogenesis associated glutamate
6) Candidates for evolutionary and
functional analyses
Secretoglobin Protein Family members: Androgen-binding proteins.
Emes et al. (2004) Genome Res.
Available Genomes
And
Divergences
Hedges, SB Nature
Reviews Genetics
3, 838 -849 (2002)
How do we find function in the
genome?
• Nothing in Biology Makes Sense Except in the
Light of Evolution. Theodosius Dobzhansky
(1900-1975).
How to find the function in the
genome?
Similar Sequences(Genes / Genome regions)
Common Ancestry (homology)
Similar Structures / Folds
Similar Functions ?
How much of the genome is functional?
Compare with the mouse
Whole
Genome
ARs
Ancestral Repetitive (AR) sequence is
non-functional and has evenly
distributed conservation scores (red)
(symmetrical bell shaped due to biological
variation)
Whole Genome sequence contains
some functional sequence under
selection and thus has a small excess
of conserved sequence under
purifying selection
(asymetrical)
Functional sequence
=Whole Genome - Ancestral
Repetitive
= 5%
N.B.
This is an estimate that doesn’t take into account sequence
•Turning over rapidly (not shared by mouse/human)
•Under positive (diversifying) selection
The human genome (euchromatic sequence)
Protein coding: 1.2%
UTR: 0.3%
Conserved non-coding (3.5% ?)
Neutral
Repeats
(Transposable elements, …)
~45%
Unknown
(old repetitive junk?)
Conserved non-coding material
• Transcription factor binding sites
• Enhancers, insulators and other
non-transcribed regulatory elements
• Alternative splicing signals
• Transfer RNAs, ribosomal RNAs
• Small RNAs (e.g. snoRNAs, microRNAs, siRNAs and piRNAs)
regulatory/gene silencing / RNA degradation
• MacroRNAs (e.g. Xist)
enzymatic? / chromosome inactivation
Functional parts of genes are highly conserved
How many protein coding genes?
• Walter Gilbert [1980s] 100k
• Antequera & Bird [1993] 70-80k
• John Quackenbush et al. (TIGR)
[2000] 120k
• Ewing & Green [2000] 30k
• Tetraodon analysis [2001] 35k
• Human Genome Project (public) [2001] ~ 31k
• Human Genome Project (Celera) [2001] 24-40k
• Mouse Genome Project (public) [2002] 25k -30k
• Lee Rowen [2003] 25,947
• Human Genome Project (finishing) 20-25k [2004]
• Current predictions [2008] 19-20k
Traditional Genome Orthology
Reciprocal BLAST best hits between longest
transcript of each gene (+ synteny)
Assumes:
• Protein similarity is proportional to
evolutionary distance (selection is
invariant!)
• Pairwise relationships adequately represent
the evolutionary tree
• No gene losses or missing predictions
• Alternative splicing can be ignored!
• No gene translocations after tandem
duplication
Orthology prediction methods
• Two genomes
– Reciprocal best blast hit
• Multiple genomes
– Clustering of
• reciprocal best hits
• protein similarities
Query
Blast hits
Reciprocal Blast Best Hits
Advantages:
• Fast, Well understood
• Works well for distant lineages
• Can correlate with protein structure (domains)
Disadvantages:
• Only provides 1:1 orthologues in the best case
• Can be difficult to reconcile with the species
tree
Genes on chromosome of species 1
Genes on chromosome of species 2
Reciprocal Blast Best Hits
?
Reciprocal Blast Best Hits
?
How to add duplicated genes? synteny
Ensembl compara in the past
• Local gene order tends to be conserved in
mammalian lineages
• Look for inparalogs locally even if the protein
distances don’t add up
(sequence error,
sampling error etc.)
Blast Best Hits in Local Regions
?
Blast Best Hits in Local Regions
?
Problems with relying only on synteny
Local homologs are often not inparalogs:
•Local rearrangements
•Missing predictions
(neighbouring orphans)
•Need sanity checking
Human and Mouse chromosomes:
•Extensive rearrangements only over larger regions
•Conservation of gene order in the short range
Olfactory Orthology from compara
Mouse chromosome 2
Rat chromosome 3
One to one
One to many
Many to one
Many to many
Olfactory Orthology
Mouse chromosome 2
Rat chromosome 3
One to one
One to many
Many to one
Many to many
Inparanoid
• Remm,M., Storm,C.E. and Sonnhammer,E.L.L.
(2001) Automatic clustering of orthologs and
in-paralogs from pairwise species
comparisons. J. Mol. Biol. 314, 1041–1052.
• Avoids multiple alignments and phylogenetic
methods for speed and to avoid errors
• Heuristics are implicitly phylogenetic
How Inparanoid works
Longest Transcripts
Pairwise alignments scores
2.
Use cutoff
Reciprocal Best Hits
are orthologues
3.
Add lineage
Specific duplicates
(inparalogs)
With confidences
4.
Resolve conflicts
5.
Orthology
Identify
Identify“inparalog”
“main” orthologues
candidates
Longest Transcripts
Pairwise alignments scores
2.
Use cutoff
Reciprocal Best Hits
are orthologues
3.
Add lineage
Add lineage
Specific duplicates
Specific duplicates
(inparalogs)
(inparalogs)
With confidences
4.
Resolve conflicts
5.
Orthology
Confidence values for inparalogs
A
B
1. Most confident inparalog is when the inparalog is sequence identical
to main orthologue.
2. Maximum value = scoreidentical – scoreorthologs
3. Confidence = (scoreinparalog – scoreorthologs) /
(scoreidentical – scoreorthologs)
Resolving conflicts
Longest Transcripts
1. Merge if orthologs already clustered in same
group
Pairwise alignments scores
2.
Use cutoff
Reciprocal Best Hits
are orthologues
2. Merge if two equally good best hits
3.
Add inparalogs
With confidences
3. Delete weaker group
4.
Resolve conflicts
4. Merge significantly overlapping
5.
Orthology
5. Divide overlapping
Why are there conflicts?
• Protein differences are a proxy for
evolutionary time
• Protein similarity scores approximate protein
differences (sequence, alignment, estimation errors)
• Pairwise scores can be used to (conceptually)
recover phylogenetic (tree) data
Alternatives: phylogenetic methods
• Inparanoid is great because it models
phylogeny explicitly
• Why not use phylogenetic methods directly?
• Multiple estimators of protein distance
4 pairwise scores used out of 30
Phylogenetic methods
• Iterative distance methods are very fast,
suitable for whole genome analyses (variants
on neighbor joining)
• Statistically consistent with evolutionary
models (can have explicit error model with
evolutionary distances, e.g. bionj)
• Inparanoid type consistency checking can be
carried out after phylogeny is predicted
Is protein similarity a good proxy
for evolutionary distance?
Advantages
• Does not saturate over long evolutionary distances
• Easy to align / predict genes (unlike non-coding
regions)
• Sometimes cDNA sequence is not available
Disadvantage
• Assumes constant evolutionary rate
• Assumes invariant selection
Use Silent Mutations as a genetic clock
• Redundant genetic code, e.g.
GCA
GCC
→
Alanine
GCG
GCT
• Third base of a codon “wobbles” without
changing the translated amino acid
• dS approximates neutral mutation rate
(without selection) in coding regions
}
dS as proxy for evolutionary distance
• Easier to align than Ancestral Repeats
• Not neutral sequence!!
• Genomic > 2x variation in dS
• Assumes most gene families are local due to
tandem duplication and share dS
• Assume (partial) gene conversions are
infrequent
dS Caveats
• Saturates at long evolutionary distances
(but less so than many think)
• Beware of GC / codon frequency biases
(use ML rather than heuristic methods)
• Multiple alignment / tree rather than pairwise
for best results
• Slow to estimate accurately
• Missing values (where dS saturates)
codeml dS accuracy at 400 codons
yn00 dS accuracy at 400 codons
Use all transcripts
PhyOP: transcript trees from dS
1. Whole genome alignment identifies
homologues
2. codeml for dS calculation
3. Ignore large dS
4. Hierarchical cluster
5. Fitch Margoliash modified to handle missing
values to give giant transcript tree
6. Heuristics based on lowest dS to select
1 “representative” transcript per gene
7. Map Gene tree to species tree
Fitch Margoliash
Minimize
Where
• dij is the pairwise distance estimate
• pij is the distance between i and j on the tree
Assumes that the error is a fixed proportion of
the total distance
(Fitch and Margoliash, 1967)
Easily adapted for missing values
PhyOP pipeline Part 1
3 ways in which transcript trees map to genes
• Simple clades
only 1 transcript per gene in orthologous
relationship: most genes
• Unambigous clades
Alternative transcripts are in the same
orthologous relationships
3 ways in which transcript trees map to genes
• Ambiguous clades
Alternative transcripts are in inconsistent
relationships (small proportion)
Where are most transcripts?
Assumption:
Most transcripts are not in any sort of
orthologous relationships: their conjugates
have not been predicted.
Reality
Most transcripts are in the same clade as their
alternative transcripts:
Because of shared exons, they are most similar
to their alternatively transcribed siblings.
How to choose between alternative transcripts?
• Use conserved exon boundaries excludes
exogenous sequence
• Use distance to its ortholog (not tree distance
because these will be equal)
high dS means exogenous sequence and will
be excluded
With multiple partially overlapping clades, this
is more difficult
PhyOP pipeline Part 2
Example
Four alternative transcripts (1-4), 6 dog genes, 3
human genes
• Clade 1 transcripts
Doga1 Dogb1 Dogc1 Dogf1 Humanb1 Humanc1
• Clade 2 transcripts
Dogb2 Dogc2 Doge2 Dogf2 Humana2 Humanc2
• Clade 3 transcripts
Doga3 Dogb3 Dogd3 Doge3 Dogf3
Humana3 Humanb3 Humanc3
“Annointing” transcripts to keep:
Example
Circularity / boot-strapping problem:
The transcript in the other species which is used for “annointing”
might itself be discarded
• Doga1 is closer to Humanb1 than Doga3 to any human
transcript: Keep Doga1 discard Doga3
• Humanb3 is closer to Doge3 than Humanb1 to any dog
transcript: Keep Humanb3 discard Humanb1
• Oops. Now no Human transcript is close to Doga1.
•
•
•
Clade 1 transcripts
Doga1 Dogb1 Dogc1 Dogf1 Humanb1 Humanc1
Clade 2 transcripts
Dogb2 Dogc2 Doge2 Dogf2 Humana2 Humanc2
Clade 3 transcripts
Doga3 Dogb3 Dogd3 Doge3 Dogf3 Humana3 Humanb3 Humanc3
How to avoid circularity
•
•
1.
2.
3.
4.
5.
Previously: Use mean distance to all other transcripts in the
other species. Close eyes. Hope problem goes away.
Now:
Take all transcript pairs from all three clades starting with
the closest dS
“Annoint” both transcripts from the pair and throw away all
other transcripts
Ignore all pairs which involve discarded transcripts
Recurse
Complicated by trying to keep merged genes
From transcripts to genes
dS for orthologues
dS distributions can be an
indication of orthologue quality
Dog vs. Human Genomes
Conservation of Gene Order in
Mouse / Rat ORs
1600
Rat OR Gene Order
1400
1200
1000
800
600
400
200
0
0
200
400
600
800
Mouse OR Gene Order
1000
1200
How to improve on using dS?
• ds better dates the history, but fails for distant
homologs.
• dn works for distant homologs, but tends to be
subjected to selective pressures.
Can we combine them?
• Full codon evolutionary model would account
for this automatically
• Use bootstrapping: if values -> random, no
longer informative
TreeBeST
Tree Building guided by Species Tree
http://treesoft.sourceforge.net/treebest.shtml
Heng Li
• Tree merge algorithm: merge several trees
that are built from the same alignment with
different models.
• Species-aware maximum likelihood:
use species phylogeny to correct errors
Maximize use of underlying data
5 tree types:
1.
2.
3.
4.
5.
Synonymous distance NJ
Non-Synonymous distance NJ
P distance NJ
WAG maximum likelihood
HKY maximum likelihood
Each predicted from same data
Use bootstrap values to identify optimal
branches using context free grammar
Context Free Grammar in TreeBeST
Given a set of binary rooted trees with the same
leaf set V, reconstruct a binary rooted tree
such that:
• each branch of the resultant tree comes from
one of the given trees
• the resultant tree minimizes a certain
objective function
• additivity
• topological independence
Maximize use of underlying data
• Switch automatically between
– codon: dN, dS;
– nucleotide: HKY and
– protein: P-distance
depending on bootstrap
• Fix high probability errors by minimizing
distance to species topology
Slide from Heng Li
Trees reconciled optimally
Slide from Heng Li
Is TreeBeST more reliable?
Slide from Heng Li
Caveats
• Bootstrapping may not be the most effective
way to test the support for a particular tree
given the underlying data
• The underlying data are not the state of the
art but cannot use codon + ML for speed
• Limited by multiple alignment
• Reconciliation with species tree can mask real
gene losses/duplications
Alternative transcripts reveal
merged genes
• Ensembl includes merged genes
435 dog
346 human
Finding merged genes
What is the best way to deal with
alternative transcripts?
• Create virtual transcript
Virtual translation
What is the best way to deal with
alternative transcripts?
If two transcripts do not overlap and have
homology to each other, they may be
tandemly duplicated gene models merged in
error
Include both transcripts in pipeline
How to run orthology pipeline for
whole genomes
• Take all proteins and cDNA
• Make sure correspond exactly, no stop
codons, no genomic mismatches
• All vs all blastall
• Protein-guided alignments of cDNA
• Create virtual translation peptide
• Run tree prediction. E.g. TreeBeST
• Reconcile with species tree to derive
orthology
Predicting orthology gets easier
with more genes/species
1. Phylogenetic methods improve in power with more data
2. Heuristic / pairwise methods decrease in power /
become more ambiguous with more data
Why is orthology prediction so
hard for mammals?
Because gene predictions is so hard
The human genome (euchromatic
sequence)
Protein coding: 1.2%
UTR: 0.3%
Conserved non-coding (3.5% ?)
Neutral
Repeats
(Transposable elements, …)
~45%
Unknown
(old repetitive junk?)
Signals in DNA are weak
• non-canonical splice sites
• promotors without TATA box
• introns/exons can have varying lengths
• ...
probabilistic models:

Hidden Markov Models
Accuracy of ab-initio gene
prediction
• Nucleotide level:
– 90% sensitivity/90% selectivity
• Exon level:
– 70% sensitivity/50% selectivity
• Gene level:
– 40% sensitivity/30% selectivity
• False positives: difficult to refute
• False negatives: will be missed
Limitations of ab-initio models
•
•
•
•
•
Limited to training set
Limited to model (strange genes)
Problems with long genes
Small exons are difficult to find
Terminal exons are difficult to find
– No splice signals, other signals variable
• e.g. Genscan
Comparative/homology methods
• Add extra data to locate genes
• Compare genome to known sequences
– cDNAs
– ESTs
– Known protein sequences
– e.g. Genewise
}
• Compare genome to other genome
– e.g. TwinScan
Same or
different
organism
Using cDNAs/ESTs
• cDNAs
Provide 3'UTR and 5'UTR.
 Provide full gene structure.
– Expensive and thus rare
– Contamination with genomic DNA

• ESTs
Cheap and thus plentiful
– Highly redundant
– Of variable quality
– Not complete

• Both: biased towards highly expressed genes
Using cDNAs
5'UTR
Exon
Intron
Exon
Intron
Exon
3'UTR
cDNA sequence
• Alignment between DNA sequences
– Introns and reading frames
Using known protein sequences
5'UTR
Exon
Intron
Exon
Intron
Exon
3'UTR
Alignment between a
“cDNA” to a genome
Implicit cDNA sequence
Predicted protein
sequence
Alignment between two
protein sequences
Known protein sequence
Using another genome sequence
BLASTN results
against Genome 2
Add evidence to ab-initio model
e.g. TwinScan
Genome 1
5'UTR
Exon
Intron
Exon
Intron
Exon
3'UTR
Align gene models between
orthologous regions
e.g. DoubleScan
5'UTR
Exon
Intron
Exon
Intron
Genome 2
Exon
3'UTR
Sweet spot for prediction by
homology
Sensitivity
Ab-Initio
Homology
Specificity
Homology
Ab-Initio
Similarity of known protein to target
Guigo et al. (2000)
Rate analyses
Branches of gene trees scale
symmetrically
• Variations in branch length
Ideal world
1.0
Cumulative frequency
0.8
0.6
Real world
dana
dere
dmel
dsec
dsim
dyak
0.4
0.2
0.0
0.0
0.5
1.0
1.5
2.0
Synonymous substitution rate / dS
Median distance to root
What orthologous genes should
look like
Sequence conservation between mouse and human genes
Mouse genome paper Nature 420, 520-562
What orthologous genes should
look like
• Exons conserved between genomes
• UTRs partially conserved between genomes
CGSC (2004)
Gene validations using orthology
• Most genes have orthologues
• Almost all genes have mammalian homologs
• Exaption of non-coding sequence is rare,
especially for constitutively expressed exons
• Conservation of exon-intron structure
(number and phase of exons)
• Conservation of length
• Conservation of domains
• Conservation of synteny
Look carefully at genes
• For example: small introns
Introns
Pseudogene?
Conservation of splice sites:
• Insertions / losses of introns are rare
• Phase Never changes
• Aligned positions should nearly always match
allowing for alignment errors
• Valid mismatches may represent insertions
(outside of protein domains)
• Find retrogenes
Conservation of splice sites:
• Tandem duplication of non-coding may result
in the appearance of splice site conservation
• Check if sequence similarity is absolute
• Check coding potential
(Tandem duplicates are often fast evolving genes
under positive selection)
Retrogenes
• Loss of introns is due to
retrotransposition can be confirmed by
loss of synteny (blastz)
• Not all retrogenes are non-functional
• Ancient ones are functional
• Recent retrogenes can be assumed to be
dead
Gene validations using orthology
Make sure orthology properties look
appropriate
Homo Monodelphis 1:1 orthologues
dN /dS
0.086
1.02
dS
Amino acid sequence identity
Pairwise alignment coverage
81.0%
94.2%
Homo sapiens
Number of exons
Sequence length (codons)
Unspliced transcript length (bp)
G+C content at 4D sites
9
471
27,241
56.9%
Monodelphis
domestica
9
445
25,365
48.7%
What can you do with orthologs?
Wait for part II