Download here - CMBI

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neuronal ceroid lipofuscinosis wikipedia , lookup

Genomic library wikipedia , lookup

Mutation wikipedia , lookup

Copy-number variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Transposable element wikipedia , lookup

Oncogenomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Epistasis wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene therapy wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Human genome wikipedia , lookup

Public health genomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Koinophilia wikipedia , lookup

Epigenetics of human development wikipedia , lookup

History of genetic engineering wikipedia , lookup

Point mutation wikipedia , lookup

Genomics wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Gene nomenclature wikipedia , lookup

Metagenomics wikipedia , lookup

Gene desert wikipedia , lookup

RNA-Seq wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Pathogenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome editing wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression programming wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Genome evolution wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
Phylogenomics
Using complete genomes to determine
the phylogeny of species
Bas E. Dutilh
Tree of life
• Bacteria
• Archaea
• Eukaryota
Evolution
• What we can see are
the present-day
species
• Offspring looks like its
parents
• Mutations
– Phenotype
– Genotype
• Nature selects:
survival of the fittest
Phenotype
• Which properties to compare?
• Watanabe's Ugly Duckling Theorem:
“All things have an infinite
number of features.
So any two things share an
infinite number of features.
Therefore two things cannot
be of the same kind because
they share more features
than they do with things of
a different kind.”
Evolution
Genotype
• Genome sequence is finite
and you do not have to
choose
• Genetic properties
– Word frequency
– Sequence (nt/aa)
– Gene content
– Gene order
Why sequence similarity works
• Every residue (nt/aa) is a
separate dimension
– Human: 3 billion nucleotides
• Most mutations are …
Sequences never converge
Evolution: mutation and selection
• Mutation is responsible for changes
• Selection is responsible for continuity
• The more differences, the more distantly
related two sequences are
• Contrary to structure or phenotype,
sequences do not converge
Phylogenetics
• Inferring the evolution of a gene
• Distance matrix
• Hierarchical clustering
• Evaluate likelihood of
all possible trees
P
P
P
• Maximum likelihood
Substitution matrix
• Describes the rate at which one character in a
sequence changes to other character states
• BLOck SUbstitution Matrix (BLOSUM) is based on
observed substitutions between proteins with e.g.
>62% sequence identity
Neighbour joining
Maximum likelihood
P
P
P
• Make all possible trees
• Calculate likelihood that
the alignment evolved in
this tree
Maximum likelihood tree
• Very computer intensive
• PhyML searches “around”
starting tree (e.g. NJ)
Maximum parsimony
• Parsimony is a special case of likelihood
• The tree with the smallest number of
mutations is the maximum parsimony tree
SSU rRNA
• Present in all species
• Constant function
• Slowly evolving
Fox et al, Science 1980
SSU rRNA
• Phylogeny of SSU rRNA
discovered the three domains
• Representative for the
evolutionary history of species
• Bacteria
• Archaea
• Eukaryota
Olsen et al, J Bacteriol 1994
Different genes tell different stories
• Conflict between trees based on single genes
• Unrecognized paralogy
- Orthologs
- Paralogs
spec A
ancestor
spec B
spec C
• Horizontal gene transfer
• Mutation saturation, biases, divergent rates
Is a tree the right representation?
• Genomes are chimeras with genes from
different origins
– Endosymbiosis (mitochondrion, chloroplast)
– Horizontal gene transfer (many examples, often
adaptations to environment)
More data = more consistent trees
• Combine information from more genes to
average out these anomalies
• Complete genomes contain the maximum
phylogenetic information
Fungi
• Yeasts, filamentous
and dimorphic fungi
• Fungi are the eukaryotic clade with largest
number of completely sequenced genomes
• S. cerevisiae is a well studied model organism
• Much consensus about phylogeny
Consensus phylogeny (literature)
19 target nodes
Orthology
• Which genes to compare between species
ancestor
– Homologs (originated “de novo”)
– Orthologs (originated at speciation)
• Orthology has higher resolution
– Pairwise orthology
– Cluster orthology
– Tree-based orthology
spec A
spec B
spec C
Pairwise orthology (Inparanoid)
• Compare all proteins in species A to all
proteins in species B to find homologs
• Find bi-directional best hit
• All proteins closer than bi-directional best hit
are (in-) paralogs
Cluster orthology (COG)
• First group in-paralogs in every species
• Find bi-directional best hits between inparalogous groups
• Join in-paralogs to orthologous groups
– Link all pairs of in-paralogous groups
– Only if link is confirmed by third species (triangle)
Tree based orthology
• Phylogenetic tree of
homologs
• Find gene duplication
nodes
• Two homologous
genes are orthologs
if last common
ancestor is not a
duplication node but
a speciation node
Gene content methods
• Presence/absence matrix (0/1)
OG1 OG2 OG3 OG4 …
sp1
1
1
0
1
…
sp2
0
1
0
0
…
sp3
0
0
1
1
…
…
…
…
…
…
• Similarity: number of shared
orthologous groups
– Genomes that share few OGs are
distantly related
– Genomes that share many OGs
are closely related
but…
Genome size correction
– Average genome size
– Smallest of two genomes
– Weighted average
genome size
3000
P. chrysosporium
# shared genes
• Large genomes have
more genes, so they
also share more genes
• Divide number of shared
genes by
2800
2600
2400
2200
2000
1800
1600
1400
1200
Korbel et al, Trends Genet 2002
1000
2000
3000
4000
5000
6000 size
7000
genome
8000
Gene content methods
• Similarity: corrected number of
shared genes
dist (spA, spB) = 1 –
(
# shared OGs (spA, spB)
weighted average size (spA, spB)
• Distance: (1 – similarity)
d\s
sp1
sp2
sp3
sp4
…
sp1
0\1
0.8
0.6
0.8
…
sp2
0.2
0\1
0.1
0.9
…
sp3
0.4
0.9
0\1
0.7
…
• Neighbour joining
sp4
0.2
0.1
0.3
0\1
…
…
…
…
…
…
)
Gene content methods
• Dollo parsimony
– Gaining a complex character
(gene) is rare and happens once
– Losing it is relatively easy
– Minimize the number of gene
losses for maximum parsimony
Superalignment methods
• Multiple alignment
• Concatenate alignments (1:1:1)
• A missing gene in a certain
species (row) can be seen as a
gap in the alignment
Superdistance methods
• Combine distance matrices
from separate gene families,
e.g. average
Supertree methods
• Make phylogenetic trees for all
gene families separately
• Matrix Representation using
Parsimony (MRP)
13 trees
14 trees
15 trees
12 trees
Gene content vs. sequence based
Gene content supertrees are
different than sequence
based supertrees
Consensus phylogeny (literature)
19 target nodes
Gene content
10.38
• Low-dimensional
compared to genotype
• Intermediate between
genotype and phenotype
– Main dichotomy between
yeasts and filamentous
Fungi, not Ascomycota
and Basidiomycota
– Dimorphic Basidiomycota
exclude filamentous P.
chrysosporium
Superalignment 18.21
Supertree
17.50
• Sequence
based trees
agree better
with literature
• Literature is
dominated by
sequence
based trees
Hyperthermophiles
Nanoarchaeum
Nanoarchaeota
Crenarchaeota
Euryarchaeota
Waters et al. PNAS 2003; Di Giulio, J Theor Biol 2006
• Gene content tree
Eury
Cren
Ciccarelli et al. Science 2006
Brochier et al. Genome Biol 2005
Assignment
www.cmbi.ru.nl/edu/seminars
• Make a gene content tree
• Compare with other phylogenetic trees
• Describe the differences
– Can you find literature that specifically studies
these species?
– What do you think is going on? Why are the trees
different?
• Write a paper about some of your most
interesting findings, include references