Download Evolutionary Analysis

Document related concepts

Complement component 4 wikipedia, lookup

Transcript
http://creativecommons.org/licenses/by-sa/2.0/
Lecture 4.2
1
Evolution
“To study history one must know in advance
that one is attempting something
fundamentally impossible, yet necessary and
highly important.”
Father Jacobus (Hesse's Magister Ludi)
Lecture 4.2
2
Some history is known
• Bacterial evolution
observed
• Manchester Moths
– light to dark
Lecture 4.2
3
Key Concepts
• Fundamentals of Systematics
• Appreciate that phylogenetic analysis allows you to
estimate or infer the evolutionary relationships
between sequences/organisms
• Learn how to better interpret trees
• Gain insight into the different phylogenetic methods
• Appreciate the need for new algorithms
• DNA and protein analysis - benefits and pitfalls of each
Lecture 4.2
4
First, some terminology…
• Systematics, an attempt to understand the
interrelationships of living things
• Taxonomy, the science of naming and classifying
organisms (evolutionary theory not necessarily
involved)
• Phylogenetics is the field of systematics that
focuses on evolutionary relationships between
organisms or genes/proteins (phylogeny).
• Cladistics is a particular method of hypothesizing
relationships among organisms/genes/proteins.
Lecture 4.2
5
Three basic assumptions of
cladistics
• Any group of organisms/
genes/proteins are related by
descent from a common ancestor
(fundamental tenant of
evolutionary theory)
• There is a bifurcating pattern of
cladogenesis
(most controversial assumption)
• Change in characteristics occurs
in lineages over time
(necessary for cladistics to work!)
Lecture 4.2
6
A phylogenetic tree
A node
Human
A clade
Mouse
Fly
taxon -- Any named group of organisms – evolutionary theory not
necessarily involved.
clade -- A monophyletic taxon (evolutionary theory utilized)
Lecture 4.2
7
A phylogenetic tree with branch lengths
A node
4
2
3
1
Human
A clade
Mouse
Fly
Branch length can be significant…
In this case it is and mouse is slightly more similar to fly
than human is to fly
(sum of branches 1+2+3 is less than sum of 1+2+4)
Lecture 4.2
8
Phylogenetic analysis
• Organismal relationships
• Gene/Protein relationships
Lecture 4.2
9
Organismal relationships
Lecture 4.2
10
Lecture 4.2
11
Lecture 4.2
12
Improving our understanding of organismal
relationships
Realization that rates of change are not constant
Lecture 4.2
13
Improving our understanding of organismal
relationships
Better appreciation for what sequences may be suitable
for analysis of different degrees of divergence
For the tree of life:
rRNA genes
Multiple genes
“Whole genome” datasets of genes
rRNA genes!
Lecture 4.2
14
Improving our understanding of organismal
relationships
Better sampling of all the species in our world
Lecture 4.2
15
Improving our understanding of organismal
relationships
Better sampling of all the species in our world
Amazing but true!
More bacteria in our bodies than
human cells!
More different types of bacterial
genes in our body then there are
human genes!
“The second human genome
project”
(David Relman)
Lecture 4.2
16
Gene/Protein Relationships
Lecture 4.2
Homolog, ortholog, paralog??
17
Homologs
Have common origins but may or may not have
common activity.
Homologous or not?: Often determined by
arbitrary threshold level of similarity determined
by alignment
Lecture 4.2
18
Homologs
…have common ancestry, but the way they are related can vary
(i.e. the reasons they have diverged into different sequences can
vary)
• orthologs - Homologs produced by speciation. They tend to have
similar function.
• paralogs - Homologs produced by gene duplication. They tend to have
differing functions.
• xenologs -- Homologs resulting from horizontal gene transfer between
two organisms.
Lecture 4.2
19
Orthologous or paralogous homologs
Early globin gene
Gene Duplication
-chain gene
mouse 
human 
Orthologs ()
ß-chain gene
cattle 
cattle ß
Paralogs (cattle)
human ß
mouse ß
Orthologs (ß)
Homologs
Orthologs – diverged after speciation – tend to have similar function
Paralogs – diverged after gene duplication – some functional divergence occurs
Therefore,
for linking similar genes between species, or performing
Lecture 4.2
“annotation transfer”, identify orthologs
20
True or False?
A1x is the ortholog in
species x of A1y?
A1x is a paralog of A2x?
A1x is a paralog of A2y?
Lecture 4.2
21
Identifying Gene/Protein Relationships
from Phylogenetic trees
• orthologs - Homologs produced by speciation. Gene phylogeny
matches organismal phylogeny.
• paralogs - Homologs produced by gene duplication. Multiple
copies of homologs in a given species.
• xenologs -- Homologs resulting from horizontal gene transfer
between two organisms. Gene phylogeny does not match
organismal phylogeny in a tree where most genes do match
organismal phylogeny well.
Lecture 4.2
22
Orthologs and Paralogs of the fly1 gene?
Known organismal phylogeny
Lecture 4.2
Chimpanzee
Chimpanzee
Human
Human
Mouse
Mouse
Fly
Fly1
Worm
Fly
Human
Chimpanzee
Human
Human
Mouse
Worm
Fly1
Fly1
Worm
23
Xenologs: Horizontal gene transfer
E. coli
Lecture 4.2
24
Gene Orthology: How to detect?
• Most common high throughput computational method: Identify
reciprocal best BLAST hits (EGO, COGs,…)
Example Problem:
• If making comparisons between human and bovine, for example, the
bovine gene dataset is still quite incomplete
• Therefore, current best hit may be a paralog now and the true ortholog
not yet sequenced
human
Lecture 4.2
cattle
mouse
cattle
25
Can we improve orthology analysis for linking
functionally similar genes?
• One solution: Phylogenetic analysis of all putative human-bovine
orthologs, using mouse as an outgroup
• Assumption:
- Mouse and Human gene datasets are more complete, with more true
orthologs identified
Expect (organismal phylogeny):
cattle
human
mouse
Reject:
mouse
human
cattle
Lecture 4.2
26
Blue genes are from the same species
PaAlgU is an ortholog of ?
PaAlgU is a paralog of ?
Lecture 4.2
27
2 Forms in 1 Species
+
+
Lecture 4.2
++
+
28
Slides from Jonathan Eisen
2 Forms in 1 Species - LGT
+
+
+
++
Both forms
maintained
Red and blue forms
diverge
+
Lecture 4.2
Gene present in
common ancestor
29
2 Forms in 1 Species - Gene Loss
+
+
++
+
Loss
Loss
Gene duplicated in common ancestor
++
Lecture 4.2
30
Unusual Distribution Pattern
+
Lecture 4.2
+
31
Unusual Distribution - LGT
+
+
Acquires new
type of gene
Gene originates
here
Lecture 4.2
32
Unusual Distribution - Gene Loss
+
+
Gene lost
here
Gene present in ancestor
Lecture 4.2
33
Unusual Distribution Evolutionary Rate Variation
-?
Gene too diverged to be found
+
+
Lecture 4.2
34
+/-
Unusual Distribution Incomplete Data
+
+/-
+
Gene present in ancestor
Lecture 4.2
35
Hope for the future
Better sampling of all the species in our world
2004: The dawn of
environmental genomics
sampling
Tyson et al (2004) Community structure
and metabolism through reconstruction of
microbial genomes from the environment.
Nature, 428, 37-43.
Venter et al (2004) Environmental
genome shotgun sequencing of the
Sargasso Sea. Science, 304, 66-74.
Lecture 4.2
36
“So….. how do we construct a phylogenetic tree??”
Lecture 4.2
37
Most common methods
• Parsimony
• Neighbor-joining
• Maximum Likelihood
Lecture 4.2
38
Parsimony
• “Shortest-way-from-A-to-B” method
• The tree implying the least number of changes in
character states (most parsimonious) is the best.
• Note:
– May get more than one tree
– No branch lengths
– Uses all character data
Lecture 4.2
39
Neighbor-joining
(and other distance matrix methods)
• “speedy-and-popular” method
• distance matrix constructed
• distance estimates the total branch length between
a given two species/genes/proteins
• Neighbor-joining approach: Pairing those
sequences that are the most alike and using that
pair to join to next closest sequence.
Lecture 4.2
40
Practical comparison of common
distance matrix methods: Some PHYLIP
and PAUP programs as an example
• Neighbor-joining: fast – not so good for highly divergent
sequences
• Fitch: Better but slower and result not that different (seeks
to maximize fit of pairwise distances)
• Kitsch: Assumes equal rate of evolution – can greatly bias
results so do not use!
• Minimum Evolution (PAUP): Similar to Fitch but fixes
location of internal verses external nodes when
maximizing fits
• Note: gap info not incorporated into analysis
Lecture 4.2
41
Maximum Likelihood
• “Inside-out” approach
• produces trees and then sees if the data could
generate that tree.
• gives an estimation of the likelihood of a
particular tree, given a certain model of nucleotide
substitution.
• Notes:
– All sequence info (including gaps) is used
– Based on a specific model of evolution – gives
probability
– Verrrrrrrrrrrry slow (unless topology of tree is known)
Lecture 4.2
42
How reliable is a result?
• Non-parametric bootstrapping
– analysis of a sample of (eg. 100 or 1000) randomly
perturbed data sets.
– perturbation: random resampling with replacement,
(some characters are represented more than once, some
appear once, and some are deleted)
– perturbed data analysed like real data
– number of times that each grouping of
species/genes/proteins appears in the resulting profile
of cladograms is taken as an index of relative support
for that grouping
Lecture 4.2
43
Bootstrapping
The number of times a
particular branch is formed
in the tree (out of the X
times the analysis is done)
can be used to estimate its
probability, which can be
indicated on a consensus tree
High bootstrap values don’t
mean that your tree is the
true tree!
Good alignment and evolutionary
assumptions are key
Lecture 4.2
44
Parametric Bootstrapping
Data are simulated
according to the
hypothesis being tested.
Lecture 4.2
45
Are we confident that the cow, mole
and hedgehog has one ancestor?
Lecture 4.2
46
Orthologs, paralogs and homologs
– one more test!
Early globin gene
Gene Duplication
-chain gene
mouse 
human 
cattle 
ß-chain gene
cattle ß
human ß
mouse ß
Orthologs – diverged after speciation – tend to have similar function
Paralogs – diverged after gene duplication – some functional divergence occurs
Therefore,
for linking similar genes between species, or performing
Lecture 4.2
“annotation transfer”, identify orthologs
47
Phylogenetics – More info
Li, Wen-Hsiung. 1997. Molecular evolution Sunderland,
Mass. Sinauer Associates.
- a good starting book, clearly describing the basis of
molecular evolution theory. It is a 1997 book, so is
starting to get a bit out of date.
Nei, Masatoshi & Kumar, Sudhir. 2000. Molecular
evolution and phylogenetics Oxford ; New York. Oxford
University Press.
- more recent, and by two very well respected
researchers in the field. A bit more in-depth than the
previous book, but very useful.
Lecture 4.2
48
Phylogenetic Tree Construction:
Examples of Common Software
PHYLIP
http://evolution.genetics.washington.edu/phylip.html
PAUP
http://paup.csit.fsu.edu/
MEGA 2.1
www.megasoftware.net/
TREEVIEW
http://taxonomy.zoology.gla.ac.uk/rod/treeview.html
Extensive list of software
http://evolution.genetics.washington.edu/phylip/software.html
Lecture 4.2
49
PhyloBLAST – a tool for analysis
Lecture 4.2
50
Challenges
How do we classify?
Lecture 4.2
51
Computational Challenges
• Need to incorporate more evolutionary theory into
the multiple sequence alignment and phylogenetic
algorithms used in phylogenetic analysis
• Phylogenetic analyses are computationally
intensive – great way to benchmark your CPU
speed!
Lecture 4.2
52
More Challenges
• Increasing the sampling of our genetic world
• More accurately differentiating orthologs, paralogs, and
horizontally acquired genes
• How frequent is gene loss, gene duplication, and
horizontal gene transfer in genome evolution?
• To what degree can we predict protein/gene function
using phylogenetic analysis?
Lecture 4.2
53
Evolution
“To study history one must know in advance
that one is attempting something
fundamentally impossible, yet necessary and
highly important.”
Father Jacobus (Hesse's Magister Ludi)
Lecture 4.2
54
Evolutionary theory
is evolving
Lecture 4.2
55