Download Phylogenetic Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epistasis wikipedia , lookup

Mutation wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome evolution wikipedia , lookup

Point mutation wikipedia , lookup

Gene expression programming wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Microevolution wikipedia , lookup

Koinophilia wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
9/29/2015
BINF 3350, Genomics and Bioinformatics
Phylogenetic Analysis
Young-Rae Cho
Associate Professor
Department of Computer Science
Baylor University
Early Evolution Studies
 Since Darwin Until 1960s

Anatomical features were the dominant criteria used to derive
evolutionary relationships between species

Depend on the relatively subjective observations

Hard to prove
 Example

Giant panda
1
9/29/2015
DNA-Based Evolution Studies
 1960s

Emile Zuckerkandl and Linus Pauling brought reconstructing evolutionary
relationships with DNA sequences
 Example

Human vs. Gorilla
 Since 1960s

DNA analysis is a dominant approach to study evolution
Old Debates
From the point of hemoglobin structure, it appears that gorilla is
just an abnormal human, or man an abnormal gorilla, and the two
species form actually one continuous population.
Emile Zuckerkandl,
Classification and Human Evolution, 1963
From any point of view other than that properly specified, that is
of course nonsense. What the comparison really indicates is that
hemoglobin is a bad choice and has nothing to tell us about
attributes, or indeed tells us a lie.
Gaylord Simpson,
Science, 1964
2
9/29/2015
BINF 3350, Phylogenetic Analysis
1.
Structure of Pylogenetic Tree
2.
Evolutionary Distance Measures
3.
Phylogenetic Tree Reconstruction
Phylogenetic Tree
 Phylogenetics

The study of evolutionary relatedness among species
 Phylogenetic Tree (Evolutionary Tree)

Tree-structure diagram showing the inferred evolutionary relationships
between a set of objects

The objects are called taxa
 individual genes, or
species when orthologous
genes are used

Each node represents each taxon

Each edge represents the evolutionary
relationship between taxa
3
9/29/2015
Types of Phylogenetic Trees (1)

Rooted Tree vs. Unrooted Tree

Rooted trees
Root - the most ancient species
External nodes (leaf nodes)
- nodes with degree-1
- existing species
Internal nodes
- nodes with degree > 1
- hypothetical ancestral species

Unrooted trees
(before speciation events)
Types of Phylogenetic Trees (2)

Cladogram vs. Additive Tree vs. Ultrametric Tree

Cladogram
- Defines tree topology only
- No meaning on branch lengths

Additive trees
- Branch lengths are a measure of evolutionary
divergence  evolutionary distance
- Weighted trees

Ultrametric trees
- The vertical axis is a time scale
- Rooted trees
4
9/29/2015
Evolutionary Distance
 Evolutionary Path

The path from the root to a leaf in a rooted tree
 Evolutionary Distance

Sum of weights of the shortest path between two leaf nodes in a weighted
tree
 Example

2
Additive unrooted tree
3
12
16
14
13
12
1
4
13
17
12
5
13
6
Types of Phylogenetic Trees (3)

Bifurcating vs. Multifurcating

Bifurcating (or Dichotomous)
- Each taxon as an internal node diverges into two separate descendent
taxa
- Fully resolved
- What is the degree of internal nodes?
- How many nodes a rooted bifurcating tree with N leaves has?
- How many nodes an unrooted bifurcating tree with N leaves has?

Multifurcating (or Polytomous)
- Each taxon diverges into more than two separate descendent taxa
- Partially resolved
5
9/29/2015
Types of Phylogenetic Trees (4)

Condense Tree
Types of Phylogenetic Trees (5)

Species Tree vs. Gene Tree

Species tree
- Evolutionary relationships
between species

Gene (or gene family) tree
- Evolutionary relationship between homologous genes
- Some branch points represent gene duplication events
- Other branch points represent
speciation events
6
9/29/2015
Splitting
 Splitting

Dividing a phylogenetic tree into a collection of subgroups
by removing a branch
Consensus Tree

Consensus Features


Consistent features supported by phylogenetic trees from the same data
Consensus Tree

A tree with consensus features
7
9/29/2015
BINF 3350, Phylogenetic Analysis
1.
Structure of Pylogenetic Tree
2.
Evolutionary Distance Measures
3.
Phylogenetic Tree Reconstruction
Evolutionary Distance Basics
 Evolutionary Distance

Measured by a mutation rate between two species

p-distance: the fraction of non-identical alignment positions
(observed distance)
 Assumption

All sequences evolved at a constant mutation rate

The sequences diverged to a moderate degree s.t. each position had
only one mutation
 Distance Correction

p-distance underestimates the number of mutations

Predict the amount of multiple mutations
8
9/29/2015
Types of Mutation (1)
 Why Mutation Occurs?

The result of errors during DNA replication process
 Transition Mutations vs. Transversions

Transition mutations: substitutions within the similar nucleotide structure

Transversions: substitutions between different nucleotide structures
Types of Mutation (2)
 Transition/Transversion Ratio (R)

In principle, it should be ½ if all mutations are equally likely

In practice, R » 1 (it varies)
9
9/29/2015
Types of Mutation (3)
 Synonymous vs. Nonsynonymous

Synonymous mutations: substitutions that do not change the encoded

Nonsynonymous mutations: substitutions that alter the encoded amino
amino acid
acid
Types of Mutation (4)
 Selective Pressure

Positive selection: a mutation is likely to be kept if it confers an advantage
on the organism

Negative selection: a mutation is likely to be lost if it confers a
disadvantage on the organism

Many of third-position mutations are synonymous
 Phenomenon of biased mutation pressure
10
9/29/2015
Gene Duplication (1)
 Gene Duplication

Process that a gene becomes copied within the same genome

Arises paralogs
 Review of Definitions

Homologs: similar sequence + common ancestor (by divergent evolution)
•
Orthologs: pairs of genes which have a common ancestor immediately
•
Paralogs: pairs of genes which have a common ancestor immediately
before a speciation event
before a gene duplication event

Analogs: similar sequence + no common ancestor
(by convergent evolution)
Gene Duplication (2)
 Phylogenetic Tree Reconstruction by Gene Duplication
11
9/29/2015
Gene Loss (1)
 Pseudogenes

Genes that have mutated so as to no longer give rise to proteins

Non-functional genes
 Gene Loss

Process that a gene becomes a pseudogene
 Effects of Gene Loss on Phylogenetic Trees
Gene Loss (2)
 Reconciled Tree

Tree that indentifies the speciation
and duplication events and the
gene losses
12
9/29/2015
Species Tree Reconstruction (1)
 Species Tree

Reconstructed by using orthologous genes

COG (Clusters of Orthologous Groups)
database from NCBI
 Resolving the difficulties by gene duplications and gene losses

Examines trees from many different orthologous genes, and find the

Restrict the analysis to the orthologous genes that appear not to have
evolutionary history supported by a majority of trees
duplication events
Species Tree Reconstruction (2)
 Proportion of Orthologous Genes
13
9/29/2015
Horizontal Gene Transfer
 Horizontal Gene Tranfer (HGT)

Also called Lateral Gene Transfer (LGT)

Process that a gene from one species is transferred to another species
Advanced Evolutionary Models
 p-Distance Model

Poisson distance correction
 Jukes-Cantor Model

Assumptions

Formulas
14
9/29/2015
BINF 3350, Phylogenetic Analysis
1.
Structure of Pylogenetic Tree
2.
Evolutionary Distance Measures
3.
Phylogenetic Tree Reconstruction
Phylogenetic Tree Reconstruction Algorithms
 Phylogenetic Tree Reconstruction Algorithms
Algorithm
Algorithm Type
Tree Type
UPGMA
distance-based
ultrametric
F-M
distance-based
unrooted, additive
neighbor-joining
distance-based
unrooted, additive
max parsimony
character-based
rooted,
additive/cladogram
max likelihood
character-based
(un)rooted, additive
15
9/29/2015
UPGMA
 UPGMA
 Unweighted pair-group method using arithmetic averages
 Input
 Distance matrix D: nn matrix of evolutionary distance for n species
 Output
 An ultrametric (rooted) tree for n species
 Process
(1) Merge two closest nodes, x and y, to create one ancestor node z
(2) Draw the height of z by distance between x and y
(3) Remove the rows and columns of x and y in D
(4) Insert the row and column of z with average distance in D
(5) Repeat (1)~(4) until reaches the root
Fitch-Margoliash Algorithm (1)
 3-Leaf Tree

A basic component for unrooted, additive (weighted) tree for 3 species
dic + djc = Dij
dic = (Dij + Dik – Djk) / 2
dic + dkc = Dik
djc = (Dij + Djk – Dik) / 2
djc + dkc = Djk
dkc = (Dki + Dkj – Dij) / 2
 n-Leaf Tree

How many edges for n species ?

How many variables ?

How many equations ?
16
9/29/2015
Fitch-Margoliash Algorithm (2)
 Neighbors
 A pair of nodes that are separated by just one other node
 Input
 Distance matrix D: nn matrix of evolutionary distance for n species
 Output
 An unrooted, additive (weighted) tree for n species
 Process
(1) Find two closest nodes, x and y, as neighbors
(2) Calculate distance from x and y to their ancestor z by 3-leaf tree formula
(3) Remove the rows and columns of x and y in D
(4) Insert the row and column of z with distance by 3-leaf tree formula
(5) Repeat (1)~(4) until completes an unrooted tree
Maximum Parsimony Methods
 Main Concept
 Find evolutionary history with the minimum number of mutations required
to produce observed sequences
 Consider mutation at each position of the sequences separately
(character-based approach)
 Parsimony Score
 Sum of the weights of all edges in the phylogenetic tree
 Examples
ACCA
1
ATCA
1
ATCG
ACCG
1
ACCG
1
ACCA
less parsimonious
1
0
ATCG ACCG
0
ATCG
2
ACCA
more parsimonious
17
9/29/2015
Fitch Algorithm (1)
 Unweighted Parsimony Method
 Build a phylogenetic tree having the min score (i.e., max parsimony)
given sequences of the existing species to be considered
 Count mismatches for scoring
 Post-order traversal
 Input
 A tree with each leaf node labeled by a single nucleotide on a specific
position of each sequence
 Output
 A rooted tree, cladogram with all labels on internal nodes
Fitch Algorithm (2)
 Process
(1) Assigns a set of candidate labels to each internal node by traversing
from leaf nodes to root
•
If two sets of labels from child nodes of a node v overlap, then
assigns the common set to v
•
If not, assigns the combined set to v
if Su and Sw overlap
otherwise
(2) Assigns a label to each node by traversing from root to leaf nodes
•
For the root, chooses one arbitrarily from candidates
•
For all other nodes, if its parent’s label is a candidate, then
assigns its parent’s label
•
Else, choose one arbitrarily from candidates
18
9/29/2015
Fitch Algorithm (3)
 Example
A
{A,C}
A
{G}
C
G
A
G
A
G
C
{A,C,G}
{A,C}
A
G
G
A
{A,C}
{G}
C
G
G
A
{G}
C
G
G
 Parsimony score?
Maximum Likelihood Method
 Maximum Likelihood (MLE)
 Maximum Likelihood Method for Evolutionary Tree Reconstruction

Jukes-Cantor Model in Maximum Likelihood
19
9/29/2015
Examples
 Reconstructed Phylogenetic Trees for 16S RNA datasets
Evolutionary Tree Validation
 Bootstrapping Method

Main concept
•
Sampling data objects uniformly with replacement to validate
statistical estimates

Application to evolutionary tree validation
20