Download Introduction to Molecular Systematics

Document related concepts

Real-time polymerase chain reaction wikipedia , lookup

SNP genotyping wikipedia , lookup

Genetic code wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Gene expression wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Transposable element wikipedia , lookup

Transformation (genetics) wikipedia , lookup

Restriction enzyme wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Biosynthesis wikipedia , lookup

Two-hybrid screening wikipedia , lookup

RNA-Seq wikipedia , lookup

Genomic library wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Promoter (genetics) wikipedia , lookup

DNA supercoil wikipedia , lookup

Molecular ecology wikipedia , lookup

Gene wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Molecular cloning wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Homology modeling wikipedia , lookup

Community fingerprinting wikipedia , lookup

Point mutation wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcript
Introduction to Phylogenetic
Systematics
Mark Fishbein
Dept. Biological Sciences
Mississippi State University
13 October 2003
Which of these critters are most
closely related?
alligator
gila monster
purple gallinule
gopher tortoise
?
kingsnake
Phylogeny
• Branching history of evolutionary lineages
• New branches arise via speciation
• Speciation occurs when gene flow is
severed between populations
• Phylogenetic relationships depicted as a tree
© W. S. Judd, et al., Plant Systematics
© W. S. Judd, et al., Plant Systematics
Phylogenetic data
•
•
•
•
•
•
•
Morphology
Secondary chemistry
Cytology
Allele frequencies
Protein sequences
Restriction sites
DNA sequences
}
“Molecular” data
Molecular (genetic) data
• Proteins
– Serology (immunoassay)
– Isozymes (electrophoretic variants)
– Amino acid sequences
• DNA
– Structural (translocations, inversions,
duplications)
– Restriction sites
– DNA sequences
• Substitutions
• Insertions/Deletions
What are genes?
From Raven et al. (1999),
Biology of Plants
Genomes
• All of the genes within a cell are the
genome
• Genes located in the nucleus are the nuclear
genome
• Other genomes (organellar)
– Mitochondrion: mitochondrial genome
– Chloroplast: plastid genome
nucleus
chloroplast
mitochondrion
From Raven et al., 1999, Biology of Plants
Comparison of Genomes
Nuclear
Mitochondrial Plastid
Size
Large
Small
Small
Number
Multiple
Single
Single
Shape of
Chromosomes
Ploidy
Linear
Circular
Circular
Diploid
Haploid
Haploid
Inheritance
Biparental
Uniparental
Uniparental
Structural rearrangements
Inversion
Crossing over, duplication, and loss
From Freeman and Herron (1998), Evolutionary Analysis
Chemistry of Genes
• DNA
• Parallel strands linked together
• Linear array of units called nucleotides
– Phosphate
– Sugar: deoxyribose
– One of four bases
•
•
•
•
Adenine (“A”)
Cytosine (“C”)
Guanine (“G”)
Thymine (“T”)
From Raven et al. (1999),
Biology of Plants
DNA structure
• Paired strands are linked by bases
– A must bond with T
– G must bond with C
• Each link is composed of a purine and a
pyrimidine
– A & G are purines
– C & T are pyrimidines
DNA function
• DNA is code for making proteins (and a few other
molecules)
• Proteins are the structures and enzymes that
catalyze biochemical reactions that are essential
for the function of an organism
• DNA code is read and converted to protein in two
steps
– Transcription: DNA is copied to messenger RNA
– Translation: messenger RNA is template for protein
DNA code
• A gene is a code composed of a string of
nucleotide bases (A’s, C’s, G’s, T’s)
• A protein is composed of a string of amino
acids (there are 20)
• How does the DNA code get translated into
protein?
DNA code
• Each amino acid is coded for by at least one
triplet of nucleotide bases in DNA
• Each triplet is called a codon
• There are 64 possible codons (4 bases, 3
positions = 43)
From Raven et al. (1999), Biology of Plants
DNA functional classes
• Coding
– Proteins (exons)
– Ribosomes (RNA)
– Transfer RNA
• “Non-coding”
– Introns
– Spacers
From Raven et al. (1999),
Biology of Plants
Homology in Molecular
Systematics
• Assess orthology
• Align sequences
• Homology is often implicit (is this a good
thing?)
DNA Sequences and Homology
• Homology: similarity due to common
descent
• How do we assess homology of DNA
sequences?
• Levels of homology
– Locus
– Allele
– Nucleotide position
From W. P. Maddison (1997), Systematic Biology 46:527
Orthology vs. Paralogy
• DNA sequences that are at homologous loci are
orthologous
• DNA sequences that are similar due to duplication
but are at different loci are paralogous
• Orthology may be best detected with a
phylogenetic analysis of all sequences
From Martin & Burg (2002), Systematic Biology 51:578
Multiple Sequence Alignment
• Goal: create data matrix in which columns
are homologous positions
• Problem: sequences vary in length
• Why?
– Insertions
– Deletions
Simple Sequence Alignment
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
GTACGTTG
GTACGTTG
GTACGTTG
GTACATTG
GTACATTG
GTACATTG
Simple Sequence Alignment
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
GTACGTTG
GTACGTTG
GTACGTTG
GTACATTG
GTACATTG
GTACATTG
DNA Sequence Data Matrix
T1
T2
T3
T4
T5
T6
C1
C2
C3
C4
C5
C6
C7
C8
G
G
G
G
G
G
T
T
T
T
T
T
A
A
A
A
A
A
C
C
C
C
C
C
G
G
G
A
A
A
T
T
T
T
T
T
T
T
T
T
T
T
G
G
G
G
G
G
Slightly Less Simple Sequence
Alignment
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
AGAGTGAC
AGAGTGAC
AGAGTGAC
AGAGGAC
AGAGGAC
AGAGGAC
Slightly Less Simple Sequence
Alignment
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
AGAGTGAC
AGAGTGAC
AGAGTGAC
AGAG-GAC
AGAG-GAC
AGAG-GAC
Alignment Gaps
• Gaps are inserted to maximize homology
across nucleotide positions
• Gaps are hypothesized indels
• Inserting a gap assumes that an indel event
is a better explanation of the differences
among sequences than nucleotide
substitution
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
AGAGTGAC
AGAGTGAC
AGAGTGAC
AGAGGAC
AGAGGAC
AGAGGAC
AGAGTGAC
AGAGTGAC
AGAGTGAC
AGAG-GAC
AGAG-GAC
AGAG-GAC
3 substitutions
0 indels
0 substitutions
1 indels
Ambiguous Alignment with a
Single-Base Indel
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
GGTCAG
GGCCAA
AGCTAA
AGCAA
AGCAA
AGCAA
Ambiguous Alignment with a
Single-Base Indel
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
GGTCAG
GGCCAA
AGCTAA
AG-CAA
AG-CAA
AG-CAA
GGTCAG
GGCCAA
AGCTAA
AGC-AA
AGC-AA
AGC-AA
4 substitutions
1 indels
4 substitutions
1 indels
Gap Number and Length
• All else being equal, is it better to assume
fewer longer gaps, or more shorter gaps?
• In other words, what is more likely:
– For a new indel to occur?
– For an existing indel to lengthen?
• There is no general answer!
– Alternate alignments are explored
algorithmically
Alignment Algorithms
• Typically built up from pairwise alignments,
using assumed gap costs
• Problem: most algorithms require an initial
tree to define alignment order--bias
• Solution: simultaneous tree estimation and
alignment optimization
• Problems: costly, unjustifiable parameters
Clustal Alignment Algorithm
• Creates alignment based on penalties for
gap opening (number of gaps) and gap
extension (gap length)
• Multiple alignment built according to guide
tree determined by pairwise alignments
• Order of adding sequences determined by a
guide tree
Clustal Alignment Algorithm
Distance matrix
calculated from
pairwise
comparisons
Additional sequences
are added according
to dendrogram, until
all sequences are
added
Dendrogram
calculated from
from distance
matrix
Alignment calculated
for most similar pair of
sequences, based on
alignment parameters
Tree-Based Alignment
• Simultaneous tree and alignment estimation
using parsimony
– TreeAlign
– MALIGN
• Implement similar gap opening/extension
costs
• These applications are very slow!
Alignment in the Future?
• Incorporate a more sophisticated understanding of
molecular evolution in parameterization
• For example, what are realistic values of gap
costs? Are they universal?
• Can phylogeny estimation proceed without
optimizing alignments?
– Likelihood based methods can sum over all alignments
• Will require major contribution of biologists
Methods of tree estimation
• Character based
– Maximum parsimony (MP)
• Fewest character changes
– Maximum likelihood (ML)
• Highest probability of observing data, given a model
– Bayesian
• Similar to ML, but incorporates prior knowledge
• Distance based
– Minimum distance
• Shortest summed branch lengths
Major classes of data
Character-based
Distance-based
Bird
A G A G T
Alligator
A G G G T
Lizard
A C C G G
Snake
A C C G G
Tortoise
C
Bird
Alligator
Lizard
Snake
C T C C
Alligator Lizard
Snake
Tortoise
0.20
0.60
0.60
1.00
0.60
0.60
1.00
0.00
0.80
0.80
Minimum Distance
Bird
Alligator
Lizard
Snake
Alligator Lizard
Snake
Tortoise
0.20
0.60
0.60
1.00
0.60
0.60
1.00
0.00
0.80
0.80
Maximum Parsimony
1
2 3 4 5
Bird
A G A G T
Alligator
A G G G T
Lizard
A C C G G
Snake
A C C G G
Tortoise
C
1: A
4: G
C T C C
2: C
3, 5 are slightly more
complicated...
Parsimony Criterion
B
N
L( )    w jdiff (x k j , x kj )
k1 j 1
L = tree length
 = topology
k = branch
B = number of branches
j = character
N = number of characters
w = character weight
diff (x1, x2) = number of steps
along branch
Parsimonious Character
Reconstruction
• To evaluate the parsimony of a tree, each
character is optimized (then the sum is
computed)
• Several parsimony algorithms have been
developed that optimize character
reconstructions
• Algorithms differ in assumptions about
permissible transformations between
character states
Likelihood Criterion
N
L( )   l j
j1
L = tree likelihood
 = topology
j = character (site)
l = site likelihood
Site Likelihoods
• lk, the probability of the nucleotides of each
sequence at a given site, is the product of
probabilities along the branches of the tree
• The probability along a branch is the
product of
– Probability of a substitution
– Branch length
– Summed over ancestral states
Substitution Model
• Many models have been proposed
• Elements are the rate of substitution of one
base for another, per site
• Rates are instantaneous (probability of
change in a short period of time)
• Rates may be allowed to vary among sites
Maximum Likelihood
1
2 3 4 5
Bird
A G A G T
Alligator
A G G G T
Lizard
A C C G G
Snake
A C C G G
Tortoise
C
C T C C
Tree is selected that maximizes
likelihood of observed sequences,
given a model of substitution
The Molecular Systematics
Revolution
• Dramatic increase in the size of data sets
– Characters, taxa
• Increased confidence in homology
assessment?
• Computational advances
– Technology
– Algorithms and software
Large Data Sets
• Pre-1990: up to ~25 taxa (rarely 100), 1
gene or up to 100 morphological characters
• 1998: 2538 rbcL sequences of green plants;
entire mtDNA sequences (15,000 bp) in
animals
• 2004: ??
The Large Data Set Headache
• Problem: the large number of sequences,
not the size of sequences
• Application of phylogenetic optimality
criteria requires evaluation of all possible
trees
• Algorithms guaranteed to find optimal
solutions have limited applicability
The Problem of Finding Optimal
Trees
• There are too many trees to evaluate!
• The number of possible topologies increases
very rapidly with the number of
taxa/samples
• There are [(2m - 5)!] / [2m-3(m-3)!] unrooted
trees , where m = number of taxa
Taxa
3
4
5
7
9
Trees
1
3
15
945
135,135
stars in the
universe
atoms in
the universe
From Hillis et al. (1996), Applications of Molecular Systematics
Heuristic Methods
• Starting trees followed by rearrangement
– Starting trees sample “tree space”
– Rearrangements search for local optima
• How to get starting trees?
• How to rearrange trees?
• These methods are prone to getting trapped
on local optima
Cutting-edge Methods
• Ratchet
– Temporarily “warps” the character space
• Annealing algorithms
– Accept suboptimal trees and gradual movement
towards optima
• Genetic algorithms
– Analogous to evolution by natural selection
Using Phylogenies
What genes are
involved in the origin
of novel traits (like
wings)?
Why are birds so different
than their closest
relatives?
Is the rate of molecular evolution
in birds especially high?