Download Genetics to Genomics (From Basics to Buzzwords)

Document related concepts

Natural selection wikipedia , lookup

Sex-limited genes wikipedia , lookup

The Selfish Gene wikipedia , lookup

Evolution wikipedia , lookup

Introduction to evolution wikipedia , lookup

Mutation wikipedia , lookup

Symbiogenesis wikipedia , lookup

Microbial cooperation wikipedia , lookup

Evolutionary developmental biology wikipedia , lookup

Inclusive fitness wikipedia , lookup

Gene expression programming wikipedia , lookup

Transcript
Genetics to Genomics
(From Basics to Buzzwords)
•
Genetics : Understanding the role of heritable material in
shaping organismal phenotypes
•
Genomes are more than collections of genes:
•
Chromosomes and episomes
•
Gene clusters within chromosomes
•
Genes and associated control elements
•
Complex Exon/Intron organization
•
Functional domains organized within coding regions
•
Functional domains positioned outside coding regions
The fundamental task of genomics is understanding what
information is important, and what is not.
•
Genomes results from the accumulation of changes over
time (evolution).
•
Therefore, an understanding of genomes must have a basis
in the understanding of how constituent domains, genes,
gene clusters and chromosomes evolve
•
This leads to an understanding of patterns of information
within and between gene and genomes.
A History of Genomic Data
•
Richard Roblin’s Ph.D. thesis in 1967 was the determination
of the identity of a single nucleotide (1 base is not a
sequence); it was the 5’ end of bacteriophage R17, a 3 kb
RNA phage; that base was a guanosine (pppGp…)
•
In 1970, the 12-bp cohesive ends of bacteriophage lambda
were determined by Ray Wu
•
In 1977, two methods were introduced for rapid DNA
sequencing (both won their proponents Nobel Prizes):
• The Maxam-Gilbert chemical degradation method
• The Sanger primer extension method
•
In 1977, the 5,386 base sequence of E. coli bacteriophage
φX-174 was published
•
In 1983, the 48,502 base pair sequence of bacteriophage λ
was published
•
In 1995 the 1,830,137 base pair sequence of the free-living
bacterium Haemophilus influenza was published
•
In late 1996, the 12,052,000 base pair sequence of the yeast
Saccharomyces cerevisiae was published
•
In late 1998, the 97,000,000 bp sequence of the nematode
Caenorhabditis elegans was published
•
In 2000, a draft of the 3,000,000,000 base pair human
genome was completed
•
By early 2003, the genomes of nearly 100 species of
Bacteria and Archaea, and 10 species of eukaryotes, are
completely sequenced.
Mendel and Darwin:
More than two dead white guys?
•
What was “Blending Inheritance”?
•
How did they view mutations?
•
What was the influence of Aristotle?
•
What was the influence of Malthus?
•
What was the influence of Geologists?
•
Natural Variation Æ The result of genetic experiments played
out over long periods of time
•
Similarity & Difference : Provide clues to relative importance:
the results of (more or less) an infinite number of genetic
experiments
Pillars of molecular evolution
(How we make our models)
Empirical Data
•
Direct experimentation in laboratory environments
•
Direct experimentation in natural environments
•
Observation of natural variation within species
•
Observation of differences between species
Integrative Analysis
•
Mathematical Modeling
•
Cogitation
•
Extrapolation and integration
Classification of similarity
Criteria for Classification
• By what criteria are features similar?
•
By what criteria are features different?
•
What processes lead to similarity?
•
What processes lead to differences?
Types of Similarity
•
Homology : Identity by Descent
•
Orthology : Encoded functions are identical
•
Paralogy : Encoded functions are different
•
Convergence : Identity by State
•
Chance : Identity by State
Methods for Assessing Molecular Similarity
•
DNA-DNA Hybridization
•
Isozyme analysis and MLEE
•
Library overlap (SAB)
•
RFLP Analysis
•
DNA sequence divergence
Measuring Mutation Rates
Mutation rates
•
Luria-Delbrück Fluctuation tests
•
Targets used in laboratory experiments
•
•
Phage resistance
•
Antibiotic resistance
•
lacZ
Lessons
•
Mutations occur almost at random
•
Probability matrix is organism-dependent
•
There are context effects
Substitution rates
•
A mutation is a lesion
•
A substitution is a variant allele observed in nature
•
Not all mutations become substitutions
Fate of Mutations
A. Mutations originate at particular frequencies
i Variable exposure to mutagens
ii Context for polymerase error
iii Likelihood of replication slippage for frameshift
B. However, lesions are repaired at different frequencies
i Different mismatch repair systems
ii Transcription coupled repair
C. Mutation not repaired, but has lethal effects
D. Mutation is disadvantageous and eventually lost
i Though not lost, mutation is infrequent in the population
ii Mutation is frequent in the population
E. Mutation becomes ubiquitous in the population (fixation)
i For a neutral mutation, P = 1/p*N
ii Average time to fixation is T = 2pN generations
Random Genetic Drift
What is the probability that a variant allele becomes fixed?
•
P = (1 - e-4Nesq)/(1 - e-4Nes)
•
Consider the correction that e-x = (1-x) when x is small
•
Consider a newly arisen allele; in a diploid population
frequency is 1/2N
•
P = (1 - e-2Nes/N)/(1 - e-4Nes)
•
s = 0 for a neutral mutation
•
Therefore P = 1/2N; this should be intuitive
•
If Ne = N, then P = (1 - e-2s)/(1 - e-4Ns)
•
If s is small, then P = 2s/(1 - e-4Ns)
•
For s > 0, N is large, P = 2s
•
Neutral alleles go to fixation in t=2pN generations
•
For s <> 0, alleles fix in t = (2/|s|)ln(2N) generations
Definitions
q = Initial frequency of variant allele
s = selection coefficient
Ne = Effective population size
N = Actual population size
Effectively Neutral Mutations
Are mutations always either beneficial or detrimental?
As we saw earlier, that depends on what phenotype one is
examining
Even more insidious, that depends on population size and
population structure
In small populations, it takes a mighty big change in fitness
(either positive or negative) to counter-act the stochastic
process of genetic drift. “Detrimental” mutations can sweep
a population even if they confer a disadvantageous
phenotype.
In larger populations, these same mutations could be
eliminated quickly, since genetic drift has a smaller impact.
This interaction between population size and the effect of a
mutation delineates a zone of fitness effects termed
“effectively neutral,” whereby the fitness impact is not
statistically different from zero. This is a function of the
population size.
This is why it is difficult to proclaim “conserved” sequences
as important and nonconserved sequences as unimportant.
“Conservation” (that is, the elimination of deleterious
mutations from the population) is a function of population
size.
It is also a function of population structure (subdivision,
migration, etc) and sexual exchange (obligate or infrequent)
Selectionism vs Neutralism
What is the significance of natural variation?
Selectionism
Selectionism argues that most variants have adaptive value,
and variation is maintain through a variety of mechanisms
•
Selection/mutation balance
•
Heterosis
•
Frequency dependence
•
Spatial and temporal heterogeneity
Neutralism
Neutralism argues that most variation is effectively neutral, and
reflects primarily genetic drift
The Poisson Distribution
•
Predicts the distribution of occurrences in a discrete
classification system
•
Derived from the binomial distribution and equal probability
of state
•
Pµ(x) = µx / x!eµ
•
So, for µ=1, P(x) = 1/ex!
•
So, the probability of zero occurrences is ~37%
•
The probability of only 1 occurrence is ~37%
•
The probability of 2 or more occurrences is ~27%
When did they diverge?
ACTGTAGGAATCGC
* * *
AATGAAAGAATCGC
If the probability of mutation is 10-9 / bp / generation, how many
generations have these two sequences been diverging?
Naïve answer
•
Let p be the probability of a mutation arising
•
This can be (and has been) measured in the laboratory
•
p = 10-9 / bp / generation
•
For 14 bp, p = 1.4x10-8 /generation
•
Therefore 1 mutation arises - on average - every 7.14 x 107
generations
•
Therefore 3 mutations arise - on average - in 2.14 x 108
generations
What is missing here?
When did they diverge - Part II
First, many substitutions go unnoticed
A
C
T
G
AÆCÆT
A
CÆG
G
TÆA
A
AÆCÆT
C
G
C
A
CÆA
T
G
A
A
CÆA
G
TÆA
A
AÆT
C
G
CÆTÆC
Single Substitution
Multiple Substitutions
Coincidental Substitutions
Parallel Substitutions
Convergent Substitution
Back Substitution
Only the “Single Substitution” leads to differences that
accurately reflect the number of mutational events
Jukes and Cantor Model
•
Probability of any base changing to another base during time
t is set to be α
•
Probability of a base being equal to its original state at time
T= t is P1 = 1 - 3α
•
At time T = 2t, the probability of the original state is:
P2 = (1 - 3α)P1 + α(1 - P1)
•
This can be formulated as a first-order differential equation:
dPt/dt = -4αPt + α
•
This can be solved as Pt = ¼ + (P0 - ¼)e-4at
•
Since P0 = 1, Pt = ¼ + ¾ e-4at
•
If P0 = 0, Pt = ¼ - ¾ e-4at
•
Notice that both equation converge at equilibrium
•
Under the Jukes-Cantor model, all bases have the same
frequency and interchange with equal likelihood
Convergence of the Jukes & Cantor Model
Probability of having an 'A'
1.00
0.75
0.5
0.25
0.00
0
50
100
150
Time (million years)
200
Kimura’s Two-parameter Model
•
Separates transition probability from transversion probability
•
A transition substitution occurs with probability α
•
A transversion substitution occurs with probability β
•
Probability of identity over time is calculated as:
Xt = ¼ + ¼e-4βt + ½e-2(α+β)t
•
Probability of difference by transition is
Yt = ¼ + ¼e-4βt - ½e-2(α+β)t
•
Probability of difference by each transversion is
Zt = ¼ - ¼e-4βt
•
Note that Xt + Yt + 2Zt = 1
Justification for the Kimura Model
Relative substitution rates in mammalian pseudogenes
Mutant
Original Nucleotide
Nucleotide
•
A
T
C
G
A
-
4.4 +/- 1.1
6.5 +/- 1.1 20.1 +/- 2.2
T
4.7 +/- 1.3
-
21.0 +/- 2.1 7.2 +/- 1.1
C
5.0 +/- 0.7
8.2 +/- 1.3
-
5.3 +/- 1.0
G
9.4 +/- 1.3
3.3 +/- 1.2
4.2 +/- 0.5
-
Notice that transition rates are higher than individual
transversion rates
•
Notice also that the rates of substitution are not symmetrical
Correcting for multiple substitutions
•
Let’s start with the Jukes & Cantor one-parameter model
•
Probability of identity for sequence in TWO lineages is
Pi = ¼ + ¾ e-8αt
•
Probability of difference is PD = (1- Pi)
PD = ¾(1 - e-8αt)
or, 8αt = -ln(1 - 4/3P)
•
Since t is unknown, we cannot estimate α. Instead, we
compute K, the number of substitutions per site
•
For 2 lineages, K = 2*(3αt)
•
So, K = - ¾ * ln(1 - 4/3P), where P is the proportion of
differing nucleotides per site
•
For sequence of length L, the sampling variance is
V(K) = P(1-P) / L(1 - 4/3P)2
•
For the Kimura model, let P be the proportion of bases as
transitions and Q be the proportion of bases as
transversions
•
K = ½ ln(a) + ¼ ln(b), where
a = 1/(1-2P-Q) and b = 1/(1-2Q)
•
V(K) = [a2P + c2Q -(aP + cQ)2]/L, where c=(a+b)/2
Nucleotide Positions Comprise Two
Classes of ‘Sites’
• Alterations at Nonsynonymous Sites change the encoded
protein
• Alterations at Synonymous Sites do not change the
encoded protein
In early terminology:
• Synonymous Site = “Silent Site;” such a change was thought
to show no phenotype
• Nonsynonymous Site = “Replacement Site”
Variation in Nonsynonymous
Substitution Rates
•
Variation in purifying selection within genes
•
There is “domain structure” within proteins that mean
some regions will evolve more quickly than others, since
they serve less important roles
•
For example, a nucleotide binding domain may evolve
more slowly that a cytoplasmic loop
•
Variation among genes due to selection intensity
•
Some entire genes play more important roles, and
therefore changes are less well tolerated; e.g., gene
encoding histones evolve very slowly
•
Variation among genes due to differences in mutation rate
•
Variation in amino acid tolerance to substitution
•
Variation in lineage-specific rates
The Molecular Clock
Hypothesis that substitution rates are equivalent between
lineages. Tested using a “relative rate” test:
-------------C
|
---|
------B
|
|
------O
|
------A
•
•
•
KAC = KOA + KOC
KBC = KOB + KOC
KAB = KOA + KOB
Therefore:
KOA = (KAC + KAB - KBC)/2
KOB = (KAB + KBC - KAC)/2
KOC = (KAC + KBC - KAB)/2
According to the Molecular Clock, KOA - KOB = 0
Natural Selection
Fitness (w) - A measure of relative ability of organisms to
survive and reproduce in a certain environment. A fitness of
1.0 is typically used as a baseline for comparison. A fitness
value lower than 1.0 indicates that an organism is less likely to
produce viable offspring. Fitness is a genotype by
environment interaction.
Selection coefficient (s) - A measure of how a particular
phenotypic trait alters fitness. Since fitness is measured as
w=1-s, positive selection coefficients denoted detrimental
traits.
Malthus - Noted that more offspring are produced than can
survive.
Darwin - Postulated that fitness is heritable; that is, more fit
organisms produce more fit offspring.
Kinds of Selection
Purifying selection - The process by which substitutions
resulting in less fit organisms are removed from the population.
Heterosis - The phenomenon whereby heterozygotes have a
higher fitness than do homozygotes
Frequency-dependent selection - The phenomenon whereby
fitness of a genotype depends upon its frequency in the
population; typically less frequent genotypes are more fit. This
leads to the stable maintenance of polymorphism.
Diversifying Selection - The phenomenon whereby the
fitness conferred by a genotype is strongly dictated by the
environment, leading to the stable maintenance of
polymorphism.
Approaches for Constructing Dendrograms
Phenetics : Relationships are based on overall levels of
similarity. Common methods include:
•
UPGMA (unweighted pair-group by geometric means)
•
Transformed Distance and other variants
•
Fitch-Margoliash
•
Neighbor-joining
Cladistics : Relationships are based elucidation of shared,
derived characteristics. Common methods include:
•
Parsimony
•
Evolutionary Parsimony
•
Maximum Likelihood
UPGMA I : The Method
•
Form clusters starting with most closely related taxa
•
Average relationships with other taxa
•
Repeat
Original Divergence
A
A
B
C
D
E
F
-
.08
.19
.32
.28
.55
-
.22
.26
.25
.62
-
.31
.28
.59
-
.14
.64
-
.63
B
C
D
E
Round 1 : Group taxa A & B
A,B
C
D
E
A,B
C
D
E
F
-
.205
.28
.265
.585
-
.31
.28
.59
-
.14
.64
-
.63
UPGMA II : The Tree
Round 2 : Group taxa D & E
A,B
A,B
-
C
D,E
F
.205 .2725 .585
C
-
.295
.59
-
.635
D,E
Round 3 : Group taxa C with (A,B) Cluster
(A,B),C
(A,B),C
-
D,E
F
.28375 .5875
D,E
-
.635
Round 4 : Group taxa [(A,B),C] Cluster with (D,E) cluster
((A,B),C),(D,E)
F
-
0.61125**
((A,B),C),(D,E)
**Note straight average of all taxa is 0.606; this value reflects shared branches
Round 5 : Dendrogram is complete
A
B
C
D
E
F
0.6
0.5
0.4
0.3
0.2
Divergence
0.1
0.0
UPGMA III : Significance
So, the previous tree looked robust, but what about this one:
A
B
C
D
E
F
**
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Divergence
We may be confident with saying A, B & C belong to one group
and D & E belong to another, but are we confident that A is
closer to B than it is to C? In other words, what is the
confidence in the marked (**) branch?
Other Distance Methods
Transformed Distance
•
Transform distance first as Dij* = (Dij - Dir - Djr)/2 + c
•
Where r is a referent taxon and c allows for positive values
Fitch and Margoliash
•
Dij is again the observed distance and Eij is the tree distance
•
Trees are chosen to minimize the following:
sFM = 100[ 2Σ(I<j){(Dij-Eij)/Dij}2]/(n2-n) ]1/2
Minimum Evolution (Simplified as Neighbor Joining)
•
In an unrooted, bifurcating tree of n taxa, there are 2n - 3
possible branches; λi is the length of the ith branch
•
The sum of branches is L = Σ λi
•
Final tree minimizes L; this is not maximum parsimony since
this method is not affected by backward or parallel mutation
Neighborliness
•
Consider a tree with n > 3 taxa; assume taxa 1 & 2 are
neighbors and Dij is distance between taxa i & j
•
Therefore D12 + Dij < D1i + D2j AND D12 + Dij < D1j + D2I
•
The best tree maximums the cases this is true
Estimating Branch Lengths
Consider 3 taxa in an unrooted tree
•
DAB = x + y
•
DAC = x + z
•
DBC = y + z
A x
B
x = (DAB + DAC - DBC)/2
3
•
y = (DAB + DBC - DAC)/2
•
z = (DAC + DBC - DAB)/2
1 a
2
b
C
y
So, we can solve as
•
z
c
5
f
g
d
e
4
Now, consider more than three taxa
•
Lets say taxa 1 & 2 were the first to cluster
•
These will correspond to taxa ‘A’ and ‘B’ in the three-taxa
case as above
•
Therefore, x=a and y=b
•
Collapse all other taxa into ‘C’, represented as the c/d/e
junctions on this phylogeny
•
So, DA,C = D1,(3,4,5) = (D1,3 + D1,4 + D1,5)/3
•
Next we collapse other taxa so that 1,2=A, 4=B and 3,5=C
•
Repeat until all lengths are calculated
Parsimony Methods
Maximum Parsimony
•
Ancestral sequences are inferred from extant sequences,
and the tree requiring the minimum number of changes is
computed.
•
Branch lengths are computed a number of ways, often
correlated to the number of changes occurring along a
branch.
•
The likelihood of parallel and/or backwards events can be
adjusted depending on the data set.
Evolutionary Parsimony
•
Usually considers only Four taxa
•
Transition/transversion bias is computed to compute
quantities X, Y & Z for the three topologies.
•
If only one is significant, then this tree is chosen
Example of Parsimony
Taxon
Sequence
A
G C G G C G G A C C G G G
B
G C G A C A C T C C G G A
C
A C A T T G G A A A T A A
D
G C A T T A C T T A T A G
Types of Sites
•
Invariant
•
Variant
•
Informative
Tree
Support
(A,B) , (C,D)
5
(A,C) , (B,D)
3
(A,D) , (B,C)
1
Types of Trees
•
Unrooted
•
Outgroup Rooted
•
Midpoint Rooted
Testing Parsimony Trees
Cavender’s test
• For specific numbers of characters, calculates how many
characters worse a tree must be to be rejected
Chars
3-4
5-6
7-9
10-11
12-14
15-17
18-20
Steps
3
4
5
6
7
8
9
Chars
21-23
24-26
27-29
30-33
34-36
37-39
40-42
Steps
10
11
12
13
14
15
16
Chars
43-46
47-49
50
60
75
100
Steps
17
18
19
22
26
33
Felsenstein tests
•
•
•
Work on trees with small numbers of taxa
First : Tests to see of the number of steps supporting the
best tree is significantly lower than the number of steps
supporting the next-best tree (S = a-b).
Second : The number of characters supporting the best tree
(C=a)
Chars
4
5
6
7
8
9-10
11-12
13
S(.05)
4
5
6
7
8
5
5
5
C(.05)
4
5
5
6
6
7
8
9
Chars
14
15-16
17-19
20
21
22-23
24-26
27-28
S(.05)
6
6
6
6
7
7
7
7
C(.05)
9
10
11
12
12
13
14
15
Maximum Likelihood Methods
Topology Generation
•
Nucleotide positions are considered separately under
models for DNA evolution.
•
Topologies are tested for their likelihood of generating the
resulting data set.
•
Likelihood is calculated as the sum log of the likelihood for
each variant site.
•
The tree with the highest likelihood is chosen.
Topology testing
•
Likelihood calculated for each tree as L = Σln(λi)
•
The log-likelihood test uses the differences in likelihood to
eliminate topologies with significantly lower likelihoods
•
All other trees are not significantly different; this is a
dendrogram “neighborhood” of equally good trees
•
Variant branches can be collapsed to yield a consensus tree
Trade-offs in Alignments
Many kinds of data must be weighed
•
•
•
•
•
Homologous positions must be assigned
Relative weighting of transitions and transversions in making
alignment
Relative weighting of transitions and transversions in
assessing divergence
Relative weight of assigning a gap
Relative weight of increasing gap length
• Different for protein coding & non-coding sequences
Sequence 1
Sequence 2
Sequence 3
Scheme A
Scheme B
GT-AC
G-CAG
GTCAC
GT-AC
GC-AG
GTCAC
•
In both schemes, three events occur
•
In scheme A, there are two insertion/deletion events and
one nucleotide substitution
•
In scheme B, there is one insertion/deletion event and two
nucleotide substitutions
Testing Complex Trees : A Naïve Approach
•
Felsenstein’s and other tests work on small numbers of taxa
•
Therefore, we can test specific 4-taxon subsets to determine
which clades are robust
Image a complex phylogeny (right); we could
analyze this as follows:
1. Test (A,B) , (C,D) to determine is ‘C’ is
excluded from (A,B) clade
2. Test (D,E) , (F,G) to determine if ‘F’ is
excluded from (D,E) clade
3. Test (G,H) , (I,J) to determine if those are
A
B
C
D
E
F
G
H
I
J
robust clade
4. Test [(A,C) , (D,F)] AND [(A,B) , (E,F)] AND [(B,B) , (D,F)] to
determine if those clades are distinct.
5. But wait, that’s only 3 of the possible 9 combinations; should
we do all 9? What if 8 support and 1 doesn’t?
6. In test 1, does it matter that the ‘outgroup’ is ‘D’? Should we
try all 7 outgroups?
The 1% Inclusion Parsimony Approach
•
Examine all trees within 1% of the tree length of the mostparsimonious tree
•
Assign confidence in nodes according to what percentage of
trees include that clade
This is somewhat arbitrary in two ways:
(1) Why are trees within 1% of the most-parimonious branch
length chosen?
(2) At how do we interpret “confidence” values?
Testing Trees : Resampling Approaches
The Jackknife
•
Resample data points at random without replacement
•
If resample size = 50% of the sample size, then the variance
of the distribution of the resampled parameter is equivalent
to the variance of the original parameter, since
•
M
2
~2
σ =σ
N −M
Robust nodes appear in >95 of trees made with resampled
datasets; typically 100 - 10,000 trees are examined
The Bootstrap
•
Resample N-1 data points at random with replacement
•
Recalculate topology as above
•
Robust nodes appear in >95 of trees made with resampled
datasets; typically 100 - 10,000 trees are examined
Advantages : Method for assessing reliability is independent
of tree construction method
Drawback : Can be computationally intensive
The Problem With Parsimony
For three taxa there is only 1 unrooted tree and three possible
rooted trees: (A,B),C and (A,C),B and (B,C),A.
But these numbers grow fast. For N taxa there are:
N-2
• (2N-3)! / (2 )(N-2)! rooted trees and
N-3
• (2N-5)! / (2 )(N-3)! unrooted trees
N
2
3
4
5
6
7
8
9
10
12
14
16
18
20
22
24
26
28
30
Rooted
1
3
15
105
945
10,395
135,135
2,027,025
34,459,425
13,749,310,575
7,905,853,580,625
6,190,283,353,629,370
6,332,659,870,762,850,000
8.200 E+021
1.311 E+025
2.537 E+028
5.843 E+031
1.579 E+035
4.951 E+038
How Good Is It?
Like distance methods, parsimony will give you a tree,
although you may not get a “most-parsimonious” tree.
How good is it? Consider these two data sets:
Taxon Data Set 1
A
B
C
D
Steps
Chars
Data Set 2
GGGCCAATTAA
GGGCCAATGCC
CAATTTTGTCC
CAATTTTGGAA
14
8 OF 11
GGGCCAATTAA
GGGCCTTGGCC
AAATTAATTCC
AAATTTTGGAA
17
5 OF 11
A
C
B
D
History of Classification
God 4500 B.C.
Noah (3500 B.C.)
Plato (427-347 BC)
Aristotle (384 - 322 BC)
Hans and Zacharias Janssen (1600)
Marcello Malpighi (1628-1694)
Robert Hooke (1635-1702)
Anton van Leeuwenhoek (1632-1723)
Carl von Linné (1707-1778)
Otto F. Muller (1730-1784)
Antoine-Laurent de Jussieu (1748 -1836)
Georges Cluvier (1769-1832)
Christian Ehrenberg (1795-1876)
Georges-Louis Buffon (1707-1788)
not-A groups
Cladistic characters
Idealized Form
Scala Natural
Microscope
Cellular orgaization
Cells
Describe bacteria.
Systema naturale
379 Animacule descriptions
Major divisions of plants
Major animal phyla
Included bacteria in systematics
Not all species present at the Creation
Thomas Malthus (1766-1834)
Exponential growth
Georges Cluvier (1769-1832)
Louis Agassiz (1807-1873)
James Hutton (1726-1797)
Charles Lyell (1797-1875)
Jean-Baptiste Lamarck (1744-1829)
Charles Darwin (1809 - 1882)
Catastrophes
Serial Creation
Old Earth
Old Earth
Inheritance of acquired characters
Natural Selection
Louis Pasteur (1822-1895)
Microbial processes
Ernst Haeckel (1834 - 1919)
Edouard Chatton (1883 - 1947)
Herbert Copeland (1902 - 1968)
Robert Whittaker (1924 - 1980)
Emil Zuckerkandl & Linus Pauling
Motoo Kimura and Tom Jukes
Carl Woese and George Fox
Naoyuki Iwabe & Takashi Miyata
Peter Gogarten
Brian Golding & Radhey Gupta
Evolutionary classification
Prokaryote/eukaryote dichotomy
Reclassification
“Modern” classification
Molecular clocks
Neutral theory
Molecular phylogeny
Rooting the tree of life via EF’s
Rooting the tree of life via ATPases
Eukarya by Fusion
Phylogeny I
God (4500 BC)
Heavens
Earth
Yet the “Heavens” have
no defining characteristics
Noah (3500 BC)
Animals
Plants
Potential Introduction
of Hierarchy in Classification
Living Things
Animals
Plants
Phylogeny II
Aristotle (350 BC)
Animal
Vegetable
Mineral
Classification, but
lacking Hierarchy
Aristotle (350 BC)
Air
Earth
Classification, but
lacking Hierarchy
Fire
Water
Phylogeny III
Linneus (AD 1743)
Animalia Plantae
Completely Hierarchical
(even to non-living things!)
Animals
Vertebrates
Birds
Mammals
Infusoria
Invertebrates
Chatton (AD 1937)
Eukaryotes
Prokaryotes
Introduced Polarity,
or Time, Into Lines of
Phylogenetic Descent
Copeland (AD 1956)
Animalia
Plantae
Fungi
Protista
Monera
Whittaker (AD 1959)
Eukaryotes
Prokaryotes
Animalia Plantae Protista Fungi Monera
- Incorporation of Chatton’s distinction
Association Coefficients Between representative members of
the Three Primary Kingdoms
Organism
1
2
3
4
5
6
7
8
9
10
11
12
13
S. cerevisiae
-
.29
.33
.05
.06
.08
.09
.11
.08
.11
.11
.08
.08
.29
-
.36
.10
.05
.06
.10
.09
.11
.10
.10
.13
.07
.33
.36
-
.06
.06
.07
.07
.09
.06
.10
.10
.09
.07
Escherichia coli
.05
.10
.06
-
.24
.25
.28
.26
.21
.11
.12
.07
.12
Chlorobium vibrioforme
.06
.05
.06
.24
-
.22
.22
.20
.19
.06
.07
.06
.09
Bacillus firmus
.08
.06
.07
.25
.22
-
.34
.26
.20
.11
.13
.06
.12
C. diptheriae
.09
.10
.07
.28
.22
.34
-
.23
.21
.12
.12
.09
.10
Aphanocapsa 6714
.11
.09
.09
.26
.20
.26
.23
-
.31
.11
.11
.10
.10
Chloroplast (Lemna)
.08
.11
.06
.21
.19
.20
.21
.31
-
.14
.12
.10
.12
Methanobacterium th.
.11
.10
.10
.11
.06
.11
.12
.11
.14
-
.51
.25
.30
M. ruminantium
.11
.10
.10
.12
.07
.13
.12
.11
.12
.51
-
.25
.24
Methanobacterium sp.
.08
.13
.09
.07
.06
.06
.09
.10
.10
.25
.25
-
.32
Methanosarcina barkeri
.08
.07
.07
.12
.09
.12
.10
.10
.12
.30
.24
.32
-
Lemna minor
Similarities determined by SAB analysis of ribosomal RNAs
Phylogeny VI
Woese (AD 1977)
Eukaryotes
Eubacteria
Archaebacteria
Woese, implied (AD 1977)
Eukaryotes
Archaebacteria
Eubacteria
Rooting the Tree of Life : Use of Duplicate Genes
Probability of Recovering the Specified Relationships
EF-G
EF-Tu
1.00
0.79
0.96
0.00
0.00
0.21
0.03
0.12
0.00
0.00
0.00
0.01
Eucarya
Eubacteria
Archaea
Outgroup gene
ATPase F1-
1.00
0.33
0.09
Archaea
Eubacteria
Eucarya
Outgroup gene
ATPase F1-
0.55
0.41
Archaea
Eucarya
Eubacteria
Outgroup gene
tRNA Met-E
0.50
Gene
tRNA Met-I
Phylogeny VII
Iwabe (AD 1989) & Gogarten (1989)
Bacteria
Archaea
Eukarya
Golding and Gupta (AD 1995)
Eubacteria
Eukaryotes
drion
n
o
h
c
Mito
Cyanobacteria
Archaebacteria
Gene Families
• Most genes have homologues in closely-related taxa whose
products perform similar functions; these pairs of genes are
called orthologues
• In addition, many genes have homologues within the same
genome which perform some different function; these paris
of genes are called paralogues
What Functions Do Paralogues Play?
• They may act on different substrates (e.g., an enzyme with a
different binding site, or a protein kinase with a different
target)
• They may action in different tissue types or at different
developmental stages (e.g., embryonically-expressed globin
genes)
• They may be regulated in response to differential
environmental conditions to perform the same job, but for
different reasons (e.g., nitrate reductase for reducing nitrate
as an electron sink, or reducing it to provide ammonia for
assimilation)
How Do Gene Families Arise?
The Classic Model
• A gene duplicates within the genome; typically, an unequalcrossing-over event is invoked
• The additional copy is free to evolve a novel function, or
novel regulatory regime, since the original copy performs its
original function
• Both copies are then maintained by selection
Yet this scenario is not as rosy as it sounds….
Problems with the Classic Model
• Dosage compensation after duplication may select for
organisms that have eliminated the duplicated copy
• After duplication, there is no selection to prevent deleterious
mutations from eliminating gene function.
• If an advantageous mutation arises, it must have a
sufficiently large benefit that elimination of this newly-created
form by mutation and drift is counter selected; this is difficult
in small populations, especially those seen in eukaryotes
• After duplication and gain of advantageous function, gene
conversion may homogenize the two copies
• Duplication of a single gene may be insufficient to provide for
a novel phenotype; for example, a new signaling cascade
will require a new receptor, MAPK, MAPKK, MAPKKKm etc.
An Alternative Model
• Genes “duplicate” every time an organism reproduces.
• Consider a population (Pop’n A) where an entire pathway
experiences selection to perform an alternative function; in
Population B, the original function is maintained
• This function will likely never be achieved, since it would
require abandoning its original function, which may be
essential
• However, substantial headway may be gained in pursuing
the alternative form at the expense of the original form
• Admixture of population A and B produces heterozygotes
with an advantage; that is, both pathways are now found in
the same cell, leading to heterosis.
• This is an unstable state, since only 50% of the progeny of
heterozygotes are also heterozygotes.
• NOW, unequal crossing-over or other chromosome
gymnastics will allow for presence of both pathways in all
offspring.
• In this model, genes diverge UNDER SELECTION, prior to
reintroduction into the same cytoplasm (duplication).
Keeping Pathways Together I
• What prevents “mixing” of the genes of the two pathways via
gene-conversion and meiosis?
• R.A. Fisher (1930) proposed that natural selection could
maintain groups of cooperating advantageous genes; this
idea was extended by Botstein and Suskind to suggest that
this selection would lead to clustering of these genes on the
eukaryotic chromosome.
• Consider two loci, each with two alleles (A & a; B & b)
• Consider A works well with B, and a works well with b
• Therefore, the fitness of AB/AB cells and ab/ab cells would
be higher than heterozygotes, especially Ab/Ab or aB/aB.
• This would lead to APPARENT linkage disequilibrium
between loci A & B due to counterselection of the classes
of heterozygotes.
• This selection would lead to an increase in the ACTUAL
linkage disequilibrium (decrease in chromosomal distance)
so that heterozygote disruption of coadapted gene
complexes would be minimized.
• This model requires high-frequency recombination.
Keeping Pathways Together II
• The above model does not work for haploid organisms
with minimal amounts of recombination that could disrupt
coadapted gene complexes.
• Yet in these organisms, coadapted gene complexes are
found in very tight clusters (bacterial operons).
• How do genes attain such tight clustering, especially since
the primary mechanism for juxtaposition (deletion) would
likely remove important genes form the chromosome?
What genes are clustered?
• The bacterial operon allows for coregulation of genes, as
well as reducing their disruption by recombination.
• Yet coregulation is not a plausible influence for selection
for the origin of the gene cluster, since a very tiny
advantage would be conferred by adding only one gene to
a cluster at a time.
• Moreover, virtually none of the very important,
coordinately-regulated genes are found in operons
• In contrast, many operons encode peripheral metabolic
functions with lower selective value.
Keeping Pathways Together III
• Therefore, we must consider that the selection for the
ORIGIN of a feature may not be the same as the selection
for the MAINTENANCE of a feature.
• One advantage to a gene cluster is that it allows
mobilization of all of the genes responsible for a selectable
function or phenotype to be transferred in a single step;
transfer of one individual gene wold not.
• After transfer of genes that are only moderately clustered,
intervening genes would be deleted, since only those
genes under selection would be maintained; this results in
a tight gene cluster or operon, which can be expressed by
a host promoter at the site of insertion.
• The operon exploits the capability of prokaryotes to direct
the synthesis of numerous proteins from a single transcript
• Transfer of operons among bacteria and from bacteria to
eukaryotes is a powerful mechanism for allowing
recipients to gain novel metabolic capabilities
Bacterial genes are organized into clusters
Both bacterial and eukaryotic can be
clustered via selection for proximity to cisacting regulatory sequences
β−Globin Locus : Developmental regulation via proximity to LCR
Both bacterial and eukaryotic can be
clustered via selection for proximity to cisacting genes