Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Biochemistry wikipedia , lookup

Gene regulatory network wikipedia , lookup

Community fingerprinting wikipedia , lookup

Western blot wikipedia , lookup

Gene expression wikipedia , lookup

Protein wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Gene wikipedia , lookup

Genetic code wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Proteolysis wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Molecular ecology wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metalloprotein wikipedia , lookup

Protein structure prediction wikipedia , lookup

Point mutation wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Structural alignment wikipedia , lookup

History of molecular evolution wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcript
EVOLUTION OF GLOBINS
Evolution of Globins
Evolution of visual pigments and related molecules
Evolution of gene clusters
• Many genes occur as multigene families (e.g.,
actin, tubulin, globins, Hox)
– Inference is that they evolved from a common
ancestor
– Families can be
• clustered - nearby on chromosomes (αglobins, HoxA)
• Dispersed – on various chromosomes
(actin, tubulin)
• Both – related clusters on different
chromosomes (α,β-globins, HoxA,B,C,D)
– Members of clusters may show stage or
tissue-specific expression
• Implies means for coregulation as well
as individual regulation
Evolution of gene clusters
•
multigene families (contd)
– Gene number tends to increase with
evolutionary complexity
• Globin genes increase in number from
primitive fish to humans
– Clusters evolve by duplication and
divergence
•
History of gene families can be
traced by comparing sequences
– Molecular clock model holds
that rate of change within a
group is relatively constant
• Not totally accurate –
check rat genome
sequence paper
– Distance between related
sequences combined with
clock leads to inference about
when duplication took place
Classic phylogenetic studies of sequence
conservation: the globins
The globins are the best studied family in
terms of sequence conservation, partly
because they were one of the first families
for which multiple members were
sequenced, and partly because some of the
earliest protein structures (in fact, the
earliest) solved were globins. The classic
papers of Perutz, Kendrew and Watson
were the first to correlate sequence
conservation with aspects of protein
structure and function. They drew their
conclusion based on only a few aligned
sequences. Later globin studies, such as
that of Bashford, Chothia and Lesk,
expanded the analyses of globin sequence
conservation to include hundreds of
sequences.
Perutz, Kendrew & Watson J Mol Biol 13, 669 (1965)
Bashford, Chothia & Lesk J Mol Biol 196, 199 (1987)
Scapharca inaequivalvis
oxygenated hemoglobin
Conservation of functional residues
There were only 2 perfectly
conserved residues among the 8
known globin structures at the
time of the Bashford et al study.
These are residues critical in
binding of heme and/or interaction
w/heme-bound oxygen. It will
often be found that the best
conserved residues in related
Phe 43
proteins are those involved in
heme
critical aspects of the general
function.
His 87
Residues involved in more specific aspects of function may or may not be
conserved, depending upon the relationship between the proteins under
consideration. For example, residues involved in substrate specificity for
serine proteases may be conserved among orthologs, such as the
chymotrypsins, but not between paralogs, such as chymotrypsins and
trypsins.
Conservation at buried positions
• core residues, which are usually hydrophobic, often tolerate
conservative substitutions, i.e. to other hydrophobics
• overall core volume is well-conserved (Lim & Ptitsyn, 1970) though
individual core positions tolerate variation in volume
• this reflects what we know about packing and the effects of core
mutations on stability--thus sequence conservation is partly related to
maintaining a stable structure
portion of alignment of
prokaryotic and eukaryotic globins
Y140
yellow = small neutral/polar
green = hydrophobic
red/pink = polar/acidic
blue = basic
buried
H156
human hemoglobin
beta chain
Conservation at solvent-exposed positions
• solvent-exposed (surface) positions are mutable and usually tolerate
mutation to many residue types including hydrophobics. Bashford et al.,
however, noted that for globins at least, some surface positions do not
tolerate large hydrophobics. Since polar-to-hydrophobic mutations on protein
surfaces do not reduce stability, this conservation could reflect constraints
on solubility. Indeed, it is clear that the overall polar character of the
surface is conserved for soluble, globular proteins, even though a certain
number of hydrophobics may be tolerated.
Y140
yellow = small neutral/polar
green = hydrophobic
red/pink = polar/acidic
blue = basic
examples
of surface
residues
H156
human hemoglobin
beta chain
Conservation of loops and turns
• “Spacer” regions between secondary structures, such as loops and
turns, are often hypermutable and vary not only in sequence but in
length, tolerating insertion and deletion events (Insertions and deletions
are much less often found within secondary structure elements. Why?)
part of alignment of animal hemoglobin a and b chains
human a chain
Are the a and b chains related to each other by paralogy or orthology?
Sequence identity and homology: poor coverage
the two proteins have the
same fold,both bind heme
and oxygen in same place:
good independent
structural/functional evidence
for homology...
Yet alignments of their
sequences reveal only 24%
identity. There are also many
examples of related globins
and other proteins with much
lower identity than this.
1MBO and 1HBB
hemoglobin and myoglobin
Any reasonable sequence identity criterion, whether it is a flat percent
cutoff or a length-dependent cutoff, will give incomplete coverage--in
other words, it will fail to identify many distant but true relationships.
Evolutionary analysis: one step into the a priori prediction
Synonymous
Consensus
Seq1
Seq2
Seq3
Consensus: AAT GGC TCT TTT GAA AAA ...
N
Seq4
Seq5
Seq6
G
F
F
N
K
.
Seq2: AAC GGA TGT TTC GAG AAA...
N
G
C
F
E
K
.
Seq7
Seq8
Seq9
Seq10
Seq11
Non-synonymous
Neutrally
fixed
Number of
individuals
Positive
selection
Purifying
selection
E
Number of mutations
AAT GGC TGT TTT GAA AAA ...
N
G
C
F
N
K
.
Neutral evolution vs selection
Non-synonymous nucleotide substitution
changes
Amino acid replacements
Protein function or
structure
Neutral Theory of molecular evolution
Purifying selection
Amino acid
changes
Neutrality
Positive selection
Biological
fitness (W)
Measuring the strength of selection

Non  synonymous(d N )
Synonymous(d S )
N
dN 
n
S
dS 
s
=1
<1
>1
Neutrality
Purifying selection
Positive selection
Two ways of testing the functional importance of peptide
regions
Experimental (Functional Biologists)
Predictive (Evolutionary Biologists)
Serial deletions and random directed
Evolutionary and structural
analysis
mutagenesis
Consensus: AAT GGC TCT TTT GAA AAA ...
N
G
F
F
N
K
.
Seq2: AAC GGA TGT TTC GAG AAA...
N
G
C
F
E
K
.
Methods to detect adaptive
evolution using DNA divergence data
A
B
Maximum-likelihood
models
Multiple alignment
Kimura-based models
Sq1: ...ATGGGCGTC...
Sq2: ...ATGGACGTA...
A1
Sq3: ...ATGGGAGAG...
B1
Sq4: ...ATGAGCGTC...
Models to detect adaptive
evolution at single codon sites
Parsimony method to detect
Selection at single sites
Tree
A2
b
Models to detect adaptive
evolution at specific lineages of
the tree
Sq3

6
1
a 
2
4
Sliding-window based
Methods
B3
A4
Tree
b
B2
Sq4
A3
5
Sq1
Sq2
a
Sq1
Sq2
Sq4
Sq3


Sq1: ...ATGGGCGTC...
Sq2: ...ATGGACGTA...
5
b
Sq3: ...ATGGGAGAG...
Sq4: ...ATGAGCGTC...

Tree
6
1
a 
2


Sq1 ...ATGGGCGTC...
...ATGGACGTA...
Sq2
Sq4
...ATGGGAGAG...
Sq3
...ATGAGCGTC...
Different levels of protein’s function and evolution
Intra-molecular co-evolution
Inter-protein/gene co-evolution
Tully and Fares (2006) Evol.
Bioinf.
Co-evolution/interaction between two
different biological systems
Covariation analysis
Substitution patterns at different positions in a sequence alignment are
not necessarily independent. This is sometimes referred to as
covariation or correlated evolution.
name
A
B
C
D
sequence
YADLGRIKS
YSDLGSEKE
IDDFGEIAA
IDDFGVIGT
For example, in the mini multiple
alignment shown at left, the identity of
the residue at the 4th position is
correlated to the identity of the
residue at the 1st position.
A statistical perturbation analysis can be used to characterize this
covariation. An alignment of related sequences is “perturbed” by
only considering sequences at which, for example, the first position is
Y. The effect of this perturbation on the residue distribution observed
at other positions is then measured. If the distribution changes
significantly, covariation between sequence changes at the first site
and other sites in the alignment is inferred.
Covariation and hydrophobic core packing
The hydrophobic core residues in related
proteins tend to be covariant due to
constraints on core packing. One sees
compensatory volume changes at different
positions.
Davidson and coworkers found that for 266
aligned SH3 domain sequences, the
strongest covariation was observed for a
cluster of central hydrophobic residues.
For example, substitution of a smaller residue
(Ala->Gly) at 39 was strongly correlated to
substitution of a larger residue (Ile->Phe) at
50.
Hydrophobic core of SH3
domains, with most frequently
covarying residues shown in
yellow
S.M. Larson, A.A. DiNardo and A.R. Davidson, J Mol Biol 303, 433 (2000)
Some recent studies (Suel
et al) have suggested a
connection between
covarying clusters of
residues and transduction
of signals between distant
sites in proteins.
For example, G-protein
coupled receptors bind a
ligand on one side of a
membrane, and then
transduce that signal to the
other side through
conformational change.
Suel et al showed that
the main clusters of
covarying residues tended
to connect the ligand and
G-protein binding sites.
ligand
covarying
networks
(brown)
membrane
G-protein binding sites
Suel et al. Nat Struct Biol 2003
A novel method to detect co-evolution in protein-coding genes
(Fares and Travers, Genetics 2006)
AAMWCGPCPNDEE
AAMWCGPCPNDEE
AAMWCGPCPNDEE
AAMWCGPCPNDEE
AAMWCGPCPNDEE
AAMWCGPCPNDEE
CAMCCGMCMNDEE
CAMCCGMCMNDEE
CAMCCGMCMNDEE
CAMCCGMCMNDEE
CAMCCGMCMNDEE
CAMCCGMCMNDEE
CAMDCGACANDEE
CAMDCGACANDEE
CAMDCGACANDEE
CAMDCGACANDEE
CAMDCGACANDEE
CAMDCGACANDEE
AAMMCGCCCNDEE
AAMMCGCCCNDEE
AAMMCGCCCNDEE
AAMMCGCCCNDEE
AAMMCGCCCNDEE
AAMMCGCCCNDEE
(q ek )ij   Bek x 1 

t ij
(qek )ij
T
q A  1  (q ek )S

1
  Bek x 

t ij
T
Testing the significance of
the correlation coefficient
S 1
1
 (R),
1000 i 1
 
"  0.95  Z  i
P(  i > 0.95) 
s ( )
 AB 
S 1
]
[
Dˆ ek  (q ek )ij q B
]
2
AAMWCGPCPNDEE
AAMWCGPCPNDEE
AAMWCGPCPNDEE
CAMCCGMCMNDEE
CAMCCGMCMNDEE
CAMCCGMCMNDEE
CAMCCGMCMNDEE
CAMDCGACANDEE
CAMDCGACANDEE
CAMDCGACANDEE
CAMDCGACANDEE
AAMMCGCCCNDEE
AAMMCGCCCNDEE
AAMMCGCCCNDEE
AAMMCGCCCNDEE
[
1 T
DB   (q ek )S  q B
T S 1
 [(Dˆ )
S 1
ek S
 [(Dˆ )
T
S 1
T
[
2
Dˆ ek  (q ek )ij q A
AAMWCGPCPNDEE
T
1000
T
q B  1  (q ek )S
ek S
][( )
 D A Dˆ ek
 DA
S
 DB
]  [(Dˆ )
2
T
S 1
ek S
]
 DB
]
2
1 T
DA   (q ek )S  q A
T S 1
]
2
[
]
2
Clade 1
> 75%
Sequence
alignment
Clade 2
> 75%
3D
Tree
Molecular co-evolution analyses: CAPS (Fares and
McNally, Bioinformatics 2006)
Collate results from ‘re-sampling’ and
‘real’ data and sort by 
Calculate probabilities of R-values
applying the step-down permutational
correction
i  1

 P(  0.55)  N 
Identify groups of co-evolving pairs
with P > 0.95
Re-sampling
1 = 0.1
2 = 0.15
3 = 0.35
.
.
.
i = 0.40
i+1 = 0.55
.
.
N-1 = 0.98
N = 0.99
Real
1 = 0.55
2 = 0.98
Flow of
information
in CAPS
SENSITIVITY
Comparative analysis of sensitivities
100
100
90
90
80
80
70
70
60
60
50
50
40
40
40
30
30
30
20
20
20
10
10
10
0
0
TRUE POSITIVES
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
100
90
80
70
60
50
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
100
100
100
90
90
90
80
80
80
70
70
70
60
60
60
50
50
50
40
40
40
30
30
30
20
20
20
10
10
10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.6
0.7
0.8
0.9
1
0
0
0
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
DISTANCE
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
MICK
Dependency
CAPS
lnLCorr
Divergence
CAPS
Mean Sensitivity
100
90
80
70
60
50
40
30
20
10
0
MICK
Dep.
LnLCorr
0.1
CAPS
MICK
DEPENDENCY
lnLCorr
0.5
1
Distance
Mean Sensitivity
100
90
80
70
60
50
40
30
20
10
0
0.2
n. sequence
CAPS
MICK
Dep.
10
20
30
Number of Sequences
LnLCorr
Three-dimensional spheres to detect proteinprotein interfaces
Co-evolving amino acid sites
Spheres of 4Å radius
Highly conserved sites at overlapping
areas
Co-evolving Amino acids share properties of
hydrophobicity and molecular weight
Protein-protein interfaces could be predicted
with greater accuracy