Download CS790 – Introduction to Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clinical neurochemistry wikipedia , lookup

Multi-state modeling of biomolecules wikipedia , lookup

Biochemistry wikipedia , lookup

Paracrine signalling wikipedia , lookup

Gene expression wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Point mutation wikipedia , lookup

Magnesium transporter wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Expression vector wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Interactome wikipedia , lookup

Homology modeling wikipedia , lookup

Western blot wikipedia , lookup

Metalloprotein wikipedia , lookup

Protein purification wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
Disulfide Bonds
 Two cyteines in
close proximity
will form a
covalent bond
 Disulfide bond,
disulfide bridge,
or dicysteine
bond.
 Significantly
stabilizes tertiary
structure.
Protein Folding
Algorithms for Bioinformatics
1
Determining Protein Structure
 There are O(100,000) distinct proteins in the
human proteome.
 3D structures have been determined for 14,000
proteins, from all organisms
• Includes duplicates with different ligands bound,
etc.
 Coordinates are determined by X-ray
crystallography
Protein Folding
Algorithms for Bioinformatics
2
X-Ray Crystallography
~0.5mm
• The crystal is a mosaic of millions of copies
of the protein.
• As much as 70% is solvent (water)!
• May take months (and a “green” thumb) to
grow.
Protein Folding
Algorithms for Bioinformatics
3
X-Ray diffraction
 Image is averaged
over:
• Space (many copies)
• Time (of the diffraction
experiment)
Protein Folding
Algorithms for Bioinformatics
4
Electron Density Maps
 Resolution is
dependent on the
quality/regularity
of the crystal
 R-factor is a
measure of
“leftover” electron
density
 Solvent fitting
 Refinement
Protein Folding
Algorithms for Bioinformatics
5
The Protein Data Bank
 http://www.rcsb.org/pdb/
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Protein Folding
N
CA
C
O
CB
N
CA
C
O
N
CA
C
O
CB
CG1
CG2
ALA
ALA
ALA
ALA
ALA
GLY
GLY
GLY
GLY
VAL
VAL
VAL
VAL
VAL
VAL
VAL
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
1
1
1
1
1
2
2
2
2
3
3
3
3
3
3
3
22.382
22.957
23.572
23.948
23.932
23.656
24.216
25.653
26.258
26.213
27.594
28.569
28.429
27.834
29.259
26.811
47.782
47.648
46.251
45.688
48.787
45.723
44.393
44.308
45.296
43.110
42.879
43.613
43.444
41.363
41.013
40.649
112.975
111.613
111.545
112.603
111.380
110.336
110.087
110.579
110.994
110.521
110.975
110.055
108.822
110.979
111.404
111.850
Algorithms for Bioinformatics
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
24.09
22.40
21.32
21.54
22.79
19.17
17.35
16.49
15.35
16.21
16.02
15.69
16.43
16.66
17.35
17.03
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
6
A Peek at Protein Function
 Serine proteases – cleave other proteins
• Catalytic Triad: ASP, HIS, SER
Protein Folding
Algorithms for Bioinformatics
7
Cleaving the peptide bond
Protein Folding
Algorithms for Bioinformatics
8
Three Serine Proteases
 Chymotrypsin – Cleaves the peptide bond on
the carboxyl side of aromatic (ring) residues:
Trp, Phe, Tyr; and large hydrophobic residues:
Met.
 Trypsin – Cleaves after Lys (K) or Arg (R)
• Positive charge
 Elastase – Cleaves after small residues: Gly,
Ala, Ser, Cys
Protein Folding
Algorithms for Bioinformatics
9
Specificity Binding Pocket
Protein Folding
Algorithms for Bioinformatics
10
The Protein Folding Problem
 Central question of molecular biology:
“Given a particular sequence of amino acid
residues (primary structure), what will the
tertiary/quaternary structure of the resulting
protein be?”
 Input: AAVIKYGCAL…
Output: 11, 22…
= backbone conformation:
(no side chains yet)
Protein Folding
Algorithms for Bioinformatics
11
Protein Folding – Biological perspective
 “Central dogma”: Sequence specifies structure
 Denature – to “unfold” a protein back to
random coil configuration
• -mercaptoethanol – breaks disulfide bonds
• Urea or guanidine hydrochloride – denaturant
• Also heat or pH
 Anfinsen’s experiments
• Denatured ribonuclease
• Spontaneously regained enzymatic activity
• Evidence that it re-folded to native conformation
Protein Folding
Algorithms for Bioinformatics
12
Folding intermediates
 Levinthal’s paradox – Consider a 100 residue
protein. If each residue can take only 3
positions, there are 3100 = 5  1047 possible
conformations.
• If it takes 10-13s to convert from 1 structure to
another, exhaustive search would take 1.6  1027
years!
 Folding must proceed by progressive
stabilization of intermediates
• Molten globules – most secondary structure formed,
but much less compact than “native” conformation.
Protein Folding
Algorithms for Bioinformatics
13
Forces driving protein folding
 It is believed that hydrophobic collapse is a key
driving force for protein folding
• Hydrophobic core
• Polar surface interacting with solvent




Minimum volume (no cavities)
Disulfide bond formation stabilizes
Hydrogen bonds
Polar and electrostatic interactions
Protein Folding
Algorithms for Bioinformatics
14
Folding help
 Proteins are, in fact, only marginally stable
• Native state is typically only 5 to 10 kcal/mole more
stable than the unfolded form
 Many proteins help in folding
• Protein disulfide isomerase – catalyzes shuffling of
disulfide bonds
• Chaperones – break up aggregates and (in theory)
unfold misfolded proteins
Protein Folding
Algorithms for Bioinformatics
15
The Hydrophobic Core
 Hemoglobin A is the protein in red blood cells
(erythrocytes) responsible for binding oxygen.
 The mutation E6V in the  chain places a
hydrophobic Val on the surface of hemoglobin
 The resulting “sticky patch” causes hemoglobin
S to agglutinate (stick together) and form fibers
which deform the red blood cell and do not
carry oxygen efficiently
 Sickle cell anemia was the first identified
molecular disease
Protein Folding
Algorithms for Bioinformatics
16
Sickle Cell Anemia
Sequestering hydrophobic residues in
the protein core protects proteins from
hydrophobic agglutination.
Protein Folding
Algorithms for Bioinformatics
17
Computational Problems in Protein Folding
 Two key questions:
• Evaluation – how can we tell a correctly-folded
protein from an incorrectly folded protein?
H-bonds, electrostatics, hydrophobic effect, etc.
 Derive a function, see how well it does on “real” proteins

• Optimization – once we get an evaluation function,
can we optimize it?
Simulated annealing/monte carlo
 EC
 Heuristics
 We’ll talk more about these methods later…

Protein Folding
Algorithms for Bioinformatics
18
Fold Optimization
 Simple lattice models (HPmodels)
• Two types of residues:
hydrophobic and polar
• 2-D or 3-D lattice
• The only force is hydrophobic
collapse
• Score = number of HH
contacts
Protein Folding
Algorithms for Bioinformatics
19
Scoring Lattice Models
 H/P model scoring: count noncovalent
hydrophobic interactions.
 Sometimes:
• Penalize for buried polar or surface hydrophobic
residues
Protein Folding
Algorithms for Bioinformatics
20
What can we do with lattice models?
 For smaller polypeptides, exhaustive search can
be used
• Looking at the “best” fold, even in such a simple
model, can teach us interesting things about the
protein folding process
 For larger chains, other optimization and search
methods must be used
• Greedy, branch and bound
• Evolutionary computing, simulated annealing
• Graph theoretical methods
Protein Folding
Algorithms for Bioinformatics
21
Learning from Lattice Models
 The “hydrophobic zipper” effect:
Ken Dill ~ 1997
Protein Folding
Algorithms for Bioinformatics
22
Representing a lattice model
 Absolute directions
• UURRDLDRRU
 Relative directions
• LFRFRRLLFFL
• Advantage, we can’t have UD or RL in absolute
• Only three directions: LRF
 What about bumps? LFRRR
• Bad score
• Use a better representation
Protein Folding
Algorithms for Bioinformatics
23
Preference-order representation
 Each position has two “preferences”
• If it can’t have either of the two, it will take the
“least favorite” path if possible
 Example: {LR},{FL},{RL},
{FR},{RL},{RL},{FR},{RF}
 Can still cause bumps:
{LF},{FR},{RL},{FL},
{RL},{FL},{RF},{RL},
{FL}
Protein Folding
Algorithms for Bioinformatics
24
“Decoding” the representation
 The optimizer works on the representation, but
to score, we have to “decode” into a structure
that lets us check for bumps and score.
 Example: How many bumps in:
URDDLLDRURU?
 We can do it on graph paper
• Start at 0,0
• Fill in the graph
 In PERL we use a two-dimensional array
Protein Folding
Algorithms for Bioinformatics
25
A two-dimensional array in PERL
$configuration = “URDDLLDRURU”;
$sequence = “HPPHHPHPHHH”;
foreach $i (1..100) {
foreach $j (1..100) {
$grid[$i][$j] = “empty”;
}
}
$x = 0;
$y = 0;
@moves = split(//,$configuration);
@residues = split(//,$sequence);
Protein Folding
Algorithms for Bioinformatics
26
Setting up the grid
foreach $move (@moves) {
$residue = shift(@residues);
if ($move = “U”) {
$y_position++;
}
if ($move = “R”) {
$x_position++;
}
etc…
if ($grid[$x][$y] ne “empty”) {
BUMP!
} else {
$grid[$x][$y] = $residue;
}
Protein Folding
Algorithms for Bioinformatics
27
More realistic models
 Higher resolution lattices (45° lattice, etc.)
 Off-lattice models
• Local moves
• Optimization/search methods and /
representations
Greedy search
 Branch and bound
 EC, Monte Carlo, simulated annealing, etc.

Protein Folding
Algorithms for Bioinformatics
28
The Other Half of the Picture
 Now that we have a more realistic off-lattice
model, we need a better energy function to
evaluate a conformation (fold).
 Theoretical force field:
• G = Gvan der Waals + Gh-bonds + Gsolvent +
Gcoulomb
 Empirical force fields
• Start with a database
• Look at neighboring residues – similar to known
protein folds?
Protein Folding
Algorithms for Bioinformatics
29
Threading: Fold recognition
 Given:
• Sequence:
IVACIVSTEYDVMKAAR…
• A database of molecular
coordinates
 Map the sequence onto
each fold
 Evaluate
• Objective 1: improve
scoring function
• Objective 2: folding
Protein Folding
Algorithms for Bioinformatics
30
Secondary Structure Prediction
AGVGTVPMTAYGNDIQYYGQVT…
A-VGIVPM-AYGQDIQY-GQVT…
AG-GIIP--AYGNELQ--GQVT…
AGVCTVPMTA---ELQYYG--T…
AGVGTVPMTAYGNDIQYYGQVT…
----hhhHHHHHHhhh--eeEE…
Protein Folding
Algorithms for Bioinformatics
31
Secondary Structure Prediction

Easier than folding
•

Current algorithms can prediction secondary
structure with 70-80% accuracy
Chou, P.Y. & Fasman, G.D. (1974).
Biochemistry, 13, 211-222.
•

Based on frequencies of occurrence of residues in
helices and sheets
PhD – Neural network based
•
•
Uses a multiple sequence alignment
Rost & Sander, Proteins, 1994 , 19, 55-72
Protein Folding
Algorithms for Bioinformatics
32
Chou-Fasman Parameters
Name
Alanine
Arginine
Aspartic Acid
Asparagine
Cysteine
Glutamic Acid
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
Protein Folding
Abbrv
A
R
D
N
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
P(a)
142
98
101
67
70
151
111
57
100
108
121
114
145
113
57
77
83
108
69
106
P(b) P(turn)
83
66
93
95
54
146
89
156
119
119
37
74
110
98
75
156
87
95
160
47
130
59
74
101
105
60
138
60
55
152
75
143
119
96
137
96
147
114
170
50
f(i)
0.06
0.07
0.147
0.161
0.149
0.056
0.074
0.102
0.14
0.043
0.061
0.055
0.068
0.059
0.102
0.12
0.086
0.077
0.082
0.062
Algorithms for Bioinformatics
f(i+1)
0.076
0.106
0.11
0.083
0.05
0.06
0.098
0.085
0.047
0.034
0.025
0.115
0.082
0.041
0.301
0.139
0.108
0.013
0.065
0.048
f(i+2)
0.035
0.099
0.179
0.191
0.117
0.077
0.037
0.19
0.093
0.013
0.036
0.072
0.014
0.065
0.034
0.125
0.065
0.064
0.114
0.028
f(i+3)
0.058
0.085
0.081
0.091
0.128
0.064
0.098
0.152
0.054
0.056
0.07
0.095
0.055
0.065
0.068
0.106
0.079
0.167
0.125
0.053
33
Chou-Fasman Algorithm
 Identify -helices
• 4 out of 6 contiguous amino acids that have P(a) >
100
• Extend the region until 4 amino acids with P(a) <
100 found
• Compute P(a) and P(b); If the region is >5
residues and P(a) > P(b) identify as a helix
 Repeat for -sheets [use P(b)]
 If an  and a  region overlap, the overlapping
region is predicted according to P(a) and
P(b)
Protein Folding
Algorithms for Bioinformatics
34
Chou-Fasman, cont’d
 Identify hairpin turns:
• P(t) = f(i) of the residue  f(i+1) of the next residue
 f(i+2) of the following residue  f(i+3) of the
residue at position (i+3)
• Predict a hairpin turn starting at positions where:
P(t) > 0.000075
 The average P(turn) for the four residues > 100
 P(a) < P(turn) > P(b) for the four residues

 Accuracy  60-65%
Protein Folding
Algorithms for Bioinformatics
35
Chou-Fasman Example
 CAENKLDHVRGPTCILFMTWYNDGP
 CAENKL – Potential helix (!C and !N)

Residues with P(a) < 100: RNCGPSTY
• Extend: When we reach RGPT, we must stop
• CAENKLDHV: P(a) = 972, P(b) = 843
• Declare alpha helix
 Identifying a hairpin turn
• VRGP: P(t) = 0.000085
• Average P(turn) = 113.25

Protein Folding
Avg P(a) = 79.5, Avg P(b) = 98.25
Algorithms for Bioinformatics
36