Download Lecture 10 - Prediction, Engineering, Design of Protein Structures

Document related concepts

Silencer (genetics) wikipedia , lookup

Expression vector wikipedia , lookup

Gene expression wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Magnesium transporter wikipedia , lookup

Metabolism wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Interactome wikipedia , lookup

Metalloprotein wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Protein purification wikipedia , lookup

Biosynthesis wikipedia , lookup

Western blot wikipedia , lookup

Protein wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Biochemistry wikipedia , lookup

Point mutation wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Genetic code wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
Chapter 17
Prediction, Engineering and Design of
Protein Structures
Protein Engineering vs. Protein Design
• Protein Engineering: Mutating gene(s) to modify an existing
protein.
– Capability exists
– Many examples can be found
• Protein Design: Designing an entire protein from scratch to
serve a specific purpose.
– Unlikely until we can reliably predict folding from sequence
– Levinthal’s Paradox: Why we cannot test random combinations
– We can predict 2° structure, but prediction of 3° structure will require
a shortcut (e.g., energy considerations, kinetics, etc)
Prediction of Secondary Structure from Sequence
•
•
•
PDBSum (EMBL-EBI) http://www.ebi.ac.uk/pdbsum/
Jpred: http://www.compbio.dundee.ac.uk/www-jpred/
PredictProtein: https://www.predictprotein.org/
•
Either enter FASTA sequence file or can load new/existing sequence
•
Based on propensity of certain AA’s to form specific structures, or stereochemical
considerations (compactness & hydrophobicity related to known tertiary structures),
but all are related to extensive analyses of sequences and the applications of scoring
matrices
FASTA format:
versatile, compact with one header line
followed by a string of nucleotides or amino acids
in the single letter code
Pairwise Alignment
• Potential relationships between proteins or nucleic acids can be
explored by comparing 2 or more sequences of amino acids or
nucleotides.
• Difficult to do visually.
• Computer algorithms help us by:
– Accelerating the comparison process
– Allowing for “gaps” or indels in sequences (i.e., insertions, deletions)
– Identifying substituted amino acids that are structurally or functionally
similar (D and E).
One way to do this is with BLAST (Basic Local Alignment Search Tool)
• Allows rapid sequence comparison of a query sequence against a database.
• The BLAST algorithm is fast, accurate, and web-accessible.
• BLAST lets user select from a variety of scoring matrices to evaluate
sequence relatedness.
Pevsner, Bioinformatics and Functional Genomics, 2009
NCBI key features: BLAST
BLAST is…
• Basic Local Alignment Search Tool
• NCBI's sequence similarity search tool
• supports analysis of DNA and protein databases
3CLN
BLAST
 BLAST allows user to search a sequence (the query) against
millions of sequences in the NCBI database (the target).
 Global alignments (e.g., Needleman-Wunsch) would be time
consuming and computationally intensive for this amount of
data.
 BLAST is designed for local alignment, not global alignment.
 Allows for faster searches, can match subsets of proteins (e.g.,
domains).
C-terminal domain of
CaM (from 3cln.pdb)
E
l
he
ix
E
he
l
ix
1
3
12
F he
5
lix
Ca2+
9 7
8 8
7 9
F he
2+
Ca
lix
5
12
3
1
BLAST Output from DB Search
 Graphic Summary includes conserved domains, when applicable.
E
l
he
ix
1
F he
lix
E
he
l
ix
Ca2+
12
3
5
Ca2+
9 7
8 8
7 9
F he
lix
5
12
3
1
BLAST Output from DB Search
 Graphic Summary includes distribution of blast hits.
 Color coded by bit Score.
 Higher score related to higher sequence identity.
Sequence Analyses: RNA
• Codons (3 RNA bases in sequence) determine each amino acid
that will build the protein expressed
• Many amino acids are encoded by more than 1 codon (change
in 3rd base).  Change of single base may not be significant.
Comparing protein sequences
• Comparing protein sequences usually more
informative than nucleotide sequences.
– Changing base at 3rd position in codon does not always
change AA (Ex: Both UUU and UUC encode for
phenylalanine)
– Different AAs may share similar chemical properties (Ex:
hydrophobic residues A, V, L, I)
– Relationships between related but mismatched AAs in
sequence analysis can be accounted for using scoring
systems (matrices).
– Protein sequence comparisons can ID sequence
homologies from proteins sharing a common ancestor as
far back as 1 × 109 years ago (vs. 600 × 106 for DNA).
Amino acids by similar biophysical properties
http://kimwootae.com.ne.kr/apbiology/chap2.htm
Amino acids by similar biophysical properties
These have useful fluorescent properties
http://kimwootae.com.ne.kr/apbiology/chap2.htm
Amino acids by similar biophysical properties
http://kimwootae.com.ne.kr/apbiology/chap2.htm
Amino acids by similar biophysical properties
http://kimwootae.com.ne.kr/apbiology/chap2.htm
Amino acids by similar biophysical properties
http://kimwootae.com.ne.kr/apbiology/chap2.htm
Sequence Identity and Similarity
• Identity: How closely two sequences match one another.
– Unlike homology, identity can be measured quantitatively
• Similarity: Pairs of residues that are structurally or functionally
related (conservative substitutions).
>lcln|28245 3CLN:A|PDBID|CHAIN|SEQUENCE
Length=148
Score = 268 bits (684), Expect = 3e-97, Method: Compositional matrix adjust.
Identities = 130/148 (88%), Positives = 143/148 (97%), Gaps = 0/148 (0%)
Query
1
Sbjct
1
Query
61
Sbjct
61
Query
121
Sbjct
121
AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN
A+QLTEEQIAEFKEAF+LFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN
60
GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE
GTIDFPEFL++MARKMK+ DSEEE+ EAF+VFD+DGNG ISAAELRHVMTNLGEKLTD+E
GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE
120
VDEMIREADIDGDGHINYEEFVRMMVSK
VDEMIREA+IDGDG +NYEEFV+MM +K
VDEMIREANIDGDGQVNYEEFVQMMTAK
60
120
148
148
88% of sequences include the same amino acids (Identities). This increases
to 97% (Positives) when you include amino acids that are different, but
with similar properties.
Pevsner, Bioinformatics and Functional Genomics, 2009
Sequence Homology
• Homology: Two sequences are homologous if they share a
common ancestor.
• No “degrees of homology”: only homologous or not
• Almost always share similar 3D structure
– Ex. myoglobin and beta globin
– Sequences can change significantly over time, but 3D
structure changes more slowly
Beta-globin sub-unit of adult
hemoglobin (2H35.pdb, in
blue), superimposed over
myoglobin (3RGK.pdb, in red).
These sequences probably
separated 600 million years
ago.
Pevsner, Bioinformatics and Functional Genomics, 2009
Percent Identity and Homology
• For an alignment of 70 amino acids, 40% sequence
identity is a reasonable threshold for homology.
• Above 20% (more than 70 amino acids) may indicate
homology.
• Below 20% probably indicates chance alignment.
Pevsner, Bioinformatics and Functional Genomics, 2009
Orthologs and Paralogs
• Orthologs: Homologous sequences in different species that
arose from a common ancestral gene during speciation.
– Ex. Humans and rats diverged around 80 million years ago 
divergence of myoglobin genes occurred.
– Orthologs frequently have similar biological functions.
• Human and rat myoglobin (oxygen transport)
• Human and rat CaM
• Paralogs: Homologous sequences that arose by a mechanism
such as gene duplication.
• Within same organism/species
• Ex. Myoglobin and beta globin are paralogs
– Have distinct but related functions.
Pevsner, Bioinformatics and Functional Genomics, 2009
Conservative Substitutions in Matrices
Scoring may also vary based on conserved substitutions of amino
acids: i.e., amino acids with similar properties will not lose as many
points as AAs with very different properties.
Basic AAs: K, R, H
Acidic AAs: D, E
Hydroxylated AAs: S, T
Hydrophobic AAs: G, A, V, L, I, M, F, P, W, Y
These relationships would be considered when calculating “Positives” in
BLAST alignment.
Pevsner, Bioinformatics and Functional Genomics, 2009
Dayhoff Model: Building a Scoring Matrix








1978, Margaret Dayhoff provided one of the first models of a scoring matrix
Model was based on rules by which evolutionary changes occur in proteins
Catalogued 1000’s of proteins, considered which specific amino acid
substitutions occurred when 2 homologous proteins aligned
Assumes substitution patterns in closely-related proteins can be
extrapolated to more distantly-related proteins
An accepted point mutation (PAM) is an AA replacement accepted by
natural selection
Based on observed mutations, not necessarily on related AA properties
Probable mutations are rewarded, while unlikely mutations are penalized
Scores for comparison of 2 residues (i, j) based on the following equation:
Here, qi,j is the probability of an observed substitution (from mutation probability
matrix), while p is the likelihood of observing the replacement AA (i) as a result of chance
(normalized frequency of AA table).
Pevsner, Bioinformatics and Functional Genomics, 2009
PAM250 Mutation Probability Matrix
Replacement AA
Original AA
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
Ala
A
13
3
4
5
2
3
5
12
2
3
6
6
1
2
7
9
8
0
1
7
Arg
R
6
17
4
4
1
5
4
5
5
2
4
18
1
1
5
6
5
2
1
4
Asn
N
9
4
6
8
1
5
7
10
5
2
4
10
1
2
5
8
6
0
2
4
Asp
D
9
3
7
11
1
6
11
10
4
2
3
8
1
1
4
7
6
0
1
4
Cys
C
5
2
2
1
52
1
1
4
2
2
2
2
0
1
3
7
4
0
3
4
Gln
Q
8
5
5
7
1
10
9
7
7
2
6
10
1
1
5
6
5
0
1
4
Glu
E
9
3
6
10
1
7
12
9
4
2
4
8
1
1
4
7
5
0
1
4
Gly
G
12
2
4
5
2
3
5
27
2
2
3
5
1
1
5
9
6
0
1
4
His
H
6
6
6
6
2
7
6
5
15
2
5
8
1
3
5
6
4
1
3
5
Ile
I
8
3
3
3
2
2
3
5
2
10
15
5
2
5
3
5
6
0
2
4
Leu
L
6
2
2
2
1
3
2
4
2
6
34
4
3
6
3
4
4
1
2
15
Lys
K
7
9
5
5
1
5
5
6
3
2
4
24
2
1
4
7
6
0
1
10
Met
M
7
4
3
3
1
3
3
5
2
6
20
9
6
4
3
5
5
0
2
4
Phe
F
4
1
2
1
1
1
1
3
2
5
13
2
2
32
2
3
3
1
15
10
Pro
P
11
4
4
4
2
4
4
8
3
2
5
6
1
1
20
9
6
0
1
5
Ser
S
11
4
5
5
3
3
5
11
3
3
4
8
1
2
6
10
8
1
2
5
Think of these values as percentages (columns sum to 100).
For example, there is an 18% (0.18) probability of R being replaced by K.
This probability matrix needs to be converted into a scoring matrix.
http://www.icp.ucl.ac.be/~opperd/private/pam250.html
Thr
T
11
3
4
5
2
3
5
9
2
4
6
8
1
2
5
9
11
0
2
5
Trp
W
2
7
2
1
1
1
1
2
2
1
6
4
1
4
1
4
2
55
3
72
Tyr
Y
4
2
3
2
4
2
2
3
3
3
7
3
1
20
2
4
3
1
31
4
Val
V
9
2
3
3
2
3
3
7
2
9
13
5
2
3
4
6
6
0
2
17
Normalized Frequencies of Amino Acids
Normalized Frequencies of Amino Acids
Ala
0.096
Asn
0.042
Gly
0.090
Pro
0.041
Lys
0.085
Ile
0.035
Leu
0.085
His
0.034
Val
0.078
Arg
0.034
Thr
0.062
Gin
0.032
Ser
0.057
Tyr
0.030
Asp
0.053
Cys
0.025
Glu
0.053
Met
0.012
Phe
0.045
Trp
0.012
**How often a given amino acid appears in a protein (determined by empirical analyses)
http://www.icp.ucl.ac.be/~opperd/private/pam250.html
Purpose of PAM Matrices
• Derive a scoring system to determine
relatedness of 2 sequences.
• PAM mutation probability matrix must be
converted to a scoring matrix (log odds
matrix).
PAM250 Log-Odds Matrix
Cys
Ser
Thr
Pro
Ala
Gly
Asn
Asp
Glu
Gln
His
Arg
Lys
Met
Ile
Leu
Val
Phe
Tyr
Trp
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
12
0
-2
-3
-2
-3
-4
-5
-5
-5
-3
-4
-5
-5
-2
-8
-2
-4
0
-8
C
Cys
2
1
1
1
1
1
0
0
-1
-1
0
0
-2
-1
-3
-1
-3
-3
-2
S
Ser
3
0
1
0
0
0
0
-1
-1
-1
0
-1
0
-2
0
-3
-3
-5
T
Thr
6
1
-1
-1
-1
-1
0
0
0
-1
-2
-2
-3
-1
-5
-5
-6
P
Pro
2
1
0
0
0
0
-1
-2
-1
-1
-1
-2
0
-4
-3
-6
A
Ala
5
0
1
0
-1
-2
-3
-2
-3
-3
-4
-1
-5
-5
-7
G
Gly
2
2
1
1
2
0
1
-2
-2
-3
-2
-4
-2
-4
N
Asn
4
3
2
1
-1
0
-3
-2
-4
-2
-6
-4
-7
D
Asp
4
2
1
-1
0
-2
-2
-3
-2
-5
-4
-7
E
Glu
4
3
1
1
-1
-2
-2
-2
-5
-4
-5
Q
Gln
6
2
0
-2
-2
-2
-2
-2
0
-3
H
His
8
3
0
-2
-3
-2
-4
-4
2
R
Arg
5
0
6
-2
2
5
-3
4
2
8
-2
2
4
2
4
-5
0
1
2
-1
-4
-2
-1
-1
-2
-3
-4
-5
-2
-6
K
M
I
L
V
Lys Met Ile Leu Val
This is the PAM250 scoring matrix, calculated as follows:
http://www.icp.ucl.ac.be/~opperd/private/pam250.html
9
7
0
F
Phe
10
0
Y
Tyr
17
W
Trp
Pairwise Alignment and Homology
PAM Value
80
100
200
Distance(%)
50
60
75
250
85
300
92
<- Twilight zone
Think of PAM value as total number of mutations. This included multiple mutations over
time at a single position.
Currently, we accept that once the percent distance reaches ~85%, homology is
indeterminate.
PAM250 works best for more distantly related protein sequences.
Seq1
Seq2
Seq3
AGDFWYGGDGEYLLV
AGQFWYGGEGEKLLV
AGEFWYGGEGEKLLV
http://www.icp.ucl.ac.be/~opperd/private/pam.html
Seq1 and Seq2 separated by 3 units,
while Seq1 and Seq3 separated by 4
PAM units
Practical Lessons from the Dayhoff Model




Less mutable amino acids likely play more important structural and
functional roles
Mutable amino acids fulfill functions that can be filled by other amino acids
with similar properties
Common substitutions tend to require only a single nucleotide change in
codon
Amino acids that can be created from more than 1 codon are more likely to
be created as a substitute (See p. 63, textbook)
Changes to sequence that do
not alter structure and function
of protein likely to be more
tolerated in nature
Pevsner, Bioinformatics and Functional Genomics, 2009
BLOSUM62 Scoring Matrix
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V



4
-1 5
-2 0 6
-2 -2 1 6

0 -3 -3 -3 9
-1 1 0 0 -3 5
-1 0 0 2 -4 2 5
0 -2 0 -1 -3 -2 -2
-2 0 1 -1 -3 0 0
-1 -3 -3 -3 -1 -3 -3
-1 -2 -3 -4 -1 -2 -3
-1 2 0 -1 -1 1 1
-1 -2 -2 -3 -1 0 -2
-2 -3 -3 -3 -2 -3 -3
-1 -2 -2 -1 -3 -1 -1
1 -1 1 0 -1 0 0
0 -1 0 -1 -1 -1 -1
-3 -3 -4 -4 -2 -2 -3
-2 -2 -2 -3 -2 -1 -2
0 -3 -3 -3 -1 -2 -2
A R N D C Q E
Pevsner, Bioinformatics and Functional Genomics, 2009
BLOck SUbstitution Matrix
By Henikoff and Henikoff (1992)
Default scoring matrix for pairwise alignment of
sequences using BLAST (local alignments)
Based on empirical observations of distantlyrelated proteins organized into blocks
6
-2 8
-4 -3 4
-4 -3 2 4
-2 -1 -3 -2 5
-3 -2 1 2 -1 5
-3 -1 0 0 -3 0 6
-2 -2 -3 -3 -1 -2 -4 7
0 -1 -2 -2 0 -1 -2 -1
-2 -2 -1 -1 -1 -1 -2 -1
-2 -2 -3 -2 -3 -1 1 -4
-3 2 -1 -1 -2 -1 3 -3
-3 -3 3 1 -2 1 -1 -2
G H I
L K M F P
In BLOSUM62,
proteins are
arranged in blocks
sharing at least 62%
identity
4
1 5
-3 -2 11
-2 -2 2 7
-2 0 -3 -1 4
S T W Y V
General Trends in Scoring Matrices
BLOSUM90
PAM30
Less
divergent
Human vs.
chimp
BLOSUM62
PAM120
BLOSUM45
PAM250
More
divergent
Human vs.
bacteria
Choose a matrix that is consistent with the level sequence
identity you are investigating. I.E., if you are looking at/for
more closely related sequences, use BLOSUM90. If you are not
sure, use BLOSUM62.
Sequence Alignments: General Concepts
• Global Alignment: Tries to match the entire length of the
sequence.
• Local Alignment: Tries to find the longest section that
matches.
Both are examples of dynamic programming: precise but slow
Global Alignment
Input: two sequences over the same alphabet (either nucleotide or
amino acid sequences)
Output: The alignment of the sequences
Example:
• GADEGYFGPVILAADGEVA and GGAEGDYFGPAIAEGEVA
• A possible alignment might look like this:
mut
del
del
mut
ins
del
ins
-GADEG-YFGPVILAADGEVA
GGA-EGDYFGPAI--AEGEVA
Global Alignment – A Simple Scoring Scheme
Each position is scored independently:
• Match:
+1
• Mismatch:
-1
• Insertions or deletions (gaps): -2
The alignment score is the sum of the position scores
-GADEG-YFGPVILAADGEVA
GGA-EGDYFGPAI--AEGEVA
Global Alignment Score: (14 ×(+1)) + (5 × (-2)) + (2 × (-1)) = 2
-----GADEG-YFGPVILAADGEVA--DLGNVGA-EGDYFGPAI--AEGEVARPL
Global Alignment Score: (14 ×(+1)) + (12 × (-2)) + (2 × (-1)) = -12
-----GADEG-YFGPVILAADGEVA--dlgnvGA-EGDYFGPAI--AEGEVArpl
Local Alignment Score: (14 ×(+1)) + (4 × (-2)) + (2 × (-1)) = 4
Matrices and Gap Costs
Query
Length
<35
35-50
50-85
85
Substitution
Matrix
PAM-30
PAM-70
BLOSUM-80
BLOSUM-62
Gap
Costs
(9,1)
(10,1)
(10,1)
(10,1)
The raw score of an alignment is the sum of the scores for aligning pairs of residues and
the scores for gaps.
Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the
existence of a gap, and the score -b for each residue in the gap. Thus a gap of k residues
receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b).
Your total raw score for the alignment is reduced when you introduce gaps into the
query sequence.
Calculate the score in BLOSUM-62 for a gap with 7 residues…
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Global Sequence Alignments
• Global Alignment: Entire sequence of each protein or DNA.
• Needleman and Wunsch (1970)
• Reduces problem to series of smaller alignments on a residueby-residue basis.
• How this approach works
1.
2.
3.
Setting up a matrix
Score the matrix
ID the optimal alignment
Local Sequence Alignment
• Local Alignment: Longest matching regions (subsets) between
2 sequences.
• Smith and Waterman Algorithm (1981)
• Scoring is similar to global alignment
1.
2.
Set up a matrix
Score the matrix
•
•
3.
No negative values allowed: If negative values are the only choices, then answer
defaults to zero (0).
Mismatches and gaps at ends score 0.
ID the optimal alignment
• More sensitive but much slower than heuristic methods
(FASTA, BLAST)
Heuristic (word or k-tuple based) algorithms
• Uses initial query to make reasonable guesses about
sequence alignments, then evaluates those
considered “most likely”
• Alignment then extended until:
– One of the sequences ends
– Score falls below some threshold
• In BLAST, search depends on word size
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)
MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)
extend
Hit!
extend
FASTA (Pearson and Lippman 1988)
• Combines Smith and Waterman algorithm with word
(k-tup) search  faster, heuristic approach
• Query sequence divided into small words (usually
k=2 for proteins)
– Words used to initially compare and match sequences
– If words located on same diagonal, surrounding region is
then selected for analysis
Seq 1 Search words
FY
YG
Seq 1 FYGKLHMEGD
KL
LH
Seq 2 FWGKLHMEGSNE
ME
EG
http://www.incogen.com/bioinfo_tutorials/Bioinfo-Lecture_2-pairwise-align.html
(k-tup = 2)
GK
HM
GD
Statistical Measures of Algorithms
• Objective of alignment algorithms is to maximize sensitivity
and specificity of alignments.
• Sensitivity: Measure of how well algorithm correctly predicts
sequences that are related.
• Specificity: Measure of how well algorithm correctly predicts
sequences that are unrelated.
TP: Positive identified as positive
FP: Negative identified as positive
TN: Negative identified as negative
FN: Positive identified as negative
Relationships between biological sequences
• Biological sequences tend to occur in families
– These may be related genes within an organism (paralogs) or
between species (orthologs)
– Presumably derived from common ancestor
• Nucleotides corresponding to coding regions are typically less well
conserved than proteins due to degeneracy of genetic code
– More difficult to align
Sequences evolve faster
than structures, but
homologous sequences
tend to retain similar
structure and function
(e.g., rat vs. human CaM)
Multiple sequence alignments
• Homology can be observed through multiple sequence
alignments (MSA)
• MSA: 3 or more protein (or nucleic acid) sequences that
are partially or completely aligned
• Homologous residues are aligned in columns across the
length of the sequences
1exr_A
1N0Y_A
3cln_
-EQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 59
AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60
----TEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 56
:************:*******************************************
1exr_A
1N0Y_A
3cln_
GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 119
GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 120
GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE 116
*********::******: *****: ***:***:**** *******************:*
1exr_A
1N0Y_A
3cln_
VDEMIREADIDGDGHINYEEFVRMMVS- 146
VDEMIREADIDGDGHINYEEFVRMMVSK 148
VDEMIREANIDGDGQVNYEEFVQMMTA- 143
********:*****::******:**.:
Multiple sequence alignments
• MSAs are powerful because they can reveal
relationships between 2 sequences that can only be
observed by their relationships with a third sequence
Seq 1
Seq 2
Seq 1
Seq 3
Seq 2
AVGYDFGEKMLSGADDW
LVGERADLTGAEIDE
AVGYDFGEKMLSGA--DDW
LVGYDRADK-LTGAE-DDLVG-ERAD--LTGAEIDE-
How MSAs are determined?
MSAs can be determined based on:
•
•
•
•
Presence of highly-conserved residues such as cysteine
Conserved motifs and domains
Conserved features of protein secondary structure
Regions showing consistent patterns of insertions or
deletions
C-terminal domain of
CaM (from 3cln.pdb)
Conserved 2° structure
(α-helices)
ClustalW Output for CD2 Protein
1
1.
2.
3.
4.
5.
2
3
4
Color coding indicates AA property class
* Indicates 100% conserved over entire alignment
: Conservative mutations
. Less conservative mutations
[blank] gap or least conserved mutations
5
Statistical Analysis of PDB Data: Ca2+ vs. Pb2+
SC N, 5.1
MC N,0.6
Asn, 1.1
Carbonyl,
5.6
L
L
Asp, 20.3
L
SC O,
61.0
M
L
L
Gln, 0.6
S, 7.3
L
Glu, 38.4
L
Pentagonal bipyramidal
geometry
Thr, 0.6
HOH, 20.3
Holo- and Hemi-directed
geometries
Pb: Ligand Distribution
HOH,
13.3
Asp, 29.7
HOH,
33.1
SC, 65.3
Carbonyl,
21.4
SC, 42.9
Glu, 26.6
Asn, 6.1
Asp, 24.5
Carbonyl,
23.9
Gln, 0.0
Ser, 2.6
Ca: EF-Hand
Thr, 0.3
(Kirberger, Wang et al. 2008; Kirberger and Yang 2008; Glusker et al. 1998)
Ca: Non-EF-Hand
Glu, 10.4
Asn, 4.3
Gln, 1.3
Ser, 1.3
Thr, 1.0
Tyr, 0.1
Develop Algorithms/Programs to Address Specific
Problems
• Identify calcium-binding proteins by matching patterns of
known calcium-binding sites in sequences.
Descriptive ID
Sequence Pattern
Prosite
PS00018: EF-Hand
D-X-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]X(2)-[DE]-[LIVMFYW]
Yang (Pattern 1)
EFH Helix E
X-{DNQ}-X-X-{GP}-{ENSPQ}-X-X-{DQRP}
EFH Loop
[DNS]-X-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC][DENQSTAGC]-X(2)-[ED]
EFH Helix F
[FLMYVIW]-X-X-{NPS}-{DNEQ}-X(3)
Yang (Pattern 2)
YY00018
X(1)-{DNQ}-X(2)-{GP}-{ENSPQ}-X(2)-{DQRP}-[DNS]-X(1)-[DNS]-{ILVFYW}[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-X(2)-[ED]-[FLMYVIW]X(2)-{NPS}-{DNEQ}-X(3)
Protein Engineering by Rational Design
1. Computer aided design – May
include statistical & structural
parameters
6. Biochemical testing
5. Protein purification – separate target
protein from other biomolecules
2. Site-directed mutagenesis –
changing one or more nucleic acids
in plasmid to change AA in protein
3. Transformation – Alteration of
bacterial cell through introduction
of exogenous genetic material
4. Protein expression –
Manufacturing the protein(s)
Engineered Proteins: Therapy
• Abatacept: Fusion protein composed of the Fc region of the immunoglobulin
IgG1 fused to the extracellular domain of CTLA-4.
• Abatacept binds to the CD80 and CD86 molecule, and inhibits T cell activation
by blocking signal from antigen presenting cell. Prevents immune response.
• Developed by Bristol-Myers Squibb and is licensed in the United States for the
treatment of rheumatoid arthritis.
Engineered Proteins: Research
1T6W
Ca.CD2 is a protein engineered by Dr. Jenny Yang’s research group at Georgia State
University.
Cell Adhesion Molecule CD2 was modified by insertion of a calcium binding site.
The binding site was observed to bind calcium selectively over other mono- and divalent biological metals, and to bind several other metals including lanthanum and
terbium, while still retaining the ability to bind it’s natural target molecule.
The objectives of this research were to see if a metal binding site could be
engineered into a small protein without significantly altering the protein, to study
an isolated calcium binding site, and to develop a model for the development of
proteins with specific functions.
Design of a calcium-binding protein with desired structure in a cell adhesion molecule, JACS, 2005.
Engineered Proteins: Research
GFP and other FP’s, fused to other proteins, have found a variety of uses in cellular and
tissue imaging.
http://www.conncoll.edu/ccacad/zimmer/GFP-ww/prasher.html
Mutations to GFP produce different colors
The availability of a different FP colors
has also enabled researchers to
develop methods to probe whether
two proteins are within a distance of
less than 10 nm of each other using
the phenomenon of Förster (or
fluorescence) resonance energy
transfer (FRET) (Förster 1948). FRET is
the distance- and orientationdependent radiationless transfer of
excitation energy from a donor
fluorophore to an acceptor
chromophore.
http://zeiss-campus.magnet.fsu.edu/articles/probes/jellyfishfps.html
Protein Design Algorithms
• Two major classes:
• Exact algorithms (e.g., Dead-end elimination),
provided optimal solutions but long run times
• Heuristic algorithms (e.g., Monte Carlo), faster run
times but may not provide optimal solutions.
DEE Algorithm (Exact)
•
•
The DEE (dead-end elimination): Compares all possible side chain rotamers on fixed protein
backbone and removes those that cannot be part of the global lowest energy conformation
(GMEC).
DEE cannot guarantee convergence. If, after a certain number of iterations, DEE cannot
remove any more rotamers, then either rotamers have to be merged or another search
algorithm must be used to search the remaining search space. In such cases, the dead-end
elimination acts as a pre-filtering algorithm to reduce the search space.
https://www.cs.duke.edu/brd/papers/Proteins12/
Branch and Bound Algorithms (Exact)
• The protein design conformational space can be represented as a tree,
where the protein residues are ordered in an arbitrary way, and the tree
branches at each of the rotamers in a residue. Branch and bound
algorithms use this representation to efficiently explore the conformation
tree: At each branching, branch and bound algorithms bound the
conformation space and explore only the promising branches.
• Tests multiple conformational changes (global changes), retaining lowest
energy conformations. Can be very slow process.
•
•
•
•
Monte Carlo and Simulated Annealing Algorithm
(Heuristic)
A starting structure is needed for a
molecular dynamics calculation,
which is generated from all
constraints for the molecular
structure, such as bond-lengths and
bond-angles.
This starting structure may be any
conformation such as an extended
strand or an already folded protein.
Starting at theoretical high
temperatures (meaning energy put
into system) approximately 20
different random, simulated protein
folds are allowed to “cool” to lowest
localized energies, to observe folding.
These results are used for another set
of iterations with different input
parameters, until energy can no
longer be minimized (global energy
minimum is achieved).
Simons, JMB, 1997