Download Powerpoint on Proteins

Document related concepts

SR protein wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Peptide synthesis wikipedia , lookup

Metabolism wikipedia , lookup

Paracrine signalling wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Biosynthesis wikipedia , lookup

Gene expression wikipedia , lookup

Signal transduction wikipedia , lookup

Expression vector wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Magnesium transporter wikipedia , lookup

Point mutation wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Genetic code wikipedia , lookup

Interactome wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Metalloprotein wikipedia , lookup

Biochemistry wikipedia , lookup

Protein wikipedia , lookup

Protein purification wikipedia , lookup

Structural alignment wikipedia , lookup

Western blot wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
Protein Structure
IST 444
Protein Chemistry Basics
• Proteins are polymers consisting of amino
acids linked by peptide bonds
• Each amino acid consists of:
– a central carbon atom
– an amino group NH2
– a carboxyl group COOH
– a side chain (R group)
• Differences in side chains distinguish
different amino acids.
repeating repeating
backbone backbone
structure structure
O H
O H
O H
O H
O H
OH
OH
H3N+ CH C N CH C N CH C N CH C N CH C N CH C N CH C N CH COOCH2
CH2
COO-
CH2
CH
CH2
H3C CH3
CH2
H C CH3
CH2
OH
CH3
NH
CH2 CH2 CH2
HC
CH
HN
CH2
CH2
N
CH
C
N+H2
NH2
Asp
D
Arg
R
Val
V
Tyr
Y
Ile
I
His
H
Protein sequence: DRVYIHPF
Pro
P
Phe
F
Side Chains Determine Structure
Hydrophobic stays
inside, while
hydrophilic stay
close to water
Oppositely charged
amino acids can
form salt bridge.
Polar amino acids
can participate
hydrogen bonding
Steps in Obtaining Protein
Structure
Target selection
Obtain, characterize protein
Determine, refine, model the structure
Deposit in repository
Domain, Fold, Motif
• A protein chain could have several domains
– A domain is a discrete portion of a protein, can fold
independently, possess its own function
• The overall shape of a domain is called a fold.
There are only a few thousand possible folds.
• Sequence motif: highly conserved protein
subsequence
• Structure motif: highly conserved substructure
Protein Data Bank
Protein structures, solved using experimental techniques
Unique structural folds
Same structural folds
Different structural folds
Protein Structure
Determination
• High-resolution structure determination
– X-ray crystallography (~1Å)
– Nuclear magnetic resonance (NMR) (~1-2.5Å)
• Low-resolution structure determination
– Cryo-EM (electron-microscropy) ~10-15Å
X-ray crystallography
• most accurate
• An extremely pure protein sample is needed.
• The protein sample must form crystals that are relatively
large without flaws. Generally the biggest problem.
• Many proteins aren’t amenable to crystallization at all
(i.e., proteins that do their work inside of a cell
membrane).
•
~$100K per structure
Nuclear Magnetic Resonance
• Fairly accurate
• No need for crystals
• limited to small, soluble proteins only.
Protein Structure Visualization
• http://www.umass.edu/microbio/chime/top
5.htm
• http://molvis.sdsc.edu/visres/
• Rasmol
• Chime
• Protein Explorer
• DeepView
• JmolJava
Secondary Structure Prediction
• Rules developed from PDB data
• Chou and Fasman (1974) developed an
algorithm based on the frequencies of amino
acids found in a helices, b-sheets, and turns.
• Proline: occurs at turns, but not in a helices.
• http://prowl.rockefeller.edu/aainfo/chou.htm
• Modern algorithms: use multiple sequence
alignments and achieve higher success rate
(about 70-75%)
Ramachandran Plot
a way to visualize
dihedral angles φ
(phi) against ψ (psi) of
amino acid residues
in protein structure.
Chou Fasman 1974
• measured frequencies at which each amino acid
appeared in particular types of secondary sequences in
a set of proteins of known structure
• assigns the amino acids three conformational
parameters based on the frequency at which they were
observed in alpha helices, beta sheets and beta turns
– P(a) = propensity to form alpha helices
– P(b) = propensity to form beta sheets
– P(turn) = propensity to form beta turns
• also assigns 4 turn parameters based on frequency at
which they were observed in the first, second, third or
fourth position of a beta turn
–
–
–
–
f(i) = probability of being in position 1
f(i+1) = probability of being in position 2
f(i+2) = probability of being in position 3
f(i+3) = probability of being in position 4
P(a)
P(b)
P(turn)
f(i)
f(i+1)
f(i+2)
f(i+3)
142
83
66
0.060
0.076
0.035
0.058
Arginine
98
93
95
0.070
0.106
0.099
0.085
Asparagine
67
89
156
0.161
0.083
0.191
0.091
Aspartic acid
101
54
146
0.147
0.110
0.179
0.081
Cysteine
70
119
119
0.149
0.050
0.117
0.128
Glutamic acid
151
37
74
0.056
0.060
0.077
0.064
Glutamine
111
110
98
0.074
0.098
0.037
0.098
Glycine
57
75
156
0.102
0.085
0.190
0.152
Histidine
100
87
95
0.140
0.047
0.093
0.054
Isoleucine
108
160
47
0.043
0.034
0.013
0.056
Leucine
121
130
59
0.061
0.025
0.036
0.070
Lysine
114
74
101
0.055
0.115
0.072
0.095
Methionine
145
105
60
0.068
0.082
0.014
0.055
Phenylalanine
113
138
60
0.059
0.041
0.065
0.065
Proline
57
55
152
0.102
0.301
0.034
0.068
Serine
77
75
143
0.120
0.139
0.125
0.106
Threonine
83
119
96
0.086
0.108
0.065
0.079
Tryptophan
108
137
96
0.077
0.013
0.064
0.167
Tyrosine
69
147
114
0.082
0.065
0.114
0.125
Valine
106
170
50
0.062
0.048
0.028
0.053
A.A.
.Alanine
Chou Fasman isn’t Perfect
• Accuracy = 50-85%, depending on the
protein
• http://npsapbil.ibcp.fr/NPSA/npsa_references.html
• Software and sites for protein predictions
GOR (Garnier, Osguthorpe and
Robson)
• Another commonly used algorithm, uses a window of 17
amino acids to predict secondary structure
• rationale: experiments show each amino acid has a
significant effect on the conformation of amino acids up
to 8 positions in front or behind it.
• a collection of 25 proteins of known structure was
analyzed, and the frequency at which each amino acid
was found in helix, sheet, turn or coil within the 17
position window was determined
– this creates a 17 *20 scoring matrix that is used to
calculate the most likely conformation of each amino
acid within the 17 a.a. window
• This window slides down the primary sequence, scoring
the most likely conformation for each amino acid based
on the neighboring amino acids.
• Accuracy is about 65%
Signal for a Coiled Region
• Gapped in multiple alignments
• Small polar residues
–Ala
–Gly (v. small so flexible)
–Ser
–Thr
• Prolines rarer in other kinds of secondary
structure
How to Find Patterns
Mathematically
Hidden Markov Models
• Hidden Markov Models (HMMs) are a
more sophisticated form of profile
analysis.
• Rather than build a table of amino acid
frequencies at each position, they model
the transition from one amino acid to the
next.
• Pfam is built with HMMs.
Hidden Markov Models
Sample ProDom Output
Discovery of new Motifs
• All of the tools discussed so far rely on a
database of existing domains/motifs
• How to discover new motifs
– Start with a set of related proteins
– Make a multiple alignment
– Build a pattern or profile
Depicting Structure
Helix
Beta
Sheet
Loop
PDB ID: 12as
PDB New Fold Growth
Old fold
New fold
• Only a few thousand unique folds in nature
• 90% of new structures deposited to PDB in the
past three years have similar structural folds
• Secondary structure is context-dependent
• Elements may be predicted to ID topology
• Generally only 50% of a structure is alphahelix or beta-sheet.
• Beta-strands have necessarily longer range
associations.
Secondary Structure
• Protein secondary structure takes one of
three forms:
 Alpha
helix
 Beta pleated sheet
 Turn
• 2ndary structure is predicted within a small
window
• Many different algorithms, not highly accurate
• Better predictions from a multiple alignment
Signals for Alpha
Helices
• Amphipathic helices
interact with core and
solvent
– Characteristic
hydrophobicity profile
• Prolines disrupt the
middles of helices
Signals for beta strands
• Edge strands alternate
hydrophobic/hydrophilic
• Center strands all
hydrophobic
• Strands are extended so
few residues per core
span
Antiparallel Beta Sheet
Parallel Beta Sheet
Peptide chains have a directionality
conferred by their N-terminus and Cterminus. β strands can be said to be
directional, indicated by an arrow
pointing toward the C-terminus.
Adjacent β strands can form hydrogen
bonds in antiparallel, parallel, or
mixed arrangements.
Antiparallel β strands alternate
directions so that the N-terminus of
one strand is adjacent to the Cterminus of the next. This produces
the strongest inter-strand stability
because it allows the inter-strand
hydrogen bonds between carbonyls
and amines to be planar, which is
their preferred orientation.
Beta Sheet (Antiparallel)
R groups don’t form
these secondary
structures, but block
formation of the
secondary
structures.
The bonds forming
the structures are
from the amino and
carboxy groups of
the amino acid
residues.
Signal for a Beta Strand
Creating Beta Sheets
• Large aromatic residues
(Tyr, Phe and Trp) and βbranched amino acids
(Thr, Val, Ile) are favored
to be found in β strands
in the middle of β sheets.
Interestingly, different
types of residues (such
as Pro) are likely to be
found in the edge strands
in β sheets
Protein Classification
• Family: homologous, same ancestor, high
sequence identity, similar structures
• Super Family: distant homologous, same
ancestor, sequence identity is around 25%-30%,
similar structures.
• Fold: only shapes are similar, no homologous
relationship, low sequence identity.
• Protein classification databases: Pfam, SCOP,
CATH, FSSP
Pfam
• http://www.sanger.ac.uk/Software/Pfam/
• Protein sequence classification database
• As of Pfam 24.0 (October 2009, 11912
families)
• Multiple sequence alignment for each
family, then modeled by a HMM model
SCOP: Structural Classification of
Proteins
http://scop.mrc-lmb.cam.ac.uk/scop/
Protein structure classification database, manually curated
110800 Domains, 38221 PDB entries
Class
# folds
# superfamilies
# families
All alpha proteins
284
507
871
All beta proteins
174
354
742
Alpha and beta proteins (a/b)
147
244
803
Alpha and beta proteins (a+b)
376
552
1055
Multi-domain proteins
66
66
89
Membrane and cell surface
58
110
123
Small proteins
90
129
219
1195
1962
3902
Total
SCOP
• Nearly all proteins have structural similarities with other
proteins and, in some of these cases, share a common
evolutionary origin.
• The SCOP database, created by manual inspection and
automated methods, aims to provide a detailed and
comprehensive description of the structural and
evolutionary relationships between all proteins whose
structure is known.
• SCOP provides a broad survey of all known protein
folds, detailed information about the close relatives of
any particular protein, and a framework for future
research and classification.
The Problem
protein
structure
• Protein functions determined
by 3D structures
• ~ 30,000 protein structures
in PDB (Protein Data Bank)
medicine
• Experimental determination
of protein structures timeconsuming and expensive
sequence
• Many protein sequences
available
function
Protein Structure Prediction
• In theory, a protein structure can be solved
computationally
• A protein folds into a 3D structure to minimizes
its free potential energy
• The problem can be formulated as a search
problem
for minimum energy
– the search space is enormous
– the number of local minima increases exponentially
Computationally it is an exceedingly difficult problem
Who Cares?
•
•
•
•
Long history: more than 30 years
Listed as a “grand challenge” problem
IBM’s big blue
Competitions: CASP (1992-2006)
• Useful for
–
–
–
–
Drug design
Function annotation
Rational protein engineering
Target selection
Observations
• Sequences determine structures
• Proteins fold into minimum energy state.
• Structures are more conserved than
sequences. Two protein with 30% identity
likely share the same fold.
What determines structures?
• Hydrogen bonds: essential in stabilizing the
basic secondary structures
• Hydrophobic effects: strongest determinants
of protein structures
• Van der Waal Forces: stabilizing the
hydrophobic cores
• Electrostatic forces: oppositely charged side
chains form salt bridges
Protein Structure Prediction
• Stage 1: Backbone
Prediction
– Ab initio folding
– Homology
modeling
– Protein threading
• Stage 2: Loop
Modeling
• Stage 3: SideChain Packing
• Stage 4: Structure
Refinement
The picture is adapted from http://www.cs.ucdavis.edu/~koehl/ProModel/fillgap.html
State of The Art
• Ab inito folding (simulation-based method)
1998 Duan and Kollman
36 residues, 1000 ns, 256 processors, 2 months
Do not find native structure
• Template-based (or knowledge-based) methods
– Homology modeling: sequence-sequence alignment,
works if sequence identity > 25%
– Protein threading: sequence-structure alignment, can
go beyond the 25% limit
Sample Structure Prediction
detail:
....,....1....,....2....,....3....,....4....,....5....,....6
AA
|MMSGAPSATQPATAETQHIADQVRSQLEEKYNKKFPVFKAVSFKSQVVAGTNYFIKVHVG|
PHD sec |
HHHHHHHHHHHHHHHH
EEEEEEEEEEEEE EEEEEEEE |
Rel sec |999997899667599999999989997655877843368889999999233399999658|
prH
prE
prL
subset: SUB
sec
sec
sec
sec
|000000000221289999999989998762011111000000000000000000000000|
|000000000000000000000000000010000023578889989888536699999720|
|999898889777600000000010001126888865311110000000363300000278|
|LLLLLLLLLLLLLHHHHHHHHHHHHHHHHLLLLL...EEEEEEEEEEE....EEEEEELL|
ACCESSIBILITY
3st:
P_3 acc
10st: PHD acc
Rel acc
subset: SUB acc
|bbebbeeeeeebbeebbebbeebeeebeeeeeee eebebbebebbbbbb bbbbeb bb|
|007006778670077007007706760777777737707007060000005000060500|
|103021343252044604644672424555547615444425212186671016926120|
|.......e..e..eeb.ebbeeb.e.beeeeeee.eebeb.e....bbbb...bb.b...|
“Super-secondary” Structure
• Common structural motifs
– Membrane spanning (GCG= TransMem)
– Signal peptide (GCG= SPScan)
– Coiled coil (GCG= CoilScan)
– Helix-turn-helix (GCG = HTHScan)
Transmembrane Structures
Signal Peptide
Coiled Coil
Helix Turn Helix
Fig. 9.23
Finding Information in Protein
Sequences
There Are Many Meaningful
Protein Signals
• Predicting protein cleavage sites
• Predicting signal peptides
• Predicting transmembrane domains
Signal Peptides
• Proteins have intrinsic signals that govern
their transport and localization in the cell.
• Noble Prize to Gunter Blobel in 1999 for
describing protein signaling.
• Proteins have to be transported either out
of the cell, or to the different
compartments - the organelles - within the
cell.
Signal Peptides
• Newly synthesized proteins have an
intrinsic signal that is essential for
governing them to and across the
membrane of the endoplasmic
reticulum, one of the cell’s organelles.
• How do large proteins traverse the
tightly sealed, lipid-containing,
membranes surrounding the
organelles?
Signal Peptides
• The signal consists of a peptide: a
sequence of amino acids in a particular
order that form an integral part of the
protein.
• Specific amino acid sequences
(topogenic signals) determine whether a
protein will pass through a membrane
into a particular organelle, become
integrated into the membrane, or be
exported out of the cell.
Signal Peptides
• Software exists that can predict the signal
peptide sequences.
• The SignalP World Wide Web server predicts
the presence and location of signal peptide
cleavage sites in amino acid sequences from
different organisms:
– Gram-positive prokaryotes
– Gram-negative prokaryotes
– Eukaryotes.
Signal Peptides
• The method incorporates a prediction of
cleavage sites and a signal peptide/non-signal
peptide prediction based on a combination of
several artificial neural networks.
• Artificial neural networks are collections of
mathematical models that emulate some of the
observed properties of biological nervous
systems and draw on the analogies of adaptive
biological learning.
Patterns in Unaligned
Sequences
• Sometimes sequences may share just a
small common region
– common signal peptide
– new transcription factors
• MEME: San Diego Supercomputing Facility
– http://www.sdsc.edu/MEME/meme/website/meme.htm
l
• MEME uses Hidden Markov Models
Protein Secondary Structure
• CATH (Class, Architecture,Topology,
Homology)
http://www.biochem.ucl.ac.uk/dbbrowser/cath/
• SCOP (structural classification of proteins) hierarchical database of protein folds
http://scop.mrc-lmb.cam.ac.uk/scop
• FSSP Fold classification using structurestructure alignment of proteins
http://www2.ebi.ac.uk/fssp/fssp.html
• TOPS Cartoon representation of topology
showing helices and strands
• http://tops.ebi.ac.uk/tops/
Protein Sequence Hierarchy
SUPERFAMILY
FAMILY
DOMAIN
FOLD or MOTIF
Active SITE
RESIDUE
Protein families
•
Proteins can be divided into families by:
– Sequence.
– Structure.
– Function.
•
Secondary databases divide proteins into
families.
Protein families
•
•
•
Types of secondary databases:
“Curated” databases: Expert judgment of
each family (Prosite, prints, Pfam).
“Automated” databases: Constructed
automatically (Blocks, ProDom).
Prosite
• Characterization of protein families by
conserved motifs observed in a multiple
sequence alignments of known homologues.
• Each family is defined by a single pattern.
• Motifs:
Prosite
• Each entry includes: Pattern and
sometimes also a profile.
• Pattern is a method for describing a
conserved sequence (consensus, profile).
• Sample entry
Prosite Structure
• Entries are divided into two files
– Pattern file: the pattern and all Swiss-Prot
matches.
– Documentation file: Details of the characterized
family, a description of the biological role of the
chosen motif, references.
Prosite
• Pattern are described using regular
expressions.
• Example:
W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE]
• Regular expressions retain only conserved
or significant residue information
Prosite
consensus
A
A
C
T
T
G
multiple alignment
A
A
C
T
T
G
A
A
G
T
C
G
C
A
C
T
T
C
pattern
[AC]-A-[GC]-T-[TC]-[GC]
profile
•Sensitivity:
consensus<pattern<profile
1
2
3
4
5
A
0.66
1
0
0
.
T
0
0
0
1
.
C
0.33
0
0.6
6
0
.
G
0
0
0.3
0
.
Prosite Syntax
 The standard IUPAC one-letter codes.
 `x' : any amino acid.
 `[]' : residues allowed at the position.
 `{}' : residues forbidden at the position.
 `()' : repetition of a pattern element are indicated in
parenthesis. X(n) or X(n,m) to indicate the number or
range of repetition.
 `-' : separates each pattern element.
 `‹' : indicated a N-terminal restriction of the pattern.
 `›' : indicated a C-terminal restriction of the pattern.
 `.' : the period ends the pattern.
Prosite Syntax - Examples
• [AC]-x-v-x(4)-{ED}.
• [Ala or Cys]-any-val-any-any-any-any-any but
Glu or Asp
• <A-x-[ST](2)-x(0,1)-v
• N-terminus-Ala-any-[Ser or Thr]-[Ser or Thr](any or none)-val
Searching with Regular
Expressions
• Ideally the pattern should only detect true
positives.
• Creating a regular expression that performs
well in database searches is a compromise
between sensitivity and tolerance (false
positives and false negatives).
• The fuzzier the pattern, the noisier its result, but
the greater the chances of finding distant
relatives
Prosite
Searching
Prosite
Input: Protein sequence
Input: A pattern
Output: list of patterns
Output: list sequences
BLOCKS
• Blocks are multiply aligned un-gapped
segments corresponding to the most
highly conserved regions of proteins
Blocks
• Blocks of 5-200 aa long alignments.
• A family is characterized by a group of
blocks.
BLOCKS Construction
• Creation of BLOCKS by automatically
detecting the most highly conserved
regions of each protein family
• Blocks incorporates all known families
from the “curated” databases.
Blocks
Searching
Blocks
Input: Protein sequence
Input: A Block
Output: list of blocks
Output: list sequences
InterPro
• Integrated resource of Protein Families
• Unifies a set of secondary databases
using same terminology.
• InterPro provides text and sequence
based searches.
Conclusions
• Secondary databases are useful for
characterizing of protein sequences.
• Numerous databases describe protein
families.
• “Curated” databases do not include all
known families.
• Secondary databases are useful for
testing new user-defined motifs.