Download Protein structure prediction Haixu Tang School of Informatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Artificial gene synthesis wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Metabolism wikipedia , lookup

Expression vector wikipedia , lookup

Magnesium transporter wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Biosynthesis wikipedia , lookup

Gene expression wikipedia , lookup

Point mutation wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Genetic code wikipedia , lookup

Interactome wikipedia , lookup

Protein purification wikipedia , lookup

Western blot wikipedia , lookup

Protein wikipedia , lookup

Metalloprotein wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Biochemistry wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
Protein structure prediction
Haixu Tang
School of Informatics
Basic operations in a cell
(Central Dogma)
•
A gene is expressed in two steps
1) Transcription: RNA synthesis
2) Translation: Protein synthesis
Basic operations in a cell
(Central Dogma)
Proteins
•
A gene is expressed in two steps
1) Transcription: RNA synthesis
2) Translation: Protein synthesis
Proteins are major function
biomolecules in cells
• Examples of protein functions
Alcohol
dehydrogenase
oxidizes alcohols
to aldehydes or
ketones
– Catalysis:
Almost all chemical reactions in a
living cell are catalyzed by protein
enzymes.
– Transport:
Some proteins transports various
substances, such as oxygen, ions,
and so on.
– Information transfer:
For example, hormones.
Haemoglobin
carries oxygen
Insulin controls
the amount of
sugar in the
blood
Protein is composed of amino acids
R
NH3
+
C
Amino group
Different side chains, R,
determine the chemical
properties of 20 amino acids.
H
COO
Carboxylic
acid group
20 Amino acids
Glycine (G)
Alanine (A)
Valine (V)
Isoleucine (I)
Leucine (L)
Proline (P)
Methionine (M)
Phenylalanine (F)
Tryptophan (W)
Asparagine (N)
Glutamine (Q)
Serine (S)
Threonine (T)
Tyrosine (Y)
Cysteine (C)
Lysine (K)
Arginine (R)
Histidine (H)
Asparatic acid (D) Glutamic acid (E)
White: Hydrophobic, Green: Hydrophilic, Red: Acidic, Blue: Basic
Proteins are linear polymers of
amino acids
R1
R2
NH3+ C COO + NH3+ C COO +
ー
ー
H
H
H 2O
H 2O
R1
R2
R3
NH3+ C CO NH C CO NH C CO
H
A
F
Peptide
bond
G
N
S
Peptide
bond
H
T
D
K
G
H
S
A
The amino acid
sequence is called as
primary structure
Each Protein has a unique structure
Amino acid sequence
NLKTEWPELVGKSVEE
AKKVILQDKPEAQIIVL
PVGTIVTMEYRIDRVR
LFVDKLDNIAEVPRVG
folding
Protein Structure Determination
• X-ray crystallography
–
–
–
–
most accurate
in vitro
need crystal proteins
~100K per structure
• Nuclear Magnetic Resonance
–
–
–
–
Fairly accurate
in vivo, in solution
No need for crystals
Limited to small proteins
Protein data bank
• http://www.rcsb.org/pdb/
• PDB files: atom coordinates, etc
• (1atn: actin/DNAse I complex)
•
•
•
•
•
•
•
•
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
1
2
3
4
5
6
7
8
CA ACE A 0 105.046 51.546 40.626 1.00 72.72 1ATN 263
C ACE A 0 105.314 50.822 41.951 1.00 72.72 1ATN 264
O ACE A 0 105.220 51.451 43.013 1.00 72.56 1ATN 265
N ASP A 1 105.665 49.507 41.867 1.00 71.64 1ATN 266
CA ASP A 1 105.992 48.589 42.982 1.00 70.20 1ATN 267
C ASP A 1 107.024 49.191 43.936 1.00 69.70 1ATN 268
O ASP A 1 106.927 49.088 45.163 1.00 69.14 1ATN 269
CB ASP A 1 106.533 47.248 42.410 1.00 70.66 1ATN 270
Visualizing protein structure
(PDB files)
Basic structural units of proteins:
Secondary structure
α-helix
β-sheet
Secondary structures, α-helix
and β-sheet, have regular
hydrogen-bonding patterns.
Three-dimensional structure of
proteins
Tertiary
structure
Quaternary structure
Hierarchical nature of protein structure
Primary structure (Amino acid sequence)
↓
Secondary structure (α-helix, β-sheet)
↓
Tertiary structure (Three-dimensional structure
formed by assembly of secondary structures)
↓
Quaternary structure (Structure formed by more
than one polypeptide chains)
Secondary Structure Prediction
• Given a protein sequence, secondary structure
prediction aims at predicting the state of each amino acid
as being either H (helix), E (extended=strand), or O
(other).
• The quality of secondary structure prediction is
measured with a “3-state accuracy” score, or Q3. Q3 is
the percent of residues that match “reality” (X-ray
structure).
Early methods for Secondary Structure
Prediction
• Chou and Fasman
(Chou and Fasman. Prediction of protein conformation.
Biochemistry, 13: 211-245, 1974)
• GOR
(Garnier, Osguthorpe and Robson. Analysis of the accuracy
and implications of simple methods for predicting the
secondary structure of globular proteins. J. Mol. Biol., 120:97120, 1978)
Chou and Fasman
Amino Acid
-Helix
-Sheet
Turn
Ala
Cys
Leu
Met
Glu
Gln
His
Lys
Val
Ile
Phe
Tyr
Trp
Thr
Gly
Ser
Asp
Asn
Pro
1.29
1.11
1.30
1.47
1.44
1.27
1.22
1.23
0.91
0.97
1.07
0.72
0.99
0.82
0.56
0.82
1.04
0.90
0.52
0.90
0.74
1.02
0.97
0.75
0.80
1.08
0.77
1.49
1.45
1.32
1.25
1.14
1.21
0.92
0.95
0.72
0.76
0.64
0.78
0.80
0.59
0.39
1.00
0.97
0.69
0.96
0.47
0.51
0.58
1.05
0.75
1.03
1.64
1.33
1.41
1.23
1.91
Arg
0.96
0.99
0.88
Favors
-Helix
Favors
-strand
Favors
turn
The GOR method
For each position j in the sequence, eight residues on either side are
considered.
j
Accuracy
• Both Chou and Fasman and GOR have been
assessed and their accuracy is estimated to be
Q3=60-65%.
Neural networks
The most successful methods for predicting secondary structure
are based on neural networks. The overall idea is that neural
networks can be trained to recognize amino acid patterns in
known secondary structure units, and to use these patterns to
distinguish between the different types of secondary structure.
Neural networks classify “input vectors” or “examples” into
categories (2 or more).
Protein 3D Structure Prediction
• In theory, a protein structure can be predicted
computationally
• A protein folds into a 3D structure to minimizes its
free potential energy
• The problem can be formulated as a search
problem for minimum energy
– the search space is enormous even for small proteins!
– the number of local minima increases exponentially of the size of
proteins
Computational Methods for
Protein 3D Structure Prediction
• Comparative modeling
– Protein threading – make structure prediction through
identification of “good” sequence-structure fit
– Homology modeling – identification of homologous
proteins through sequence alignment; structure
prediction through placing residues into “corresponding”
positions of homologous structure models
Protein Threading
• Find the “correct” sequence-structure
alignment between a target sequence and
its native-like fold in PDB
• Energy function – knowledge (or statistics)
based rather than physics based
– Should be able to distinguish correct structural
folds from incorrect structural folds
– Should be able to distinguish correct
sequence-fold alignment from incorrect
sequence-fold alignments
Protein Threading
• Structure database
• Fitness function
• Sequence-structure alignment algorithm
• Prediction reliability assessment
Protein Threading – structure database
• Build a template database
Protein Threading – fitness function
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
how preferable to put
two particular residues
nearby: E_p
how well a residue fits
a structural
environment: E_s
alignment gap
penalty: E_g
find a sequence-structure alignment
to minimize the energy function
Protein Threading
(sequence-structure alignment)
• Unlike sequence-sequence alignment where amino acids are
aligned, a sequence-structure alignment aligns amino acids with
structural environments
• A simple definition of structural environment
– secondary structure: alpha-helix, beta-strand, loop
– solvent accessibility: 0, 10, 20, …, 100% of accessibility
– each combination of secondary structure and solvent accessibility level
defines a structural environment
• E.g., (alpha-helix, 30%), (loop, 80%), …
Protein Threading -- algorithm
• Threading algorithm – to find a sequence-structure
alignment with the minimum fitness function
sequence
fold
links
CASP
• CASP = Critical Assessment of Structure
Prediction
• First held in 1994, every 2 years
afterwards
• Teams make structure predictions from
sequences alone
CASP
• Two categories of predictors
– Automated
• Automatic Servers, must complete analysis within
48 hours
• Shows what is possible through computer analysis
alone
– Non-automated
• Groups spend considerable time and effort on
each target
• Utilize computer techniques and human analysis
techniques
CASP
• CASP6, 2004
– 200 prediction teams from 24 countries
– Over 30,000 predictions for 64 protein targets
collected and evaluated
– Conference held after to discuss results, with
many teams presenting individual results and
methodologies
– Helps to steer future work