Download Machine learning methods for Protein Secondary Structure Prediction

Document related concepts

Magnesium transporter wikipedia , lookup

Expression vector wikipedia , lookup

Gene expression wikipedia , lookup

Point mutation wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Drug design wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein wikipedia , lookup

Biochemistry wikipedia , lookup

Protein purification wikipedia , lookup

Interactome wikipedia , lookup

Western blot wikipedia , lookup

Metalloprotein wikipedia , lookup

Proteolysis wikipedia , lookup

Structural alignment wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Transcript
COT 6930
HPC and Bioinformatics
Protein Structure Prediction
Xingquan Zhu
Dept. of Computer Science and Engineering
Protein
structure
databases
Gene expression
database
transcription
DNA
Genomic
DNA
Databases
translation
RNA
cDNA
ESTs
UniGene
protein
Protein
sequence
databases
phenotype
Outline

Protein Structure



Why structure
How to predict protein structure
 Experimental methods
 Computational methods (predictive methods)
Protein Structure Prediction

Secondary structure prediction (2D)


Machine learning methods for protein secondary structure prediction
Tertiary structure prediction (3D)
 Ab initio
 Homology modeling
Proteins


Proteins play a crucial role in virtually all biological processes with a
broad range of functions.
The activity of an enzyme or the function of a protein is governed by
the three-dimensional structure
Protein Structure is Hierarchical
Protein Structure
Video
http://www.youtube.co
m/watch?v=lijQ3a8yU
YQ
Primary Structure: Sequence

The primary structure of a protein is the amino acid sequence
Protein Structure Prediction Problem
Protein structure prediction




Predict protein 3D structure from (amino acid) sequence
One step closer to useful biological knowledge
Sequence → secondary structure → 3D structure → function
Outline

Protein Structure



Why structure
How to Predict Protein Structure
 Experimental methods
 Computational methods (predictive methods)
Protein Structure Prediction

Secondary structure prediction (2D)


Machine learning methods for Protein Secondary Structure Prediction
Tertiary structure prediction (3D)
 Ab initio
 Homology modeling
Why Predict Structure?
Structure is more
conserved than
sequence
Structure
determines
function
Goals:
1. Predict structure from
sequence
2. Predict function based on
structure
3. Predict function based
on sequence
Molecular
function
Why predict structure: Structure is
more conserved than sequence
28% sequence identity
Why predict structure: Can Label
Proteins by Dominant Structure

SCOP: Structural Classification Of Proteins
Why predict structure: Large number
proteins vs. relative smaller number folds

Small number of unique folds found in practice

90% proteins < 1000 folds, estimated ~4000 total folds
http://www.rcsb.org/pdb/home/home.do
As of 02/05/2008
48,878 structures
Examples of Fold Classes
How to Predict Protein Structure

A related biological question: what are the factors that
determine a structure?



Energy
Kinematics
How can we determine structure?

Experimental methods
 X-ray crystallography or NMR (Nuclear magnetic resonance)
spectrometry


limitation: protein size, require crystallized proteins
Computational methods (predictive methods)
 2-D structure (secondary structure)
 3-D structure (tertiary structure)
Geometry of Protein Structure
rotatable
rotatable
Inter-atomic Forces

Covalent bond



(short range, very strong)
Covalent bond between sulfhydryl (sulfur + hydrogen) groups
Hydrophobic / hydrophillic interaction (weak)


(short range, strong)
Binds two polar groups (hydrogen + electronegative atom)
Disulfide bond / bridge


Binds atoms into molecules / macromolecules
Hydrogen bond

(short range, very strong)
Hydrogen bonding w/ H2O in solution
Van der Waal’s interaction

Nonspecific electrostatic attractive force
(very weak)
Types of Inter-atomic Forces
Quick Overview of Energy
Bond
Strength
(kcal/mole)
H-bonds
3-7
Ionic bonds
10
Hydrophobic
interactions
1-2
Van der vaals
interactions
1
Disulfide bridge
51
Protein Folding Animation


http://www.youtube.com/watch?v=fvBO3TqJ6FE
http://www.youtube.com/watch?v=swEc_sUVz5I
Two Related Problems in
Structure Prediction


Directly predicting protein structure from the
amino acid sequence has proved elusive
Two sub-problems


Secondary Structure Prediction
Tertiary Structure Prediction
Secondary Structure Predication (2D)

For each residues in a protein structure, three possible states: a
(a-helix), ß (ß-strand), t (others).
amino acid sequence
Secondary structure sequence

Currently the accuracy of secondary structure methods is nearly
80% (2000).

Secondary structure prediction can provide useful information to
improve other sequence and structure analysis methods, such as
sequence alignment and 3-D modeling.
http://bioinf.cs.ucl.ac.uk/psipred/psiform.html
Outline

Protein Structure



Why structure
How to Predict Protein Structure
 Experimental methods
 Computational methods (predictive methods)
Protein Structure Prediction

Secondary structure prediction (2D)


Machine learning methods for Protein Secondary Structure Prediction
Tertiary structure prediction (3D)
 Ab initio
 Homology modeling
PSSP: Protein Secondary
Structure Prediction

Three Generations
•
•
•
Based on statistical information of single
amino acids
Based on local amino acid interaction
(segments). Typically a segment containes
11-21 aminoacids
Based on evolutionary information of the
homology sequences
Secondary Structure preferences for
Amino Acids
The normalized frequencies for
each conformation were calculated
from the fraction of residues of each
amino acid that occurred in that
conformation, divided by this
fraction for all residues.
Random occurrence of a particular
amino in a conformation would give
a value of unity. A value greater
than unity indicates a preference for
a particular type of secondary
structure.
Outline

Protein Structure



Why structure
How to Predict Protein Structure
 Experimental methods
 Computational methods (predictive methods)
Protein Structure Prediction

Secondary structure prediction (2D)


Machine learning methods for Protein Secondary Structure Prediction
Tertiary structure prediction (3D)
 Ab initio
 Homology modeling
Machine learning methods for Protein
Secondary Structure Prediction



Introduction to classification
Generalize protein secondary structure prediction
as a machine learning problem
Introduction to Neural Network
Classification and Classifiers


Given a data base table DB with a set of
attribute values and a special atribute C, called
a class label.
Example:
A1
1
0
1
A2
1
1
0
A3
m
v
m
A4
g
g
b
C
Tumor
Normal
Normal
Classification and Classifiers

An algorithm is called a classification algorithm if it uses
the data to build a set of patterns




Decision rules or decision trees, etc.
Those patters are structured in such a way that we can use them to
classify unknown sets of objects- unknown records.
For that reason (because of the goal) the classification
algorithm is often called shortly a classifier.
Classifier Example
Classification and Classifiers

Building a classifier consists of two phases:



The training data set to create patterns (rules, trees, or to
train a Neural network).


Training and testing.
In both phases we use data (training data set and disjoint test data
set) for which the class labels are known for ALL of the records.
Evaluate created patterns with the use of of test data, which
classification is known.
The measure for a trained classifier accuracy is called
predictive accuracy.
Predictive Accuracy Evaluation
The main methods of predictive accuracy evaluations are:
•
•
•
•
Re-substitution (N ; N)
Holdout (2N/3 ; N/3)
x-fold cross-validation (N-N/x ; N/x)
Leave-one-out (N-1 ; 1),
where N is the number of instances in the dataset

The process of building and evaluating a classifier is also
called a supervised learning, or lately when dealing with
large data bases a classification method in Data Mining
Classification Models: Different
Classifiers
Typical classification models
 Decision Trees (ID3, C4.5)
 Nearest Neighbors
 Support Vector Machines
 Neural Networks


Most of the best classifiers for PSSP are based on
Neural Network model
Demonstration
Machine learning methods for Protein
Secondary Structure Prediction



Introduction to classification
Generalize protein secondary structure prediction
as a machine learning problem
Introduction to Neural Network
How to generalize protein secondary
prediction as a machine learning problem?

Using a sliding window to move along the amino acid
sequence



Each window denotes an instance
Each amino acid inside the window denotes an attribute
The known secondary structure of the central amino acid is the class
label
How to generalize protein secondary
prediction as a machine learning problem?




A set of “examples” are generated from sequence
with known secondary structures
Examples form a training set
Build a neural network classifier
Apply the classifier to a sequence with unknown
secondary structure
Machine learning methods for Protein
Secondary Structure Prediction



Introduction to classification
Generalize protein secondary structure prediction
as a machine learning problem
Introduction to Neural Network
Introduction to Neural Network

What is an artificial Neural Network?

An extremely simplified model of the brain


Essentially a function approximator
Transforms inputs into outputs to the best of its ability
Introduction to Neural Network

Composed of many “neurons” that co-operate to
perform the desired function
How do Neural Network Work?


A neuron (perceptron) is a single layer NN
The output of a neuron is a function of the weighted
sum of the inputs plus a bias
Activation Function

Binary active function



f(x)=1 if x>=0
f(x)=0 otherwise
The most common sigmoid function used is the
logistic function


f(x) = 1/(1 + e-x)
The calculation of derivatives are important for neural
networks and the logistic function has a very nice
derivative

f’(x) = f(x)(1 - f(x))
Where Do The Weights Come
From?


The weights in a neural network are the most
important factor in determining its function
Training is the act of presenting the network with
some sample data and modifying the weights to
better approximate the desired function

Supervised Training


Supplies the neural network with inputs and the desired
outputs
Response of the network to the inputs is measured

The weights are modified to reduce the difference between the
actual and desired outputs
Perceptron Example

Simplest neural network with the ability to learn



Made up of only input neurons and output neurons
Output neurons use a simple threshold activation
function
In basic form, can only solve linear problems
 Limited applications
Perceptron Example

Perceptron weight updating

If the output is not correct, the weights are adjusted
according to the formula:

wnew = wold + ·(desired – output)input
Assuming given
instance
{(1,0,1), 0}
Multi-Layer Feedforward NN

An extension of the perceptron

Multiple layers


Activation function is not simply a threshold


Usually a sigmoid function
A general function approximator


The addition of one or more “hidden” layers in between the
input and output layers
Not limited to linear problems
Information flows in one direction

The outputs of one layer act as inputs to the next layer
Multi-Layer Feedforward NN
Example

XOR problem
Back-propagation

Searches for weight values that minimize the
total error of the network over the set of
training examples


Forward pass: Compute the outputs of all units in the
network, and the error of the output layers.
Backward pass: The network error is used for
updating the weights (credit assignment problem).
NN for Protein
Secondary
Structure
Prediction
Outline

Protein Structure



Why structure
How to Predict Protein Structure
 Experimental methods
 Computational methods (predictive methods)
Protein Structure Prediction

Secondary structure prediction (2D)


Machine learning methods for Protein Secondary Structure Prediction
Tertiary structure prediction (3D)
 Ab initio
 Homology modeling
Ab initio Prediction

Sampling the global conformation space
 Lattice models / Discrete-state models
 Molecular Dynamics

Picking native conformations with an energy function
 Solvation model: how protein interacts with water
 Pair interactions between amino acids
Lattice String Folding

HP model: main modeled force is hydrophobic attraction




Amino Acids are classified into two types
 Hydrophopic (H) or Polar (P)
NP-hard in both 2-D square and 3-D cubic
Constant approximation algorithms
Not so relevant biologically
Lattice String Folding
Energy Minimization

Many forces act on a protein
 Hydrophobic: inside of protein wants to avoid water





Packing: atoms can't be too close, nor too far away
van der Waals interactions
Bond angle/length constraints
Long distance, e.g.





Hydrophobic molecules associate with each other in water solvent as if water
molecules is the repellent to them. It is like oil/water separation.
Electrostatics & Hydrogen bonds
Disulphide bonds
Salt bridges
Can calculate all of these forces, and minimize
Intractable in general case, but can be useful
Molecular Dynamics (MD)
In molecular dynamics simulation, we simulate motions of atoms as a function of
time according to Newton’s equation of motion. The equations for a system
consisting on N atoms can be written as
d ri t 
2
mi
2
 Fi t ,
(i  1, 2,  , N ).
(1)
dt
Here, ri and mi represent the position and mass of atom i and Fi(t) is the force on
atom i at time t. Fi(t) is given by
Fi  iV r1 , r2 ,  , rN ,
(2)
where V(r1, r2, …, rN) is the potential energy of the system that depends on the
positions of the N atoms in the system. ∇i is



i  i
j
k
x
y
z
(3)
Energy Functions used in
Molecular Simulation
Φ
r
Θ
Bond stretching
term
Angle bending
term
Vtotal 
Dihedral term
 K r  r    K       K 1  cosn   
2
b
2

0
bonds
angles
dihedrals
 Cij Dij 
 12  10  


 van der Waals
r
Hbonds rij
ij

 i , j pairs


0


H-bonding term Van der Waals term
O
r
H
The most
time
demanding
part.
 Aij Bij 
qi q j
 12  6  
r
 electrosta tic r
r
ij
ij

 i , j pairs ij
Electrostatic
term
+
r
r
ー
Outline

Protein Structure



Why structure
How to Predict Protein Structure
 Experimental methods
 Computational methods (predictive methods)
Protein Structure Prediction

Secondary structure prediction (2D)


Machine learning methods for Protein Secondary Structure Prediction
Tertiary structure prediction (3D)
 Ab initio
 Homology modeling
Homology-based Prediction

Align query sequence with sequences of known structure,
usually >30% similar

Superimpose the aligned sequence onto the structure
template, according to the computed sequence alignment

Perform local refinement of the resulting structure in 3D
The number of unique structural folds
is small (possibly a few thousand)
90% of new structures submitted to PDB in the
past three years have similar folds in PDB
Homology-based Prediction
Raw model
Loop modeling
Side chain placement
Refinement
Homology-based Prediction
Outline

Protein Structure



Why structure
How to predict protein structure
 Experimental methods
 Computational methods (predictive methods)
Protein Structure Prediction

Secondary structure prediction (2D)


Machine learning methods for protein secondary structure prediction
Tertiary structure prediction (3D)
 Ab initio
 Homology modeling