Download CSCE590/822 Data Mining Principles and Applications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Silencer (genetics) wikipedia , lookup

Paracrine signalling wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Gene expression wikipedia , lookup

Biosynthesis wikipedia , lookup

Expression vector wikipedia , lookup

Magnesium transporter wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Genetic code wikipedia , lookup

Point mutation wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Metalloprotein wikipedia , lookup

Interactome wikipedia , lookup

Protein wikipedia , lookup

Structural alignment wikipedia , lookup

Biochemistry wikipedia , lookup

Protein purification wikipedia , lookup

Western blot wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
CSCE555 Bioinformatics

Lecture 18 Protein Bioinforamtics and
Protein Secondary Structure Prediction
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008
www.cse.sc.edu.
Outline
Understanding Protein Structures
 Protein bioinformatics: what and why?
 Protein Secondary Structure Prediction:
problem & algorithm
 Summary

Proteins
Large organic compounds made of amino acids
 Proteins play a crucial role in virtually all biological
processes with a broad range of functions.
 The activity of an enzyme or the function of a protein is
governed by the three-dimensional structure

How Proteins
Are Generated
folding
Protein Bioinformatics

Analysis and prediction of protein
structures (Structural Bioinformatics)
◦ Protein Design: design a sequence that will
fold into a designated structure

Assist experimental biology in assigning
functions or suggesting functional
hypotheses for all known proteins.
Protein Bioinformatics
Protein
structure
databases
Gene expression
database
transcription
DNA
Genomic
DNA
Databases
translation
RNA
cDNA
ESTs
UniGene
protein
Protein
sequence
databases
phenotype
TOP 10 Most Wanted solutions in
protein bioinformatics
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Protein sequence alignment
Predicting protein features from sequence
Function prediction
Protein structure prediction
Membrane proteins
Functional site identification
Protein-protein interaction
Protein-small molecule interaction (Docking)
Protein design
Protein engineering
Why Protein Bioinformatics?
Function = S interactions
Disease Mechanism, Gene regulation, Drug design…
Relevance of Protein Structure
in the Post-Genome Era
structure
medicine
sequence
function
Protein Structure Example
Beta Sheet
Helix
Loop
2 chains
Proteins Structure is Hierarchical
Single peptide
chain
Sequence
Local Folding
Multiple peptide
chains
Long-range Folding
Multi-meric organization
How to Obtain Protein Structures

Experimental methods (>50,000)
 X-ray crystallography or NMR (Nuclear
magnetic resonance) spectrometry
 limitation: protein size, require crystallized
proteins
 Difficult to get crystallized for membrane proteins

Computational methods (predictive
methods)
 2-D structure (secondary structure)
 3-D structure (tertiary structure)
 CASP competition: Critical Assessment of Techniques for
Protein Structure Prediction
Protein Structure Prediction Problem

Given the amino acid sequence of a
protein, what’s its shape in threedimensional space?
◦
Sequence → secondary structure → 3D structure →
function
Why Prediction Needed?
The functions of a protein is determined
by its structure.
 Experimental methods to determine
protein structure are time-consuming and
expensive.
 Big gap between the available protein
sequences and structures.

Growth of Protein Sequences and
Structures
30000*X
species
50,000 as
2008
Data from http://www.dna.affrc.go.jp
What determines structures:
Inter-atomic Forces

Covalent bond
(short range, very strong)
◦ Binds atoms into molecules / macromolecules

Hydrogen bond
(short range, strong)
◦ Binds two polar groups (hydrogen + electronegative atom)

Disulfide bond / bridge
(short range, very strong)
◦ Covalent bond between sulfhydryl (sulfur + hydrogen) groups

Hydrophobic / hydrophillic interaction (weak)
◦ Hydrogen bonding w/ H2O in solution

Van der Waal’s interaction
(very weak)
◦ Nonspecific electrostatic attractive force

Electrostatic forces:
◦ oppositely charged side chains form salt bridges
Secondary Structure Predication (2D)

For each residues in a protein structure, three possible
states: a (a-helix), ß (ß-strand), t (others).
amino acid sequence
Secondary structure sequence

Currently the accuracy of secondary structure methods is
nearly 80-82% (2006). Theoretical uplimit is 90% due to
uncertainty 10% in real proteins

Secondary structure prediction can provide useful information
to improve other sequence and structure analysis methods,
such as sequence alignment and 3-D modeling.
http://bioinf.cs.ucl.ac.uk/psipred/psiform.html
PSSP: Protein Secondary Structure
Prediction
 Three
Generations
• Based on statistical information of single
amino acids
• Based on local amino acid interaction
(segments). Typically a segment
containes 11-21 aminoacids
• Based on evolutionary information of
the homology sequences
Formulate PSSP as a machine learning
classification problem

Using a sliding window to move along the amino acid
sequence
◦ Each window denotes an instance
◦ Each amino acid inside the window denotes an attribute
◦ The known secondary structure of the central amino acid is
the class label
How to generalize protein secondary
prediction as a machine learning problem?
A set of “examples” are generated from
sequence with known secondary
structures
 Examples form a training set
 Build a neural network classifier
 Apply the classifier to a sequence with
unknown secondary structure

Introduction to Neural Network

What is an Artificial Neural Network?
◦ An extremely simplified model of the brain
 Essentially a function approximator
 Transforms inputs into outputs to the best of its
ability
How do Neural Network Work?


A neuron (perceptron) is a single layer NN
The output of a neuron is a function of the
weighted sum of the inputs plus a bias
Activation Function

Binary active function
◦ f(x)=1 if x>=0
◦ f(x)=0 otherwise

The most common sigmoid function used
is the logistic function
◦ f(x) = 1/(1 + e-x)
Multi-Layer Feedforward NN Example

XOR problem (nonlinear classification capable)
Where Do The Weights Come From?
The weights in a neural network are the most
important factor in determining its function
 Training is the act of presenting the network
with some sample data and modifying the
weights to better approximate the desired
function (class labels)

◦ Supervised Training
 Supplies the neural network with inputs and the desired
outputs
 Response of the network to the inputs is measured
 The weights are modified to reduce the difference between the
actual and desired outputs
Training in Perceptron Neural Net
Training a perceptron:
Find the weights W that minimizes the error function:
E   F ( X .W )  t ( X ) 
P
i
i
i 1
Use steepest descent:
- compute gradient:
- update weight vector:
2
P: number of training data
Xi: training vectors
F(W.Xi): output of the perceptron
t(Xi) : target value for Xi
 E E E
E
E  
,
,
,...,
wN
 w1 w2 w3
Wnew  Wold  E
- iterate
(e: learning rate)



Back-propagation algorithm
For Mult-layer NN, the errors of hidden layers
are not known
 Searches for weight values that minimize the
total error of the network over the set of
training examples

◦ Forward pass: Compute the outputs of all units in
the network, and the error of the output layers.
◦ Backward pass:The network error is
backpropogated for updating the weights (credit
assignment problem).
Feedforward Network Training by
Backpropagation: Process Summary
 Select an architecture
 Randomly initialize weights
 While error is too large
◦ Select training pattern and feedforward to
find actual network output
◦ Calculate errors and backpropagate error
signals
◦ Adjust weights

Evaluate performance using the test set
5/23/2017
Copyright G. A. Tagliarini, PhD
28
NN for Protein
Secondary Structure
Prediction
0
How to Encode Each Amino Acid?
20 bit binary sequence
 10000000000000000000-----A
 01000000000000000000-----R
 00100000000000000000-----N
…
 00000000000000000001-----V

Evaluation of Performance: Accuracy(Q3)
ALHEASGPSVILFGSDVTVPPASNAEQAK
hhhhhooooeeeeoooeeeooooohhhhh
Amino acid sequence
ohhhooooeeeeoooooeeeooohhhhhh
Q3=22/29=76%
Actual Secondary Structure
Q3 for random prediction is 33%
Secondary structure assignment in real proteins is uncertain to
about 10%;
Therefore, a “perfect” prediction would have Q3=90%.
Performances(CASP)
CASP
CASP1
YEAR
1994
# of
Targets
6
<Q3>
Group
63%
Rost
and
Sander
Rost
CASP2
1996
24
70%
CASP3
1998
18
75%
Jones
CASP4
2000
28
80%
Jones
Summary
Protein bioinformatics is a very important
area with many interesting problems
 Computational methods can have big
impact in medicine and molecular biology
 Secondary protein structure prediction
algorithms are very strong

Slides Acknowledgements
Jinbo Xu University of Waterloo
 Xingquan Zhu

Why predict structure: Can Label Proteins
by Dominant Structure

Protein classification, Structural Blasting
Amino Acids
Side chain
Each amino acid is identified by its side chain,
which determines the properties of this amino acid.
Side Chain Properties
hydrophobic
V, L, I, M, F
Hydrophilic
N, E, Q, H, K, R, D
In-between
G, A, S, T, Y, W, C, P
Positively charged
R, H, L
Negatively charged
D, E
Polar but not charged
N, Q, S, T
nonpolar
A, G, I, L, M, P, V
Aromatic
F, W, Y
Hydrophobic amino acids stay inside of a protein, while
Hydrophilic ones tend to stay in the exterior of a
protein.
Oppositely charged amino acids can form salt bridge.
Polar amino acids can participate hydrogen bonding
Alpha Helix Examples
Beta Sheet Examples
Parallel beta sheet
Anti-parallel beta sheet
Calculate Outputs For Each Neuron
Based On The Pattern
The output from neuron j for
pattern p is Opj where
Feedforward
1
pj (net j ) 
 net j
1 e
O
k ranges
the input
indices
net jover
 bias
*Wbias

and Wjk is the weight on the
connection from input k to k
neuron j
5/23/2017
W jk
pk
Copyright G. A. Tagliarini, PhD
Outputs
and O
Inputs

40
Calculate The Error Signal For Each
Output Neuron
The output neuron error signal dpj is given
by dpj=(Tpj-Opj) Opj (1-Opj)
 Tpj is the target value of output neuron j
for pattern p
 Opj is the actual output value of output
neuron j for pattern p

5/23/2017
Copyright G. A. Tagliarini, PhD
41
Calculate The Error Signal For Each
Hidden Neuron

The hidden neuron error signal dpj is given
by
d pj  O pj (1  O pj ) d pkWkj
k
where dpk is the error signal of a postsynaptic neuron k and Wkj is the weight of
the connection from hidden neuron j to
the post-synaptic neuron k
5/23/2017
Copyright G. A. Tagliarini, PhD
42