Download in Protein Folding Based on HP Model

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Replisome wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

DNA sequencing wikipedia , lookup

Microsatellite wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Self-organizing Map (SOM)
in Protein Folding Based
on HP Model
Xiang-Sun ZHANG
[email protected]
http://zhangroup.aporc.org
2003.12.2
2 Dec. 2003 at NCSU
Motivation
We are all concerning what we (OR
researchers and algorithm designers)
can do in Bioinformatics?
What is the junction of Operations
research and Bioinfomatics?
Abstract
Many problems in Bioinformatics can be
formulated as large linear/nonlinear integer
programming or combinatorial problems
which are NP-hard and unsolvable within
existing algorithms. Then efficient approximate methods are needed.
As examples, a heuristic algorithm for SBH
and a new SOM algorithm for solving the
protein HP model are presented.
Other related research works in our group
are introduced.
Problem areas in Bioinformatics
Human Genome Project

Large molecule data in biology, such as DNA and
protein
Genomics (基因组学)



DNA sequencing
Gene prediction
Sequence alignment
Proteomics(50000 entries in google)/Protenomics
(hundreds entries in google)(蛋白质学)


Structure prediction
Protein alignment
“Operations Research”
Over 8 millions entries on “google”
DNA Sequencing
ACGTGATCGATCGAGTACGAGAGTCTA
_______________________________
ACGTGATCGATCGAGTACGAGAGTCTA
ACGTGATCGATCGAGTACGAGAGTCTA
ACGTGATCGATCGAGTACGAGAGTCTA
ACGTGATCGATCGAGTACGAGAGTCTA
Two pieces of a target sequence with longer
overlap are preferably connected together,
that needs that
‫ ٭‬the average size of the pieces is as long
as possible and
‫ ٭‬the duplicates of the target sequence are
as many as possible.
A novel DNA sequencing technique, called
Sequencing By Hybridization (SBH), was
proposed as an alternative to the traditional
sequencing by gel electrophoresis.
SBH is based on the DNA chip (or DNA array).
A DNA chip contains all 4 k probes of length k
(i.e. a short k-nucleotide fragment of DNA or
called a k-tuple).
Given a probe and a target DNA, the target will
bind (hybridize) to the probe if there is a
substring of the target which “fits” the probe.
DNA Sequencing
DNA array (DNA chip)

AAATGCG(5 3-tuples, a chip with
4 3  64 3-tuples)
SBH uses classical probing scheme, i.e., by the
hybridization of an (unknown) DNA fragment
with this chip, the unknown target DNA can be
tested and its all k-tuple compositions (called a
spectrum) determined.
SBH provides information about k-tuples
presented in target DNA, but does not provide
information about positions of these k-tuples.
This results in a problem: how to reconstruct
the target DNA from this data.
Because of the limitation of technology, k has not
been taken as large as possible yet (generally less
than 30---already a big chip). This possibly leads to
the branching phenomenon in the sequence
reconstruction and multiple reconstruction.
On the other hand, there are two cases of errors
possibly occur: negative errors (i.e. some k-tuples
in the sequence which are not hybridized) and
positive errors (i.e. some hybridized probes which
are not k-tuples in the sequence). Therefore, for
larger DNA fragments, the problem of sequence
reconstruction becomes rather complicated and hard
to analyze.
In the case of error-free SBH and ideal spectrum (i.e.
consists of n-k+1 different k-tuples where n is the
length of the DNA fragment), it is known that the
SBH reconstruction problem is equivalent to finding
an Eulerian path in a corresponding graph, and the
algorithm can be implemented in linear time.
An occurrence of positive and negative errors and
repetitions of k-tuple in the DNA fragment will result
in a computational difficulty, i.e., the Problem
becomes a strongly NP-hard one.
Sequencing by Hybridization
DNA fragment
Spectrum
……ATACGAAGA……

ATA
TAC
ACG
CGA
GAA
AAG
AGA
Ideal case
ATA
TAC
AGG
CGA
GAA
AAG
AGA
With errors
Error: Positive (misread) / Negative (missing, repetition)
1989,Pevzner, SBH reconstruction problem is equivalent
to finding an Eulerian path in a related graph.
1990,Fleischner, the algorithm can be implemented in
linear time.
1991,Dramanac,et al., an algorithm for SBH with errors
under assumption that only the first or last nucleotide
in the data can be erroneous.
1993,Lipshutz, use empirically derived rates of positive
and negative errors and other assumptions. No
convergence analysis.
1999,Blazewicz,et al., branch and bound method in the
case of only positive errors.
2000,Blazewicz,et al., a heuristic algorithm producing
near-optimal solutions.
SBH Reconstruction Problem
Design efficient heuristic algorithms



Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang. A new
approach to the reconstruction of DNA sequencing by
hybridization. Bioinformatics, vol 19(1), pages 14-21, 2003.
Xiang-Sun Zhang, Ji-Hong Zhang and Ling-Yun Wu.
Combinatorial optimization problems in the positional DNA
sequencing by hybridization and its algorithms. System
Sciences and Mathematics, vol 3, 2002. (in Chinese)
Ling-Yun Wu, Ji-Hong Zhang and Xiang-Sun Zhang.
Application of neural networks in the reconstruction of DNA
sequencing by hybridization. In Proceedings of the 4th
ISORA, 2002.
Basic Observation
The spectrum corresponds to a graph:
each k-tuple to a vertex and two
connected k-tuples to an edge. The
structure of the graph is represented by
the adjacency matrix
A reconstruction of the spectrum is a
path in the graph. Information about all
paths are implied in the power of the
adjacency matrix
Some criteria, using information in the power
of adjacency matrix, which can determine
the most possible k-tuples at both ends and
in the middle of all possible reconstructions of
the target DNA in a polynomial time O((n  k ) 4 )
are given.
A novel means which can transform the
negative errors into the positive errors is
proposed. It enables us to handle both types
of errors easily.
Protein Structure Prediction
Predict protein 3D structure from (amino acid) sequence
Sequence  secondary structure  3D structure  function
Proteins Secondary Structure
a-helix (30-35%)
a-螺旋
b-sheet / b-strand (20-25%)
b-折叠
Coil (40-50%) 无规则卷曲
Loop 环
b-turn b-转角
3D Structure of Protein
Turn or coil
Alpha-helix
Beta-sheet
Loop and Turn
Protein 3D Structure Detection
X-ray diffraction
X-射线衍射法
Expensive
Slow
Protein Structure Prediction
Prediction is possible because


Sequence information uniquely determines 3D
structure
Sequence similarity (>50%) tends to imply
structural similarity
Prediction is necessary because

DNA sequence data » protein sequence data »
structure data
Sequence (Swiss-Port)
Structure (PDB)
1994
1997
2002.10
40,000
68,000
114,033
4,045
7,000
18,838
Three Methods of Protein Structure
Prediction
Goal
Find best fit of sequence to 3D structure
Comparative (homology) modeling (同源建模法)
Construct 3D model from alignment to protein sequences with
known structure
Threading (fold recognition) (折叠识别法)
Pick best fit to sequences of known 2D / 3D structures (folds)
Ab initio / de novo methods (从头预测法)
Attempt to calculate 3D structure “from scratch”
 Molecular dynamics
 Energy minimization
 Lattice models
Lattice Models
• Suppose that each amino acid occupies one
point in a space lattice
• It is called an Exact Model
HP Model (Simple Model)
• Twenty amino acids can be divided into two classes:
Hydrophobic/Non-polar (H) (疏水)
Hydrophilic/Polar
(P) (亲水)
• The contacts between H points are favorable
hydrophobic amino acid
hydrophilic amino acid
Covalent bond
H-H contact
• Goal: maximize the number of H-H contacts
Basic Ideas
Each acid (neuron) in the primary sequence
occupies one lattice point (city).
The distance between two cities mapped by
two neighboring neurons is forced to be 1 as a
covalent bond length between the amino acids
in a protein molecule.
Move the neurons to have more H-H contacts,
I.e., emphasis on forming hydrophobic core.
Main Observation
A Traveling Salesman Problem with an
energy function concerning the H-H
contacts that would be maximized.
Mathematical Model (in square lattice)
Let the both of sequence and lattice size be n, let xij  1 / 0
for the i-th acid taking the j-th lattice point or not. Let N ( j )
be the neighboring set of point j. | N ( j ) | 1 / 2 / 3
Let i  H / P  f (i )  1 / 0 and the coordinates of point j be
n
max
n
n
 [ f (i) x   f (i) x
j 1
ij
i 1
n
subject to
Yj
x
ij
i 1
n
x
j 1
ij
sN ( j ) i 1
is
 1,
j  1,..., n
 1,
i  1,..., n
n
n
j 1
j 1
]
||  xijY j  x(i 1) jY j ||  1,
i  2,..., n
Complexity
NP-hard problem even in the case of two
dimensional HP model
P.Crescenzi, et al.
On the complexity of protein folding,
Journal of Computational Biology, 5(3):
423-, 1998

Many local solutions
GA MC SA ----- time consuming
SOM Approach
Existing algorithm



Motivated by Self-Organizing-Map for TSP
Incorporation of HP Information
Compact lattice
(the sequence
exactly fills the
lattice)
A 36-long sequence
In a 6x6 lattice
New SOM Approach
Motivation


Consider a bigger lattice than
the sequence to have more
flexible shapes than the only
rectangular shape
Equivalent to a PCTSP
(Price Collecting Traveling
Salesman Problem): a man
travels only a part of the city
set with some expectation.
Difficulties caused:
Number of cities > number of neurons
PCTSP
A traveling salesman who gets a prize f k in every
city k that he visits and pays a penalty pl for
every city l that he fails to visit, and who travels
between cities i and j at cost cij , wants to minimize
the sum of his travel cost and net penalties, while
including in his tour enough cities to collect a
prescribed amount f 0 of prize money.
The New SOM model is corresponding to the
integer programming:
m
max
n
n
 [  f (i) x   f (i) x
j 1
ij
i 1
n
subject to
x
ij
i 1
m
x
j 1
ij
is
sN ( j ) i 1
 y j  1,
 1,
]
j  1,..., m
i  1,..., n
n
n
j 1
j 1
||  xijY j   x(i 1) jY j || 1,
i  2,..., n
where m>n and the total variables are (n+1)m.
New SOM Approach
Innovate Points




Heuristic initialization to imitate a protein
Learning sample set partition strategy
Learning sample set reduction strategy
Local search procedure to overcome the
multi-mapping phenomena
Numerical Results
1. Constructed HP
sequences
(Length of 17)
2. HP benchmark
(up to 36 amino
acids)
SOM Approach for 2D HP-Model
Xiang-Sun Zhang, Yong Wang, Zhong-Wei Zhan,
Ling-Yun Wu, Luonan Chen. A New SOM Approach for
2D HP-Model of Proteins' Structure Prediction.
Submitted to RECOMB04.
Yong Wang, Zhong-Wei Zhan, Ling-Yun Wu, XiangSun Zhang. Improved Self-Organizing Map Algorithm
for Protein Folding and its Realization. Submitted
to J. of Systems Science and Mathematical
Sciences. (in Chinese)
Main Inprovements
Find the global maximum H-H contacts
configurations in all the tests
Find more optimal conformations
Fast -- running time is linear with the
sequence length
Unique Optimal Folding Problem
What proteins in the two dimensional HP
model have unique optimal (minimum energy)
folding? (Brian Hayes, 1998)
Oswin Aichholzer proved that in square lattice


There are closed chains of monomers with this property for
all even lengths.
There are open monomer chains with this property for all
lengths divisible by four.
Square Lattice and Triangular Lattice
Our Results
For any n = 18k (k is a positive integer), there exists
O(n)
an n-node (open or closed) chain with at least 3
optimal foldings all with isomorphic contact graphs of
size n/2.
On 2D triangular lattice, for any integer n> 19, there
exist both closed and open chains of n nodes with
unique optimal folding.
Proteins With Unique Optimal Foldings
Zhen-Ping Li, Xiang-Sun Zhang, Luo-Nan Chen,
Protein with Unique Optimal Foldings on a Triangular
Lattice in the HP Model, Submitted to Journal of
Computational Biology.
Examples of Optimal Foldings
3D Protein Structure Alignment
Motivation





Group proteins by structural similarity
Determine impact of individual residues on
protein structure
Identify distant homologues of protein
families
Predict function of proteins with low
sequence similarity
Identify new folds / targets for x-ray
crystallography
3D Protein Structure Alignment
Correspondence between atoms

Pairwise sequence alignment
Locations of atoms

Protein Data Bank (in PDB file)
 Bond angles / lengths
 X,Y,Z atom coordinates
Evaluation metric

6 degrees of freedom
 3 degrees of translation (A)
 3 degrees of rotation (R)

Root Mean Square Deviation (RMSD) RMSD 
 n = number of atoms
 di = distance between corresponding atoms i
2
d
i
i
n
Structure Alignment Problem
X  (x , x , x )
1
i
1
i1
i
j
X  (x , x , x )
2
j
2
j1
2
j2
2
j3
1
i2
1
i3
Match two rigid bodies by rotating and
removing them in the 3D space
Structure Alignment Problem
A nonlinear integer programming problem:
N1
min
N2
E ( S , A, R)   sij A  RX i1  X 2j
2
i 1 j 1
N1
N1
  i si 0   (  i(1) ) si 10 si 0
(1)
i 1
i 2
N2
N2
   j s0 j   (   (2)
) s j 10 s j 0
j
(2)
j 1
N1
s.t.
s
ij
 1, j  1, 2,
, N2 ;
ij
 1, i  1, 2,
, N1.
i 0
N2
s
j 0
j 1
Structure Alignment Problem
Luo-Nan Chen, Tian-Shou Zhou, Yun Tang,
Xiang-Sun Zhang. Structure of Alignment of
Protein by Mean Field Annealing. Submitted
to ICSB2003.
On-going Research
Protein structure prediction


Algorithms for HP model
Threading methods
Protein structure alignment

Novel model for structure alignment
SBH reconstruction

Algorithms for new pattern SBH methods
SNP(Single Nucleotide Polymorphism)
and Haplotype analysis
Summary
Problems in Bioinformatics are simple in
description but complicated in solving
Many problems in Proteomics are in
deterministic nature


Combinatorial
Continuous model
while many problems in Genomics are in
stochastic nature
Model a problem accurately but solves it
approximately