Download dimaio.icml03

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Biochemistry wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Photosynthetic reaction centre wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Proteolysis wikipedia , lookup

Multi-state modeling of biomolecules wikipedia , lookup

Structural alignment wikipedia , lookup

Protein structure prediction wikipedia , lookup

Metalloprotein wikipedia , lookup

Transcript
Using Pictorial Structures to Identify
Proteins in X-ray Crystallographic
Electron Density Maps
Frank DiMaio
Jude Shavlik
George N. Phillips, Jr.
[email protected]
[email protected]
[email protected]
ICML Bioinformatics Workshop
21 August 2003
Task Overview

Given
• Electron density for a
region in a protein
• Protein’s topology
Find
• Atomic positions of
individual atoms in the
density map
Pictorial Structures
A pictorial structure
is…
a collection of image
parts
together with…
a deformable
conformation of these
parts
Pictorial Structures
Formally, a model consists of

Set of parts V={v1, …, vn}

Configuration L=(l1, …, ln)

Edges eij  E, connect neighboring parts vi, vj
v1
e13

Appearance parameters Ai for each part

Connection parameters Cij for each edge
e23
v3
– Explicit dependency between li, lj
– G=(V,E) forms a Markov Random Field
v2
e35
e34
v4
v5
e46
v6
Matching Algorithm Overview

Want configuration L of model Θ maximizing

P(L|I,Θ)
P(I|L,Θ) · P(L|Θ)
P(I|L,Θ) = Πi P(I|li,Θ) =
1
Z1
- Σi matchi(li)
e
P(L|Θ) = Π (v ,v )E P(li,lj|Cij) =
i

j
1
Z2
- Σ(vi,vj)E dij(li,lj)
e
Equivalent to minimizing
Σi matchi(li) + Σ(vi,vj)E dij(li,lj)
Linear-Time Matching Algorithm

A Dynamic Programming implementation runs in
quadratic time


Requires tree configuration of parts
Felzenszwalb & Huttenlocher (2000) developed
linear-time matching algorithm


Additional constraint on part-to-part cost function dij
Basic “Trick”: Parallelize minimization computation over
entire grid using a Generalized Distance Transform
Pictorial Structures for Map Interpretation
Basic Idea: Build pictorial structure that is able to
model all configurations of a molecule


Each part in “collection of parts” corresponds to an atom
Model has low-cost conformation
for low-energy states of the molecule
The Screw-Joint Model

Ideally, we would have
cost function = atomic energy

Problem: Impossible to represent atomic
energy function using pairwise potentials while
maintaining tree-structure

Solution: screw-joint model

Ignore non-bonded interactions

Edges correspond to covalent bonds

Allow free rotation around bonds
Screw-Joint Model Details

Each part’s configuration has six params (x,y,z,α,β,γ) with
 (x,y,z) is part’s position
αi
 α is part’s rotation (about bond connecting vi and vj)
 (β,γ) is part’s orientation
(βi,γi)
(βj,γj)
vi
vj
vj


(xj,yj,zj)
αj
vi
(xi,yi,zi)
(xij,yij,zij)
Part-to-part cost function dij based on child’s deviation from ideal
Matching cost function matchi based on 3x3x3 template match
Pictorial Structures for Map Interpretation

Ideally, we would …


Build pictorial structure for the entire protein
Run the matching algorithm to get best layout

However, computationally infeasible

Instead, we use two-phase algorithm that …
a)
b)
computes best backbone trace
computes best sidechain conformation
(current focus)
Sidechain Refinement

Assume we have a rough Cα trace of the protein

Next use pictorial structure matching to place sidechains

Walk along chain one residue at a time, placing individual atoms
Cα, ALA_82
Cα, MET_80
Cα, ARG_81
Cα, PRO_83
Sidechain Refinement
Given:



residue type
approximate Cα locations

Find: most likely location for sidechain atoms in the residue

Example Alanine
N
C-1
O
N
Cα
Cα-1 O-1
C
Cβ
N
O
O
N+1
Cα+1
Matching
algorithm
Learning Model Parameters
O
N
N
N
O
Cα
Cβ
Averaged 3D Template
N
N
C-1
O
Alanine Cα
C
Cα
Cβ
C
N+1
Cα
r = 1.51
θ = 118.4°
φ = -19.7°
Canonic Orientation
r = 1.53
θ = 0.0°
φ = -19.3°
Cβ
C
Averaged Bond Geometry
Soft Maximums

Sometimes we may get
an optimal match like
the one to the right

When this occurs, explore
the space of non-optimal
solutions via soft
maximums in DP

Basic Idea: Take a path
with probability inversely
proportional to its cost
ACTUAL
PREDICTED 1
Soft Maximums

Figure to the right
shows soft maximums

Red molecule eventually
found

Annealing increases
“softness” until legal
structure found
PREDICTED 2
ACTUAL

Legal structure may not
be “right”
PREDICTED 1
Results

Only sidechain refinement implemented & tested

Experimental Methodology


Assume Cα’s known to within 2Å

Trained on 1.7 Å resolution protein,
tested on 1.9 Å resolution protein

Templates built for ALA, VAL, TYR, LYS
Model Parameters

Grid spacing of 0.5 Å within diameter 10 Å sphere

Rotational discretization:

12 rotational steps

84 orientations
Sidechain Placement

Compared predicted vs.
actual location for 599 atoms
on testset protein
1


29.9% atoms within 0.5Å
% atoms
0.8
0.6
0.4
72.3% atoms within 1.0Å
0.2

93.0% atoms within 2.0Å
0
0
2
4
6
Accuracy (angstroms)

Recall 0.5Å grid spacing
8
Predictive Accuracy Task

We used DP matching score as
a predictor of amino acid type

Tested 49 ALA, LYS, TYR, VAL
residues

Highest scoring normalized
template determined type

61.2% accuracy
(majority classification = 33%)
The Good…

PREDICTED vs. ACTUAL
LYSINE
LYSINE
TYROSINE
VALINE
… and the Bad

PREDICTED vs. ACTUAL
LYSINE
VALINE
TYROSINE
ALANINE
Future Work

Implement & integrate backbone tracing algorithm, to
create complete two-tiered solution

Better strategies to handle illegal molecule configurations

perturbation of branches involved in collisions

more accurate representation of atomic energy function,
e.g. torsion angle

Better match function … make use of previous work?

More tests (larger training set, higher resolution)
Acknowledgements

NLM grant 1T15 LM007359-01

NLM grant 1R01 LM07050-01

NIH grant P50 GM64598.