Download 14_Cumbaa

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Crystallization Image Analysis
on the World Community Grid
Christian A. Cumbaa and Igor Jurisica
Jurisica Lab, Division of Signaling Biology
Ontario Cancer Institute,
Toronto, Ontario
Why automate classification of
protein crystallization trial images?
clear
phase separation
precipitate
skin
X crystal
garbage
unsure
• Hauptman-Woodward has 65,000,000 images.
– They want 65,000,000 outcomes.
2
Why automate classification of
protein crystallization trial images?
•
•
•
•
Assist or replace human screening
Speed the search phase in protein crystallization
Improve throughput, consistency, objectivity
Enables data mining and statistical optimization
of the crystallization process
clear
precipitate crystal
3
Image classification
feature extraction
classification
clear
feature 1
feature 2
…
feature k
phase separation
precipitate
skin
X
crystal
garbage
unsure
100000s of numbers
10s of numbers
7 numbers
4
Truth data
• 96 study
– 96 proteins X 1536 images
hand-scored by 3 experts
– Presence/absence of 7
independent outcomes
• NESG & SGPP
96-study
– 15000 images
– Hand-scored by 1 expert,
same scoring system
• 50% unanimously-scored
images
– 10 most interesting
compound categories
NESG (crystals)
SGPP (crystals)
5
Feature set
12375 features computed
per image
–
–
–
A few basic statistics
50 microcrystal features
Euler number features,
two variations
1. 11 Blur levels
2. 11 Blur levels X 4
thresholds
–
Image “energy”
•
11 blur levels
– 2925 Grey-Level Cooccurrence Matrix features
• 3 different grey-level
quantizations
• 13 basic functions
• 25 sample distances
• ~100 directions
– Computable from every
point in the image
– Distilled to max range,
max mean, min mean
– ~9500 image-blob features
• Radon & edge-detection
6
Our image analysis problem
• Computing all 12,375 features takes >5 hours
for a single image
• We have 165,000 images in our training set
• Features must be evaluated for quality
• The best features (10s or low 100s) must be
computed for the remaining 65,000,000 images
Massive computing resources required!
7
Image analysis on the World
Community Grid
•
http://www.worldcommunitygrid.org
–
a global, distributed-computing platform for solving large
scientific computing problems with human impact
377,627 volunteers contribute idle CPU time of 960,346
devices.
–
•
Our project: Help Conquer Cancer*
–
•
launched November 2007.
HCC has two goals:
1.
2.
To survey a wide tract of image-feature space and identify
image analysis algorithms and parameters (features) that best
determine crystallization outcome.
To perform the necessary image analysis on Hauptman
Woodward’s archive of 65,000,000 crystallization trial images.
* fundraising slogan of the Ontario Cancer Institute and its parent organization.
8
Image analysis on the World
Community Grid
•
HCC has two phases
– Phase I: calculate 12,375 features per image on
high-priority images, including 165,441 hand-scored
images.
– November 2007-May 2008
– analysis on hand-scored images completed January 2008
– Phase II: calculate the best features from Phase I on
the backlog of HWI images
•
Grid members have contributed 8,919 CPUyears so far to HCC, an average of 55 CPUyears per day.
9
10
11
Phase I: feature assessment
Measuring feature quality
• Treat as random
variables:
feature entropy
– Image class
– Feature value
unsure
garbage
crystal
skin
precipitate
phase
separation
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
clear
= entropy(class) +
entropy(feature) –
entropy(class,feature)
Entropy (bits)
• Measure the mutual
information between
them (unit: bits)
class entropy
13
Measuring feature quality
clear
precipitate (no crystal)
other
14
Information density: microcrystal
counts parameter space
Clear
Precipitate
Crystal
15
Information density: GLCM
maximum range parameter space
Clear
Precipitate
Crystal
16
Information density: Radon-Sobel
soft sum parameter space
Clear
Precipitate
Crystal
17
Information density: Radon-Sobel
blob metrics (means) parameter
space
Clear
Precipitate
Crystal
18
Towards Phase II: image
classification
Building classifiers
•
•
handpicked 74 features from peaks in the
clear, precipitate and other mutual information
plots
two classification schemes
three-way: clear, non-crystal precipitate, other
ten-way: clear, phase separation, phase + precipitate,
skin, phase + crystal, precip, precip + skin, precip +
crystal, crystal, garbage
•
•
naïve Bayes model
leave-one-out cross-validation
20
Measuring classifier accuracy:
precision and recall
crystals
recall
“I think these
are crystals”
false negatives
precision
true
positives
false positives
21
Three-class distribution
Clear
Precipitate AND NOT crystal
24.3%
52.7%
Other
23.0%
Confusion matrix
non-crystal
precipitate
other
machine
says
clear
27615
817
617
non-crystal precipitate
1819
45112
15928
other
5109
5258
17095
true
clear
clas
s
22
Recall & precision
23
10-class distribution
Clear
Phase separation
33.83%
7.00%
Phase separation + precipitate
0.50%
Skin
0.79%
Phase separation + crystal
2.32%
Precipitate
Precipitate + skin
Precipitate + crystal
Crystal
Garbage
34.25%
4.95%
7.53%
8.34%
0.55%
24
garbage
crystal
precipitate and crystal
precipitate and skin
precipitate
phase and crystal
clear
class
skin
true
phase and precipitate
machine says
phase separation
Confusion matrix
clear
25585
227
1
1135
0
815
1
0
92
1193
phase separation
1446
2433
40
281
668
298
75
139
503
91
phase and precipitate
1
24
32
6
51
97
81
107
31
3
skin
126
29
0
372
6
13
5
0
105
20
phase and crystal
74
268
37
85
511
75
88
292
551
10
precipitate
441
1972
494
617
553
16907
3440
4088
512
385
precipitate and skin
12
205
33
243
328
692
2008
395
305
29
precipitate and crystal
35
222
85
111
562
1063
611
2852
914
8
crystal
888
345
56
586
649
219
90
1072
3129
129
garbage
28
4
0
49
1
52
2
0
20
313
25
Recall & precision
26
Acknowledgements
Hauptman-Woodward Medical Research
Institute
George DeTitta, Joe Luft, Eddie Snell, Mike
Malkowski, Angela Lauricella, Max Thayer,
Raymond Nagel, Steve Potter, and the 96study reviewers.
World Community Grid
Bill Bovermann, Viktors Berstis, Jonathan D.
Armstrong, Tedi Hahn, Kevin Reed, Keith J.
Uplinger, Nels Wadycki
IBM Deep Computing:
Funding from
NIH U54 GM074899
Genome Canada
IBM
NSERC
(and earlier work from)
NIH P50 GM62413
NSERC
CITO
Jerry Heyman
Jurisica Lab:
Richard Lu
All crystallization images were generated
at the High-Throughput Screening lab at
The Hauptman-Woodward Institute.
27
Related documents