Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Crystallization Image Analysis on the World Community Grid Christian A. Cumbaa and Igor Jurisica Jurisica Lab, Division of Signaling Biology Ontario Cancer Institute, Toronto, Ontario Why automate classification of protein crystallization trial images? clear phase separation precipitate skin X crystal garbage unsure • Hauptman-Woodward has 65,000,000 images. – They want 65,000,000 outcomes. 2 Why automate classification of protein crystallization trial images? • • • • Assist or replace human screening Speed the search phase in protein crystallization Improve throughput, consistency, objectivity Enables data mining and statistical optimization of the crystallization process clear precipitate crystal 3 Image classification feature extraction classification clear feature 1 feature 2 … feature k phase separation precipitate skin X crystal garbage unsure 100000s of numbers 10s of numbers 7 numbers 4 Truth data • 96 study – 96 proteins X 1536 images hand-scored by 3 experts – Presence/absence of 7 independent outcomes • NESG & SGPP 96-study – 15000 images – Hand-scored by 1 expert, same scoring system • 50% unanimously-scored images – 10 most interesting compound categories NESG (crystals) SGPP (crystals) 5 Feature set 12375 features computed per image – – – A few basic statistics 50 microcrystal features Euler number features, two variations 1. 11 Blur levels 2. 11 Blur levels X 4 thresholds – Image “energy” • 11 blur levels – 2925 Grey-Level Cooccurrence Matrix features • 3 different grey-level quantizations • 13 basic functions • 25 sample distances • ~100 directions – Computable from every point in the image – Distilled to max range, max mean, min mean – ~9500 image-blob features • Radon & edge-detection 6 Our image analysis problem • Computing all 12,375 features takes >5 hours for a single image • We have 165,000 images in our training set • Features must be evaluated for quality • The best features (10s or low 100s) must be computed for the remaining 65,000,000 images Massive computing resources required! 7 Image analysis on the World Community Grid • http://www.worldcommunitygrid.org – a global, distributed-computing platform for solving large scientific computing problems with human impact 377,627 volunteers contribute idle CPU time of 960,346 devices. – • Our project: Help Conquer Cancer* – • launched November 2007. HCC has two goals: 1. 2. To survey a wide tract of image-feature space and identify image analysis algorithms and parameters (features) that best determine crystallization outcome. To perform the necessary image analysis on Hauptman Woodward’s archive of 65,000,000 crystallization trial images. * fundraising slogan of the Ontario Cancer Institute and its parent organization. 8 Image analysis on the World Community Grid • HCC has two phases – Phase I: calculate 12,375 features per image on high-priority images, including 165,441 hand-scored images. – November 2007-May 2008 – analysis on hand-scored images completed January 2008 – Phase II: calculate the best features from Phase I on the backlog of HWI images • Grid members have contributed 8,919 CPUyears so far to HCC, an average of 55 CPUyears per day. 9 10 11 Phase I: feature assessment Measuring feature quality • Treat as random variables: feature entropy – Image class – Feature value unsure garbage crystal skin precipitate phase separation 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 clear = entropy(class) + entropy(feature) – entropy(class,feature) Entropy (bits) • Measure the mutual information between them (unit: bits) class entropy 13 Measuring feature quality clear precipitate (no crystal) other 14 Information density: microcrystal counts parameter space Clear Precipitate Crystal 15 Information density: GLCM maximum range parameter space Clear Precipitate Crystal 16 Information density: Radon-Sobel soft sum parameter space Clear Precipitate Crystal 17 Information density: Radon-Sobel blob metrics (means) parameter space Clear Precipitate Crystal 18 Towards Phase II: image classification Building classifiers • • handpicked 74 features from peaks in the clear, precipitate and other mutual information plots two classification schemes three-way: clear, non-crystal precipitate, other ten-way: clear, phase separation, phase + precipitate, skin, phase + crystal, precip, precip + skin, precip + crystal, crystal, garbage • • naïve Bayes model leave-one-out cross-validation 20 Measuring classifier accuracy: precision and recall crystals recall “I think these are crystals” false negatives precision true positives false positives 21 Three-class distribution Clear Precipitate AND NOT crystal 24.3% 52.7% Other 23.0% Confusion matrix non-crystal precipitate other machine says clear 27615 817 617 non-crystal precipitate 1819 45112 15928 other 5109 5258 17095 true clear clas s 22 Recall & precision 23 10-class distribution Clear Phase separation 33.83% 7.00% Phase separation + precipitate 0.50% Skin 0.79% Phase separation + crystal 2.32% Precipitate Precipitate + skin Precipitate + crystal Crystal Garbage 34.25% 4.95% 7.53% 8.34% 0.55% 24 garbage crystal precipitate and crystal precipitate and skin precipitate phase and crystal clear class skin true phase and precipitate machine says phase separation Confusion matrix clear 25585 227 1 1135 0 815 1 0 92 1193 phase separation 1446 2433 40 281 668 298 75 139 503 91 phase and precipitate 1 24 32 6 51 97 81 107 31 3 skin 126 29 0 372 6 13 5 0 105 20 phase and crystal 74 268 37 85 511 75 88 292 551 10 precipitate 441 1972 494 617 553 16907 3440 4088 512 385 precipitate and skin 12 205 33 243 328 692 2008 395 305 29 precipitate and crystal 35 222 85 111 562 1063 611 2852 914 8 crystal 888 345 56 586 649 219 90 1072 3129 129 garbage 28 4 0 49 1 52 2 0 20 313 25 Recall & precision 26 Acknowledgements Hauptman-Woodward Medical Research Institute George DeTitta, Joe Luft, Eddie Snell, Mike Malkowski, Angela Lauricella, Max Thayer, Raymond Nagel, Steve Potter, and the 96study reviewers. World Community Grid Bill Bovermann, Viktors Berstis, Jonathan D. Armstrong, Tedi Hahn, Kevin Reed, Keith J. Uplinger, Nels Wadycki IBM Deep Computing: Funding from NIH U54 GM074899 Genome Canada IBM NSERC (and earlier work from) NIH P50 GM62413 NSERC CITO Jerry Heyman Jurisica Lab: Richard Lu All crystallization images were generated at the High-Throughput Screening lab at The Hauptman-Woodward Institute. 27