Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Microarray Data Mining: Dealing with Challenges Gregory Piatetsky-Shapiro KDnuggets © 2005 KDnuggets Dec 8, 2005 DNA and Gene Expression Cell Nucleus Chromosome Gene expression Protein © 2005 KDnuggets Gene (mRNA), single strand 2 Gene (DNA) Graphics courtesy of the National Human Genome Research Institute BYU, Dec 8, 2005 Gene Expression Cells are different because of differential gene expression. About 40% of human genes are expressed at one time. Gene is expressed by transcribing DNA exons into single-stranded mRNA mRNA is later translated into a protein Microarrays measure the level of mRNA expression © 2005 KDnuggets 3 BYU, Dec 8, 2005 Gene Expression Measurement mRNA expression represents dynamic aspects of cell mRNA expression can be measured with latest technology mRNA is isolated and labeled with fluorescent protein mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser © 2005 KDnuggets 4 BYU, Dec 8, 2005 Affymetrix DNA Microarrays image courtesy of Affymetrix © 2005 KDnuggets 5 BYU, Dec 8, 2005 Some Probes Match Match Probe Gene T A A T C G T A C G T T … © 2005 KDnuggets 6 BYU, Dec 8, 2005 Other Probes MisMatch Gene MisMatch Probe A T T A G T A T G C T T … © 2005 KDnuggets 7 BYU, Dec 8, 2005 Affymetrix Microarray Concept 1. mRNA segments tagged with fluorescent chemical 2. Matches with complementary probes 3. Fluorescence measured with laser image courtesy of Affymetrix © 2005 KDnuggets 8 BYU, Dec 8, 2005 From Raw Image to Numbers Scanner Cleaning and Processing Software enlarged section of raw image © 2005 KDnuggets 9 Gene D26528_at D26561_cds1_at D26561_cds2_at D26561_cds3_at D26579_at D26598_at D26599_at Value 193 -70 144 33 318 1764 153 BYU, Dec 8, 2005 Microarray Potential Applications New and better molecular diagnostics Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip, based on Affymetrix technology New molecular targets for therapy few new drugs, large pipeline, … Improved treatment outcome Partially depends on genetic signature Fundamental Biological Discovery finding and refining biological pathways Personalized medicine ?! © 2005 KDnuggets 10 BYU, Dec 8, 2005 Microarray Data Analysis Types Gene Selection Find genes for therapeutic targets (new drugs) Classification (Supervised Learning) Identify disease Predict outcome / select best treatment Clustering (Unsupervised) Find new biological classes / refine existing ones Discovery of Associations and Pathways © 2005 KDnuggets 11 BYU, Dec 8, 2005 Outline DNA and Biology Microarray Classification - Best practices Gene Set Analysis Synthetic microarray data sets © 2005 KDnuggets 12 BYU, Dec 8, 2005 Microarray Data Analysis Challenges Biological, Process, and Other variation Model needs to be explainable to biologists Few records (samples), usually < 100 Many columns (genes), usually > 1,000 This is very likely to result in false positives, “discoveries” due to random noise © 2005 KDnuggets 13 BYU, Dec 8, 2005 Data Mining from Small Data: Example Country US Favorite Food Wine Heart-Disease Rate High France Wine Low England Beer High Germany Beer Low © 2005 KDnuggets 14 BYU, Dec 8, 2005 Data Mining from Small Data: Conclusion You can drink whatever you want It is speaking English which causes heart disease © 2005 KDnuggets 15 BYU, Dec 8, 2005 Same Data, Different Results – Who is Right? Many examples (e.g. CAMDA conferences) where researchers analyzed the same data, and found different results and gene sets. Who is right? Good methodology is needed for avoiding errors for best results © 2005 KDnuggets 16 BYU, Dec 8, 2005 Capturing Best Practices for Microarray Data Analysis Developed with SPSS and S. Ramaswamy (MIT / Whitehead Institute). Implemented as SPSS Microarray CATs (Clementine Application Template). Also implemented using Weka and opensource software Capturing Best Practice for Microarray Gene Expression Data Analysis, G. PiatetskyShapiro, T. Khabaza, S. Ramaswamy, Proceedings of KDD-2003 www.kdnuggets.com/gpspubs/ © 2005 KDnuggets 17 BYU, Dec 8, 2005 Best Practices Capture the complete process, from raw data to final results Gene (feature) selection inside cross-validation Randomization testing Robust classification algorithms Simple methods give good results Advanced methods can be better Wrapper approach for best gene subset selection Use bagging to improve accuracy Remove/relabel mislabeled or poorly differentiated samples © 2005 KDnuggets 18 BYU, Dec 8, 2005 Observations Simplest approaches are most robust Advanced approaches can be more accurate “Small” increase in diagnostic accuracy (e.g. 90% to 98%) can greatly reduce rate of false positives (5-fold) p(disease | test=yes), disease rate=0.01 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 90 100 Accuracy © 2005 KDnuggets 19 BYU, Dec 8, 2005 Microarray Classification Desired Features Robust in presence of false positives Stable under cross-validation Results understandable by biologists Return confidence/probability Fast enough © 2005 KDnuggets 20 BYU, Dec 8, 2005 Popular Classification Methods Decision Trees/Rules Find smallest gene sets, but not robust – poor performance Neural Nets - work well for reduced number of genes K-nearest neighbor – good results for small number of genes, but no model Naïve Bayes – simple, robust, but ignores gene interactions Support Vector Machines (SVM) Good accuracy, does own gene selection, but hard to understand Specialized methods, D/S/A (Dudoit), … © 2005 KDnuggets 21 BYU, Dec 8, 2005 Gene Reduction Improves Classification Accuracy - Why? Most learning algorithms look for non-linear combinations of features Can easily find spurious combinations given few records and many genes – “false positives problem” Classification accuracy improves if we first reduce number of genes by a linear method e.g. T-values of mean difference © 2005 KDnuggets 22 BYU, Dec 8, 2005 Feature selection approach Rank genes by measure & select top 100-200 T-test for Mean Difference= Signal to Noise (S2N) = Other methods … © 2005 KDnuggets 23 ( Avg1 − Avg 2 ) (σ 12 / N1 + σ 22 / N 2 ) ( Avg1 − Avg 2 ) (σ 1 + σ 2 ) BYU, Dec 8, 2005 Global Feature / Parameter Selection is wrong Gene data Data Cleaning & Preparation Class data Feature/Parameter Selection Train data Model Building Evaluation Test data Because the information is “leaked” via gene selection. Leads to overly “optimistic” classification results. © 2005 KDnuggets 24 BYU, Dec 8, 2005 Global Feature Selection Bias Example: Data w 6000 features, ~ 100 samples Used 3 Weka algorithms: SVM, IB1, J48 Error with Global feature selection was 50% to 150% lower than using X-validation – overly “optimistic” error estimate Average%%decrease increase in error from using Average Global Feature Selection (A) 200 150 100 50 0 -50 IB1 J48 SVM 5 10 20 30 50 Best Features © 2005 KDnuggets 25 BYU, Dec 8, 2005 Microarray Classification Process Data Cleaning & Preparation Train data Gene data Class data Feature/Parameter Selection Model Building Evaluation Test data © 2005 KDnuggets 26 BYU, Dec 8, 2005 Evaluation Evaluation on one test dataset is not accurate (for small datasets) Evaluation, including feature/parameter selection needs to be inside cross-validation (Validation Croisee) © 2005 KDnuggets 27 BYU, Dec 8, 2005 Evaluation inside X-validation Gene Data T r a i n Microarray Classification Process Final Test 1 Final Model 1 Final Result 1 © 2005 KDnuggets 28 BYU, Dec 8, 2005 Evaluation inside X-validation, 2 Gene Data T r a i n Microarray Classification Process Final Test 2 Final Model 2 Final Result 2 © 2005 KDnuggets 29 BYU, Dec 8, 2005 Evaluation inside X-validation, N T r a i n Gene Data Final Test N Microarray Classification Process Final Model N Final Result N © 2005 KDnuggets 30 BYU, Dec 8, 2005 Which of N models is your final model? Neither of them! average Error from Models 1, … N is the best estimate of the error of the process You then use the same process on the entire data to get your final and best model © 2005 KDnuggets 31 BYU, Dec 8, 2005 Iterative Wrapper approach to selecting the best gene set Model with top 100 genes is not optimal Test models using 1,2,3, …, 10, 20, 30, 40, ..., N top genes with cross-validation. Gene selection: Simple: equal number of genes from each class advanced: best number from each class For “randomized” algorithms (e.g. neural nets), average 10 cross-validation runs © 2005 KDnuggets 32 BYU, Dec 8, 2005 Best Gene Set: one X-validation run Error Avg for 10-fold X-val 30% 25% 20% 15% 10% 5% 0% 1 2 3 4 5 10 20 30 40 Genes per Class Single Cross-Validation run © 2005 KDnuggets 33 BYU, Dec 8, 2005 Best Gene Set 10 X-validation runs Select gene set with lowest combined Error good, but not optimal! Average, high and low error rate for all classes © 2005 KDnuggets 34 BYU, Dec 8, 2005 Multi-class classification Simple: One model for all classes Advanced: Separate model for each class © 2005 KDnuggets 35 BYU, Dec 8, 2005 Error rates for each class Error rate Genes per Class © 2005 KDnuggets 36 BYU, Dec 8, 2005 Example: Pediatric Brain Tumor Data 92 samples, 5 classes (MED, EPD, JPA, MGL, RHB) from U. of Chicago Children’s Hospital Photomicrographs of tumors (400x) © 2005 KDnuggets 37 BYU, Dec 8, 2005 Single Neural Net Class 1-Net Error Rate MED 2.1% MGL 17% RHB 24% EPD 9% JPA 19% *ALL* 8.3% © 2005 KDnuggets 38 BYU, Dec 8, 2005 Bagging Improves Results Bagging or simple voting of N different neural nets improves the accuracy © 2005 KDnuggets 39 BYU, Dec 8, 2005 Bagging 100 Networks Class 1-Net Error Rate MED 2.1% 17% MGL 24% RHB 9% EPD 19% JPA *ALL* 8.3% Bag Error rate 2% (0)* 10% 11% 0 0 3% (2)* Bag Avg Conf 98% 83% 76% 91% 81% 92% Note: suspected error on one sample (labeled as MED but consistently classified as RHB) © 2005 KDnuggets 40 BYU, Dec 8, 2005 Cross-validated prediction strength misclassified? MED: Strong predictions poorly differentiated © 2005 KDnuggets 41 BYU, Dec 8, 2005 Outline DNA and Biology Microarray Classification - Best practices Gene Set Analysis Synthetic microarray data sets © 2005 KDnuggets 42 BYU, Dec 8, 2005 Gene Sets – Key to Power Number of genes is a problem for single gene analysis Analyzing gene sets increases statistical power Group genes statistically (prototypes, correlations) Group genes using biological knowledge (pathways, Gene Ontology, medical text, …) © 2005 KDnuggets 43 BYU, Dec 8, 2005 Gene Set Analysis Gene Set Enrichment Mootha et al, Nature Genetics, July 2003 Question: Genes involved in oxidative phosphorylation (~100) vs. diabetes ? No genes in OXPHOS group are individually significant (avg. ~20% decrease) © 2005 KDnuggets 44 BYU, Dec 8, 2005 OXPHOS Gene set example Decrease is consistent for 89% of OXPHOS genes Mean expression of all genes (gray) and of OXPHOS genes (red) for people with DM2 (Type 2 diabetes mellitus) vs Normal. © 2005 KDnuggets 45 BYU, Dec 8, 2005 Gene sets from Gene Ontology( FatiGO) Cluster Query – subset of genes; reference - remaining genes. http://bioinfo.cnio.es/docus/courses/FatiGO/ © 2005 KDnuggets 46 BYU, Dec 8, 2005 Comprehensive Microarray Test Suite Getting high accuracy is very important Current data has too few samples; the “true” labels may be unknown. What methods are best for given data and what is their estimated accuracy? Proposal: use realistic simulated microarray data to evaluate how current algorithms perform under different conditions © 2005 KDnuggets 47 BYU, Dec 8, 2005 Global Analysis of Microarray Data Analyzed means of all genes across samples Found: log(means >1) ~ normal distribution Plot: log10(means) (blue) vs normal quantiles (red) Other regularities: log(SD) ~ a log(means) + b © 2005 KDnuggets 48 (not shown) BYU, Dec 8, 2005 True Rank vs T-value Rank Simulated 100 microarray datasets, with 6000 genes and 30 samples (using R) First 10 genes up-regulated by UP factor © 2005 KDnuggets 49 BYU, Dec 8, 2005 Future Research Directions Why is gene distribution log normal? Develop a synthetic microarray data generator Which gene selection methods are most robust and under what conditions? Evaluate algorithm accuracy for a given test set by creating similar synthetic sets with known answers © 2005 KDnuggets 50 BYU, Dec 8, 2005 Acknowledgements Eric Bremer, Children’s Hospital (Chicago) & Northwestern U. Tom Khabaza, SPSS Sridhar Ramaswamy, MIT/Broad Institute Pablo Tamayo, MIT/Broad Institute © 2005 KDnuggets 51 BYU, Dec 8, 2005 SIGKDD Explorations Special Issue on Microarray Data Mining www.acm.org/sigkdd/explorations/ © 2005 KDnuggets 52 BYU, Dec 8, 2005 Thank you! Further resources on Data Mining: www.KDnuggets.com Microarrays: www.KDnuggets.com/websites/microarray.html Data Mining Course (one semester) www.KDnuggets.com/dmcourse © 2005 KDnuggets 53 BYU, Dec 8, 2005