Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Microarray Data Mining: Dealing with Challenges Gregory Piatetsky-Shapiro KDnuggets © 2005 KDnuggets Dec 8, 2005 DNA and Gene Expression Cell Nucleus Chromosome Gene expression Protein © 2005 KDnuggets Gene (mRNA), single strand 2 Gene (DNA) Graphics courtesy of the National Human Genome Research Institute BYU, Dec 8, 2005 Gene Expression  Cells are different because of differential gene expression.  About 40% of human genes are expressed at one time.  Gene is expressed by transcribing DNA exons into single-stranded mRNA  mRNA is later translated into a protein  Microarrays measure the level of mRNA expression © 2005 KDnuggets 3 BYU, Dec 8, 2005 Gene Expression Measurement  mRNA expression represents dynamic aspects of cell  mRNA expression can be measured with latest technology  mRNA is isolated and labeled with fluorescent protein  mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser © 2005 KDnuggets 4 BYU, Dec 8, 2005 Affymetrix DNA Microarrays image courtesy of Affymetrix © 2005 KDnuggets 5 BYU, Dec 8, 2005 Some Probes Match Match Probe Gene T A A T C G T A C G T T … © 2005 KDnuggets 6 BYU, Dec 8, 2005 Other Probes MisMatch Gene MisMatch Probe A T T A G T A T G C T T … © 2005 KDnuggets 7 BYU, Dec 8, 2005 Affymetrix Microarray Concept 1. mRNA segments tagged with fluorescent chemical 2. Matches with complementary probes 3. Fluorescence measured with laser image courtesy of Affymetrix © 2005 KDnuggets 8 BYU, Dec 8, 2005 From Raw Image to Numbers Scanner Cleaning and Processing Software enlarged section of raw image © 2005 KDnuggets 9 Gene D26528_at D26561_cds1_at D26561_cds2_at D26561_cds3_at D26579_at D26598_at D26599_at Value 193 -70 144 33 318 1764 153 BYU, Dec 8, 2005 Microarray Potential Applications  New and better molecular diagnostics  Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip, based on Affymetrix technology  New molecular targets for therapy  few new drugs, large pipeline, …  Improved treatment outcome  Partially depends on genetic signature  Fundamental Biological Discovery  finding and refining biological pathways  Personalized medicine ?! © 2005 KDnuggets 10 BYU, Dec 8, 2005 Microarray Data Analysis Types  Gene Selection  Find genes for therapeutic targets (new drugs)  Classification (Supervised Learning)  Identify disease  Predict outcome / select best treatment  Clustering (Unsupervised)  Find new biological classes / refine existing ones  Discovery of Associations and Pathways © 2005 KDnuggets 11 BYU, Dec 8, 2005 Outline  DNA and Biology  Microarray Classification - Best practices  Gene Set Analysis  Synthetic microarray data sets © 2005 KDnuggets 12 BYU, Dec 8, 2005 Microarray Data Analysis Challenges  Biological, Process, and Other variation  Model needs to be explainable to biologists  Few records (samples), usually < 100  Many columns (genes), usually > 1,000  This is very likely to result in false positives, “discoveries” due to random noise © 2005 KDnuggets 13 BYU, Dec 8, 2005 Data Mining from Small Data: Example Country US Favorite Food Wine Heart-Disease Rate High France Wine Low England Beer High Germany Beer Low © 2005 KDnuggets 14 BYU, Dec 8, 2005 Data Mining from Small Data: Conclusion  You can drink whatever you want  It is speaking English which causes heart disease © 2005 KDnuggets 15 BYU, Dec 8, 2005 Same Data, Different Results – Who is Right?  Many examples (e.g. CAMDA conferences) where researchers analyzed the same data, and found different results and gene sets.  Who is right?  Good methodology is needed  for avoiding errors  for best results © 2005 KDnuggets 16 BYU, Dec 8, 2005 Capturing Best Practices for Microarray Data Analysis  Developed with SPSS and S. Ramaswamy (MIT / Whitehead Institute).  Implemented as SPSS Microarray CATs (Clementine Application Template).  Also implemented using Weka and opensource software Capturing Best Practice for Microarray Gene Expression Data Analysis, G. PiatetskyShapiro, T. Khabaza, S. Ramaswamy, Proceedings of KDD-2003 www.kdnuggets.com/gpspubs/ © 2005 KDnuggets 17 BYU, Dec 8, 2005 Best Practices  Capture the complete process, from raw data to final results  Gene (feature) selection inside cross-validation  Randomization testing  Robust classification algorithms  Simple methods give good results  Advanced methods can be better  Wrapper approach for best gene subset selection  Use bagging to improve accuracy  Remove/relabel mislabeled or poorly differentiated samples © 2005 KDnuggets 18 BYU, Dec 8, 2005 Observations  Simplest approaches are most robust  Advanced approaches can be more accurate  “Small” increase in diagnostic accuracy (e.g. 90% to 98%) can greatly reduce rate of false positives (5-fold) p(disease | test=yes), disease rate=0.01 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 90 100 Accuracy © 2005 KDnuggets 19 BYU, Dec 8, 2005 Microarray Classification Desired Features  Robust in presence of false positives  Stable under cross-validation  Results understandable by biologists  Return confidence/probability  Fast enough © 2005 KDnuggets 20 BYU, Dec 8, 2005 Popular Classification Methods  Decision Trees/Rules  Find smallest gene sets, but not robust – poor performance  Neural Nets - work well for reduced number of genes  K-nearest neighbor – good results for small number of genes, but no model  Naïve Bayes – simple, robust, but ignores gene interactions  Support Vector Machines (SVM)  Good accuracy, does own gene selection, but hard to understand  Specialized methods, D/S/A (Dudoit), … © 2005 KDnuggets 21 BYU, Dec 8, 2005 Gene Reduction Improves Classification Accuracy - Why?  Most learning algorithms look for non-linear combinations of features  Can easily find spurious combinations given few records and many genes – “false positives problem”  Classification accuracy improves if we first reduce number of genes by a linear method  e.g. T-values of mean difference © 2005 KDnuggets 22 BYU, Dec 8, 2005 Feature selection approach  Rank genes by measure & select top 100-200  T-test for Mean Difference=  Signal to Noise (S2N) =  Other methods … © 2005 KDnuggets 23 ( Avg1 − Avg 2 ) (σ 12 / N1 + σ 22 / N 2 ) ( Avg1 − Avg 2 ) (σ 1 + σ 2 ) BYU, Dec 8, 2005 Global Feature / Parameter Selection is wrong Gene data Data Cleaning & Preparation Class data Feature/Parameter Selection Train data Model Building Evaluation Test data Because the information is “leaked” via gene selection. Leads to overly “optimistic” classification results. © 2005 KDnuggets 24 BYU, Dec 8, 2005 Global Feature Selection Bias  Example: Data w 6000 features, ~ 100 samples  Used 3 Weka algorithms: SVM, IB1, J48  Error with Global feature selection was 50% to 150% lower than using X-validation – overly “optimistic” error estimate Average%%decrease increase in error from using Average Global Feature Selection (A) 200 150 100 50 0 -50 IB1 J48 SVM 5 10 20 30 50 Best Features © 2005 KDnuggets 25 BYU, Dec 8, 2005 Microarray Classification Process Data Cleaning & Preparation Train data Gene data Class data Feature/Parameter Selection Model Building Evaluation Test data © 2005 KDnuggets 26 BYU, Dec 8, 2005 Evaluation  Evaluation on one test dataset is not accurate (for small datasets)  Evaluation, including feature/parameter selection needs to be inside cross-validation (Validation Croisee) © 2005 KDnuggets 27 BYU, Dec 8, 2005 Evaluation inside X-validation Gene Data T r a i n Microarray Classification Process Final Test 1 Final Model 1 Final Result 1 © 2005 KDnuggets 28 BYU, Dec 8, 2005 Evaluation inside X-validation, 2 Gene Data T r a i n Microarray Classification Process Final Test 2 Final Model 2 Final Result 2 © 2005 KDnuggets 29 BYU, Dec 8, 2005 Evaluation inside X-validation, N T r a i n Gene Data Final Test N Microarray Classification Process Final Model N Final Result N © 2005 KDnuggets 30 BYU, Dec 8, 2005 Which of N models is your final model?  Neither of them!  average Error from Models 1, … N is the best estimate of the error of the process  You then use the same process on the entire data to get your final and best model © 2005 KDnuggets 31 BYU, Dec 8, 2005 Iterative Wrapper approach to selecting the best gene set  Model with top 100 genes is not optimal  Test models using 1,2,3, …, 10, 20, 30, 40, ..., N top genes with cross-validation.  Gene selection:  Simple: equal number of genes from each class  advanced: best number from each class  For “randomized” algorithms (e.g. neural nets), average 10 cross-validation runs © 2005 KDnuggets 32 BYU, Dec 8, 2005 Best Gene Set: one X-validation run Error Avg for 10-fold X-val 30% 25% 20% 15% 10% 5% 0% 1 2 3 4 5 10 20 30 40 Genes per Class Single Cross-Validation run © 2005 KDnuggets 33 BYU, Dec 8, 2005 Best Gene Set 10 X-validation runs  Select gene set with lowest combined Error  good, but not optimal! Average, high and low error rate for all classes © 2005 KDnuggets 34 BYU, Dec 8, 2005 Multi-class classification  Simple: One model for all classes  Advanced: Separate model for each class © 2005 KDnuggets 35 BYU, Dec 8, 2005 Error rates for each class Error rate Genes per Class © 2005 KDnuggets 36 BYU, Dec 8, 2005 Example: Pediatric Brain Tumor Data  92 samples, 5 classes (MED, EPD, JPA, MGL, RHB) from U. of Chicago Children’s Hospital  Photomicrographs of tumors (400x) © 2005 KDnuggets 37 BYU, Dec 8, 2005 Single Neural Net Class 1-Net Error Rate MED 2.1% MGL 17% RHB 24% EPD 9% JPA 19% *ALL* 8.3% © 2005 KDnuggets 38 BYU, Dec 8, 2005 Bagging Improves Results  Bagging or simple voting of N different neural nets improves the accuracy © 2005 KDnuggets 39 BYU, Dec 8, 2005 Bagging 100 Networks Class 1-Net Error Rate MED 2.1% 17% MGL 24% RHB 9% EPD 19% JPA *ALL* 8.3% Bag Error rate 2% (0)* 10% 11% 0 0 3% (2)* Bag Avg Conf 98% 83% 76% 91% 81% 92%  Note: suspected error on one sample (labeled as MED but consistently classified as RHB) © 2005 KDnuggets 40 BYU, Dec 8, 2005 Cross-validated prediction strength misclassified? MED: Strong predictions poorly differentiated © 2005 KDnuggets 41 BYU, Dec 8, 2005 Outline  DNA and Biology  Microarray Classification - Best practices  Gene Set Analysis  Synthetic microarray data sets © 2005 KDnuggets 42 BYU, Dec 8, 2005 Gene Sets – Key to Power  Number of genes is a problem for single gene analysis  Analyzing gene sets increases statistical power  Group genes statistically (prototypes, correlations)  Group genes using biological knowledge (pathways, Gene Ontology, medical text, …) © 2005 KDnuggets 43 BYU, Dec 8, 2005 Gene Set Analysis  Gene Set Enrichment  Mootha et al, Nature Genetics, July 2003  Question: Genes involved in oxidative phosphorylation (~100) vs. diabetes ?  No genes in OXPHOS group are individually significant (avg. ~20% decrease) © 2005 KDnuggets 44 BYU, Dec 8, 2005 OXPHOS Gene set example Decrease is consistent for 89% of OXPHOS genes Mean expression of all genes (gray) and of OXPHOS genes (red) for people with DM2 (Type 2 diabetes mellitus) vs Normal. © 2005 KDnuggets 45 BYU, Dec 8, 2005 Gene sets from Gene Ontology( FatiGO) Cluster Query – subset of genes; reference - remaining genes. http://bioinfo.cnio.es/docus/courses/FatiGO/ © 2005 KDnuggets 46 BYU, Dec 8, 2005 Comprehensive Microarray Test Suite  Getting high accuracy is very important  Current data has too few samples; the “true” labels may be unknown.  What methods are best for given data and what is their estimated accuracy?  Proposal: use realistic simulated microarray data to evaluate how current algorithms perform under different conditions © 2005 KDnuggets 47 BYU, Dec 8, 2005 Global Analysis of Microarray Data Analyzed means of all genes across samples Found: log(means >1) ~ normal distribution Plot: log10(means) (blue) vs normal quantiles (red) Other regularities: log(SD) ~ a log(means) + b © 2005 KDnuggets 48 (not shown) BYU, Dec 8, 2005 True Rank vs T-value Rank  Simulated 100 microarray datasets, with 6000 genes and 30 samples (using R)  First 10 genes up-regulated by UP factor © 2005 KDnuggets 49 BYU, Dec 8, 2005 Future Research Directions  Why is gene distribution log normal?  Develop a synthetic microarray data generator  Which gene selection methods are most robust and under what conditions?  Evaluate algorithm accuracy for a given test set by creating similar synthetic sets with known answers © 2005 KDnuggets 50 BYU, Dec 8, 2005 Acknowledgements  Eric Bremer, Children’s Hospital (Chicago) & Northwestern U.  Tom Khabaza, SPSS  Sridhar Ramaswamy, MIT/Broad Institute  Pablo Tamayo, MIT/Broad Institute © 2005 KDnuggets 51 BYU, Dec 8, 2005 SIGKDD Explorations Special Issue on Microarray Data Mining www.acm.org/sigkdd/explorations/ © 2005 KDnuggets 52 BYU, Dec 8, 2005 Thank you! Further resources on Data Mining: www.KDnuggets.com Microarrays: www.KDnuggets.com/websites/microarray.html Data Mining Course (one semester) www.KDnuggets.com/dmcourse © 2005 KDnuggets 53 BYU, Dec 8, 2005