Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey Mudd What is Microarray Data? •Microarray devices obtain RNA expression levels from gene samples •Data obtained can be used for a variety of medical purposes: diagnosis, predicting treatment outcome, etc. •Data produced are typically large and complex, which makes data mining a useful task Standardizing Data Mining Process •Crisp-DM: Cross-Industry Standard Process model for Data Mining •Crisp-DM is a way of standardizing steps taken in a data mining process using high-level structure and terminology •Useful for describing best practice Microarray Data Analysis Issues •Typical number of records is small (<100) due to difficulty of collecting samples •Typical number of attributes (genes) is large (many thousands) •Can lead to false positives (correlation due to chance), over-fitting •Paper suggests reducing number of genes examined (feature reduction) Data Cleaning and Preparation •Thresholding: Determine appropriate range of values (authors used min:100, max 16,000 for Affymetrix arrays) •Normalization: Required for clustering (authors used mean 0, stddev 1) •Filtering: Remove attributes that do not vary enough across samples, such as: MaxValue(G)-MinValue(G)<500, MaxValue(G)/MinValue(G)<5 Feature Selection •Because of the large number of attributes/small number of samples, feature selection is important •Use statistical measures to determine “best genes” for each class •To avoid under representing some classes, apply heuristic of selecting equal number of genes from each class Building Classification Models •For this data, decision trees work poorly, neural nets work well •Feature reduction alone not sufficient •Test models using a varying number of genes from each class •Five-fold sufficient, leave-one-out cross-validation considered most accurate Case Study 1 •Leukemia data, 2 classes (AML, ALL), 38 samples training, 34 samples test (separate samples) •Filter to reduce number of genes, select top 100 based on T-values •Build neural net models, 10 genes turned out to be best subset size •97% accuracy (33/34 test record correctly classified) Case Study 2 •Brain data, 5 classes, 42 samples (no separate test set) •Same preprocessing as Case Study 1 •Select top genes based on Signal to Noise measure, select equal number of genes per class •Build neural net models, 12 genes per class (60 total) gave best results •Lowest average error rate was 15%. Case Study 3 •Cluster analysis, with goal of discovering natural classes •Leukemia data with 3 classes: ALL -> ALL-T and ALL-B •Same preprocessing as before, also normalize values for clustering •Used two clustering methods in Clementine package, both able to discover natural classes in data, to the authors’ satisfaction Conclusions •Ideas presented could be applicable to other domains where balance between attributes and samples is similar (cheminfomatics or drug design) •Future work could evaluate cost-sensitive classification which minimize errors based on cost they inflict •Principled methodology can lead to good results