Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Knowledge Discovery in Microarray Gene Expression Data Gregory Piatetsky-Shapiro [email protected] IMA 2002 Workshop on Data-driven Control and Optimization Copyright © 2002 KDnuggets Data Mining Methodology is Critical! CRISP-DM methodology Data Mining is a Continuous Process! Following Correct Methodology is Critical! Copyright © 2002 KDnuggets 2 IMA-2002 Workshop Overview  Molecular Biology Overview  Microarrays for Gene Expression  Classification on Microarray Data  avoiding false positives  wrapper approach  Microarrays for Modeling Dynamic Processes  finding causal networks and clusters Copyright © 2002 KDnuggets 3 IMA-2002 Workshop Biology and Cells  All living organisms consist of cells.  Humans have trillions of cells. Yeast - one cell.  Cells are of many different types (blood, skin, nerve), but all arose from a single cell (the fertilized egg)  Each* cell contains a complete copy of the genome (the program for making the organism), encoded in DNA. Copyright © 2002 KDnuggets 4 IMA-2002 Workshop DNA  DNA molecules are long double-stranded chains; 4 types of bases are attached to the backbone: adenine (A), guanine (G), cytosine (C), and thymine (T). A pairs with T, C with G.  A gene is a segment of DNA that specifies how to make a protein.  Human DNA has about 30-35,000 genes; Rice -- about 50-60,000, but shorter genes. Copyright © 2002 KDnuggets 5 IMA-2002 Workshop Exons and Introns: Data and Logic?  exons are coding DNA (translated into a protein), which are only about 2% of human genome  introns are non-coding DNA, which provide structural integrity and regulatory (control) functions  exons can be thought of program data, while introns provide the program logic  Humans have much more control structure than rice Copyright © 2002 KDnuggets 6 IMA-2002 Workshop Gene Expression  Cells are different because of differential gene expression.  About 40% of human genes are expressed at one time.  Gene is expressed by transcribing DNA into single-stranded mRNA  mRNA is later translated into a protein  Microarrays measure the level of mRNA expression Copyright © 2002 KDnuggets 7 IMA-2002 Workshop Molecular Biology Overview Cell Nucleus Chromosome Protein Copyright © 2002 KDnuggets Gene (mRNA), single strand 8 Gene (DNA) Graphics courtesy of the National Human Genome Research Institute IMA-2002 Workshop Gene Expression Measurement  mRNA expression represents dynamic aspects of cell  mRNA expression can be measured with latest technology  mRNA is isolated and labeled with fluorescent protein  mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser Copyright © 2002 KDnuggets 9 IMA-2002 Workshop Gene Expression Microarrays The main types of gene expression microarrays:  Short oligonucleotide arrays (Affymetrix);  cDNA or spotted arrays (Brown/Botstein).  Long oligonucleotide arrays (Agilent Inkjet);  Fiber-optic arrays  ... Copyright © 2002 KDnuggets 10 IMA-2002 Workshop Affymetrix Microarrays Raw image 1.28cm 50um ~107 oligonucleotides, half Perfectly Match mRNA (PM), half have one Mismatch (MM) Raw gene expression is intensity difference: PM - MM Copyright © 2002 KDnuggets 11 IMA-2002 Workshop Microarray Potential Applications  Biological discovery  new and better molecular diagnostics  new molecular targets for therapy  finding and refining biological pathways  Recent examples  molecular diagnosis of leukemia, breast cancer, ...  appropriate treatment for genetic signature  potential new drug targets Copyright © 2002 KDnuggets 12 IMA-2002 Workshop Microarray Data Analysis Types  Gene Selection  find genes for therapeutic targets  avoid false positives (FDA approval ?)  Classification (Supervised)  identify disease  predict outcome / select best treatment  Clustering (Unsupervised)  find new biological classes / refine existing ones  exploration Copyright © 2002 KDnuggets 13 IMA-2002 Workshop Microarray Data Mining Challenges  too few records (samples), usually < 100  too many columns (genes), usually > 1,000  Too many columns likely to lead to False positives  for exploration, a large set of all relevant genes is desired  for diagnostics or identification of therapeutic targets, the smallest set of genes is needed  model needs to be explainable to biologists Copyright © 2002 KDnuggets 14 IMA-2002 Workshop Data Preparation Issues (MAS-4)  Thresholding: usually min 20, max 16,000  For older Affy chips (new Affy chips do not have negative values)  Filtering - remove genes with insufficient variation  e.g. MaxVal - MinVal < 500 and MaxVal/MinVal < 5  biological reasons  feature reduction for algorithmic  For clustering, normalize each gene (sample) separately to Mean = 0, Std. Dev = 1 Copyright © 2002 KDnuggets 16 IMA-2002 Workshop Classification  desired features:  robust in presence of false positives  understandable  return confidence/probability  fast enough  simplest approaches are most robust  advanced approaches can be more accurate Copyright © 2002 KDnuggets 17 IMA-2002 Workshop FALSE POSITIVES PROBLEM  Not enough records (samples), usually < 100  Too many columns (genes), usually >>1,000  FALSE POSITIVES are very likely because of few records and many columns Copyright © 2002 KDnuggets 18 IMA-2002 Workshop Controlling False Positives CD37 antigen Class 178 105 4174 7133 1 1 2 2 Class Avg Std 1 2 2287.9 4457.5 1452.4 2010.3 Mean Difference between Classes: T-value = -3.25 Significance: p=0.0007 Copyright © 2002 KDnuggets 19 IMA-2002 Workshop Controlling False Positives with Randomization CD37 antigen 178 105 4174 7133 Randomized Class Class 1 1 2 2 Randomize 2 1 1 2 Randomization is Less Conservative Preserves inner structure of data Class 178 105 4174 7133 Copyright © 2002 KDnuggets 2 1 1 2 20 T-value = -1.1 IMA-2002 Workshop Controlling false positives with randomization, II Gene Class 178 105 4174 7133 1 1 2 2 Copyright © 2002 KDnuggets Rand Class Randomize 500 times 2 1 1 2 Gene Class 178 105 4174 7133 2 1 1 2 21 Bottom 1% T-value = -2.08 Select potentially interesting genes at 1% IMA-2002 Workshop Controlling False Positives: SAM (Statistical Analysis of Microarrays)  Tusher, Tibshirani, and Chu, Significance analysis of microarrays …, PNAS, Apr 2001  SAM software available from Tibshirani web site Copyright © 2002 KDnuggets 22 IMA-2002 Workshop Feature selection approach  Rank genes by measure; select top 200-500  T-test for Mean Difference= ( Avg1  Avg2 ) ( 1 / N1   2 / N 2 ) ( Avg1  Avg2 )  Signal to Noise (S2N) = ( 1   2 )  Other: Information-based, biological?  Almost any method works well with a good feature selection Copyright © 2002 KDnuggets 24 IMA-2002 Workshop Gene Reduction improves Classification  most learning algorithms looks for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes  Classification accuracy improves if we first reduce # of genes by a linear method, e.g. T-values of mean difference  Heuristic: select equal # genes from each class  Then apply a favorite machine learning algorithm Copyright © 2002 KDnuggets 25 IMA-2002 Workshop Wrapper approach to select the best gene set Select best 200 or so genes based on statistical measures Test models using 1,2,3, …, 10, 20, 30, 40, ... genes with xvalidation. Select gene set with lowest average error Heuristically, at least 10 genes overall Error Avg for 10-fold X-val 30% 25% 20% 15% 10% 5% 0% 1 2 3 4 5 10 20 30 40 Genes per Class Copyright © 2002 KDnuggets 26 IMA-2002 Workshop Popular Classification Methods  Decision Trees/Rules  find smallest gene sets, but not robust false positives  Neural Nets - work well for reduced # of genes  K-nearest neighbor - robust for small # genes  TreeNet from authors of CART and MARS  networks of simple trees; very robust against outliers  Support Vector Machines (SVM)  good accuracy, does its own gene selection, but hard to understand  ... Copyright © 2002 KDnuggets 27 IMA-2002 Workshop Microarrays: An Example  Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999  72 examples (38 train, 34 test), about 7,000 genes  well-studied (CAMDA-2000), good test example ALL AML Visually similar, but genetically very different Copyright © 2002 KDnuggets 28 IMA-2002 Workshop Results on the test data  Genes selected and model trained on Train set ONLY!  Best Clementine neural net model used 10 genes per class  Evaluation on test data (34 samples) gives  1 or 2 errors (94-97% accuracy),  Note: all methods give error on sample 66, believed to be mis-classified by a pathologist Copyright © 2002 KDnuggets 29 IMA-2002 Workshop Multi-class Data Analysis  Brain data, Pomeroy et al 2002, Nature (415), Jan 2002  42 examples, about 7,000 genes, 5 classes Photomicrographs of tumours (400x) a, MD (medulloblastoma) classis b, MD desmoplastic c, PNET d, rhabdoid e, glioblastoma Analysis also used Normal tissue, not shown Copyright © 2002 KDnuggets 30 IMA-2002 Workshop Modeling with TreeNet  Build a model using top 3 genes from each class  Evaluate using cross-validation  Results: 95% accuracy:  1 error on training data, 1 on test 0.5 0.4 Risk 0.3 0.2 0.1 0.0 0 10 20 30 Number of Trees Copyright © 2002 KDnuggets 31 IMA-2002 Workshop TreeNet results for multi-class data Class MD MGlio Normal PNET Rhab Learn Cases (Errors) 7 (0) 8 (0) 3 (0) 6 (1) 8 (0) Test Cases 3 (0) 2 (0) 1 (0) 2 (0) 2 (1) Average cross-validation accuracy over 95% Original authors had accuracy of about 85% using nearest neighbor classifier. Copyright © 2002 KDnuggets 32 IMA-2002 Workshop Yeast SOM Clusters  Yeast Cell Cycle SOM. www.pnas.org/cgi/content/full/96/6/2907  (a) 6 × 5 SOM. The 828 genes that passed the variation filter were grouped into 30 clusters. Each cluster is represented by the centroid (average pattern) for genes in the cluster. Expression level of each gene was normalized to have mean = 0 and SD = 1 across time points. Expression levels are shown on y-axis and time points on x-axis. Error bars indicate the SD of average expression. n indicates the number of genes within each cluster. Note that multiple clusters exhibit periodic behavior and that adjacent clusters have similar behavior. (b) Cluster 29 detail. Cluster 29 contains 76 genes exhibiting periodic behavior with peak expression in late G1. Normalized expression pattern of 30 genes nearest the centroid are shown. (c) Centroids for SOMderived clusters 29, 14, 1, and 5, corresponding to G1, S, G2 and M phases of the cell cycle, are shown. Copyright © 2002 KDnuggets 34 IMA-2002 Workshop Yeast SOM Clusters Copyright © 2002 KDnuggets 35 IMA-2002 Workshop Discovery of causal processes  A long term goal of Systems Biology is to discover the causal processes among genes, proteins, and other molecules in cells  Can this be done (in part) by using data from High Throughput experiments, such as microarrays? Copyright © 2002 KDnuggets 36 IMA-2002 Workshop A Model of Galactose Utilization (manually discovered) T. Ideker, et al., Science 292 (May 4, 2001) 929-934. Copyright © 2002 KDnuggets 37 IMA-2002 Workshop Bayesian Causal Network Structure P(GAL4) P(GAL2 | GAL4) P(Intracellular Galactose | GAL2) Each variable is independent of its distant causes given all of its direct causes. Thanks to Greg Cooper, U. Pitt Copyright © 2002 KDnuggets 38 IMA-2002 Workshop Bayesian Network Learned for Yeast Hartemink et al, Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models, PSB 2002 psb.stanford.edu/psb-online Copyright © 2002 KDnuggets 39 IMA-2002 Workshop Future directions for Microarray Analysis  Algorithms optimized for small samples  Integration with other data  biological networks  medical text  protein data  Cost-sensitive classification algorithms  error cost depends on outcome (don’t want to miss treatable cancer), treatment side effects, etc. Copyright © 2002 KDnuggets 40 IMA-2002 Workshop Integrate biological knowledge when analyzing microarray data (from Cheng Li, Harvard SPH) Right picture: Gene Ontology: tool for the unification of biology, Nature Genetics, 25, p25 Copyright © 2002 KDnuggets 41 IMA-2002 Workshop GeneSpring Demo  Yeast data  Zoom all the way to bases  Yeast Cycle -- animation  Color -- expression strength Copyright © 2002 KDnuggets 42 IMA-2002 Workshop Acknowledgements  Sridhar Ramaswamy, MIT Whitehead Institute  Pablo Tamayo, MIT Whitehead Institute  Greg Cooper, U. Pittsburgh  Tom Khabaza, SPSS Copyright © 2002 KDnuggets 43 IMA-2002 Workshop Thank you! Further resources on Data Mining: www.KDnuggets.com Contact: Gregory Piatetsky-Shapiro: [email protected] Copyright © 2002 KDnuggets 44 IMA-2002 Workshop