Download Mine Microarray Gene Expression Data, Predict Cancers

Gene Expression Data Analysis Zhang Louxin Dept. of Mathematics Nat. University of Singapore Molecular Class Prediction Several supervised learning methods available: • Neural Networks • Support Vector Machines • Decision trees • Other statistical methods A Supervised Learning Method for Predicting a Binary Class Positive and negative examples Yes Learning Prediction ? A new item A class is just a concept! In the learning step, the class is modelled as a math. object -- a function with multiple variables, or a subspace in a high dimensional space, representing knowledge of the class. No Learning the class of tall men The class is modelled as the half space h>=6’3 Examples: 8ft 7ft 6ft 5ft Jordan Brown Ewin Yao Iverson O'neal Studamire Boykins 6'6 5'9 6'11 7'6 6'0 7'1 5'11 5'5 tall short tall tall short tall short short Support Vector Machines A support vector machine finds a hyperplane that maximally separate data points into two classes in the feature space. Embed Using a kernel function Input space Feature ? space Molecular Class Prediction -- Leukemia Case Morphology does not distinguish leukemias very well. Golub et al. (Science, 1999) proposed a voting method for predicting Acute lymphoblastic leukemia(ALL) and Acute Myeloid Leukemia(AML) using gene expression fingerprinting. In the work, Affymetrix DNA chip with 6817 genes was used for 72 ALL/AML samples. Courtesy Golub The voting algorithm (Golub’99] 1. Select a subset of (2X25) genes highly correlating with ALL/AML distinction based on 38 training samples. Correlation metric: P( g ) = µ1 − µ 2 d1 + d 2 µ1 ( µ 2 ) : the mean expression level of g in AML (ALL) samples; d1 (d 2 ) : the within-class standard deviation of expression of g in AML (ALL) samples. 2. Each selected gene casts a weighted vote for a new sample; the total of the weighted votes decides the winning class. The voting method : Separating samples by hyperplanes Mathematically, the total of all the votes on a new sample X is 50 50 V = ? vi = ? P ( g i )( xi − bi ) i =1 xi i =1 is the expression level of gi in the new sample X. If V>0, X is classified as AML; otherwise , X is ALL. AML ALL Decision Tree Learning • Information-reduction learning method. • Representing a class or concept as a logic sentence. IF (Outlook = = Sunny) & (Humidity = = Normal) THEN playTennis • When to use decision trees • Instance describable by attribute-value pairs • Target function is discrete valued • Possibly noisy training data Examples: medical diagnosis, credit risk analysis •Each internal node tests Textbook Example:PlayTennis an attribute •Each branch has a value •Each leaf assigns a classification Outlook Sunny NO Rain Yes Humidity High Overcast Normal Yes Wind Strong NO Weak Yes Remarks • Decision tree is constructed by top-down induction • Preference for short trees, and for those with high information gain attributes near the root. • Information is measured with entropy. ALL vs AML - Decision Tree this time (Y. Sun, tech report, MIT) • Single gene (zyxin), single branch tree 38/38 correct on training cases 31/34 correct on test cases, 3 errors X*5735_at <=(8+1)38: ALL • Tree size up to 3 genes 1 decision tree with 1 error 7 decision trees with 2 errors 7 decision trees with 3 errors Gene Selection Gene Selection is critical in molecular class prediction as we learn from decision tree results. Why? • In a cellular processe, only a relatively small set of genes are active. •Mathematically, each gene is just a feature. The more weak features, the more noise the data. More features arise overfitting problem. Research Problem: How to select genes? Two Approaches 1. Gene selection is done first, and then use these genes to learn; such as Golub et al’s paper. 2. Gene selection and learning are done together, like decision tree learning. Does this make difference in learning? Discovery Sample Data Sets Sample Data Sets Correlation Test Correlation Test NewData DataSets Sets New VoterSelection Selection Voter Correlation metric Predictors Predictors Support Vector Machines Cross Validation Pearson correlation coefficient Eucliden distance Voting Method None cross validation Bayesian Method Bayesian coefficient Classification Result

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Mine Microarray Gene Expression Data, Predict Cancers