Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Mining in Micro array Analysis Classification (Supervised Learning) Finding models (functions) that describe and distinguish classes or concepts for future prediction E.g., predict disease based on gene expression profiles Similar to Prediction: Predict some unknown or missing categorical value rather than a numerical values Presentation: decision-tree, classification rule, neural network Cluster analysis (Unsupervised Learning) Class label is unknown: Group data to form new classes, e.g., cluster genes to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity E.g. Group genes based on their gene expression profiles Supervised vs Unsupervised Learning Supervised Unsupervised Classification Clustering • known number of classes • unknown number of classes • based on a training set • no prior knowledge • used to classify future observations • used to understand (explore) data Supervised vs. Unsupervised Learning debt debt * * o o * o * ** * o * * * o o * + + + + + + + ++ + + + +++ + + o o o o + + + + income Supervised Learning Unsupervised Learning debt debt * * o o * o * ** * o * * * o o * + + + + + + + ++ + + + +++ + + o o o + o income + + + income Classification Training Set Data with known classes Data with unknown classes Classification Technique Classifier Class Assignment Types of Classifiers debt * * o o * o * ** * o * * * o o * o o o o income Linear Classifier: Non Linear Classifier: debt debt * * o o * o * ** * o * * * o o * * * o o * o * ** * o * * * o o * o o o o o o o o income a*income + b*debt < t => No loan ! income Predictive Modelling: Day 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Temperature Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild Humidity Wind High High High High Normal Normal Normal High Normal Normal Normal High Normal High Play Tennis Weak No Strong No Weak Yes Weak Yes Weak Yes Strong No Strong Yes Weak No Weak Yes Weak Yes Strong Yes Strong Yes Weak Yes Strong No Predict categorical class labels Classify data (construct a model) based on the training set and the values (class labels) in a classifying attribute and Use it in classifying new data Classification Task: determine which of a fixed set of classes an example belongs to Input: training set of examples annotated with class values. Output:induced hypotheses (model/concept description/classifiers) Learning : Induce classifiers from training data Training Data: Inductive Learning System Classifiers (Derived Hypotheses) Predication : Using Hypothesis for Prediction: classifying any example described in the same manner Data to be classified Classifier Decision on class assignment Decision Tree: Example Day Outlook Temperature 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Humidity Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Wind Play Tennis Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Outlook Sunny Humidity High No Overcast Rain Wind Yes Strong Normal Yes No Weak Yes Classification: Relevant Gene Identification Goal: Identify subset of genes that distinguish between treatments, tissues, etc. Method Collect several samples grouped by treatments (e.g. Diseased vs. Healthy) Use genes as “features” Build a classifier to distinguish treatments Gene Expression Example ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 G1 11.12 12.34 13.11 13.34 14.11 11.34 21.01 66.11 33.11 11.54 12.00 15.23 31.22 11.33 G2 1.34 2.01 1.34 11.11 13.10 14.21 12.32 33.3 44.1 11.1 15.1 1.11 2.0 11.1 G3 1.97 1.22 1.34 1.38 1.06 1.07 1.97 1.97 1.96 1.97 1.98 1.89 1.99 1.01 G4 11.0 11.1 2.0 2.23 2.44 1.23 1.34 1.34 11.23 10.01 9.01 12.48 13.51 11.01 Cancer No No G1 Yes Yes <=22 >22 Yes No G3 G4 Yes Yes <=52 >52 <=12 >12 Yes Yes Yes Yes No Yes No No Yes No 15 ….. … .. .. .. Problem: With large number of genes (~10000) Need to use feature selection/reduction techniques