Download Micro array Data Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining in Micro array Analysis

Classification (Supervised Learning)





Finding models (functions) that describe and distinguish classes
or concepts for future prediction
E.g., predict disease based on gene expression profiles
Similar to Prediction: Predict some unknown or missing
categorical value rather than a numerical values
Presentation: decision-tree, classification rule, neural network
Cluster analysis (Unsupervised Learning)



Class label is unknown: Group data to form new classes, e.g.,
cluster genes to find distribution patterns
Clustering based on the principle: maximizing the intra-class
similarity and minimizing the interclass similarity
E.g. Group genes based on their gene expression profiles
Supervised vs Unsupervised Learning
Supervised
Unsupervised
Classification
Clustering
• known number of classes
• unknown number of classes
• based on a training set
• no prior knowledge
• used to classify future
observations
• used to understand (explore) data
Supervised vs. Unsupervised Learning
debt
debt
*
* o o
*
o
* ** *
o
*
* * o o
*
+
+ + +
+
+
+ ++ +
+
+
+++ +
+
o
o
o
o
+
+
+
+
income
Supervised
Learning
Unsupervised
Learning
debt
debt
*
* o o
*
o
* ** *
o
*
* * o o
*
+
+ + +
+
+
+ ++ +
+
+
+++ +
+
o
o
o
+
o
income
+
+
+
income
Classification
Training Set
Data with known
classes
Data with
unknown classes
Classification
Technique
Classifier
Class
Assignment
Types of Classifiers
debt
*
* o o
*
o
* ** *
o
*
* * o o
*
o
o
o
o
income
Linear Classifier:
Non Linear Classifier:
debt
debt
*
* o o
*
o
* ** *
o
*
* * o o
*
*
* o o
*
o
* ** *
o
*
* * o o
*
o
o
o
o
o
o
o
o
income
a*income + b*debt < t => No loan !
income
Predictive Modelling:
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temperature
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humidity Wind
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Play Tennis
Weak
No
Strong
No
Weak
Yes
Weak
Yes
Weak
Yes
Strong
No
Strong
Yes
Weak
No
Weak
Yes
Weak
Yes
Strong
Yes
Strong
Yes
Weak
Yes
Strong
No
 Predict categorical class labels
 Classify data (construct a model) based on the training set and the
values (class labels) in a classifying attribute and
 Use it in classifying new data
Classification
Task: determine which of a fixed set of classes an example belongs to


Input: training set of examples annotated with class values.
Output:induced hypotheses (model/concept description/classifiers)
Learning : Induce classifiers from training data
Training
Data:
Inductive
Learning
System
Classifiers
(Derived Hypotheses)
Predication : Using Hypothesis for Prediction: classifying any
example described in the same manner
Data to be classified
Classifier
Decision on class
assignment
Decision Tree: Example
Day
Outlook Temperature
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Humidity
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind
Play Tennis
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Wind
Yes
Strong
Normal
Yes
No
Weak
Yes
Classification: Relevant Gene Identification


Goal: Identify subset of genes that distinguish
between treatments, tissues, etc.
Method



Collect several samples grouped by treatments (e.g.
Diseased vs. Healthy)
Use genes as “features”
Build a classifier to distinguish treatments
Gene Expression Example
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
G1
11.12
12.34
13.11
13.34
14.11
11.34
21.01
66.11
33.11
11.54
12.00
15.23
31.22
11.33
G2
1.34
2.01
1.34
11.11
13.10
14.21
12.32
33.3
44.1
11.1
15.1
1.11
2.0
11.1
G3
1.97
1.22
1.34
1.38
1.06
1.07
1.97
1.97
1.96
1.97
1.98
1.89
1.99
1.01
G4
11.0
11.1
2.0
2.23
2.44
1.23
1.34
1.34
11.23
10.01
9.01
12.48
13.51
11.01
Cancer
No
No
G1
Yes
Yes
<=22
>22
Yes
No
G3
G4
Yes
Yes
<=52 >52
<=12 >12
Yes
Yes
Yes
Yes
No
Yes
No
No
Yes
No
15
…..
…
..
..
..
Problem: With large number of genes (~10000)
Need to use feature selection/reduction techniques
Related documents