Download Using SVM for Expression Micro

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Using SVM for Expression
Micro-array Data Mining
—— Data Mining Final Project
Chong Shou
Apr.17, 2007
Expression Micro-arrays
 Data:
 A n*m data matrix
 n = : gene number under investigation
 m = 79: conditions under which gene
expression levels are measured
 Expression level: expression under a
certain condition is compared to a
reference expression level, positive if upregulated, negative if down-regulated
Expression Micro-arrays
 It is believed that genes working
together in certain biological
processes should have similar
expression patterns
 We should be able to find similar
patterns of genes having related
 We could use this information to infer
(predict) functions of unknown genes
Supervised Learning
 We use biological knowledge gathered
from experiments to label selected
genes to a set of classes as our
training set
 Training set: 2467 genes
 Test set: 3754 genes
GO (Gene Ontology)
 Use GO as functional annotation for
selected genes
 Classes:
 Respiration
 TCA cycle (biochemical process that produce
 Histone (protein helps DNA packing)
 Ribosome (protein complex assembles amino
acids to proteins)
 Proteolysis (process destroy proteins)
 Meiosis (process produce reproductive cells)
Advantages using SVM
 Classical supervised learning method
 Able to use a variety of distance
functions (kernels)
 Able to handle data with extremely
high dimensions: 79 conditions
Details in SVM Parameters
 Four kernel functions
 Linear
 K(X,Y) = <X,Y> + 1
 Polynomial
 K(X,Y) = (<X,Y> + 1)2
 K(X,Y) = (<X,Y> + 1)3
 Radial
 K(X,Y) = exp(-σ||X - Y||2)
Kernel Matrix Modification
 Using kernel functions to calculate kernel
matrix for the use of SVM training
 Ki,j = K(Xi, Xj)
 Problem
 Positive samples: only a few genes have
classification to the six classes
 Negative samples: most genes do not have
specification classification
 Positive samples are considered as noise, thus
prone to make incorrect classifications
Kernel Matrix Modification (cont)
 Add to the diagonal of the kernel matrix a
constant whose magnitude depends on the
class of the data point, thus control the
Positive samples: Kii = Kii + λ(n+/N)
Negative samples: Kii = Kii + λ(n-/N)
λ= 0.1
n+: number of positive samples
n-: number of negative samples
N: total number of samples
Othre Supervised Learning Methods
 Classification tree
 rpart
 moc
 k-Nearest Neighbors
Model Evaluation
 Confusion Matrix
 FP, FN, TP, TN
 Cost Function
 CM = FPM + 2*FNM
 M: learning method
 FN has larger weight
 Save Function
 SM = CN – CM
 CN: cost if all samples are labeled negative
 Models with large SM are preferred
 Use SVM on testing set to predict
gene function
 Compare the classification result with
Some Problems
 High data matrix dimension, 2467 *
2467 for the training set. Requires
long running time on PC.
 “predict” function in R