Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Applications of Data Mining in
Microarray Data Analysis
Yen-Jen Oyang
Dept. of Computer Science and
Information Engineering
Observations and Challenges in
the Information Age
• A huge volume of information has been
and is being digitized and stored in the
computer.
• Due to the volume of digitized information,
effectively exploitation of information is
beyond the capability of human being
without the aid of intelligent computer
software.
An Example of Data Mining
• Given the data set shown on next slide, can
we figure out a set of rules that predict the
classes of objects?
Data Set
Data
Class
Data
Class
Data
Class
(15,33) O (18,28)
×
(16,31) O
(9 ,23)
×
(15,35)
O
(9 ,32)
×
(8 ,15)
×
(17,34)
O
(11,38)
×
(11,31)
O
(18,39)
×
(13,34)
O
(13,37)
×
(14,32)
O
(19,36)
×
(18,32)
O
(25,18)
×
(10,34)
×
(16,38)
×
(23,33)
×
(15,30)
O
(12,33)
O
(21,28)
×
(13,22)
×
Distribution of the Data Set
×
×
× ×
×
。
。
。
。。。 。
。 。
。
×
×
×
30
×
×
×
×
×
10
15
20
×
Rule Based on Observation
If
x  15   y  30 
2
and y  30 , then
class  0
else
class  X .
2
 25
Rule Generated by a
RBF(Radial Basis Function)
Network Based Learning
Algorithm
Let
10

f o (v )  
i 1
 v  coi
1
2
2
oi
e


f o (v)  f x (v),
2 o2i
 v cxj
2
and
14

f x (v )  
j 1
1
2
If
then prediction=“O”.
Otherwise prediction=“X”.
2
xj
e
2 x2j
2
.
c xj
 xj
coi
(15,33) (11,31) (18,32) (12,33) (15,35) (17,34) (14,32) (16,31) (13,34) (15,30)
 oi
1.723
2.745
2.327
1.794
1.973
2.045
1.794
1.794
1.794
2.027
(9,23)
(8,15)
(13,37) (16,38) (18,28) (18,39) (25,18) (23,33) (21,28) (9,32)
(11,38) (19,36) (10,34) (13,22)
6.458
10.08
2.939
3.463
2.745
5.451
3.287
10.86
5.322
5.070
4.562
3.587
3.232
6.260
Identifying Boundary of Different
Classes of Objects
Boundary Identified
Data Mining /
Knowledge Discovery
• The main theme of data mining is to
discover unknown and implicit knowledge
in a large dataset.
• There are three main categories of data
mining algorithms:
• Classification;
• Clustering;
• Mining association rule/correlation analysis.
Data Classification
• In a data classification problem, each object is
described by a set of attribute values and each
object belongs to one of the predefined classes.
• The goal is to derive a set of rules that predicts
which class a new object should belong to, based
on a given set of training samples. Data
classification is also called supervised learning.
Instance-Based Learning
• In instance-based learning, we take k
nearest training samples of a new instance
(v1, v2, …, vm) and assign the new instance
to the class that has most instances in the k
nearest training samples.
• Classifiers that adopt instance-based
learning are commonly called the KNN
classifiers.
Example of the KNN
• If an 1NN classifier is employed, then the
prediction of “” = “X”.
• If an 3NN classifier is employed, then prediction
of “” = “O”.
Applications of Data
Classification in
Bioinformatics
• In microarray data analysis, data
classification is employed to predict the
class of a new sample based on the existing
samples with known class.
• For example, in the Leukemia data set, there
are 72 samples and 7129 genes.
• 25 Acute Myeloid Leukemia(AML) samples.
• 38 B-cell Acute Lymphoblastic Leukemia
samples.
• 9 T-cell Acute Lymphoblastic Leukemia
samples.
Model of Microarray Data Sets
Gene1
Gene2
Sample1
Sample2
Samplem
M (i, j )  R.
‧‧‧‧‧‧
Genen
Alternative Data Classification
Algorithms
•
•
•
•
Decision tree (Q4.5 and Q5.0);
Instance-based learning(KNN);
Naïve Bayesian classifier;
Support vector machine(SVM);
• Novel approaches including the RBF
network based classifier that we have
recently proposed.
Accuracy of Different
Classification Algorithms
Data set
classification algorithms
RBF
SVM
1NN
3NN
Satimage
(4335,2000)
92.30
91.30
89.35
90.6
Letter
(15000,5000)
97.12
97.98
95.26
95.46
Shuttle
(43500,14500)
99.94
99.92
99.91
99.92
Average
96.45
96.40
94.84
95.33
Comparison of Execution
Time(in seconds)
Cross
validation
Make
classifier
Test
RBF without
data reduction
RBF with data
reduction
SVM
Satimage
670
265
64622
Letter
2825
1724
386814
Shuttle
96795
59.9
467825
Satimage
5.91
0.85
21.66
Letter
17.05
6.48
282.05
Shuttle
1745
0.69
129.84
Satimage
21.3
7.4
11.53
Letter
128.6
51.74
94.91
Shuttle
996.1
5.85
2.13
More Insights
Satimage
Letter
Shuttle
# of training samples in the
original data set
4435
15000
43500
# of training samples after
data reduction is applied
1815
7794
627
40.92%
51.96%
1.44%
Classification accuracy after
data reduction is applied
92.15
96.18
99.32
# of support vectors in
identified by LIBSVM
1689
8931
287
% of training samples
remaining
Data Clustering
• Data clustering concerns how to group a set
of objects based on their similarity of
attributes and/or their proximity in the
vector space. Data clustering is also called
unsupervised learning.
The Agglomerative
Hierarchical Clustering
Algorithms
• The agglomerative hierarchical clustering
algorithms operate by maintaining a sorted
list of inter-cluster distances.
• Initially, each data instance forms a cluster.
• The clustering algorithm repetitively merges
the two clusters with the minimum intercluster distance.
• Upon merging two clusters, the clustering
algorithm computes the distances between
the newly-formed cluster and the remaining
clusters and maintains the sorted list of
inter-cluster distances accordingly.
• There are a number of ways to define the
inter-cluster distance:
• minimum distance (single-link);
• maximum distance (complete-link);
• average distance;
• mean distance.
An Example of the
Agglomerative Hierarchical
Clustering Algorithm
• For the following data set, we will get
different clustering results with the singlelink and complete-link algorithms.
5
1
2
3
4
6
Result of the Single-Link
algorithm
5
1
3
2
4
6
3
1
4
5
2
6
Result of the Complete-Link
algorithm
5
1
2
3
4
6
1
3
2
4
5
6
Remarks
• The single-link and complete-link are the
two most commonly used alternatives.
• The single-link suffers the so-called
chaining effect.
• On the other hand, the complete-link also
fails in some cases.
Example of the Chaining
Effect
Single-link (10 clusters)
Complete-link (2 clusters)
Effect of Bias towards
Spherical Clusters
Single-link (2 clusters)
Complete-link (2 clusters)
K-Means: A Partitional Data
Clustering Algorithm
• The k-means algorithm is probably the most
commonly used partitional clustering
algorithm.
• The k-means algorithm begins with
selecting k data instances as the means or
centers of k clusters.
• The k-means algorithm then executes
the following loop iteratively until the
convergence criterion is met.
• repeat {
• assign every data instance to the closest cluster
based on the distance between the data instance and
the center of the cluster;
• compute the new centers of the k clusters;
• } until(the convergence criterion is met);
• A commonly-used convergence criterion is
E    p  mi ,
2
Ci pCi
where mi is the center of cluster Ci .
Illustration of the K-Means
Algorithm---(I)
initial center
initial center
initial center
Illustration of the K-Means
Algorithm---(II)
new center after
1st iteration
x
x
new center after
1st iteration
x
new center after
1st iteration
Illustration of the K-Means
Algorithm---(III)
new center after
2nd iteration
new center after
2nd iteration
new center after
2nd iteration
A Case in which the K-Means
Algorithm Fails
• The K-means algorithm may converge to a
local optimal state as the following example
demonstrates:
Initial
Selection
Remarks
• As the examples demonstrate, no clustering
algorithm is definitely superior to other
clustering algorithms with respect to
clustering quality.
Applications of Data
Clustering in Microarray Data
Analysis
• Data clustering has been employed in
microarray data analysis for
• identifying the genes with similar expressions;
• identifying the subtypes of samples.
Feature Selection in
Microarray Data Analysis
• In microarray data analysis, it is highly
desirable to identify those genes that are
correlated to the classes of samples.
• For example, in the Leukemia data set, there
are 7129 genes. We want to identify those
genes that lead to different disease types.
• Furthermore, Inclusion of features that are
not correlated to the classification decision
may result in lower classification accuracy
or poor clustering quality.
• For example, in the data set shown on the
following page, inclusion of the feature
corresponding to the Y-axis causes incorrect
prediction of the test instance marked by
“”, if a 3NN classifier is employed.
y
x=10
x
• It is apparent that “o”s and “x” s are separated by
x=10. If only the attribute corresponding to the xaxis was selected, then the 3NN classifier would
predict the class of “” correctly.
Univariate Analysis in Feature
Selection
• In the univariate analysis, the importance of each
feature is determined by how objects of different
classes are distributed in this particular axis.
• Let v1 , v2 ,..., vm and v1 , v2 ,..., vn denote the feature
values of class-1 and class-2 objects, respectively.
• Assume that the feature values of both classes of
objects follow the normal distribution.
• Then,
T

v  v
 m  1s  n  1s


2
2

 1 1 
m  n  2  m  n 
,
is a t-distribution with degree of freedom =
(m+n-2), where
1 m
1 n
v   vi and v   vi ;
m i 1
n i 1
m
n
1
1
2
2
2












s2 
v

v
and
s

v

v
.


i
i
m  1 i 1
n  1 i 1
If the t statistic of a feature is lower than a
threshold, then the feature is deleted.
Multivariate Analysis
• The univariate analysis is not able to
identify crucial features in the following
example.
• Therefore, multivariate analysis has been
developed. However, most multivariate
analysis algorithms that have been proposed
suffer high time complexity and may not be
applicable in real-world problems.
Summary
• Data clustering and data classification have
been widely used in microarray data
analysis.
• Feature selection is the most challenging
issue as of today.
Related documents