Download Mining Patterns from Protein Structures

Document related concepts
no text concepts found
Transcript
EECS 800 Research Seminar
Mining Biological Data
Instructor: Luke Huan
Fall, 2006
The UNIVERSITY of Kansas
Administrative
Student presentations start from next week
Send me the list of references that you use, if your choices are
different from what are listed at the class webpage
Speaker:
Allocate enough time for background.
20 minutes for general audience
20 minutes for people working in “your area”
<20 minutes for experts
Keep in mind of five “w” questions: What is it? Why is it
important? Who cares? Works or not? and So what?
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide2
Administrative
Participants
Need to ask at least one question
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide3
Overview
Semi-supervised learning
Semi-supervised classification
Semi-supervised clustering
Semi-supervised clustering
Search based methods
Cop K-mean
Seeded K-mean
Constrained K-mean
Similarity based methods
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide4
Supervised Classification Example
.
.
.
.
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide5
Supervised Classification Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide6
Supervised Classification Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide7
Unsupervised Clustering Example
.
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide8
Unsupervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide9
Semi-Supervised Learning
Combines labeled and unlabeled data during training to
improve performance:
Semi-supervised classification: Training on labeled data exploits
additional unlabeled data, frequently resulting in a more accurate
classifier.
Semi-supervised clustering: Uses small amount of labeled data to
aid and bias the clustering of unlabeled data.
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide10
Semi-Supervised Classification
Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide11
Semi-Supervised Classification
Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide12
Semi-Supervised Classification
Algorithms:
Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].
Co-training [Blum:COLT98].
Transductive SVM’s [Vapnik:98,Joachims:ICML99].
Assumptions:
Known, fixed set of categories given in the labeled data.
Goal is to improve classification of examples into these known
categories.
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide13
Semi-Supervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide14
Semi-Supervised Clustering Example
.
.
. ..
.. .
.. .
.
. .
. . .
. .
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide15
Second Semi-Supervised Clustering
Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide16
Second Semi-Supervised Clustering
Example
.
.
. ..
.. .
.. .
.
. .
. . .
. .
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide17
Semi-Supervised Clustering
Can group data using the categories in the initial labeled
data.
Can also extend and modify the existing set of categories
as needed to reflect other regularities in the data.
Can cluster a disjoint set of unlabeled data using the
labeled data as a “guide” to the type of clusters desired.
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide18
Problem definition
Input:
A set of unlabeled objects
Some domain knowledge
Output:
A partitioning of the objects into clusters
Objective:
Maximum intra-cluster similarity
Minimum inter-cluster similarity
High consistency between the partitioning and the
domain knowledge
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide19
What is Domain Knowledge?
Must-link and cannot-link
Class labels
Ontology
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide20
Why semi-supervised clustering?
Why not clustering?
Could not incorporate prior knowledge into clustering process
Why not classification?
Sometimes there are insufficient labeled data.
Potential applications
Bioinformatics (gene and protein clustering)
Document hierarchy construction
News/email categorization
Image categorization
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide21
Semi-Supervised Clustering
Approaches
Search-based Semi-Supervised Clustering
Alter the clustering algorithm using the constraints
Similarity-based Semi-Supervised Clustering
Alter the similarity measure based on the
constraints
Combination of both
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide22
Search-Based Semi-Supervised
Clustering
Alter the clustering algorithm that searches for a good
partitioning by:
Modifying the objective function to give a reward for obeying
labels on the supervised data [Demeriz:ANNIE99].
Enforcing constraints (must-link, cannot-link) on the labeled data
during clustering [Wagstaff:ICML00, Wagstaff:ICML01].
Use the labeled data to initialize clusters in an iterative
refinement algorithm (kMeans, EM) [Basu:ICML02].
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide23
Unsupervised KMeans Clustering
KMeans iteratively partitions a dataset into K clusters.
Algorithm:
Initialize K cluster centers {l } randomly. Repeat until
convergence:
K
l 1
Cluster Assignment Step: Assign each data point x to the
cluster Xl, such that L2 distance of x from  l (center of Xl) is
minimum
Center Re-estimation Step: Re-estimate each cluster center  l
as the mean of the points in that cluster
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide24
KMeans Objective Function
Locally minimizes sum of squared distance between the
data points and their corresponding cluster centers:
 
K
l 1
xi X l
|| xi  l ||
2
Initialization of K cluster centers:
Totally random
Random perturbation from global mean
Heuristic
to
ensure
well-separated
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
centers
etc.
slide25
K Means Example
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide26
K Means Example
Randomly Initialize Means
x
x
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide27
K Means Example
Assign Points to Clusters
x
x
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide28
K Means Example
Re-estimate Means
x
x
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide29
K Means Example
Re-assign Points to Clusters
x
x
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide30
K Means Example
Re-estimate Means
x
x
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide31
K Means Example
Re-assign Points to Clusters
x
x
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide32
K Means Example
Re-estimate Means and Converge
x
x
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide33
Semi-Supervised K-Means
Constraints (Must-link, Cannot-link)
COP K-Means
Partial label information is given
Seeded K-Means (Basu, ICML’02)
Constrained K-Means
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide34
COP K-Means
COP K-Means is K-Means with must-link (must be in
same cluster) and cannot-link (cannot be in same cluster)
constraints on data points.
Initialization: Cluster centers are chosen randomly but no
must-link constraints that may be violated
Algorithm: During cluster assignment step in COP-KMeans, a point is assigned to its nearest cluster without
violating any of its constraints. If no such assignment
exists, abort.
Based on Wagstaff et al.: ICML01
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide35
COPK-Means
K-MeansAlgorithm
Algorithm
COP
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide36
Illustration
Determine
its label
Must-link
x
x
Assign to the red class
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide37
Illustration
Determine
its label
x
x
Cannot-link
Assign to the red class
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide38
Illustration
Determine
its label
Must-link
x
x
Cannot-link
The clustering algorithm fails
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide39
Evaluation
Rand index: measures the agreement between two
partitions, P1 and P2, of the same data set D.
Each partition is viewed as a collection of n(n-1)/2
pairwise decisions, where n is the size of D.
a is the number of decisions where P1 and P2 put a pair
of objects into the same cluster
b is the number of decisions where two instances are
placed in different clusters in both partitions.
Total agreement can then be calculated using
Rand(P1; P2) = (a + b)/ (n (n -1)/2):
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide40
Evaluation
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide41
Semi-Supervised K-Means
Seeded K-Means:
Labeled data provided by user are used for initialization: initial
center for cluster i is the mean of the seed points having label i.
Seed points are only used for initialization, and not in subsequent
steps.
Constrained K-Means:
Labeled data provided by user are used to initialize K-Means
algorithm.
Cluster labels of seed data are kept unchanged in the cluster
assignment steps, and only the labels of the non-seed data are reestimated.
Based on Basu et al., ICML’02.
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide42
Seeded K-Means
Use labeled data to find
the initial centroids and
then run K-Means.
The labels for seeded
points may change.
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide43
Seeded K-Means Example
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide44
Seeded K-Means Example
Initialize Means Using Labeled Data
x
11/01/2006
Semi-supervised learning
x
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide45
Seeded K-Means Example
Assign Points to Clusters
x
11/01/2006
Semi-supervised learning
x
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide46
Seeded K-Means Example
Re-estimate Means
x
11/01/2006
Semi-supervised learning
x
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide47
Seeded K-Means Example
Assign points to clusters and Converge
x
11/01/2006
Semi-supervised learning
x
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
the label is changed
slide48
Constrained K-Means
Use labeled data to find
the initial centroids and
then run K-Means.
The labels for seeded
points will not change.
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide49
Constrained K-Means Example
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide50
Constrained K-Means Example
Initialize Means Using Labeled Data
x
11/01/2006
Semi-supervised learning
x
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide51
Constrained K-Means Example
Assign Points to Clusters
x
11/01/2006
Semi-supervised learning
x
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide52
Constrained K-Means Example
Re-estimate Means and Converge
x
11/01/2006
Semi-supervised learning
x
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide53
Datasets
Data sets:
UCI Iris (3 classes; 150 instances)
CMU 20 Newsgroups (20 classes; 20,000 instances)
Yahoo! News (20 classes; 2,340 instances)
Data subsets created for experiments:
Small-20 newsgroup: random sample of 100 documents from
each newsgroup, created to study effect of datasize on
algorithms.
Different-3 newsgroup: 3 very different newsgroups
(alt.atheism, rec.sport.baseball, sci.space), created to study
effect of data separability on algorithms.
Same-3
newsgroup:
3
very
similar
newsgroups
(comp.graphics, comp.os.ms-windows, comp.windows.x).
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide54
Evaluation
Mutual information
Objective function
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide55
Results: MI and Seeding
Zero noise in seeds [Small-20 NewsGroup]
Semi-Supervised KMeans substantially better than unsupervised
KMeans
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide56
Results: Objective function and Seeding
User-labeling consistent with KMeans assumptions
[Small-20
NewsGroup] Obj. function of data partition increases exponentially
with seed fraction
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide57
Results: Objective Function and Seeding
User-labeling inconsistent with KMeans assumptions
[Yahoo! News] Objective function of constrained algorithms decreases
with seeding
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide58
Similarity Based Methods
Questions: given a set of points and the class labels, can
we learn a distance matrix such that inter-cluster distance
are minimized and intra-cluster distance are maximized?
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide59
Distance metric learning
Define a new distance measure of the form:
Linear transformation of the original data
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide60
Distance metric learning
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide61
Semi-Supervised Clustering Example
Similarity Based
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide62
Semi-Supervised Clustering Example
Distances Transformed by Learned Metric
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide63
Semi-Supervised Clustering Example
Clustering Result with Trained Metric
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide64
Evaluation
Source:
11/01/2006
Semi-supervised learning
Mining
Biological
Data learning
E. Xing, et al.
Distance
metric
KU EECS 800, Luke Huan, Fall’06
slide65
Evaluation
Source: E. Xing, et al. Distance metric learning
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide66
Additional Readings
Combining Similarity and Search-Based Semi-Supervised
Clustering “Comparing and Unifying Search-Based and
Similarity-Based Approaches to Semi-Supervised
Clustering”, Basu, et al.
Ontology based semi-supervised clustering “A framework
for ontology-driven subspace clustering”, Liu et al.
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide67
Summary
Semi-supervised clustering
Its applications in Bioinformatics?
Not fully explored
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide68
Reference
UT machine learning group
http://www.cs.utexas.edu/~ml/publication/unsupervised.html
Semi-supervised Clustering by Seeding
http://www.cs.utexas.edu/users/ml/papers/semi-icml-02.pdf
Constrained K-means clustering with background
knowledge
http://www.litech.org/~wkiri/Papers/wagstaff-kmeans-01.pdf
Some slides are from Jieping Ye at Arizona State
11/01/2006
Semi-supervised learning
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide69
Related documents