Download Lecture 3

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Databases and
Data Mining
Lecture 3:
Descriptive Data Mining
Peter van der Putten
(putten_at_liacs.nl)
Course Outline
• Objective
– Understand the basics of data mining
– Gain understanding of the potential for applying it in the
bioinformatics domain
– Hands on experience
• Schedule
Date
Time
Room
4-Nov-05 13.45 - 15.30
174
Lecture
18-Nov-05 13.45 - 15.30
413
Lecture
15.45 - 17.30 306/308 Practical Assignments
25-Nov-05 13.45 - 15.30
413
Lecture
2-Dec-05 13.45 - 15.30
413
Lecture
15.45 - 17.30 306/308 Practical Assignments
• Evaluation
– Practical assignment (2nd) plus take home exercise
• Website
– http://www.liacs.nl/~putten/edu/dbdm05/
Agenda Today:
Descriptive Data Mining
• Before Starting to Mine….
• Descriptive Data Mining
– Dimension Reduction & Projection
– Clustering
• Hierarchical clustering
• K-means
• Self organizing maps
– Association rules
•
•
•
•
Frequent item sets
Association Rules
APRIORI
Bio-informatics case: FSG for frequent subgraph discovery
Before starting to mine….
• Pima Indians
Diabetes Data
– X = body mass
index
– Y = age
Before starting to mine….
Before starting to mine….
Before starting to mine….
• Attribute Selection
– This example: InfoGain by Attribute
– Keep the most important ones
Diastolic blood pressure (mm Hg)
Diabetes pedigree function
Number of times pregnant
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Age (years)
Body mass index (weight in kg/(height in m)^2)
Plasma glucose concentration a 2 hours in an oral
glucose tolerance test
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20
Before starting to mine….
• Types of Attribute Selection
– Uni-variate versus multivariate (sub set selection)
• The fact that attribute x is a strong uni-variate predictor does
not necessarily mean it will add predictive power to a set of
predictors already used by a model
– Filter versus wrapper
• Wrapper methods involve the subsequent learner (classifier
or other)
Dimension Reduction
• Projecting high dimensional data into a lower
dimension
–
–
–
–
Principal Component Analysis
Independent Component Analysis
Fisher Mapping, Sammon’s Mapping etc.
Multi Dimensional Scaling
• See Pattern Recognition Course (Duin)
Data Mining Tasks: Clustering
Clustering is the discovery of
groups in a set of instances
Groups are different, instances
in a group are similar
f.e. weight
In 2 to 3 dimensional pattern
space you could just visualise
the data and leave the
recognition to a human end
user
f.e. age
Data Mining Tasks: Clustering
Clustering is the discovery of
groups in a set of instances
Groups are different, instances
in a group are similar
f.e. weight
In 2 to 3 dimensional pattern
space you could just visualise
the data and leave the
recognition to a human end
user
f.e. age
In >3 dimensions this is not
possible
Clustering Techniques
• Hierarchical algorithms
– Agglomerative
– Divisive
• Partition based clustering
– K-Means
– Self Organizing Maps / Kohonen Networks
• Probabilistic Model based
– Expectation Maximization / Mixture Models
Hierarchical clustering
•
Agglomerative / Bottom up
–
–
–
–
•
Divisive / Top Down
–
–
–
•
Start with single-instance clusters
At each step, join the two closest clusters
Method to compute distance between cluster x and y: single
linkage (distance between closest point in cluster x and y),
average linkage (average distance between all points), complete
linkage (distance between furthest points), centroid
Distance measure: Euclidean, Correlation etc.
Start with all data in one cluster
Split into two clusters based on category utility
Proceed recursively on each subset
Both methods produce a dendrogram
Levels of Clustering
Agglomerative
Divisive
Dunham, 2003
Hierarchical Clustering Example
• Clustering Microarray Gene Expression Data
– Gene expression measured using microarrays studied under variety of
conditions
– On budding yeast Saccharomyces cerevisiae
– Groups together efficiently genes of known similar function,
•
Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P.,
Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro
Hierarchical Clustering Example
• Method
– Genes are the instances, samples the attributes!
– Agglomerative
– Distance measure = correlation
•
Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P.,
Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro
Simple Clustering: K-means
•
Pick a number (k) of cluster centers (at random)
•
•
Assign every item to its nearest cluster center
•
•
F.i. Euclidean distance
Move each cluster center to the mean of its
assigned items
Repeat until convergence
•
•
KDnuggets
Cluster centers are sometimes called codes, and the
k codes a codebook
change in cluster assignments less than a threshold
K-means example, step 1
k1
Y
Initially
distribute
codes
randomly
in pattern
space
k2
k3
X
KDnuggets
K-means example, step 2
k1
Y
Assign
each point
to the closest
code
k2
k3
X
KDnuggets
K-means example, step 3
k1
k1
Y
Move
each code
to the mean
of all its
assigned
points
k2
k3
k2
k3
X
KDnuggets
K-means example, step 2
Repeat the
process – Y
reassign the
data points to
the codes
Q: Which
points are
reassigned?
k1
k3
k2
X
KDnuggets
K-means example
Repeat the
process – Y
reassign the
data points to
the codes
Q: Which
points are
reassigned?
k1
k3
k2
X
KDnuggets
K-means example
k1
Y
re-compute
cluster
means
k3
k2
X
KDnuggets
K-means example
k1
Y
move
cluster
centers to
cluster
means
KDnuggets
k2
k3
X
K-means clustering summary
Advantages
• Simple, understandable
• items automatically
assigned to clusters
Disadvantages
• Must pick number of
clusters before hand
• All items forced into a
cluster
• Sensitive to outliers
Extensions
• Adaptive k-means
• K-mediods (based on median instead of mean)
– 1,2,3,4,100  average 22, median 3
Biological Example
• Clustering of yeast cell images
– Two clusters are found
– Left cluster primarily cells with thick capsule, right
cluster thin capsule
• caused by media, proxy for sick vs healthy
Self Organizing Maps
(Kohonen Maps)
• Claim to fame
– Simplified models of cortical maps in the brain
– Things that are near in the outside world link to
areas near in the cortex
– For a variety of modalities: touch, motor, …. up
to echolocation
– Nice visualization
• From a data mining perspective:
– SOMs are simple extensions of k-means
clustering
– Codes are connected in a lattice
– In each iteration codes neighboring
winning code in the lattice are also allowed
to move
SOM
10x10 SOM
Gaussian
Distribution
SOM
SOM
SOM
SOM example
Famous example:
Phonetic Typewriter
• SOM lattice below left is trained on spoken
letters, after convergence codes are labeled
• Creates a ‘phonotopic’ map
• Spoken word creates a sequence of labels
Famous example:
Phonetic Typewriter
• Criticism
– Topology preserving property is not used so why use SOMs and
not adaptive k-means for instance?
• K-means could also create a sequence
• This is true for most SOM applications!
– Is using clustering for classification optimal?
Bioinformatics Example
Clustering GPCRs
• Clustering G Protein Coupled Receptors (GPCRs)
[Samsanova et al, 2003, 2004]
• Important drug target, function often unknown
Bioinformatics Example
Clustering GPCRs
Association Rules Outline
• What are frequent item sets &
association rules?
• Quality measures
– support, confidence, lift
• How to find item sets efficiently?
– APRIORI
• How to generate association rules
from an item set?
• Biological examples
KDnuggets
Market Basket Example
Gene Expression Example
TID Produce
1
MILK, BREAD, EGGS
2
BREAD, SUGAR
3
BREAD, CEREAL
4
MILK, BREAD, SUGAR
5
MILK, CEREAL
6
BREAD, CEREAL
7
MILK, CEREAL
8
MILK, BREAD, CEREAL, EGGS
9
MILK, BREAD, CEREAL
• What genes are expressed (‘active’)
together?
• Interaction / regulation
• Similar function
• Frequent item set
• {MILK, BREAD} = 4
• Association rule
• {MILK, BREAD}  {EGGS}
• Frequency / importance = 2
(‘Support’)
• Quality = 50% (‘Confidence’)
ID
1
2
3
4
5
6
7
8
9
Expressed Genes in Sample
GENE1, GENE2, GENE 5
GENE1, GENE3, GENE 5
GENE2
GENE8, GENE9
GENE8, GENE9, GENE10
GENE2, GENE8
GENE9, GENE10
GENE2
GENE11
Association Rule Definitions
•
•
•
•
Set of items: I={I1,I2,…,Im}
Transactions: D={t1,t2, …, tn}, tj I
Itemset: {Ii1,Ii2, …, Iik}  I
Support of an itemset: Percentage of
transactions which contain that itemset.
• Large (Frequent) itemset: Itemset whose
number of occurrences is above a threshold.
Dunham, 2003
Frequent Item Set Example
I = { Beer, Bread, Jelly, Milk, PeanutButter}
Support of {Bread,PeanutButter} is 60%
Dunham, 2003
Association Rule Definitions
• Association Rule (AR): implication X  Y
where X,Y  I and X,Y disjunct;
• Support of AR (s) X  Y: Percentage of
transactions that contain X Y
• Confidence of AR (a) X  Y: Ratio of
number of transactions that contain X  Y to
the number that contain X
Dunham, 2003
Association Rules Ex (cont’d)
Dunham, 2003
Association Rule Problem
• Given a set of items I={I1,I2,…,Im} and a
database of transactions D={t1,t2, …, tn}
where ti={Ii1,Ii2, …, Iik} and Iij  I, the
Association Rule Problem is to identify all
association rules X  Y with a minimum
support and confidence.
• NOTE: Support of X  Y is same as support
of X  Y.
Dunham, 2003
Association Rules Example
• Q: Given frequent set {A,B,E}, what
association rules have minsup = 2 and
minconf= 50% ?
A, B => E : conf=2/4 = 50%
A, E => B : conf=2/2 = 100%
B, E => A : conf=2/2 = 100%
E => A, B : conf=2/2 = 100%
Don’t qualify
A =>B, E : conf=2/6 =33%< 50%
B => A, E : conf=2/7 = 28% < 50%
__ => A,B,E : conf: 2/9 = 22% < 50%
KDnuggets
TID List of items
1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C
Solution Association Rule Problem
• First, find all frequent itemsets with sup
>=minsup
– Exhaustive search won’t work
• Assume we have a set of m items  2m subsets!
– Exploit the subset property (APRIORI algorithm)
• For every frequent item set, derive rules with
confidence >= minconf
KDnuggets
Finding itemsets: next level
• Apriori algorithm (Agrawal & Srikant)
• Idea: use one-item sets to generate two-item
sets, two-item sets to generate three-item sets, ..
– Subset Property: If (A B) is a frequent item set, then
(A) and (B) have to be frequent item sets as well!
– In general: if X is frequent k-item set, then all (k-1)item subsets of X are also frequent
 Compute k-item set by merging (k-1)-item sets
KDnuggets
An example
• Given: five three-item sets
(A B C), (A B D), (A C D), (A C E), (B C D)
• Candidate four-item sets:
(A B C D)
Q: OK?
A: yes, because all 3-item subsets are frequent
(A C D E)
Q: OK?
A: No, because (C D E) is not frequent
KDnuggets
From Frequent Itemsets to
Association Rules
• Q: Given frequent set {A,B,E}, what are
possible association rules?
–
–
–
–
–
–
–
KDnuggets
A => B, E
A, B => E
A, E => B
B => A, E
B, E => A
E => A, B
__ => A,B,E (empty rule), or true => A,B,E
Example:
‘Generating Rules from an Itemset
• Frequent itemset from golf data:
Humidity = Normal, Windy = False, Play = Yes (4)
• Seven potential rules:
If Humidity = Normal and Windy = False then Play = Yes
If Humidity = Normal and Play = Yes then Windy = False
If Windy = False and Play = Yes then Humidity = Normal
If Humidity = Normal then Windy = False and Play = Yes
If Windy = False then Humidity = Normal and Play = Yes
If Play = Yes then Humidity = Normal and Windy = False
If True then Humidity = Normal and Windy = False and Play = Yes
KDnuggets
4/4
4/6
4/6
4/7
4/8
4/9
4/12
Example:
Generating Rules
• Rules with support > 1 and confidence = 100%:
Association rule
Sup.
Conf.
1
Humidity=Normal Windy=False
Play=Yes
4
100%
2
Temperature=Cool
Humidity=Normal
4
100%
3
Outlook=Overcast
Play=Yes
4
100%
4
Temperature=Cold Play=Yes
Humidity=Normal
3
100%
...
...
...
...
...
58
Outlook=Sunny Temperature=Hot
Humidity=High
2
100%
• In total: 3 rules with support four, 5 with support
three, and 50 with support two
KDnuggets
Weka associations: output
KDnuggets
Extensions and Challenges
• Extra quality measure: Lift
– The lift of an association rule I => J is defined as:
• lift = P(J|I) / P(J)
• Note, P(I) = (support of I) / (no. of transactions)
• ratio of confidence to expected confidence
– Interpretation:
• if lift > 1, then I and J are positively correlated
lift < 1, then I are J are negatively correlated.
lift = 1, then I and J are independent
• Other measures for interestingness
– A  B, B  C, but not A  C
• Efficient algorithms
• Known Problem
– What to do with all these rules? How to exploit / make useful /
actionable?
KDnuggets
Biomedical Application
Head and Neck Cancer Example
1. ace27=0 fiveyr=alive 381  tumorbefore=0 372
conf:(0.98)
2. gender=M ace27=0 467  tumorbefore=0 455
conf:(0.97)
3. ace27=0 588  tumorbefore=0 572
conf:(0.97)
4. tnm=T0N0M0 ace27=0 405  tumorbefore=0 391
conf:(0.97)
5. loc=LOC7 tumorbefore=0 409  tnm=T0N0M0 391 conf:(0.96)
6. loc=LOC7 442  tnm=T0N0M0 422
conf:(0.95)
7. loc=LOC7 gender=M tumorbefore=0 374 tnm=T0N0M0 357
conf:(0.95)
8. loc=LOC7 gender=M 406  tnm=T0N0M0 387
9. gender=M fiveyr=alive 633  tumorbefore=0 595
10. fiveyr=alive 778  tumorbefore=0 726
conf:(0.95)
conf:(0.94)
conf:(0.93)
Bioinformatics Application
• The idea of association rules have been
customized for bioinformatics applications
• In biology it is often interesting to find frequent
structures rather than items
– For instance protein or other chemical structures
• Solution: Mining Frequent Patterns
– FSG (Kuramochi and Karypis, ICDM 2001)
– gSpan (Yan and Han, ICDM 2002)
– CloseGraph (Yan and Han, KDD 2002)
FSG: Mining Frequent Patterns
FSG: Mining Frequent Patterns
FSG Algorithm
for finding frequent subgraphs
Frequent Subgraph Examples
AIDS Data
• Compounds are active, inactive or moderately active (CA, CI, CM)
Predictive Subgraphs
• The three most discriminating sub-structures for
the PTC, AIDS, and Anthrax datasets
FSG References
• Frequent Sub-structure Based Approaches for Classifying
Chemical Compounds
Mukund Deshpande, Michihiro Kuramochi, and George Karypis
ICDM 2003
• An Efficient Algorithm for Discovering Frequent Subgraphs
Michihiro Kuramochi and George Karypis
IEEE TKDE
• Automated Approaches for Classifying Structures
Mukund Deshpande, Michihiro Kuramochi, and George Karypis
BIOKDD 2002
• Discovering Frequent Geometric Subgraphs
Michihiro Kuramochi and George Karypis
ICDM 2002
• Frequent Subgraph Discovery
Michihiro Kuramochi and George Karypis
1st IEEE Conference on Data Mining 2001
Recap
• Before Starting to Mine….
• Descriptive Data Mining
– Dimension Reduction & Projection
– Clustering
• Hierarchical clustering
• K-means
• Self organizing maps
– Association rules
•
•
•
•
Frequent item sets
Association Rules
APRIORI
Bio-informatics case: FSG for frequent subgraph discovery
• Next week
– Bioinformatics Data Mining Cases / Lab Session / Take Home
Exercise