Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Course Outline
Introduction
Data warehousing and OLAP
Data preprocessing for mining and warehousing
Concept description: characterization and
discrimination
Classification and prediction
Association analysis
Clustering analysis
Mining complex data and advanced mining
techniques
Trends and research issues
Copyright Jiawei Han, modified by
1
Data Mining and Warehousing: Session 7
Clustering Analysis
Copyright Jiawei Han, modified by
2
Clustering analysis
What is Clustering Analysis?
Clustering in Data Mining Applications
Handling Different Types of Variables
Major Clustering Techniques
Outlier Discovery
Problems and Challenges
Copyright Jiawei Han, modified by
3
What Is Clustering ?
Clustering is a process of partitioning a set of data (or objects)
into a set of meaningful sub-classes, called clusters.
May help users understand the natural grouping or
structure in a data set.
Cluster: a collection of data objects that are “similar” to one
another and thus can be treated collectively as one group.
Clustering: unsupervised classification: no predefined classes.
Used either as a stand-alone tool to get insight into data
distribution or as a preprocessing step for other algorithms.
Copyright Jiawei Han, modified by
5
What Is Good Clustering?
A good clustering method will produce high quality
clusters in which:
the intra-class (that is, intra-cluster) similarity is high.
the inter-class similarity is low.
The quality of a clustering result also depends on both the
similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
Copyright Jiawei Han, modified by
6
Requirements of Clustering in Data Mining
Scalability
Dealing with different types of attributes
Discovery of clusters with arbitrary shape
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Interpretability and usability.
Copyright Jiawei Han, modified by
7
Clustering analysis
What is Clustering Analysis?
Clustering in Data Mining Applications
Handling Different Types of Variables
Major Clustering Techniques
Outlier Discovery
Problems and Challenges
Copyright Jiawei Han, modified by
8
Applications of Clustering
Clustering has wide applications in
Pattern Recognition
Spatial Data Analysis:
– create thematic maps in GIS by clustering feature spaces
– detect spatial clusters and explain them in spatial data mining.
Image Processing
Economic Science (especially market research)
WWW:
– Document classification
– Cluster Weblog data to discover groups of similar access patterns
Copyright Jiawei Han, modified by
9
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs.
Land use: Identification of areas of similar land use in an
earth observation database.
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost.
City-planning: Identifying groups of houses according to
their house type, value, and geographical location.
Copyright Jiawei Han, modified by
10
Clustering analysis
What is Clustering Analysis?
Clustering in Data Mining Applications
Handling Different Types of Variables
Major Clustering Techniques
Outlier Discovery
Problems and Challenges
Copyright Jiawei Han, modified by
11
Similarity and Dissimilarity Between Objects
Distances are normally used to measure the similarity or
dissimilarity between two data objects.
Some popular ones include: Minkowski distance:
d (i, j) q (| x x |q | x x |q ... | x x |q )
i1 j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional
data objects, and q is a positive integer.
If q = 1, d is Manhattan distance.
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2
i p jp
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1
j1
i2
j2
ip
jp
Copyright Jiawei Han, modified by
12
Measure Similarity
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
Values should be scaled (normalized to 0-1)
Weights should be associated with different variables based
on applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
Copyright Jiawei Han, modified by
13
Binary, Nominal, Continuous variables
Binary variable: d = 0 of x=y; d=0 otherwise
Nominal variables: > 2 states, e.g., red, yellow, blue, green.
p u
Simple matching: u: # of matches, p: total # of variables. d (i, j) p
Also, one can use a large number of binary variables.
Continuos variables: d = |x-y|
Scaling and normalization
Copyright Jiawei Han, modified by
14
Clustering analysis
What is Clustering Analysis?
Clustering in Data Mining Applications
Handling Different Types of Variables
Major Clustering Techniques
Outlier Discovery
Problems and Challenges
Copyright Jiawei Han, modified by
15
Five Categories of Clustering Methods
Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion.
Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion.
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other.
Copyright Jiawei Han, modified by
16
Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion.
Global optimal: exhaustively enumerate all partitions.
Heuristic methods: k-means and k-medoids algorithms.
k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster.
Copyright Jiawei Han, modified by
17
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in 4 steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the clusters of
the current partition. The centroid is the center (mean
point) of the cluster.
Assign each object to the cluster with the nearest seed
point.
Go back to Step 2, stop when no more new assignment.
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
5
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
0
0
0
10
9
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
8
7
6
0
0
1
2
3
4
Copyright Jiawei Han, modified by
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
18
Comments on the K-Means Method
Strength of the k-means:
Relatively efficient: O(tkn), where n is # of objects, k is # of
clusters, and t is # of iterations. Normally, k, t << n.
Often terminates at a local optimum.
Weakness of the k-means:
Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance.
Unable to handle noisy data and outliers.
Not suitable to discover clusters with non-convex shapes.
Copyright Jiawei Han, modified by
19
The K-Medoids Clustering Method
Find representative objects, called medoids, in clusters
To achieve this goal, only the definition of distance from
any two objects is needed.
PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and iteratively
replaces one of the medoids by one of the non-medoids if
it improves the total distance of the resulting clustering.
PAM works effectively for small data sets, but does not
scale well for large data sets.
Copyright Jiawei Han, modified by
20
Two Types of Hierarchical Clustering Algorithms
Agglomerative (bottom-up): merge clusters iteratively.
start by placing each object in its own cluster
merge these atomic clusters into larger and larger clusters
until all objects are in a single cluster.
Most hierarchical methods belong to this category. They
differ only in their definition of between-cluster similarity.
Divisive (top-down): split a cluster iteratively.
It does the reverse by starting with all objects in one cluster
and subdividing them into smaller pieces.
Divisive methods are not generally available, and rarely
have been applied.
Copyright Jiawei Han, modified by
21
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition.
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
Copyright Jiawei Han, modified by
divisive
(DIANA)
22
More on Hierarchical Clustering Methods
between-cluster similarity
Minimal distance
Maximal distance
Center distance
Major weakness of agglomerative clustering methods:
do not scale well: time complexity of at least O(n2), where
n is the number of total objects
can never undo what was done previously.
Integration of hierarchical clustering with distance-based
method:
Copyright Jiawei Han, modified by
23
Clustering analysis
What is Clustering Analysis?
Clustering in Data Mining Applications
Handling Different Types of Variables
Major Clustering Techniques
Outlier Discovery
Problems and Challenges
Copyright Jiawei Han, modified by
24
What Is Outlier Discovery?
What are outliers?
The set of objects are considerably dissimilar from the
remainder of the data
Example: Sports: Michael Jordon, Wayne Gretzky, ...
Problem
Given: Data points
Find top n outlier points
Applications:
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis
Copyright Jiawei Han, modified by
25
Outlier Discovery Methods
Distance-based vs. statistics-based outlier analysis:
Most outlier analyses are univariate (single-var) and
distribution-based (how do we know it is in a normal or
gammar distribution?)
We need multi-dimensional analysis without knowing on
data distribution.
Distance-based outlier:
An object O in a dataset T is a DB(p, D)-outlier if at least
fraction p of the object in T lies greater than distance D
from O.
Copyright Jiawei Han, modified by
26
Clustering analysis
What is Clustering Analysis?
Clustering in Data Mining Applications
Handling Different Types of Variables
Major Clustering Techniques
Outlier Discovery
Problems and Challenges
Copyright Jiawei Han, modified by
27
Problems and Challenges
Considerable progress has been made in scalable clustering
methods:
Partitioning: k-means, k-medoids, CLARANS
Hierarchical: BIRCH, CURE
Density-based: DBSCAN, CLIQUE, OPTICS
Grid-based: STING, WaveCluster.
Model-based: Autoclass, Denclue, Cobweb.
Current clustering techniques do not address all the
requirements adequately.
Constraint-based clustering analysis: Constraints exists in
data space (bridges and highways) or in user queries.
Copyright Jiawei Han, modified by
28
Data Mining and Data Warehousing
Introduction
Data warehousing and OLAP
Data preprocessing for mining and warehousing
Concept description: characterization and
discrimination
Classification and prediction
Association analysis
Clustering analysis
Mining complex data and advanced mining
techniques
Trends and research issues
Copyright Jiawei Han, modified by
29
Data Mining and Warehousing: Session 6
Association Analysis
Copyright Jiawei Han, modified by
30
Session 6: Association Analysis
What is association analysis?
Mining single-dimensional Boolean association
rules in transactional databases
Mining multi-level association rules
Copyright Jiawei Han, modified by
31
What Is Association Mining?
Association
rule mining:
Finding
association, correlation, or causal structures
among sets of items or objects in transaction databases,
relational databases, and other information repositories.
Applications:
Basket
data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
Examples.
form: “Body ead [support, confidence]”.
buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]
major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%,
75%]
Copyright Jiawei Han, modified by
Rule
32
Session 6: Association Analysis
What is association analysis?
Mining single-dimensional Boolean association
rules in transactional databases
Mining multi-level association rules
Copyright Jiawei Han, modified by
33
What Is an Association Rule?
Given
A database of customer transactions
Each transaction is a list of items (purchased by a
customer in a visit)
Find all rules that correlate the presence of one set of items
with that of another set of items
Example: 98% of people who purchase tires and auto
accessories also get automotive services done
Any number of items in the consequent/antecedent of rule
Possible to specify constraints on rules (e.g., find only rules
involving Home Laundry Appliances).
Copyright Jiawei Han, modified by
34
Application Examples
Market Basket Analysis
* Maintenance Agreement
What the store should do to boost Maintenance
Agreement sales
Home Electronics *
What other products should the store stocks up on if the
store has a sale on Home Electronics
Attached mailing in direct marketing
Detecting “ping-pong”ing of patients
transaction:
patient
item:
doctor/clinic visited by a patient
support of a rule: number
of common patients
Copyright Jiawei Han, modified by
35
Rule Measures: Support and Confidence
Customer
buys both
Customer
buys beer
Transaction ID
2000
1000
4000
5000
Customer
buys diaper
Find all the rules X & Y Z with
minimum confidence and support
support, s, probability that a
transaction contains {X, Y, Z}
confidence, c, conditional
probability that a transaction
having {X, Y} also contains Z.
Items Bought
Let minimum support 50%, and
A,B,C
minimum confidence 50%, we
A,C
have
A,D
A C (50%, 66.6%)
B,E,F
C A (50%, 100%)
Copyright Jiawei Han, modified by
36
Mining Association Rules -- Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
For rule A C:
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent.
Copyright Jiawei Han, modified by
37
Mining Frequent Itemsets: the Key Step
Find the frequent itemsets: the sets of items that have
minimum support
A subset
of a frequent itemset must also be a frequent
itemset, i.e., if {AB} is a frequent itemset, both {A} and {B}
should be a frequent itemset
Iteratively
find frequent itemsets with cardinality from 1
to k (k-itemset)
Use the frequent itemsets to generate association
rules.
Copyright Jiawei Han, modified by
38
The Apriori Algorithm
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Copyright Jiawei Han, modified by
39
The Apriori Algorithm -- Example
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
Copyright Jiawei Han, modified by
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
40
Generating Association Rules
A Naive Algorithm
for each frequent itemset F do
for each subset c of F do
if ( support(F)/support(F-c) minconf ) then
output rule (F-c) c,
with confidence = support(F)/support (F-c)
and support = support(F)
Copyright Jiawei Han, modified by
41
Session 6: Association Analysis
What is association analysis?
Mining single-dimensional Boolean association
rules in transactional databases
Mining multi-level association rules
Copyright Jiawei Han, modified by
42
Multiple-Level Association Rules
Food
Items often form hierarchy.
Items at the lower level are
bread
milk
expected to have lower
support.
2%
wheat white
skim
Rules regarding itemsets at
Fraser Sunset
appropriate levels could be
quite useful.
TID Items
Transaction database can be
encoded based on dimensions T1 {111, 121, 211, 221}
T2 {111, 211, 222, 323}
and levels
T3 {112, 122, 221, 411}
It is smart to explore shared
T4 {111, 121}
multi-level mining (Han &
T5 {111, 122, 211, 221, 413}
Fu,VLDB’95).
Copyright Jiawei Han, modified by
43
Mining Multi-Level Associations
A top_down, progressive deepening approach:
First find high-level strong rules:
milk bread [20%, 60%].
Then find their lower-level “weaker” rules:
2% milk wheat bread [6%, 50%].
Variations at mining multiple-level association
rules.
–
–
Level-crossed association rules:
2% milk Wonder wheat bread
Association rules with multiple, alternative hierarchies:
2% milk Wonder bread
Copyright Jiawei Han, modified by
44
Multi-Level Mining: Progressive Deepening
A top-down, progressive deepening approach:
First mine high-level frequent items:
milk (15%), bread (10%)
Then mine their lower-level “weaker” frequent itemsets:
2% milk (5%), wheat bread (4%)
Different min_support threshold across multi-levels
lead to different algorithms:
If adopting the same min_support across multi-levels
then toss t if any of t’s ancestors is infrequent.
If adopting reduced min_support at lower levels
then examine only those descendents whose ancestor’s support is
frequent/non-negligible.
Copyright Jiawei Han, modified by
45