Download Topic 5

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Topic 5: Data Mining II
Classification and Prediction
„ The classification model assumes a set of predefined
classes and aims to classify a large collection of
tuples/samples to these classes
A
B
C
1
1
2
2
2
3
1
2
1
1
2
2
3
2
1
2
1
2
C
1
1
2
2
2
3
1
2
1
1
2
2
3
2
1
2
1
2
Dr. N. Mamoulis
„
„
C
predicts categorical class labels
classifies data (constructs a model) based on the training set and
the values (class labels) in a class attribute and uses it in
classifying new data
Prediction:
class Y
Clustering is else called unsupervised classification; the
aim is to divide the data tuples into non-predefined
classes.
cluster X
A
B
„
class X
„
A
Classification:
cluster Y
models continuous-valued functions, i.e., predicts unknown or
missing values
Typical Applications
„
„
„
„
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
B
Advanced Database Technologies
1
Classification—A Two-Step Process
Dr. N. Mamoulis
Advanced Database Technologies
2
Classification Process (1): Model Construction
Model construction: describing a set of predetermined classes
„
„
„
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
„
NAME
M ik e
M ary
B ill
J im
D ave
Anne
Estimate accuracy of the model
Š The known label of test sample is compared with the classified
result from the model
Š Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Š Test set is independent of training set, otherwise over-fitting will
occur
Dr. N. Mamoulis
Advanced Database Technologies
Classification
Algorithms
Training
Data
3
Classification Process (2): Use the Model
in Prediction
RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso c iate P ro f
7
yes
A ssistan t P ro f
6
no
A sso c iate P ro f
3
no
Dr. N. Mamoulis
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Advanced Database Technologies
4
Classification by Decision Tree Induction
Decision tree
Classifier
„
„
„
Testing
Data
„
Unseen Data
Decision tree generation consists of two phases
„
(Jeff, Professor, 4)
NAME
Tom
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Dr. N. Mamoulis
Advanced Database Technologies
„
Tenured?
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Tree construction
Š At start, all the training examples are at the root
Š Partition examples recursively based on selected attributes
Tree pruning
Š Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
„
5
Test the attribute values of the sample against the decision tree
Dr. N. Mamoulis
Advanced Database Technologies
6
1
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
Dr. N. Mamoulis
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
Output: A Decision Tree for “buys_computer”
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
Advanced Database Technologies
7
How is the decision tree constructed?
„
„
„
„
Advanced Database Technologies
What is Cluster Analysis?
Cluster: a collection of data objects
„
„
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
„
Grouping a set of data objects into clusters
„
„
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Classification rules:
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
…
Dr. N. Mamoulis
„
„
Advanced Database Technologies
8
Assign to sample X the class label C such that P(C|X) is maximal
Naïve Bayes Classifier: Uses Bayes-theorem to estimate
probabilities, assuming attributes are conditionally independent
Classification by backpropagation (Neural Networks)
Classification based on association rules mining
k-nearest neighbor classifier
case-based reasoning
Rough set approach
Fuzzy set approaches
9
Dr. N. Mamoulis
Advanced Database Technologies
10
General Applications of Clustering
Pattern Recognition
Spatial Data Analysis
„
Cluster analysis
Clustering is unsupervised classification: no
predefined classes
Typical applications
>40
Genetic algorithms
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left
Dr. N. Mamoulis
yes
student?
Bayesian Classification
At start, all the training examples are at the root. Attributes are
categorical (if continuous-valued, they are discretized in advance)
The attribute with the highest information gain is selected, and
their values formulate partitions. The examples are then
partitioned and the tree is constructed recursively by selecting the
attribute with the highest information gain at the next level.
The information gain of an attribute is a probabilistic measure,
that reflects the “least randomness” in the partitions.
Conditions for stopping partitioning/recursion
„
30..40
Other Classification Methods
Basic algorithm (a greedy algorithm)
„
age?
<=30
„
create thematic maps in GIS by clustering feature
spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
WWW
„
„
Document classification
Cluster Weblog data to discover groups of similar
access patterns
Dr. N. Mamoulis
Advanced Database Technologies
12
2
Requirements of Clustering
Clustering Visualized
Input of Clustering
Scalability
„
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
„
High dimensionality
Incorporation of user-specified constraints
„
Interpretability and usability
Advanced Database Technologies
A
B
C
1
1
2
2
2
3
1
2
1
1
2
2
3
2
1
2
1
2
A
cluster X
C
cluster Y
outliers
B
Output of Clustering
Insensitive to order of input records
Dr. N. Mamoulis
A large data set with n attributes (dimensions). Its tuples can
be modeled as points in a high dimensional space
13
Major Clustering Approaches
A set of k clusters, where the distance between points in the
same cluster is small and the distance between points in
different clusters is large
A set of outliers, i.e., points which do not belong to a cluster;
they form clusters of small cardinality and thus small interest.
Dr. N. Mamoulis
Advanced Database Technologies
14
Partitioning Algorithms: Basic Concept
Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
Partitioning method: Construct a partition of a database
D of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
„
Global optimal: exhaustively enumerate all partitions
„
Heuristic methods: k-means and k-medoids algorithms
„
k-means: Each cluster is represented by the center of the
cluster
„
k-medoids or PAM (Partition around medoids): Each cluster is
represented by one of the objects in the cluster
each other
Dr. N. Mamoulis
Advanced Database Technologies
15
The K-Means Clustering Method
Dr. N. Mamoulis
Advanced Database Technologies
The K-Means Clustering Method
Example for k=2
Given k, the k-means algorithm is
implemented in 4 steps:
1. Partition objects into k non-empty subsets
2. Compute seed points as the centroids of the
clusters of the current partition. The centroid is
the center (mean point) of the cluster.
3. Assign each object to the cluster with the nearest
seed point.
4. Go back to Step 2, stop when no more new
assignment.
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
Dr. N. Mamoulis
2
3
4
5
6
7
8
9
10
2
1
0
0
17
1
3
2
0
Advanced Database Technologies
0
10
1
Dr. N. Mamoulis
16
1
2
3
4
5
6
7
8
9
10
0
1
2
Advanced Database Technologies
3
4
5
6
7
8
9
10
18
3
PAM (Partitioning Around Medoids)
Comments on the K-Means Method
Strength
„
„
Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and
genetic algorithms
Weakness
„
„
„
„
Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
starts from an initial set of medoids and iteratively
replaces one of the medoids by one of the non-medoids
if it improves the total distance of the resulting
clustering\Use real object to represent the cluster:
1.
Select k representative objects arbitrarily
2.
For each pair of non-selected object h and selected object i,
calculate the total swapping cost TCih
For each pair of i and h,
3.
Š
If TCih < 0, i is replaced by h
Š
Then assign each non-selected object to the most similar
representative object
4.
Dr. N. Mamoulis
Advanced Database Technologies
19
PAM Clustering: Total swapping cost TCih=∑jCjih
10
7
7
j
h
5
i
4
3
6
5
h
i
4
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
Cjih = d(j, h) - d(j, i)
0
1
2
3
4
5
6
Step 0
7
8
9
Cjih = 0
a
b
9
9
h
8
7
8
j
6
5
i
5
4
t
3
h
t
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
Cjih = d(j, t) - d(j, i)
10
0
1
2
3
4
5
6
8
9
Cjih = d(j, h) - d(j, t)
Advanced Database Technologies
21
do not scale well: time complexity of at least O(n2), where n is
the number of total objects
can never undo what was done previously
Integration of hierarchical with distance-based clustering
„
„
„
Step 4
BIRCH (1996): uses CF-tree and incrementally adjusts the quality
of sub-clusters
CURE (1998): selects well-scattered points from the cluster and
then shrinks them towards the center of the cluster by a
specified fraction
CHAMELEON (1999): hierarchical clustering using dynamic
modeling
Advanced Database Technologies
Step 2 Step 1 Step 0
Advanced Database Technologies
divisive
(DIANA)
22
Use links to measure similarity/proximity
Links: The number of common neighbors for the two
points
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
3
{1,2,3}
{1,2,4}
Algorithm
„
„
Dr. N. Mamoulis
Step 3
Dr. N. Mamoulis
Clustering Categorical Data (ROCK)
Major weakness of agglomerative clustering methods
„
de
10
Hierarchical Clustering (cont’d)
„
abcde
cde
e
7
agglomerative
(AGNES)
ab
d
j
0
0
Step 2 Step 3 Step 4
c
7
6
i
4
Step 1
10
10
10
Dr. N. Mamoulis
t
8
6
20
Use distance matrix as clustering criteria. This
method does not require the number of clusters
k as an input, but needs a termination condition
j
9
t
8
Advanced Database Technologies
Hierarchical Clustering
10
9
repeat steps 2-3 until there is no change
Dr. N. Mamoulis
23
Draw random sample
Cluster with links
Dr. N. Mamoulis
Advanced Database Technologies
24
4
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
Density-Based Clustering Methods
Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Clustering based on density (local cluster criterion),
such as density-connected points
Major features:
„
„
„
„
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Outlier
Several interesting studies:
„
„
„
„
Border
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98)
Dr. N. Mamoulis
Advanced Database Technologies
25
If the number of attributes is large (high dimensional
space), clustering can be meaningless, just like nearest
neighbor search.
This is because of the “dimensionality curse”; a point
could be as close to its cluster as to the other clusters.
For such cases, clustering could be more meaningful in
a subset of the full-dimensionality, where the rest of
the dimensions are “noise” to the specific cluster.
Some clustering techniques (e.g., CLIQUE, PROCLUS)
discover clusters in subsets of the full dimensional
space.
Advanced Database Technologies
Dr. N. Mamoulis
MinPts = 5
Advanced Database Technologies
26
The Role of DB Research in Classification
and Clustering
Sub-space clustering
Dr. N. Mamoulis
Eps = 1cm
Core
27
Classification is a classic problem in Statistics and
Machine Learning. ML has mainly focused on improving
the accuracy of classification.
On the other hand, database research mainly focuses
on improving the scalability of clustering/classification
methods.
We would like to cluster/classify data sets with millions
of tuples and hundreds of attributes (dimensions) with
reasonable speed.
The next presentations will describe two clustering
methods and a scalable decision-tree classifier.
Dr. N. Mamoulis
Advanced Database Technologies
28
5