Download Data Mining - WordPress.com

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
OVERVIEW
DATA MINING
DATA MINING VESIT M.VIJAYALAKSHMI
1
Outline Of the Presentation
–Motivation & Introduction
–Data Mining Algorithms
–Teaching Plan
DATA MINING VESIT M.VIJAYALAKSHMI
2
Why Data Mining? Commercial
Viewpoint
• Lots of data is being collected and warehoused
– Web data, e-commerce
– purchases at department/grocery stores
– Bank/Credit Card transactions
• Computers have become cheaper and more
powerful
• Competitive Pressure is strong
– Provide better, customized services for an edge
(e.g. in Customer Relationship Management)
DATA MINING VESIT M.VIJAYALAKSHMI
3
Typical Decision Making
• Given a database of 100,000 names, which
persons are the least likely to default on their
credit cards?
• Which of my customers are likely to be the
most loyal?
• Which claims in insurance are potential frauds?
• Who may not pay back loans?
• Who are consistent players to bid for in IPL?
• Who can be potential customers for a new toy?
Data Mining helps extract such information
DATA MINING VESIT M.VIJAYALAKSHMI
4
Why Mine Data?
Scientific Viewpoint
• Data collected and stored at enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene expression data
– scientific simulations generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
DATA MINING VESIT M.VIJAYALAKSHMI
5
Mining Large Data Sets Motivation
• There is often information “hidden” in the
data that is not readily evident.
• Human analysts may take weeks to
discover useful information.
DATA MINING VESIT M.VIJAYALAKSHMI
6
Data Mining works with
Warehouse Data
• Data Warehousing provides the
Enterprise with a memory
z Data Mining provides
the Enterprise with
intelligence
DATA MINING VESIT
M.VIJAYALAKSHMI
7
What Is Data Mining?
• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
• Alternative names and their “inside stories”:
– Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
• What is not data mining?
– (Deductive) query processing.
– Expert systems or small ML/statistical programs
DATA MINING VESIT M.VIJAYALAKSHMI
8
Potential Applications
• Market analysis and management
– target marketing, CRM, market basket
analysis, cross selling, market segmentation
• Risk analysis and management
– Forecasting, customer retention, quality
control, competitive analysis
• Fraud detection and management
• Text mining (news group, email, documents) and
Web analysis.
– Intelligent query answering
DATA MINING VESIT M.VIJAYALAKSHMI
9
Other Applications
• game statistics to gain competitive advantage
Astronomy
• JPL and the Palomar Observatory discovered
22 quasars with the help of data mining
• IBM Surf-Aid applies data mining algorithms
to Web access logs for market-related pages
to discover customer preference and behavior
pages, analyzing effectiveness of Web
marketing, improving Web site organization,
etc.
DATA MINING VESIT M.VIJAYALAKSHMI
10
What makes data mining
possible?
• Advances in the following areas are making
data mining deployable:
– data warehousing
– better and more data (i.e., operational,
behavioral, and demographic)
– the emergence of easily deployed data mining
tools and
– the advent of new data mining techniques.
– -- Gartner Group
DATA MINING VESIT M.VIJAYALAKSHMI
11
What is Not Data Mining
• Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than
$10,000 in the last month.
– Find all customers who have purchased milk
• Data Mining
– Find all credit applicants who are poor credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased with
milk. (association rules)
DATA MINING VESIT M.VIJAYALAKSHMI
12
Data Mining: On What Kind of Data?
•
•
•
•
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
–
–
–
–
–
–
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
DATA MINING VESIT M.VIJAYALAKSHMI
13
Data Mining Models And Tasks
DATA MINING VESIT M.VIJAYALAKSHMI
14
Are All the “Discovered” Patterns Interesting?
• A data mining system/query may generate thousands of
patterns, not all of them are interesting.
• Interestingness measures:
– A pattern is interesting if it is easily understood by humans,
valid on new or test data with some degree of certainty,
potentially useful, novel, or validates some hypothesis that a
user seeks to confirm
• Objective vs. subjective interestingness measures:
– Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
– Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, etc.
DATA MINING VESIT M.VIJAYALAKSHMI
15
Can We Find All and Only Interesting
Patterns?
• Find all the interesting patterns:
Completeness
– Association vs. classification vs. clustering
• Search for only interesting patterns:
• First general all the patterns and then filter
out the uninteresting ones.
• Generate only the interesting paterns
DATA MINING VESIT M.VIJAYALAKSHMI
16
Data Mining vs. KDD
• Knowledge Discovery in Databases
(KDD): process of finding useful
information and patterns in data.
• Data Mining: Use of algorithms to
extract the information and patterns
derived by the KDD process.
DATA MINING VESIT M.VIJAYALAKSHMI
17
KDD Process
• Selection: Obtain data from various sources.
• Preprocessing: Cleanse data.
• Transformation: Convert to common format.
Transform to new format.
• Data Mining: Obtain desired results.
• Interpretation/Evaluation: Present results to user in
meaningful manner.
DATA MINING VESIT M.VIJAYALAKSHMI
18
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DATA MINING VESIT M.VIJAYALAKSHMI
DBA
19
Data Mining Development
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Neural Networks
•Decision Tree Algorithms
DATA MINING VESIT M.VIJAYALAKSHMI
20
Data Mining Issues
•
•
•
•
•
•
•
Human Interaction
Overfitting
Outliers
Interpretation
Visualization
Large Datasets
High Dimensionality
DATA MINING VESIT M.VIJAYALAKSHMI
•
•
•
•
•
•
•
Multimedia Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integration
Application
21
Social Implications of DM
• Privacy
• Profiling
• Unauthorized use
DATA MINING VESIT M.VIJAYALAKSHMI
22
Data Mining Metrics
•
•
•
•
Usefulness
Return on Investment (ROI)
Accuracy
Space/Time
DATA MINING VESIT M.VIJAYALAKSHMI
23
Data Mining Algorithms
1. Classification
2. Clustering
3. Association Mining
4. Web Mining
DATA MINING VESIT M.VIJAYALAKSHMI
24
Data Mining Tasks
• Prediction Methods
– Use some variables to predict unknown
or future values of other variables.
• Description Methods
– Find human-interpretable patterns that
describe the data.
DATA MINING VESIT M.VIJAYALAKSHMI
25
Data Mining Algorithms
•
•
•
•
•
•
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
DATA MINING VESIT M.VIJAYALAKSHMI
26
Data Mining Algorithms
CLASSIFICATION
DATA MINING VESIT M.VIJAYALAKSHMI
27
Classification
Given old data about customers and payments, predict
new applicant’s loan eligibility.
Previous
customers
Classifier
Age
Salary
Profession
Location
Customer type
DATA MINING VESIT M.VIJAYALAKSHMI
Decision tree
Salary
> 5 K good
/
Prof. = Exec
bad
New applicant’s
data
28
Classification Problem
• Given a database D={t1,t2,…,tn} and a set of
classes C={C1,…,Cm}, the Classification
Problem is to define a mapping f:DgC
where each ti is assigned to one class.
• Actually divides D into equivalence classes.
• Prediction is similar, but may be viewed as
having infinite number of classes.
DATA MINING VESIT M.VIJAYALAKSHMI
29
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
DATA MINING VESIT M.VIJAYALAKSHMI
30
Overview of Naive Bayes
The goal of Naive Bayes is to work out whether a new example
is in a class given that it has a certain combination of attribute
values. We work out the likelihood of the example being in
each class given the evidence (its attribute values), and take
the highest likelihood as the classification.
Bayes Rule: E- Event has occurred
P[ E | H ].P[ H ]
P[ H | E ] 
P[ E ]
P[H] is called the prior probability (of the hypothesis).
P[H|E] is called the posterior probability (of the hypothesis
given the evidence)
DATA MINING VESIT M.VIJAYALAKSHMI
31 31
Worked Example 1
Take the following training data, from bank loan applicants:
ApplicantID
City
Children
Income
Status
1
Delhi
Many
Medium
DEFAULTS
2
Delhi
Many
Low
DEFAULTS
3
Delhi
Few
Medium
PAYS
4
Delhi
Few
High
PAYS
•
•
•
•
•
P[City=Delhi | Status = DEFAULTS] = 2/2 = 1
P[City=Delhi | Status = PAYS] = 2/2 = 1
P[Children=Many | Status = DEFAULTS] = 2/2 = 1
P[Children=Few | Status = DEFAULTS] = 0/2 = 0
etc.
DATA MINING VESIT M.VIJAYALAKSHMI
32 32
Worked Example 1
Summarizing, we have the following probabilities:
... given
DEFAULTS
... given PAYS
City=Delhi
2/2 = 1
2/2 = 1
Children=Few
0/2 = 0
2/2 = 1
Children=Many
2/2 = 1
0/2 = 0
Income=Low
1/2 = 0.5
0/2 = 0
Income=Medium
1/2 = 0.5
1/2 = 0.5
0/2 = 0
1/2 = 0.5
Probability of...
Income=High
and P[Status = DEFAULTS] = 2/4 = 0.5
P[Status = PAYS] = 2/4 = 0.5
The probability of ( Income=Medium ) /applicant DEFAULTs =
the number of applicants with Income=Medium who DEFAULT
divided by the number of applicants who DEFAULT
= 1/2 = 0.5
DATA MINING VESIT M.VIJAYALAKSHMI
33 33
Worked Example 1
Now, assume a new example is presented where
City=Delhi, Children=Many, and Income=Medium:
First, we estimate the likelihood that the example is a defaulter, given its attribute
values: P[H1|E] = P[E|H1].P[H1]
(denominator omitted*)
P[Status = DEFAULTS | Delhi,Many,Medium] =
P[Delhi|DEFAULTS] x P[Many|DEFAULTS] x P[Medium|DEFAULTS] x
P[DEFAULTS] =
1 x 1 x 0.5 x 0.5 = 0.25
Then we estimate the likelihood that the example is a payer, given its attributes:
P[H2|E] = P[E|H2].P[H2]
(denominator omitted*)
P[Status = PAYS | Delhi,Many,Medium] =
P[Delhi|PAYS] x P[Many|PAYS] x P[Medium|PAYS] x
P[PAYS] =
1 x 0 x 0.5 x 0.5 = 0
As the conditional likelihood of being a defaulter is higher (because 0.25 > 0), we
conclude that the new example is a defaulter.
DATA MINING VESIT M.VIJAYALAKSHMI
34 34
Worked Example 1
Now, assume a new example is presented where
City=Delhi, Children=Many, and Income=High:
First, we estimate the likelihood that the example is a defaulter, given its
attribute values:
P[Status = DEFAULTS | Delhi,Many,High] =
P[Delhi|DEFAULTS] x P[Many|DEFAULTS] x P[High|DEFAULTS] x
P[DEFAULTS] = 1 x 1 x 0 x 0.5 = 0
Then we estimate the likelihood that the example is a payer, given its
attributes:
P[Status = PAYS | Delhi,Many,High] =
P[Delhi|PAYS] x P[Many|PAYS] x P[High|PAYS] x
P[PAYS] =
1 x 0 x 0.5 x 0.5 = 0
As the conditional likelihood of being a defaulter is the same as that for being
a payer, we can come to no conclusion for this example.
DATA MINING VESIT M.VIJAYALAKSHMI
35 35
Weaknesses
• Naive Bayes assumes that variables are equally
important and that they are independent which is
often not the case in practice.
• Naive Bayes is damaged by the inclusion of
redundant (strongly dependent) attributes.
• Sparse data: If some attribute values are not present
in the data, then a zero probability for P[E|H] might
exist. This would lead P[H|E] to be zero no matter
how high P[E|H] is for other attribute values. Small
positive values which estimate the so-called ‘prior
probabilities’ are often used to correct this.
DATA MINING VESIT M.VIJAYALAKSHMI
36 36
Classification Using Decision
Trees
• Partitioning based: Divide search space
into rectangular regions.
• Tuple placed into class based on the region
within which it falls.
• DT approaches differ in how the tree is
built: DT Induction
• Internal nodes associated with attribute and
arcs with values for that attribute.
• Algorithms: ID3, C4.5, CART
DATA MINING VESIT M.VIJAYALAKSHMI
37
DT Issues
•
•
•
•
•
•
•
Choosing Splitting Attributes
Ordering of Splitting Attributes
Splits
Tree Structure
Stopping Criteria
Training Data
Pruning
DATA MINING VESIT M.VIJAYALAKSHMI
38
DECISION TREES
• An internal node represents a test on an attribute.
• A branch represents an outcome of the test, e.g.,
Color=red.
• A leaf node represents a class label or class label
distribution.
• At each node, one attribute is chosen to split training
examples into distinct classes as much as possible
• A new case is classified by following a matching path
to a leaf node.
DATA MINING VESIT M.VIJAYALAKSHMI
39
Training Set
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Tempreature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
Humidity W indy Class
high
false
N
high
true
N
high
false
P
high
false
P
normal false
P
normal true
N
normal true
P
high
false
N
normal false
P
normal false
P
normal true
P
high
true
P
normal false
P
high
true
N
DATA MINING VESIT M.VIJAYALAKSHMI
40
Example
Outlook
sunny
overcast
humidity
high
N
normal
P
DATA MINING VESIT M.VIJAYALAKSHMI
rain
windy
P
true
false
N
P
41
Building Decision Tree
• Top-down tree construction
– At start, all training examples are at the root.
– Partition the examples recursively by choosing one
attribute each time.
• Bottom-up tree pruning
– Remove subtrees or branches, in a bottom-up manner,
to improve the estimated accuracy on new cases.
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the
decision tree
DATA MINING VESIT M.VIJAYALAKSHMI
42
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-andconquer manner
– At start, all the training examples are at the root
– Attributes are categorical
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
– There are no samples left
DATA MINING VESIT M.VIJAYALAKSHMI
43
Choosing the Splitting Attribute
• At each node, available attributes are
evaluated on the basis of separating the
classes of the training examples. A
Goodness function is used for this purpose.
• Typical goodness functions:
– information gain (ID3/C4.5)
– information gain ratio
– gini index
DATA MINING VESIT M.VIJAYALAKSHMI
44
Which attribute to select?
DATA MINING VESIT M.VIJAYALAKSHMI
45
A criterion for attribute selection
• Which is the best attribute?
– The one which will result in the smallest tree
– Heuristic: choose the attribute that produces the
“purest” nodes
• Popular impurity criterion: information gain
– Information gain increases with the average
purity of the subsets that an attribute produces
• Strategy: choose attribute that results in greatest
information gain
DATA MINING VESIT M.VIJAYALAKSHMI
46
Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n
elements of class N
– The amount of information, needed to decide if an arbitrary
example in S belongs to P or N is defined as
p
p
n
n
I ( p, n)  
log 2

log 2
pn
pn pn
pn
DATA MINING VESIT M.VIJAYALAKSHMI
47
Information Gain in Decision Tree
Induction
• Assume that using attribute A a set S will be
partitioned into sets {S1, S2 , …, Sv}
– If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subtrees Si is

pi  ni
E ( A)  
I ( pi , ni )
i 1 p  n
• The encoding information that would be gained by
branching on A
Gain( A)  I ( p, n)  E ( A)
DATA MINING VESIT M.VIJAYALAKSHMI
48
Example: attribute “Outlook”
• “Outlook” = “Sunny”:
info([2,3])  entropy(2/5,3/5)  2 / 5 log( 2 / 5)  3 / 5 log(3 / 5)  0.971 bits
• “Outlook” = “Overcast”:
Note: this is
info([4,0])  entropy(1,0)  1log(1)  0 log(0)  0 bits normally not
defined.
• “Outlook” = “Rainy”:
info([3,2])  entropy(3/5,2/5)  3 / 5 log(3 / 5)  2 / 5 log( 2 / 5)  0.971 bits
• Expected information for attribute:
info([3,2],[4,0],[3,2])  (5 / 14)  0.971  (4 / 14)  0  (5 / 14)  0.971
 0.693 bits
DATA MINING VESIT M.VIJAYALAKSHMI
49
Computing the information gain
• Information gain: information before splitting –
information after splitting
gain(" Outlook" )  info([9,5] ) - info([2,3] , [4,0], [3,2])  0.940 - 0.693
 0.247 bits
• Information gain for attributes from weather
data:
gain(" Outlook" )  0.247 bits gain(" Temperatur e" )  0.029 bits
gain(" Humidity" )  0.152 bits
gain(" Windy" )  0.048 bits
DATA MINING VESIT M.VIJAYALAKSHMI
50
Continuing to split
gain(" Temperatur e" )  0.571 bits
gain(" Humidity" )  0.971 bits
gain(" Windy" )  0.020 bits
DATA MINING VESIT M.VIJAYALAKSHMI
51
The final decision tree
• Note: not all leaves need to be pure; sometimes identical
instances have different classes
 Splitting stops when data can’t be split any further
DATA MINING VESIT M.VIJAYALAKSHMI
52
Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise
or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data to decide
which is the “best pruned tree”
DATA MINING VESIT M.VIJAYALAKSHMI
53
Data Mining Algorithms
Clustering
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should
be clustered along continent faults
DATA MINING VESIT M.VIJAYALAKSHMI
56
Clustering vs. Classification
• No prior knowledge
– Number of clusters
– Meaning of clusters
– Cluster results are dynamic
• Unsupervised learning
DATA MINING VESIT M.VIJAYALAKSHMI
57
Clustering
Unsupervised learning:
Finds “natural” grouping of
instances given un-labeled data
DATA MINING VESIT M.VIJAYALAKSHMI
58
Clustering Methods
• Many different method and algorithms:
– For numeric and/or symbolic data
– Deterministic vs. probabilistic
– Exclusive vs. overlapping
– Hierarchical vs. flat
– Top-down vs. bottom-up
DATA MINING VESIT M.VIJAYALAKSHMI
59
Clustering Issues
•
•
•
•
•
•
•
Outlier handling
Dynamic data
Interpreting results
Evaluating results
Number of clusters
Data to be used
Scalability
DATA MINING VESIT M.VIJAYALAKSHMI
60
Clustering Evaluation
• Manual inspection
• Benchmarking on existing labels
• Cluster quality measures
– distance measures
– high similarity within a cluster, low across
clusters
DATA MINING VESIT M.VIJAYALAKSHMI
61
Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in terms
of a distance function, which is typically metric: d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very different
for interval-scaled, boolean, categorical, ordinal and ratio
variables.
• Weights should be associated with different variables based on
applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
DATA MINING VESIT M.VIJAYALAKSHMI
62
Type of data in clustering analysis
• Interval-scaled variables:
• Binary variables:
• Nominal, ordinal, and ratio variables:
• Variables of mixed types:
DATA MINING VESIT M.VIJAYALAKSHMI
63
Similarity and Dissimilarity
Between Objects
• Distances are normally used to measure the similarity
or dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1
j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1
i2 j 2
i p jp
DATA MINING VESIT M.VIJAYALAKSHMI
64
Clustering Problem
• Given a database D={t1,t2,…,tn} of tuples
and an integer value k, the Clustering
Problem is to define a mapping
f:Dg{1,..,k} where each ti is assigned to
one cluster Kj, 1<=j<=k.
• A Cluster, Kj, contains precisely those
tuples mapped to it.
• Unlike classification problem, clusters are
not known a priori.
DATA MINING VESIT M.VIJAYALAKSHMI
65
Types of Clustering
• Hierarchical – Nested set of clusters
created.
• Partitional – One set of clusters created.
• Incremental – Each element handled one at
a time.
• Simultaneous – All elements handled
together.
• Overlapping/Non-overlapping
DATA MINING VESIT M.VIJAYALAKSHMI
66
Clustering Approaches
Clustering
Hierarchical
Agglomerative
Partitional
Divisive
DATA MINING VESIT M.VIJAYALAKSHMI
Categorical
Sampling
Large DB
Compression
67
Cluster Parameters
DATA MINING VESIT M.VIJAYALAKSHMI
68
Distance Between Clusters
•
•
•
•
Single Link: smallest distance between points
Complete Link: largest distance between points
Average Link: average distance between points
Centroid: distance between centroids
DATA MINING VESIT M.VIJAYALAKSHMI
69
Hierarchical Clustering
• Clusters are created in levels actually creating sets
of clusters at each level.
• Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
• Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down
DATA MINING VESIT M.VIJAYALAKSHMI
70
Hierarchical Clustering
• Use distance matrix as clustering criteria. This
method does not require the number of clusters k as an
input, but needs a termination condition
Step 0
a
Step 1
Step 2
Step 3
Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2
Step 1
Step 0
DATA MINING VESIT M.VIJAYALAKSHMI
divisive
(DIANA)
71
Dendrogram
• A tree data structure which
illustrates hierarchical
clustering techniques.
• Each level shows clusters for
that level.
– Leaf – individual clusters
– Root – one cluster
• A cluster at level i is the
union of its children clusters
at level i+1.
DATA MINING VESIT M.VIJAYALAKSHMI
72
A Dendrogram Shows How the Clusters are
Merged Hierarchically
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
DATA MINING VESIT M.VIJAYALAKSHMI
73
DIANA (Divisive Analysis)
• Implemented in statistical analysis packages, e.g., Splus
• Inverse order of AGNES
• Eventually each node forms a cluster on its own
DATA MINING VESIT M.VIJAYALAKSHMI
74
Partitional Clustering
• Nonhierarchical
• Creates clusters in one step as opposed to
several steps.
• Since only one set of clusters is output, the
user normally has to input the desired
number of clusters, k.
• Usually deals with static sets.
DATA MINING VESIT M.VIJAYALAKSHMI
75
K-Means
• Initial set of clusters randomly chosen.
• Iteratively, items are moved among sets of
clusters until the desired set is reached.
• High degree of similarity among elements
in a cluster is obtained.
• Given a cluster Ki={ti1,ti2,…,tim}, the cluster
mean is mi = (1/m)(ti1 + … + tim)
DATA MINING VESIT M.VIJAYALAKSHMI
76
K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
• Randomly assign means: m1=3,m2=4
• K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
• K1={2,3,4},K2={10,12,20,30,11,25},
m1=3,m2=18
• K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
• K1={2,3,4,10,11,12},K2={20,30,25},
m1=7,m2=25
• Stop as the clusters with these means are the
same.
DATA MINING VESIT M.VIJAYALAKSHMI
77
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in
4 steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the
clusters of the current partition. The centroid is
the center (mean point) of the cluster.
– Assign each object to the cluster with the
nearest seed point.
– Go back to Step 2, stop when no more new
assignment.
DATA MINING VESIT M.VIJAYALAKSHMI
78
Comments on the K-Means Method
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may
be found using techniques such as: deterministic annealing and
genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
DATA MINING VESIT M.VIJAYALAKSHMI
79
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids,)
– starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it improves
the total distance of the resulting clustering
– Handles outliers well.
– Ordering of input does not impact results.
– Does not scale well.
– Each cluster represented by one item, called the medoid.
– Initial set of k medoids randomly chosen.
• PAM works effectively for small data sets, but does not scale
well for large data sets
DATA MINING VESIT M.VIJAYALAKSHMI
80
PAM (Partitioning Around Medoids)
• PAM - Use real object to represent the cluster
– Select k representative objects arbitrarily
– For each pair of non-selected object h and selected object i,
calculate the total swapping cost TCih
– For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most similar
representative object
– repeat steps 2-3 until there is no change
DATA MINING VESIT M.VIJAYALAKSHMI
81
PAM
DATA MINING VESIT M.VIJAYALAKSHMI
82
DATA MINING
ASSOCIATION RULES
Example: Market Basket Data
• Items frequently purchased together:
Computer Printer
• Uses:
– Placement
– Advertising
– Sales
– Coupons
• Objective: increase sales and reduce costs
• Called Market Basket Analysis, Shopping Cart
Analysis
DATA MINING VESIT M.VIJAYALAKSHMI
84
Transaction Data: Supermarket Data
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, jam, salt, ice-cream}
…
…
tn: {biscuit, jam, milk}
• Concepts:
– An item: an item/article in a basket
– I: the set of all items sold in the store
– A Transaction: items purchased in a basket; it may
have TID (transaction ID)
– A Transactional dataset: A set of transactions
DATA MINING VESIT M.VIJAYALAKSHMI
85
Transaction Data: A Set Of Documents
• A text document data set. Each document is
treated as a “bag” of keywords
doc1:
doc2:
doc3:
doc4:
doc5:
doc6:
doc7:
Student, Teach, School
Student, School
Teach, School, City, Game
Baseball, Basketball
Basketball, Player, Spectator
Baseball, Coach, Game, Team
Basketball, Team, City, Game
DATA MINING VESIT M.VIJAYALAKSHMI
86
Association Rule Definitions
• Association Rule (AR): implication X 
Y where X,Y  I and X  Y = ;
• Support of AR (s) X  Y: Percentage of
transactions that contain X Y
• Confidence of AR (a) X  Y: Ratio of
number of transactions that contain X 
Y to the number that contain X
DATA MINING VESIT M.VIJAYALAKSHMI
87
Association Rule Problem
• Given a set of items I={I1,I2,…,Im} and a
database of transactions D={t1,t2, …, tn} where
ti={Ii1,Ii2, …, Iik} and Iij  I, the Association
Rule Problem is to identify all association rules
X  Y with a minimum support and
confidence.
• Link Analysis
• NOTE: Support of X  Y is same as support
of X  Y.
DATA MINING VESIT M.VIJAYALAKSHMI
88
Association Rule Mining Task
• Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
DATA MINING VESIT M.VIJAYALAKSHMI
89
Example
• Transaction data
• Assume:
t1:
t2:
t3:
t4:
t5:
t6:
t7:
Butter, Cocoa, Milk
Butter, Cheese
Cheese, Boots
Butter, Cocoa, Cheese
Butter, Cocoa, Clothes, Cheese, Milk
Cocoa, Clothes, Milk
Cocoa, Milk, Clothes
minsup = 30%
minconf = 80%
• An example frequent itemset:
{Cocoa, Clothes, Milk}
[sup = 3/7]
• Association rules from the itemset:
Clothes  Milk, Cocoa
…
Clothes, Cocoa  Milk,
[sup = 3/7, conf = 3/3]
…
[sup = 3/7, conf = 3/3]
DATA MINING VESIT M.VIJAYALAKSHMI
90
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
• Frequent itemset generation is still
computationally expensive
DATA MINING VESIT M.VIJAYALAKSHMI
91
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
N
W
TID
Items
1
Bread, Milk
2
Bread, Biscuit, FruitJuice,
Eggs
Milk, Biscuit, FruitJuice,
Coke
Bread, Milk, Biscuit,
FruitJuice
Bread, Milk, Biscuit, Coke
3
4
5
DATA MINING VESIT M.VIJAYALAKSHMI
92
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must
also be frequent
• Apriori principle holds due to the following
property of the support measure:
X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of
its subsets
DATA MINING VESIT M.VIJAYALAKSHMI
93
Illustrating Apriori Principle
null null
A A
B B
C C
D D
E E
AB AB
AC AC
AD AD
AE AE
BC BC
BD BD
BE BE
CD CD
CE CE
DE DE
ABCABC
ABDABD
ABEABE
ACDACD
ACEACE
ADEADE
BCDBCD
BCEBCE
BDEBDE
CDECDE
Found to be
Infrequent
ABCD
ABCD
Pruned
supersets
DATA MINING VESIT M.VIJAYALAKSHMI
ABCE
ABCE
ABDE
ABDE
ACDE
ACDE
BCDE
BCDE
ABCDE
ABCDE
94
Illustrating Apriori Principle
Item
Bread
Coke
Milk
FruitJuice
Biscuit
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Minimum Support = 3
Itemset
Count
{Bread,Milk}
3
{Bread,FruitJuice}
2
{Bread,Biscuit}
3
{Milk,FruitJuice}
2
{Milk,Biscuit}
3
{FruitJuice,Biscuit}
3
If every subset is considered,
6C
1
+ 6C2 + 6C3 = 41
Pairs (2-itemsets)
(No need to generate
candidates involving
Coke or Eggs)
Triplets (3-itemsets)
Itemset
{Bread,Milk,Biscuit}
Count
3
With support-based pruning,
6 + 6 + 1 = 13
DATA MINING VESIT M.VIJAYALAKSHMI
95
Apriori Algorithm
• Let k=1
• Generate frequent itemsets of length 1
• Repeat until no new frequent itemsets are
identified
– Generate length (k+1) candidate itemsets from
length k frequent itemsets
– Prune candidate itemsets containing subsets of
length k that are infrequent
– Count the support of each candidate by scanning the
DB
– Eliminate candidates that are infrequent, leaving
only those that are frequent
DATA MINING VESIT M.VIJAYALAKSHMI
96
Example –
Finding frequent itemsets
minsup=0.5
itemset:count
Dataset T
TID
Items
T100 1, 3, 4
1. scan T  C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
 F1:
{1}:2, {2}:3, {3}:3,
 C2:
{1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}
{5}:3
T200 2, 3, 5
T300 1, 2, 3, 5
T400 2, 5
2. scan T  C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2
 F2:
 C3:
{1,3}:2,
{2,3}:2, {2,5}:3, {3,5}:2
{2, 3,5}
3. scan T  C3: {2, 3, 5}:2  F3: {2, 3, 5}
DATA MINING VESIT M.VIJAYALAKSHMI
97
Apriori Adv/Disadv
• Advantages:
– Uses large itemset property.
– Easily parallelized
– Easy to implement.
• Disadvantages:
– Assumes transaction database is memory resident.
– Requires up to m database scans.
DATA MINING VESIT M.VIJAYALAKSHMI
98
Step 2: Generating Rules From Frequent
Itemsets
• Frequent itemsets  association rules
• One more step is needed to generate association
rules
• For each frequent itemset X,
For each proper nonempty subset A of X,
– Let B = X - A
– A  B is an association rule if
• Confidence(A  B) ≥ minconf,
support(A  B) = support(AB) = support(X)
confidence(A  B) = support(A  B) / support(A)
DATA MINING VESIT M.VIJAYALAKSHMI
99
Generating Rules: An example
• Suppose {2,3,4} is frequent, with sup=50%
– Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4},
with sup=50%, 50%, 75%, 75%, 75%, 75% respectively
– These generate these association rules:
• 2,3  4, confidence=100%
• 2,4  3, confidence=100%
• 3,4  2, confidence=67%
• 2  3,4, confidence=67%
• 3  2,4, confidence=67%
• 4  2,3, confidence=67%
• All rules have support = 50%
DATA MINING VESIT M.VIJAYALAKSHMI
100
Rule Generation
• Given a frequent itemset L, find all non-empty
subsets f  L such that f  L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,
ABD C,
B ACD,
AC  BD,
CD AB,
ACD B,
C ABD,
AD  BC,
BCD A,
D ABC
BC AD,
• If |L| = k, then there are 2k – 2 candidate
association rules (ignoring L   and   L)
DATA MINING VESIT M.VIJAYALAKSHMI
101
Generating Rules
• To recap, in order to obtain A  B, we need
to have support(A  B) and support(A)
• All the required information for confidence
computation has already been recorded in
itemset generation. No need to see the data T
any more.
• This step is not as time-consuming as frequent
itemsets generation.
DATA MINING VESIT M.VIJAYALAKSHMI
102
Rule Generation
• How to efficiently generate rules from
frequent itemsets?
– In general, confidence does not have an antimonotone property
c(ABC D) can be larger or smaller than
c(AB D)
– But confidence of rules generated from the
same itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC  D)  c(AB  CD)  c(A  BCD)
DATA MINING VESIT M.VIJAYALAKSHMI
103
Rule Generation for Apriori
Lattice of rules
Algorithm
Low
Confidence
Rule
CD=>AB
CD=>AB
ABCD=>{
ABCD=>{} }
BCD=>A
BCD=>A
BD=>AC
BD=>AC
D=>ABC
D=>ABC
ACD=>B
ACD=>B
BC=>AD
BC=>AD
C=>ABD
C=>ABD
ABD=>C
ABD=>C
AD=>BC
AD=>BC
B=>ACD
B=>ACD
ABC=>D
ABC=>D
AC=>BD
AC=>BD
AB=>CD
AB=>CD
A=>BCD
A=>BCD
Pruned
Rules
DATA MINING VESIT M.VIJAYALAKSHMI
104
Rule Generation for Apriori Algorithm
• Candidate rule is generated by merging two rules
that share the same prefix
in the rule consequent
CD=>AB
BD=>AC
• Join (CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
• Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence
DATA MINING VESIT M.VIJAYALAKSHMI
D=>ABC
105
APriori - Performance Bottlenecks
• The core of the Apriori algorithm:
– Use frequent (k – 1)-itemsets to generate candidate frequent
k-itemsets
– Use database scan and pattern matching to collect counts for
the candidate itemsets
• Bottleneck of Apriori: candidate generation
– Huge candidate sets:
• 104 frequent 1-itemset will generate 107 candidate 2itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2,
…, a100}, one needs to generate 2100  1030 candidates.
– Multiple scans of database:
• Needs (n +1 ) scans, n is the length of the longest
pattern
DATA MINING VESIT M.VIJAYALAKSHMI
106
Mining Frequent Patterns
Without Candidate Generation
• Compress a large database into a compact, FrequentPattern tree (FP-tree) structure
– highly condensed, but complete for frequent pattern mining
– avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern
mining method
– A divide-and-conquer methodology: decompose mining
tasks into smaller ones
– Avoid candidate generation: sub-database test only!
DATA MINING VESIT M.VIJAYALAKSHMI
107
Construct FP-tree From A Transaction DB
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
min_support = 0.5
{}
Steps:
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Order frequent items
in frequency
descending order
3. Scan DB again,
construct FP-tree
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
DATA MINING VESIT M.VIJAYALAKSHMI
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
108
Benefits of the FP-tree Structure
• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more
likely to be shared
– never be larger than the original database (if not count nodelinks and counts)
DATA MINING VESIT M.VIJAYALAKSHMI
109
Mining Frequent Patterns Using FP-tree
• General idea (divide-and-conquer)
– Recursively grow frequent pattern path using the FP-tree
• Method
– For each item, construct its conditional pattern-base, and
then its conditional FP-tree
– Repeat the process on each newly created conditional FPtree
– Until the resulting FP-tree is empty, or it contains only one
path (single path will generate all the combinations of its sub-paths,
each of which is a frequent pattern)
DATA MINING VESIT M.VIJAYALAKSHMI
110
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in
the FP-tree
2) Construct conditional FP-tree from each conditional
pattern-base
3) Recursively mine conditional FP-trees and grow
frequent patterns obtained so far

If the conditional FP-tree contains a single path, simply
enumerate all the patterns
DATA MINING VESIT M.VIJAYALAKSHMI
111
Step 1: FP-tree to Conditional Pattern Base
• Starting at the frequent header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item
• Accumulate all of transformed prefix paths of that item to form
a conditional pattern base
Conditional pattern bases
Header Table
{}
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
DATA MINING VESIT M.VIJAYALAKSHMI
item
cond. pattern base
c
f:3
a
fc:3
b
fca:1, f:1, c:1
m
fca:2, fcab:1
p
fcam:2, cb:1
112
Step 2: Construct Conditional FP-tree
• For each pattern-base
– Accumulate the count for each item in the base
– Construct the FP-tree for the frequent items of the pattern
base
Header Table
m-conditional
pattern base:
fca:2, fcab:1
{}
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
f:4
c:3
b:1
a:3
m:2
{}
c:1
b:1
p:1
f:3

c:3 
b:1
All frequent
patterns
concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
a:3
p:2
m:1
DATA MINING VESIT
m-conditional FPM.VIJAYALAKSHMI tree
113
Mining Frequent Patterns by Creating
Conditional Pattern-Bases
Item
Conditional pattern-base
Conditional FP-tree
p
{(fcam:2), (cb:1)}
{(c:3)}|p
m
{(fca:2), (fcab:1)}
{(f:3, c:3, a:3)}|m
b
{(fca:1), (f:1), (c:1)}
Empty
a
{(fc:3)}
{(f:3, c:3)}|a
c
{(f:3)}
{(f:3)}|c
f
Empty
Empty
DATA MINING VESIT M.VIJAYALAKSHMI
114
Step 3: Recursively mine the conditional
FP-tree
{
}
f:3
Cond. pattern base of “am”: (fc:3)
{}
c:3
f:3
am-conditional FP-tree
c:3
{}
Cond. pattern base of “cm”: (f:3)
a:3
f:3
m-conditional FP-tree
cm-conditional FP-tree
Cond. pattern base of “cam”: (f:3)
{}
f:3
cam-conditional FP-tree
DATA MINING VESIT M.VIJAYALAKSHMI
115
Single FP-tree Path Generation
• Suppose an FP-tree T has a single path P
• The complete set of frequent pattern of T can be generated by
enumeration of all the combinations of the sub-paths of P
{}
f:3
c:3

a:3
All frequent patterns
concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
m-conditional FP-tree
DATA MINING VESIT M.VIJAYALAKSHMI
116
Why Is Frequent Pattern Growth
Fast?
• Performance study shows
– FP-growth is an order of magnitude faster than Apriori,
and is also faster than tree-projection
• Reasoning
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scan
– Basic operation is counting and FP-tree building
DATA MINING VESIT M.VIJAYALAKSHMI
117