Download introduction to data mining

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
INTRODUCTION TO
DATA MINING
Pinakpani Pal
Electronics & Communication Sciences Unit
Indian Statistical Institute
[email protected]
Main Sources
• Data Mining Concepts and Techniques –Jiawei Han and
Micheline Kamber, 2007
• Handbook of Data Mining and Discovery- Willi Klosgen
and Jan M Zytkow, 2002
• Fast algorithms for mining association rules and
sequential patterns – R.Srikant, Ph.D. Thesis at the
University of Wisconsin-Madison, 1996.
• “Parallel & distributed association mining: a survey,” –
M. J. Zaki, IEEE Concurrency, 7(4), pp.14-25, 1999.
Introduction to Data Mining
2
Prelude
• Data Mining is a method of finding interesting
trends or patterns in large datasets.
• Data collection may be incomplete, heterogeneous
and historical.
• Since data volume is very large, efficiency and
scalability are two very important criteria for data
mining algorithms.
• Data Mining tools are expected to involve
minimal user intervention.
Introduction to Data Mining
3
Prelude
• Data mining deals with finding patterns in data
either by
– user-definition (pre-defined by the user),
– interesting (with the help of an interestingness measure)
or
– valid (validity pre-defined).
• Discovered patterns help and guide the appropriate
authority in taking future decisions. So, Data
Mining is regarded as a tool for Decision Support.
Introduction to Data Mining
4
Data Mining Communities
• Statistics: Provides the background for the
algorithms.
• Artificial Intelligence: Provides the required
heuristics for machine learning / conceptual
clustering.
• Database: Provides the platform for storage and
retrieval of raw and summary data.
Introduction to Data Mining
5
Data Mining
Mining knowledge from Large amounts of Data.
Evolution:
• Data collection
• Database creation
• Data management
– Data storage
– Retrieval
– Transaction processing
Introduction to Data Mining
6
Data Mining
• Advanced data analysis
data warehouse and data mining
Introduction to Data Mining
7
Data Mining Components
Information Repository: single or multiple
heterogeneous data source
Data Sever: storing or retrieving relevant data
Knowledgebase: concept hierarchies, constraints,
threshold, metadata
Pattern Extraction : characterization, discrimination,
association, classification, prediction, clustering,
various statistical analysis
Pattern Evaluation: interestingness measures
Introduction to Data Mining
8
Stages of the Data Mining Process
Misconception: Data mining systems can
autonomously dig out all of the valuable knowledge
from a given large database, without human
intervention.
Steps:
• [Data Collection]
– web crawling / warehousing
Introduction to Data Mining
9
Stages of the Data Mining Process
Steps (contd.):
• Data Preprocessing & Feature Extraction
– Data cleaning: elimination of erroneous and irrelevant
data
– Data Integration: from multiple source
– Data selection / reduction: to accept only the interesting
attributes of the data according to the problem domain.
– Data transformation: normalization, aggregation
Introduction to Data Mining
10
Stages of the Data Mining Process
Steps (contd.):
• Pattern Extraction & Evaluation
– Identification of data mining primitives and
interestingness measures are done at this stage.
• Visualization of data
– Making it easily understandable
• Evaluation of results
– Not every s/w discovered facts are useful for human
beings!
Introduction to Data Mining
11
Data Preprocessing
Data Cleaning: Data may be incomplete, noisy and
inconsistent. Attempts are made to identify outliers
to smooth out noise, fill in missing values and
correct inconsistencies.
Introduction to Data Mining
12
Data Preprocessing
Data Integration: Data analysis may involve data
integration from different sources as in Data
Warehouse. The sources may include Databases,
Data cubes or flat files.
Introduction to Data Mining
13
Data Preprocessing
Data Reduction: Since both data volume and
attribute set may be too large, data reduction
becomes necessary. It includes activities like,
Removal of irrelevant and redundant attributes,
Data Compression and Aggregation or Generation
of Summary Data.
Introduction to Data Mining
14
Data Preprocessing
Transformation: Data need to be transformed or
consolidated into forms suitable for mining. It may
include
activities
like,
Generalization,
Normalization, e.g. attribute values converted from
absolute values to ranges, Construction of new
attributes etc.
Introduction to Data Mining
15
Patterns
• Descriptive – characterizing general properties of
the data
• Predictive – inference on the current data in order
to make patterns
• Discover:
– multiple kind of patterns to accommodate different user
expectation (may specify hints to guide) /application
– patterns at various granularity
Introduction to Data Mining
16
Frequent Patterns
Patterns that occur frequently in the data.
Types:
• Itemset
• Subsequences
• Substructures (sub-graphs, sub-trees, sub-lattices)
Introduction to Data Mining
17
Discovery of Association Rules
To identify the features or items in a problem
domain that tend to appear together. These features
or items are said to be associated. The process is to
find the set of all subsets of items or attributes that
frequently occur in many database records or
transactions, and additionally, to extract rules on
how a subset of items influences the presence of
another subset.
Introduction to Data Mining
18
Association Rule: Example
A user studying the buying habits of customers may
choose to mine association rules of the form:
P (X:customer,W) ^ Q (X,Y)  buys (X,Z)
[support=n%, confidence is m%]
Meta rules such as the following can be specified:
occupation(X, “student”) ^ age(X, “20...29”)  buys(X, “mobile”)
[1.4%, 70%]
Introduction to Data Mining
19
Association Rule: Single/Multi
Single-dimensional association rule:
buys(X, “computer”)  buys (X, “antivirus”)
[1.1%, 55%]
OR
“computer”  “antivirus” (A  B )
[1.1%, 55%]
Multi-dimensional association rule:
occupation(X, “student”) ^ age(X, “20...29”)  buys(X, “mobile”)
[1.4%, 70%]
Introduction to Data Mining
20
Metrics for Interestingness measures
Interestingness measures in knowledge discovery
help to identify the relevance of the patterns
discovered during the mining process.
Introduction to Data Mining
21
Interestingness measures
• Used to confine the number of uninteresting
patterns returned by the process.
• Based on the structure of patterns and statistics
underlying them.
• Associate a threshold which can be controlled by
the user
– patterns not meeting the threshold are not presented to
the user.
Introduction to Data Mining
22
Interestingness measures: objective
Objective measures of pattern interestingness:
• simplicity
• utility (support)
• certainty (confidence)
• novelty
Introduction to Data Mining
23
Interestingness measures: simplicity
Simplicity: a patterns interestingness is based on its
overall simplicity for human comprehension.
e.g. Rule length is a simplicity measure
Introduction to Data Mining
24
Interestingness measures: support
Utility (support): usefulness of a pattern
support(AB) = P(A U B).
The support for a association rule {A}  {B} is the
% of all the transactions under analysis that
contains this itemset.
Introduction to Data Mining
25
Interestingness measures: confidence
Certainty (confidence): Assesses the validity or
trustworthiness of a pattern. Confidence is a
certainty measure
confidence(A  B) = P(B│A)
The confidence for a association rule {A}  {B} is
the % cases that follows the rule.
Association rules that satisfy both the confidence and support
threshold are referred to as strong association rules.
Introduction to Data Mining
26
Interestingness measures: novelty
Novelty: Patterns contributing new information to
the given pattern set are called novel patterns.
e.g: Data exception.
Removing redundant patterns is a strategy for
detecting novelty.
Introduction to Data Mining
27
Market Basket data analysis
Let, a transaction be defined as the variety of items
purchased by a customer in one visit, irrespective of
the quantity of each item purchased. The problem is
to find the items that a customer tends to buy
together.
Introduction to Data Mining
28
Market Basket data analysis
An association rule is an expression of the form
XY,
where X and Y are the sets of items.
The intuitive meaning of the expression is, the
transactions that contain X tend to contain Y as
well. The inverse may not be true.
Since only presence or absence of items are considered and
not the quantity purchased, this type of rules are called
Binary Association Rules.
Introduction to Data Mining
29
Market Basket data analysis
Purpose is to study consumers’ purchase pattern in
departmental stores. Considering four possible
transactions,
1 - {Pen, Ink, Diary, Writing Pad}
2 - {Pen, Ink, Diary}
3 - {Pen, Diary}
4 - {Pen, Ink, Writing Pad}
Introduction to Data Mining
30
Market Basket data analysis
A possible Association Rule,
“ Purchase of Pen implies the purchase of Ink or Diary”
{Pen}  {Ink} or {Pen}  {Diary}
Basically, the rule is of the form {LHS}  {RHS}
where, both {LHS} and {RHS} are sets of items,
called
itemset and {LHS} ∩ {RHS} = .
• {Pen, Ink} is a 2-itemset.
Introduction to Data Mining
31
Binary Association Rule Mining
Two Step Process
1. Find all frequent itemsets
–
An itemset will be considered for mining rules if its
support is above a threshold called minsup.
2. Generate strong association rules from frequent
itemsets
–
Acceptance of a rule is once again through a threshold
called minconf.
Introduction to Data Mining
32
Finding Frequent Itemsets
If there are N items in a market basket and the
association is studied for all possible item
combinations, totally 2N combinations are to be
checked.
Introduction to Data Mining
33
Finding Frequent Itemsets
All nonempty subsets of a frequent itemset must also
be frequent.
(anti-monotone property)
Apriori Algorithm
An itemset is frequent when its occurrence in the
total dataset exceeds the minsup.
If there exists N items, the algorithm attempts to
compute frequent itemsets for 1-itemset to Nitemsets.
Introduction to Data Mining
34
Apriori Algorithm
The algorithm has two steps,
1. Join step
2. Prune step
1. Join step : Here frequent k-itemsets are computed
by joining the (k-1)-itemsets
2. Prune step: if a k-itemset fails to cross the minsup
threshold, all the supersets of the concerned kitemset are no longer considered for association
rule discovery.
Introduction to Data Mining
35
Apriori Algorithm
• Let Lk be the set of frequent k-itemsets
• Let Ck be the set of candidate k-itemsets
Each member of this set has two fields – itemset and
support count.
Introduction to Data Mining
36
Apriori Algorithm
1.
2.
3.
4.
5.
Let k←1
Generate L1 frequent itemsets of length 1
(Lk = ) OR (k = N) goto Step 7
k ← k+1
Generate Lk frequent itemsets of length k by
Join and Prune
6. Goto Step 3.
7. Stop
Output : UkLk
Introduction to Data Mining
37
Apriori Algorithm
Join ()
forall (i,j) where i ϵ Lk-1 and j ϵ Lk-1, i≠j
select all possible k-itemset and insert into Ck
endfor
If L3={{{1 2 3}, s123}, {{1 2 4}, s124},
{{1 3 4}, s134}, {1 3 5}, s135}, {2 3 4}, s234}}
C4={{{1 2 3 4}, s1234}, {{1 3 4 5}, s1345}}
Introduction to Data Mining
38
Apriori Algorithm
Prune()
forall itemsets c  Ck do
forall (k-1)-subsets s of c do
If (S  Lk-1) then delete c from Ck
endif
endfor
endfor
L4={{{1 2 3 4}, s1234}}
Lk ← Ck
Introduction to Data Mining
39
Rule Generation
Rule generation should ensure production of rules
that satisfy only the minimum confidence threshold
– Because, rules are generated from frequent itemsets,
they automatically satisfy the minimum support
threshold
Given a frequent itemset li, find all non-empty
subsets f  li such that f  li – f satisfies the
minimum confidence requirement
• If | li | = k, then there are 2k – 2 candidate
association rules
Introduction to Data Mining
40
Rule Generation
Algorithm:
forall li i ≥ 2 do
call genrule (li, li)
endfor
Introduction to Data Mining
41
Rule Generation
genrule (lk, fi)
F ← {(m-1)-itemset fm-1 | fm-1 fm}
forall fm-1ϵ F do
conf ←sup(lk) / sup(fm-1)
if (conf ≥ minconf)
print rule “fm-1  (lk- fm-1), conf, sup(lk)”
if (m-1 >1)
cal genrule(lk, am-1)
endif
endif
endfor
Introduction to Data Mining
42
Rule Generation
If {A,B,C,D} is a frequent itemset, candidate rules:
{ABC}{D},
{ABD}{C},
{ACD}{B},
{BCD}{A},
{AB}{CD},
{AC}{BD},
{AD}{BC},
{BC}{AD},
{BD}{AC},
{CD}{AB},
{A}{BCD},
{B} {ACD},
{C}{ABD},
{D}{ABC}
Introduction to Data Mining
43
Rule Generation
In general, confidence does not have an antimonotone property
c({ABC}  {D}) can be larger or smaller than
c({AB}  {D})
But confidence of rules generated from the same
itemset has an anti-monotone property
– Confidence is anti-monotone w.r.t. number of items on
the RHS of the rule
e.g., L = {A,B,C,D}:
c({ABC}  {D})  c({AB}  {CD})  c({A}  {BCD})
Introduction to Data Mining
44
Case Study
To find the Association among the species of trees
present in a forest.
The problem is to find a set of association rules
which would indicate the species of trees that
usually appear together and also whether a set of
species ensures the presence of another set of
species with a minimum degree of confidence
specified apriori.
Introduction to Data Mining
45
Data Collection
A forest area is divided into a number of transacts.
A group of surveyors walk through each such
transact to identify the different species of trees and
their number of occurrences.
Introduction to Data Mining
46
Data
Species
1
2
3
⁞
398
1
7
0
2
0
5
16
⁞
6
4
⁞
2
Transacts
3
…
1
…
9
…
0
…
⁞
…
25
…
Introduction to Data Mining
1008
13
0
2
⁞
7
47
Converting the Data
Species
1
2
3
⁞
398
1
1
0
2
0
1
1
⁞
1
1
⁞
1
Transacts
3
…
1
…
1
…
0
…
⁞
…
1
…
Introduction to Data Mining
1008
1
0
1
⁞
1
48
Drawbacks
Support and confidence used by Apriori allow a lot
of rules which are not necessarily interesting
Two options to extract interesting rules
• Using subjective knowledge
• Using objective measures (measures better than
confidence)
Introduction to Data Mining
49
Subjective approaches
• Visualization – users allowed to interactively
verify the discovered rules
• Template-based approach – filter out rules that do
not fit the user specified templates
• Subjective interestingness measure – filter out
rules that are obvious (bread  butter) and that are
non-actionable (do not lead to profits)
Introduction to Data Mining
50
Objective Measures
TID
1
2
3
4
5
6
7
8
9
10
A
1
0
1
1
0
1
0
1
1
1
B
1
0
1
0
1
1
1
0
1
0
C
0
1
1
0
0
0
1
1
0
1
D
0
0
1
0
1
0
1
1
0
1
Support(A) = 0.7
Support(B) = 0.6
Support(C) = 0.5
Support(D) = 0.5
Support(AB) = 0.4
Support(CD) = 0.4
minsup = 0.3
How to infer,
AB
or
CD
Introduction to Data Mining
51
Dissociation
• Dissociation of an itemset is, the % of transactions
where one or more items but not all are absent.
Dissociation(AB) = 0.5
Dissociation(CD) = 0.2
• Extract frequent itemsets from a set of transactions
under high association but low dissociation.
Introduction to Data Mining
52
Togetherness
Let Si = subset of transactions containing the item i.
SA ∩ SB = subset of transactions containing both A
and B.
SA U SB = subset of transactions containing either A or
B.
Togetherness(AB)= | SA ∩ SB | / | SA U SB |
Similar to minsup, a threshold min_togetherness can
be defined to find frequent itemsets.
Introduction to Data Mining
53
Objective Measures
• Weka uses other objective measures
– Lift (A  B) = confidence(A  B)/support(B) =
support(A  B)/(support(A)*support(B))
– Leverage (A  B) = support(A  B) –
support(A)*support(B)
– Conviction(A  B) = support(A)*support(not
B)/support(A  B)
– conviction inverts the lift ratio and also computes
support for RHS not being true
Introduction to Data Mining
54
Modifications of Apriori Algorithm
•
•
•
•
•
Reduce computation time:
Hash based techniques
Transaction reduction
Sampling
Dynamic itemset counting
Introduction to Data Mining
55
Frequent Pattern Mining Variations
•
•
•
•
•
•
Type of value handled
Levels of abstractions
Number of data dimensions
Kinds of Patterns to be mined
Completeness of Patterns to be mined
Kind of rules to be mined
Introduction to Data Mining
56
Type of Value Handled
Binary / Boolean
• Absence of items helps in improving the discovery of
association rules but does not directly contribute to rule
mining.
Quantitative
• In certain applications, absence of items may sometime be
as important as their presence.
• In medical applications, it has been found that both
presence and absence of symptoms need to be considered in
discovering association rules.
Introduction to Data Mining
57
Quantitative Association Rules
For numeric attributes like, age, salary etc. binary
association rule mining is not applicable. The
attribute domain can be categorized in two basic
approaches regarding the treatment of quantitative
attributes:
• Static
• Dynamical
Introduction to Data Mining
58
Static Discritisation
Quantitative Attributes are discritised using
predefined concept hierarchies.
Say income may be replaced by original numeric
values of this attribute interval level
“0…10K”, “11…20K” …
and so on.
Introduction to Data Mining
59
Dynamical Discritisation
Quantitative Attributes are discritised (clustered) into
“bins” based on the distribution of Data.
After the verification of minsup and minconf
thresholds, following rules may be obtained,
age(x,5)  studies(x, “in school”)
age(x,6)  studies(x, “in school”)
⁞
age(x,17)  studies(x, “in school”)
age(x,18)  studies(x, “in school”)
Introduction to Data Mining
60
Dynamical Discritisation
• ARCS(Association Rule Clustering System) used
for mining quantitative rules may be used for
classification in the form,
Aquant1  Aquant2  ….  Aquantn  Acat
where Aquant1 , Aquant2 etc. are tests on numeric
attribute ranges and Acat is the class label assigned
after the training step.
Introduction to Data Mining
61
Dynamical Discritisation
Using ARCS (Association Rule Clustering System),
a composite rule may be formed as,
age(x, “5….18”)  studies(x, “in school”)
Similar way, two dimensional quantitative rules can
also be formed.
age(x, “25 …. 40”)  income(x, “20K …. 40K”)
 buys(x, “new car”)
Introduction to Data Mining
62
Levels of Abstractions
All
Pen
Fountain
Writing Pad
Dot
Ruled
Blank
Ink
Bottle Cartridge
Pilot … Parker … Oxford … Pioneer … … … … Link
Introduction to Data Mining
63
Multilevel Association Rule
Using
• Uniform minimum support
• Reduced minimum support at lower level
• Group based minimum support
Introduction to Data Mining
64
Rules over Taxonomies
• The items used for rule mining may not be at the
same level. There can be an in-built taxonomy
among the items. An example of a taxonomy as
applicable to market basket data : Clothes
Footwear
Outerwear
Shoes
Snickers
Track Suits
Shirts
Track Pants
This taxonomy implies :
• Track Suits is-a Outerwear, Outerwear is-a Clothes etc.
Introduction to Data Mining
65
Rules over Taxonomies
Application domain may need rules at different
levels of the taxonomy.
Trivial Rule:
If Ŷ implies ancestor(Y), then rule Y  Ŷ is Trivial.
Shoes Footwear (A rule with 100% confidence)
Footwear
Shoes
Snickers
Introduction to Data Mining
66
Rules across Levels
• Rule OuterwearSnickers does not infer either
Track SuitsSnickers or Track PantsSnickers
So, a rule at a higher level does not infer the same
rule at the lower level of the taxonomy.
Clothes
Footwear
Outerwear
Shoes
Snickers
Track Suits
Introduction to Data Mining
Shirts
Track Pants
67
Rules across Levels
• Rule Track SuitsSnickers definitely infers the
rule OuterwearSnickers
So, a rule at a lower level definitely infers the same
rule at the higher level of the taxonomy.
Clothes
Footwear
Outerwear
Shoes
Snickers
Track Suits
Introduction to Data Mining
Shirts
Track Pants
68
Interest Measure
• To find rules whose support is more than R times
the expected value or whose confidence is more
than R times the expected value , for some user
specified constant R.
Introduction to Data Mining
69
Rule (with Taxonomies) Generation
Steps
1. Find frequent itemsets
2. Use frequent itemsets to generate the desired
rules.
3. Prune all uninteresting rules from this set.
Introduction to Data Mining
70
The Database
TID
1
2
3
4
5
6
Items
Shirts
Track Suits, Snickers
Track Pants, Snickers
minsup = 30%
minconf=60%
Shoes
Shoes
Track Suits
Introduction to Data Mining
71
Frequent Itemset & Taxonomies
Itemsets
Sup (out of 6)
Footwear
{Track Suit}
2
{Outerwear}
3
{Clothes}
4
{Shoes}
2
{Snickers}
2
{Footwear}
4
{Outerwear, Snickers}
2
{Clothes, Snickers}
2
{Outerwear, Footwear}
2 Track Suits
{Clothes, Footwear}
2
Shoes
Snickers
Clothes
Outerwear
Introduction to Data Mining
Shirts
Track Pants
72
Rules
Rule
Sup%
OuterwearSnickers
OuterwearFootwear
SnickersOuterwear
SnickersClothes
33
33
33
33
Introduction to Data Mining
Conf
%
66
66
100
100
73
Rule under Item Constraints
Some applications may need association rules
under user specified constraints on items. When a
taxonomy is present, these constraints may be
specified using the taxonomy.
Introduction to Data Mining
74
Rule under Item Constraints
(Track Suits  Shoes)  (descendants(Clothes)  
ancestors(Snickers))
• A Boolean expression representing a constraint.
• Allow rules containing either, both Track Suits
and Shoes or Clothes or any descendant of Clothes
and do not contain Snickers or Footwear as its
ancestor.
Introduction to Data Mining
75
Rule under Item Constraints
Exploitation of hierarchy does not stop the
generation of association rules among the items at
the same level. Thus, these types of association
rules are the Generalized Association Rules.
Introduction to Data Mining
76
Number of Data Dimensions
• Single Dimension
– Discrete Predicate:
buy(X,”Pen”)  buy (X, “Ink”)
• Multidimension
– Discrete Predicate:
age(X,”9..21”)^occupation(X,”Student”)  buy (X, “Pen”)
– Multiple occurrence of Predicate:
age(X,”9..21”)^occupation(X,”Student”)^ buy(X,”Pen”) 
buy (X, “Ink”)
Introduction to Data Mining
77
Sequential Patterns
A sequential pattern always provides an order.
• In a market basket application, it is not interested
in the set of items appearing in a transaction but
tries to find an inter-transaction purchase pattern.
So the transactions need to be ordered.
Introduction to Data Mining
78
Sequential Patterns
It is assumed that a customer can have only one
transaction at a given transaction time.
• An itemset (I) is a non-empty set of items (ij)
I = {i1 i2…in}
• A sequence (s) is an ordered list of itemsets or
events (ej).
s = {e1 e2…em} where ei occurs before ej (i<j)
Introduction to Data Mining
79
Sequential Patterns
A sequence is contained in another sequence if each
itemset in the first sequence is contained in some
itemset of the second sequence.
A sequence {(3) (4 5) (8)} is contained in another
sequence {(7) (3 8) (9) (4 5 6) (8)}
since, (3)  (3 8), (4 5)  (4 5 6) and (8)  (8).
A sequence {(3) (5)} is not contained in {(3 5)}
and vice versa.
Introduction to Data Mining
80
Sequential Patterns
• In a set of sequences, a sequence s is maximal if it
is not contained in any other sequence.
• A sequence to be frequent it must at least cross the
minimum support threshold.
• A frequent sequence is called sequential pattern.
• A sequential patterns with length l is called an lpattern.
Introduction to Data Mining
81
Discovery of Sequential Patterns
CustId
Date
001
13/0205/12
001
Items
Sequence
Support
30
{(10)}
1
14/05/2012
90
{(20)}
1
002
13/05/2012
10, 20
{(30)}
4
002
15/05/2012
30
{(40)}
2
002
16/05/2012
40, 60, 70
{(50))
1
003
17/05/2012
30, 50, 70
{(60)}
1
004
13/05/2012
30
{(70)}
3
004
14/015/2012
40, 70
004
16/05/2012
90
{(90)}
3
005
13/05/2012
90
minsup = 25%
Introduction to Data Mining
82
Discovery of Sequential Patterns
• L1={{(30)}, {(40)}, {(70)}, {(90)}}
• candidate sequence (22) c2={{(30) (30)},{(30)
(40)}, {(30) (70)}, {(30) (90)}, …, {(90) (90)} ,
{(30 40)}, …, {(70 90)}}
Sequence
Support
Sequence
Support
(10 20)
1
(30) (70)
2
(10) (30)
1
(30) (90)
2
(20) (30)
1
(40) (90)
1
(30) (40)
2
(70) (90)
1
(30) (60)
1
(40 70)
2
Introduction to Data Mining
83
Discovery of Sequential Patterns
• L2={{(30) (40)}, {(30) (70)}, {(30) (90)} {(40
70)}} candidate sequence c2={{(30) (30)
(70)},{(30) (30) (90)}, {(30) (40 70)}, …, {(40)
(30) (70)},{(40) (30) (90)}, {(40) (40 70)}, …,
{(30) (40) (30) (70)},{(30) (40) (30) (90)}, {(30)
(40) (40 70)}, …, {(40 70) (40 70)}, …, {(30) (40
70 90)}}
Sequence
(30) (40 70)
Support
2
Introduction to Data Mining
84
Discovery of Sequential Patterns
CustId
Date
Items
CustId
Sequence
001
13/0205/12
30
1
(30) (90)
001
14/05/2012
90
2
(10 20) (30) (40 60 70)
002
13/05/2012
10, 20
3
(30 60 70)
002
15/05/2012
30
4
(30) (40 70) (90)
002
16/05/2012
40, 60, 70
5
(90)
003
17/05/2012
30, 50, 70
004
13/05/2012
30
004
14/015/2012
40, 70
004
16/05/2012
90
005
13/05/2012
90
If minsup of any maximal
sequence = 0.25 (say),
then, acceptable sequential
patterns: {(30) (90)} and
{(30) (40 70)}.
Introduction to Data Mining
85
Specification of Time Windows
• User may define a time window within which the
patterns are to be discovered.
• If a pattern is found without adequate support
within a time window but crosses minsup across
different time windows, it would not be considered
as a valid sequential pattern.
• This effort helps in studying seasonal purchase
patterns in case of market basket analysis.
Introduction to Data Mining
86
Sequential Patterns over Taxonomies
Similar to rule mining, the items under consideration
may not be at the same level.
Clothes
Footwear
Shoes
Snickers
Outerwear
Track Suits
Shirts
Track Pants
From the available transactions if a sequential pattern is found
as {(Track Suits) (Shoes)}, it would also support patterns
like, {(Outerwear)(Shoes)},{(Outerwear) (Footwear)} etc.
These are called generalized sequential patterns.
Introduction to Data Mining
87
Data Classification
• Classification is a method where the data instances
in a problem domain are distributed among
different pre-defined classes or concepts.
• Usually a data instance is placed in only one
class.
• For the purpose of classification, definite criteria /
rules are defined for the membership of each class.
Introduction to Data Mining
88
Data Classification
• Classification is usually done under the
supervision of domain experts of the problem
domain under consideration. So, classification
process involves supervised learning.
• Clustering, on the other hand, is the result of
unsupervised learning. Here the class or concept
label of each data instance or each cluster is not
known. The number of such classes or concepts are
pre-defined intuitively.
Introduction to Data Mining
89
Data Classification
Classification process has two steps.
1. build the model from training data set
–
Learning a mapping function y = f(X) where y is the
associated class label for an instance X.
2. classify unknown data.
Introduction to Data Mining
90
Comparison of Classification
Methods
Properties for the comparison:
• Predictive Accuracy : Ability of a model to
correctly predict the class label for a new data
instant.
• Speed : Computation cost, in terms of time,
required in a model to generate, i.e. to train the
classes and then to classify data.
Introduction to Data Mining
91
Comparison of Classification
Methods
Properties for the comparison:
• Robustness : Ability of a model to make correct
classification under noisy data or data with missing
values.
• Scalability : The response of a model in training
and classification step against the increase in data
volume.
Introduction to Data Mining
92
Classification by Decision Tree
Induction
• A Decision Tree is a tree structure.
• Classification is done against a concept.
• Tree is formed by testing an attribute or attribute
combination in each node.
• Each branch of the tree is caused by an outcome of
this test.
• The leaf nodes represent the classes.
Introduction to Data Mining
93
Decision Tree Concept: Buy New
Car
INCOME
=20K
20-50K
MARITAL STATUS
Single
YES
AGE
Married
NO
>50K
<40
>40
YES
NO
Introduction to Data Mining
YES
94
Decision Tree Induction Algorithm
1. Tree starts as a single node on which training
samples are tested.
2. If all the training samples are of the same class
the node becomes the leaf and it is labeled with
that class.
3. Running an attribute selection algorithm, an
attribute is chosen for tree generation (attribute
INCOME in the example).
Introduction to Data Mining
95
Decision Tree Induction Algorithm
4. A branch is created for each value of the chosen
attribute and the samples are partitioned
accordingly(three branches under INCOME).
5. Algorithm repeats steps 3 and 4 recursively to
form decision tree for the samples at each
partition. Once an attribute is considered in a
node, it is not considered in any of its descendent
nodes.
Introduction to Data Mining
96
Decision Tree Induction Algorithm
6. The recursive procedure stops when
i.
all samples for each node belong to the same class
according to the domain expert.
ii. there is no other attribute on which the samples can be
further partitioned. Majority Voting may be employed
here to convert a node to a leaf node and be labeled as
a class that covers majority of its samples.
iii. There are no tuples for a given branch
Introduction to Data Mining
97
Tree Pruning
Tree pruning is done to avoid overfitting of data at
different nodes. Statistical measures are taken to
identify and to remove branches not reliable
enough. This process results in faster classification
and makes better classification of unknown data.
• Prepruning
• Postpruning
Introduction to Data Mining
98
Prepruning
The tree generation process is stopped after every
partitioning. As a result all the new nodes generated
become leaf nodes with membership of samples
decided by Majority Voting. Goodness of
partitioning is then tested by measures like 2,
information gain etc. If any result goes below a prespecified threshold, further partitioning of the
affected subset of samples is stopped.
Introduction to Data Mining
99
Prepruning
• High threshold would generate an over-simplified
tree and low threshold may cause hardly any
pruning.
Introduction to Data Mining
100
Postpruning
• Branches are removed from a fully grown tree.
Here the expected error rate at each non-leaf node
is computed if its sub-tree is pruned. It is compared
with the combined error rates along its each branch
weighted by the proportion of the participating
samples. If the expected error rate is lower, the subtree is removed.
Introduction to Data Mining
101
Classification Rule Generation
Each path of a decision tree from the root to a leaf
gives rise to a IF-THEN classification rule. From
the decision tree in the example rules may be
formed as:
IF income=20k AND marital-status=“married”
THEN buys-new-car=“no”
IF income=50k
THEN buys-new-car=“yes” etc.
Introduction to Data Mining
102
Classification Rule Generation
Either during Rule Generation or during
Postpruning the redundant paths are pruned. For
example if the following rules are found,
IF income=20k AND marital-status=“married”
THEN buys-new-car=“no”
IF income=20k AND marital-status=“widow”
THEN buys-new-car=“no”
Introduction to Data Mining
103
Classification Rule Generation
The 2 paths are pruned to 1 path as,
IF income=20k AND marital-status=(“married”
OR “widow”)
THEN buys-new-car=“no”
Other well known classification methods are,
Bayesian
Classification,
Classification
by
Backpropagation, k-Nearest Neighbor Classifiers
etc.
Introduction to Data Mining
104
Case Study: Dynamic Classification
Hierarchy
Classification of Archaeological data:
• Classification Hierarchy is created over a Backend
Database to generate and update Association Rules.
Continuous
restructuring
of
Classification
Hierarchy is done with the updation of the database.
• On arrival of a new instance, system tries to place
it in the existing hierarchy. Failure to classify,
considers the instance as an Exception to the class
found to be the closest.
Introduction to Data Mining
105
Case Study: Dynamic Classification
Hierarchy
Classification of Archaeological data:
• System initiates restructuring when the number of
Exceptions exceeds a predefined threshold value.
Three important operations are used.
1. ADD : adds a new branch to the hierarchy.
2. FUSE : merges more than one classes to one.
3. BREAK : decomposes a class into more than one
classes.
Introduction to Data Mining
106
Initial Transaction
• Universal attribute set:
A={ao, a1, a2, a3, a4, a5, a6, b0, b1, b2, b3, b4, b5, b6}
Transactions:
I1={ao, a1, a2, a3, a4}
I2={ao, a1, a2, a5, a6}
I3={ao, b0, b1, b2}
I4={ao, b0, b3, b4}
I5={ao, b0, b5, b6}
Introduction to Data Mining
107
Initial Hierarchy
Exact match at leaf level classes
• 5 leaf classes
C0 {a0}
{a1,a2} C1
C11
{a3,a4}
C2{b0}
C12
C21
C22
C23
{a5,a6} {b1,b2} {b3,b4} {b5,b6}
Introduction to Data Mining
108
Add
I6 ={ao, a3, a4, b0, b1, b2, b3, b4}
Approximate – up to intermediate level (exception)
C0 {a0}
{a1,a2} C1
C11
{a3,a4}
C2{b0}
C12
C21
C22
C23
{a5,a6} {b b } {b b } {b b }
1, 2
3, 4
5, 6
{b1,b2b3,b4}
C24
Large number of exception may generate new class
Introduction to Data Mining
109
Fuse
C0 {a0}
{a1,a2} C1
C11
C2{a1,a2, a3,a4}
…
C1n
C21
…
C2m
C0 {a0}
{a1,a2} C1
C11
…
C1n
C21
Introduction to Data Mining
C2{a3,a4}
C2m
…
110
Fuse


• The fuse of two peer classes K1 and K2 K1A  K 2A
is not allowed if there exists any other peer class K3
A
A
with K 3  K 2


Introduction to Data Mining
111
Further Transaction
• Universal attribute set:
A={ao, a1, a2, a3, a4, a5, a6, b0, b1, b2, b3, b4, b5, b6}
Transactions:
I7={ao, a3, a4, b0, b1, b2 , b3, b4}
I8 ={ao, a5, a6, b0, b1, b2 , b3, b4}
I9 ={ao, a3, a5, b0, b1, b2 , b3, b4}
I10 ={ao, a3, a5, b0, b1, b2 , b5}
I11 ={ao, a3, b0, b1, b2 , b3, b4}
Introduction to Data Mining
112
Break
C0 {a0}
{a1,a2} C1
C11
{a3,a4}
C2{b0}
C12
C21
C22
C23
C24
{a5,a6} {b1,b2} {b3,b4} {b5,b6} {b1,b2b3,b4}
C41
{a3,a4}
Introduction to Data Mining
C42
{a5,a6}
113
Cluster Analysis
• The process of partitioning a set of data objects
into groups of similar objects is called Clustering.
The objects belonging to same cluster are supposed
to be similar whereas those in different clusters
should be dissimilar under the same similarity
measure.
Introduction to Data Mining
114
Cluster Analysis
• A good clustering algorithm should have the
following properties :
• Scalability
• Ability to handle different data types
• Insensitivity to the order of input records
• Working under minimum intervention
• Constraint based clustering
• Accept high dimensionality
Introduction to Data Mining
115
Clustering Algorithms
• Partitioning Method : In presence of n objects or
data instances, a partitioning method constructs k
partitions where k  n. Each group/partition must
have at least one object. Each object must belong to
only one group (may not be true for a fuzzy
partitioning algorithm).
Introduction to Data Mining
116
k-Means Algorithm or a
Centroid-based Technique
Accepts an input parameter k and partitions n
objects into k clusters where intra-cluster similarity
is high and inter-cluster similarity is low. Similarity
is measured with respect to the mean value of the
objects in a cluster, called the centroid of the
cluster.
Introduction to Data Mining
117
Centroid-based Technique
1. arbitrarily choose k objects out of n as initial
cluster centers;
2. assign or reassign each object to a cluster where it
is most similar, with respect to the mean value;
3. re-compute the cluster means;
4. repeat steps 2 and 3 until there is no further
change or there is an exit condition.
Introduction to Data Mining
118
Centroid-based Technique
k-means is an iterative algorithm that works on the
convergence of a squared-error criterion of the
form,
E = i=1 to k pCi |p-mi|2
Where, E is the sum of square-error for all objects, p
is a given object and mi is the centroid of the
cluster Ci.
Introduction to Data Mining
119
k-Medoids Algoritrhm
k-means algorithm is sensitive to outliers where a
very large value may distort the distribution of data
among clusters. In order to overcome it, instead of
the mean a medoid is used as the reference point of
a cluster. A medoid is the most centrally located
object in a cluster.
Introduction to Data Mining
120
k-Medoids Algoritrhm
1. arbitrarily choose k objects out of n as initial
medoids;
2. assign each remaining object to the cluster with
the nearest medoid;
3. randomly select a non-medoid object, Orandom ;
Introduction to Data Mining
121
k-Medoids Algoritrhm
4. compute the total cost S of swapping Oj with
Orandom (the cost function calculates the difference
in square-error value if a current medoid is
replaced by a nonmedoid object);
5. if S<0 then swap Oj with Orandom to form new set
of k-medoids (the total cost of swapping is the sum
of costs incurred by all nonmedoid objects);
Introduction to Data Mining
122
k-Medoids Algoritrhm
6. repeat steps 2 to 5 until no change.
• To judge the quality of replacement of Oj by
Orandom , each nonmedoid object p is examined for
following four cases.
• If p Oj and Oj is replaced by Orandom and p is
closest to Oi where ij, then reassign p to Oi .
• If p Oj and Oj is replaced by Orandom and p is
closest to Orandom , then reassign p to Orandom .
Introduction to Data Mining
123
k-Medoids Algoritrhm
• If p Oi , where ij and Oj
and p is still closest to Oi ,
does not change.
• If p Oi , where ij and Oj
and p is closest to Orandom
Orandom .
is replaced by Orandom
then assignment of p
is replaced by Orandom
, then reassign p to
Introduction to Data Mining
124
Parallel Association Rule Mining
Algorithms
Challenges include:
• synchronization and communication minimization
• disk I/O minimization
• workload balancing
Introduction to Data Mining
125
Parallel Association Rule Mining
Algorithms
Strategies are,
• Distributed vs. shared memory architecture - SM
needs more synchronization by locking etc. where
for DM message passing claims higher
communication overhead.
• Data vs. task parallelism.
• Static vs. dynamic parallelism.
Introduction to Data Mining
126
Sources & References
1. Jiawei Han and Micheline Kamber, “Data Mining
Concepts and Techniques”, 2007
2. Willi Klosgen and Jan M Zytkow, “Handbook of Data
Mining and Discovery”, 2002
3. R.Srikant, “Fast algorithms for mining association rules
and sequential patterns”, Ph.D. Thesis at the University of
Wisconsin-Madison, 1996.
4. R.Agrawal, T.Imielimski & A.Swami, “Mining association
rules between sets of items in large databases,” Proc. ACM
SIGMOD, pp.207-216, 1993.
Introduction to Data Mining
127
Sources & References
5. R.Agrawal & R.Srikant, “Fast algorithms for mining
association rules,” Proc. International Conference for Very
Large databases, 1994.
6. J.S.Park, M.S.Chen & P.S.Yu, “An effective hash based
algorithm for mining association rules,” Proc. ACM
SIGMOD,1995.
7. R.Srikant, Q.Vu & R.Agrawal, “Mining association rules
with item constraints,” Proc. International Conference on
Knowledge Discovery in Databases, 1997.
Introduction to Data Mining
128
Sources & References
8. K.Ali, S.Manganaris & R.Srikant, “Partial classification
using association rules,” Proc. International Conference on
Knowledge Discovery in Databases, 1997.
9. S Pal and A Bagchi, “Association against Dissociation:
some pragmatic considerations for Frequent Itemset
generation under Fixed and Variable Thresholds,” ACM
SigKDD Explorations, Vol.7, Issue 2, Dec.2005, pp. 151159.
Introduction to Data Mining
129
Sources & References
10. S Ray and A Bagchi, “Rule Generation by Boolean
Minimization – Experience with Coronary Bifurcation
Stenting in Angioplasty,” ReTIS 2006.
11. S.Maitra & A.Bagchi, “Dynamic restructuring of
classification hierarchy towards data mining,” Proc.
International Conference on Management of Data, 1998.
12. T.G.Dietterich & R.S.Michalski, “Discovering patterns in
sequences of events,” Artificial Intelligence, vol.25, pp.187232, 1985.
Introduction to Data Mining
130
Sources & References
13. R.Agrawal & R.Srikant, “Mining sequential patterns”
Proc. IEEE International Conference on Data Engineering,
1995.
14. R.Srikant & R.Agrawal, Mining sequential patterns :
generalizations and performance improvements,” Proc.
International Conference on Extending Database
Technology, 1996.
15. M.J.Zaki, “Parallel & distributed association mining: a
survey,” IEEE Concurrency, 7(4), pp.14-25, 1999.
Introduction to Data Mining
131
Research Challenges
Areas:
• Query Language
• Architecture
• Text Mining
• Multimedia Mining
• Spatial / Temporal Analysis
• Graph-Mining
Introduction to Data Mining
132
THANK YOU