Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Intelligent Systems
Exploratory pattern discovery
Geoff Webb
http://www.csse.monash.edu.au/~webb
Outline
• Tutorial covers
•
•
•
•
•
•
•
•
•
•
Data Mining
Exploratory Pattern Discovery
Association rules
Interestingness (objective functions)
False discoveries
Limitations of minimum support
K-most interesting pattern discovery
Itemset discovery
Contrast rule discovery
Impact rules
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
2
Part 1:
Data Mining
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
3
Data mining
• Data mining seeks to discover
unanticipated knowledge from data
• Exponential growth in the quantity of data
stored gives urgency to the pursuit of
practical analytic approaches that address
•
•
•
•
Large volumes of data
Low quality data
Post-hoc analysis
Loosely defined analytical objectives
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
4
So what’s the big deal?
• Don’t statistics identify patterns in data?
• Conventional statistics do not address
• searching quintillions of potential
correlations
Eg.
• market basket data 2100,000
• US phone calls 2100,000,000
• human genome 23,000,000,000
• selecting most interesting from millions of
correlations
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
5
Example: Should we stock vitamins?
• Major national retailer with detailed records
of customer purchasing behaviour
• Considering deleting a low volume product
line
• Does data provide evidence of indirect
contribution to bottom line?
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
6
Example: Steel rolling mill
• Complex control
problem for
expensive production
process influenced
by input materials,
desired output and
state of equipment
• Currently uses
imperfect model
• Objective, use data to
identify
circumstances in
which model is
Photo courtesy G.C. Goodwin, S. Graebe and M. Salgado. Control System Design, Prentice Hall, 2000.
deficient
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
7
Example: Synchrotron x-ray data
analysis
• Synchrotron x-ray scatter patterns reflect
micro-structure of material analysed.
• Can x-ray
scatter plots
be used for
cancer
diagnosis?
Normal
Malignant
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
8
A growth area
• The sum of human data stored doubles
every 7 years
• Data mining is critical to commerce
• Fraud detection
• Information retrieval
and to science
• Bioinformatics
• Mass data analysis
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
9
Large unmet demand for good PhDs!
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
10
Beyond statistics
• Data mining goes beyond the
traditional realm of statistics by
encompassing
• problem formulation
• interactions between the
business process and the
analytic process
• knowledge management
Other
• data manipulation
knowledge
sources
Business
processes
Analytics
Data
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
11
Generating models
• The core of the data mining process is generating
models from data
Eg neural networks, support vector machines, decision
trees
• Most research concentrates on this aspect
• Surrounding activities are also very important
•
•
•
•
•
•
•
Defining analytic task
Sourcing data
Preprocessing data
Identifying appropriate forms of model
Identifying appropriate techniques for generating models
Interpreting models
Applying models
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
12
Part 2:
Exploratory Pattern Discovery
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
13
The perils of model selection
• Many data mining techniques seek to identify a
single model that best fits the observed data.
• In many applications many models will (almost)
equally fit the data
bruises=f & gill-attachment=f & gill-spacing=c & ring-number=o →
poisonous
[Coverage=0.406 (3296); Support=0.388 (3152); Confidence=0.956]
bruises=f & gill-spacing=c & veil-color=w & ring-number=o
→ poisonous
[Coverage=0.406 (3296); Support=0.388 (3152); Confidence=0.956]
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
14
Perils of model selection (cont.)
• Data mining systems often make arbitrary
choices
• without warning
• A system may have no basis on which to
select models, but an expert often will
• ease / cost of operatalisation
• comprehensibility / compatibility with
existing knowledge and beliefs
• social / legal / ethical / political acceptability
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
15
Exploratory pattern discovery
• Exploratory pattern discovery seeks all
patterns that satisfy user-defined
constraints
• The user can select from these patterns
• can use criteria that might be infeasible to
quantify
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
16
Patterns
• Rules:
• <antecedent> <consequent>
• Itemsets
• <condition1> & <condition2> & …
• Sequences
• <event1>, <event2>, ….
• Structures
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
17
Rules
• <antecedent> <consequent>
• IF <antecedent> THEN <consequent>
• IF temp >36.8 AND pulse > 120 THEN call doctor
• Antecedent
= condition
= left hand side, LHS
= conditions under which antecedent holds / applies
• Consequent
= conclusion
= right hand side, RHS
= action to perform or conclusion to reach
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
18
Theoretical foundations
• Substantial bodies of theory in Formal Logic,
Computational Logic, and Artificial Intelligence
can be brought to bear to utilise rules once they
are inferred.
• If the antecedent entails the consequent and the
antecedent is known (believed) then the
consequent can be concluded.
• Can be extended to probabilistic basis.
• Supports complex reasoning.
• Modular knowledge representation.
• can capture knowledge nuggets
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
19
Rule discovery as search
• Rule discovery can be viewed as search through a
space of expressible rules.
• The rule space (search space / description space) can
be partially ordered on generality.
• A C is a generalisation of B C iff B entails A (A
must be true if B is true)
• proper generalisation iff A does not also entail B
• If A C is a generalisation of B C then B C is a
specialisation of A C.
• Eg. IF age > 30 THEN X is a generalisation of
• IF age > 31 THEN X
• IF age > 30 AND gender = male THEN X
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
20
Generalization lattice for antecedents
{}
{A}
{A,B}
{A,C}
{A,B,C}
{B}
{C}
{B,C}
{A,B,D}
{A,D}
{A,C,D}
{D}
{B,D}
{C,D}
{B,C,D}
{A,B,C,D}
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
21
Search tree for antecedents
{}
{A}
{A,B}
{A,C}
{A,B,C}
{B}
{C}
{B,C}
{A,B,D}
{A,D}
{A,C,D}
{D}
{B,D}
{C,D}
{B,C,D}
{A,B,C,D}
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
22
Search tree with consequent propagation
{}
{A,B,C,D}
{A}
{B,C,D}
{A,B}
{C,D}
{A,C}
{B,D}
{A,B,C}
{D}
{B}
{A,C,D}
{C}
{A,B,D}
{B,C}
{A,D}
{A,B,D}
{C}
{A,D}
{B,C}
{A,C,D}
{B}
{D}
{A,B,C}
{B,D}
{A,C}
{C,D}
{A,B}
{B,C,D}
{A}
{A,B,C,D}
{}
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
23
Propositional rule discovery
• Antecedent and consequent are
propositions
• Often restricted to antecedent and
consequent both conjunctions of Boolean
terms
• IF temp >36.8 AND pulse > 120 THEN
blood pressure > 140 AND condition =
critical
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
24
Rule discovery is inherently intractable
• If
• there are n propositions,
• antecedents can be any set of propositions and
• consequents are a single proposition
then
• size of search space ≈ n2n
• It is essential to use powerful
pruning techniques to limit the
search space
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
25
Part 3:
Association rules
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
26
Association rule discovery
• Developed for market basket analysis
• a basket is a collection of products
purchased in a single transaction
• an itemset is a set of products
• all baskets are itemsets
• market basket analysis seeks to identify
products that are associated with each other
• diapers and beer
• Can generalize to itemset = any conjunction
of Boolean terms
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
27
Transaction and tabular data
• Transaction data
• Each record is a set of items involved in a single
transaction
• Eg. market basket, web site traversal, amino acids
in a protein
• Tabular data
• Each record consists of a vector of values for the
predefined attributes or fields
• Eg. A patient’s signs and symptoms, employee
details, the amino acids at each site in a protein
• While association rules were developed for
transaction data they generalise directly to
attribute-value data
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
28
Support and confidence
• F(X) = proportion of records that satisfy
condition X
• Coverage(AC) = F(A)
A
C
• Support(AC) = F(A & C)
• Confidence(AC) = support(AC) /
coverage(AC)
• Maximum likelihood estimate of P(C | A)
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
29
Frequent itemsets
• An itemset is frequent if its cover equals or
exceeds a user defined minimum
• Downward closure
• frequency is anti-monotone
• if an itemset I is not frequent then no
specialization of I is frequent
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
30
Association rules
• Antecedent and consequent are frequent
itemsets
• An association rule indicates that the
presence of the antecedent increases the
probability that the consequent will be
present
• bread & butter honey
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
31
Association rule discovery
• Requires minimum support constraint
• Finds all rules that satisfy minimum
support together with other user specified
constraints such as minimum confidence
• Example: 1000 transactions, 100 bread, 100
honey, 50 bread & honey
• support(bread honey) = 0.05
• confidence(bread honey) = 0.50
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
32
The frequent itemset approach
• Find all frequent itemsets
• Generate all association rules therefrom
• Assumes
• a minimum support constraint
• sparse data
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
33
Finding frequent itemsets
• Once frequent itemsets are found rule
generation is straightforward
• Research has concentrated on efficient
frequent itemset generation
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
34
The Apriori algorithm
Apriori(T, ε)
L1 ← frequent 1-itemsets relative to T
k←2
while Lk-1 ≠
Ck ← Generate(Lk-1)
for t T
for c Subsets(Ck, t)
count[c]++
Lk ← { c Ck | count[c] ≥ ε }
k++
return L
TRANSACTIONS
a,b,c
a,b,d
a,d
PROCESS, ε=2
L1 {{a},{b},{d}}
C2 {{a,b},{a,d},{b,d}}
L2 {{a,b},{a,d}}
C3 {{a,b,d}}
L3 {}
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
35
Closed itemsets
• In practice many itemsets cover exactly the
same items
• Eg pregnant, pregnant & woman
• A closed itemset is the most specific
itemset that covers a particular set of items
• More efficient to find all closed frequent
itemsets than all frequent itemsets
• Can generate all association rules from
closed itemsets
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
36
Closed Itemsets Example
Full set of itemsets for gill-size=n, gill-color=b & spore-print-color=w
gill-size=n [Coverage=2512]
spore-print-color=w [Coverage=2388]
gill-size=n & spore-print-color=w [Coverage=1824]
gill-color=b [Coverage=1728]
gill-color=b & spore-print-color=w [Coverage=1728]
gill-size=n & gill-color=b [Coverage=1728]
gill-size=n & gill-color=b & spore-print-color=w [Coverage=1728]
Closed itemsets
gill-size=n [Coverage=2512]
spore-print-color=w [Coverage=2388]
gill-size=n & spore-print-color=w [Coverage=1824]
gill-size=n & gill-color=b & spore-print-color=w [Coverage=1728]
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
37
Part 4:
Interestingness (objective
functions)
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
38
Interestingness (Objective Functions)
• Need some means of selecting the most
(potentially) interesting patterns
• Many different measures of interestingness
may be relevant
• Most measures relate to the degree to
which the antecedent and consequent are
interdependent
o P(A & C) – P(A) P(C)
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
39
Interestingness measures: lift
• lift = confidence / (cover(consequent)/n)
• proportional increase in confidence in
context of antecedent
• Example: 1000 transactions, 100 bread, 100
honey, 50 bread & honey
• confidence(bread honey) = 0.50
• lift(bread honey) = 5.00
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
40
M-estimates
• Problem: many rules with low support will have
unrealistically high confidence and lift
• Example: 1000 records, 500 females, 1 age>=90, 1 female
& age>=90
• confidence(age>=90 female) = 1.00
• lift(age>=90 female) = 2.00
• M-estimate is Bayesian estimate of true confidence and lift
• biases confidence toward prior
• confidence estimate = (support + m * prior) / (coverage +
m)
• lift estimate = confidence estimate / prior
• Eg confidence estimate = (1 + 2 * 0. 5) / (1 + 2) = 0.667
lift estimate = 0.667 / 0. 500 = 1.333
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
41
Interestingness measures: leverage
• leverage = support – (cover(antecedent)
cover(consequent) / n)
• absolute increase in comparison to expected cases if
antecedent and consequent independent
• Also known as interest
• Example: 1000 transactions, 100 bread, 100 honey, 50
bread & honey
• confidence(bread honey) = 0.50
• lift(bread honey) = 5.00
• leverage(bread honey) = 0.04
• Example2: 1000 transactions, 10 batteries, 5 vodka, 1
batteries & vodka
• lift(batteries vodka) = 20.00
• leverage(batteries vodka) = 0.0009
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
42
Spurious rules
• If condition X is unrelated to conditions A and
B,
• confidence(A & X B) confidence(A B)
• lift(A & X B) lift(A B)
• Eg pregnant & AI Researcher oedema
• One core rule can result in many spurious
rules
• If problem ignored, majority of rules can be
spurious!
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
43
Need to test up the generalization lattice
{}
{A,B,C,D}
{A}
{B,C,D}
{A,B}
{C,D}
{A,C}
{B,D}
{A,B,C}
{D}
{B}
{A,C,D}
{C}
{A,B,D}
{B,C}
{A,D}
{A,B,D}
{C}
{A,D}
{B,C}
{A,C,D}
{B}
{D}
{A,B,C}
{B,D}
{A,C}
{C,D}
{A,B}
{B,C,D}
{A}
{A,B,C,D}
{}
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
44
Minimum Improvement
• The improvement of rule X → Y [conf=c] =
min(c-k | ZX Z → Y [conf=k])
• A minimum improvement constraint can
eliminate many spurious rules
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
45
Non redundant rules
• xyzsc x → y [conf = 1.0 ] x → z [supp=s,
conf=c] x, z → y [supp=s, conf=c]
Eg pregnant → oedema [supp=0.1, conf=0.2]
pregnant, female → oedema [supp=0.1, conf=0.2]
• A rule X → Y [supp=s, conf=c] is redundant iff
xX X\x → Y [supp=s, conf=c] or yY X → Y\y
[supp=s, conf=c]
Eg, pregnant, female → oedema
• Closed itemset approaches lead to efficient
generation of non-redundant rules because a rule
is non-redundant iff all immediate specialisations
are closed itemsets.
• Note, redundant rules have improvement of 0.0.
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
46
Effect
filter
nondataset
none
improvement
redundant
%
>0
%
bms webview
170
170
100
155
91
covtype
998
815
82
143
14
ipums.la.99
973
959
99
481
49
kddcup98
995
992
100
939
94
letter-recognition
541
524
97
421
78
mush
891
469
53
128
14
retail
590
590
100
519
88
shuttle
666
595
89
312
47
splice-junction
748
727
97
699
93
ticdata-2000
996
996
100
988
99
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
47
Part 5:
False discoveries
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
48
False discoveries
• Massive search leads to high risk of false
discoveries
• eg 100 observations, two independent events each
occurring with 0.5 probability,
• the probability of perfect correlation is 7.8x10-31.
• if there are 1000 events then there are 21000 =
1.07x10301 antecedent – consequent pairs.
• What constitutes a false discovery depends upon
the analytic objective
• Usually should include rules where
• antecedent and consequent are independent
• antecedent and consequent are independent given a
generalisation of the antecedent
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
49
Testing independence
• Cannot perform simple test of
independence because of multiple
comparisons problem
• used previously (eg Webb, Butler &
Newlands, 2003) as a statistically
unsound filter
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
50
Standard statistical correction
• Bonferroni
• To maintain experimentwise risk ≤ α for n tests
• use critical value = α / n
• Holm procedure
• To maintain experimentwise risk ≤ α for n tests with
p values ordered from lowest to highest p1 … pn
• Accept tests corresponding to p1 … pk , where k is
the highest value such that 1≤i≤k pi ≤ α / (n – k + 1)
p values
critical values
0.0100, 0.0200, 0.0400, 0.0400
0.0125, 0.0167, 0.0250, 0.0500
accept, accept, reject, reject
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
51
Direct adjustment
• I used to think “cannot perform simple adjustment
such as Bonferroni or Holm because rule spaces
are so large, eg 21000 (> 1.0E+301 )
• would result in unacceptable type-2 error
• eg = 5.0E-303”
• However, search is often restricted to small
antecedents (eg. ≤ 4) resulting in Bonferonni
adjusted critical values of magnitude 1.0E-10 …
1.0E-20.
• With such adjustments often many rules can be
found
• Cannot order p values to apply Holm procedure
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
52
Discovery as hypothesis generation
• Important to trade-off the risks of both
type-1 and type-2 errors
• Perhaps best viewed as hypothesis
generation, recognising that ‘discovered’
patterns require independent assessment
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
53
Hypothesis testing: proposal
• Why not automate such assessment?
Data
Exploratory
Patterns
Holdout
Holm
adjustment
Exploratory
Pattern
Discovery
Small
set
prefer
-able
Any
hypothesi
s test
Statistical
Evaluation
Limite
d type2 error
Sound
Patterns
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
54
Direct adjustment vs Holdout
Direct adjustment
Holdout
• All data used for
exploration and
evaluation
• Bonferroni
adjustment
• Larger adjustment
• Adjustment alters
with size of search
space
• Half data used for
each of exploration
and evaluation
• Holm procedure
• Smaller adjustment
• Adjustment alters
with number of rules
found
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
55
Case study: Ten widely used data sets
Name
Description
Records
BMS webview
products viewed at a commercial website
covtype
forest cover data
ipums.la.99
Attributevalues
59,601
497
581,012
125
Los Angeles census data
88,443
1,874
kddcup98
charity donors
52,256
19,662
letter-recog’n
digital image recognition
20,000
74
mush
identification of poisonous mushrooms
8,124
127
retail
retail market basket data
88,162
16,470
shuttle
records of space shuttle flight data
58,000
34
splice-junction DNA sequence records
3,177
243
ticdata-2000
5,822
689
insurance risk assessment
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
56
Detecting spurious rules
• Assuming interest only in positive associations
• P(C | A) > P(C)
• For any rule A C, want to assess whether it has
higher confidence than all its generalisations
• Eg, is confidence(pregnant & female B) >
• confidence(pregnant B)
• confidence(female B)
• confidence(true B)
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
57
Detecting spurious rules (cont)
• Perform one-tailed Fisher exact tests with
respect to each generalisation
• Reject if any test does not exceed critical
value
• no need to adjust for multiple comparisons
with respect to the multiple tests for a single
rule
• Use Holm adjustment for strict control of
type-1 error
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
58
Spurious rules case study: high support
& confidence non-redundant rules
Name
bms webview
Records
Attribute
-values
# Rules # Accepted
%
59,601
497
22,135
1,747
8
581,012
125
10,018
0
0
ipums.la.99
88,443
1,874
9,857
288
3
kddcup98
52,256
19,662
9,863
40
<1
letter-recognition
20,000
74
7,978
952
12
mush
8,124
127
8,957
1,266
14
retail
88,162
16,470
11,656
97
1
shuttle
58,000
34
9,760
876
9
splice-junction
3,177
243
8,937
132
1
ticdata-2000
5,822
689
10,438
30
<1
covtype
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
59
KDDCUP98: 99.5% of rules rejected
The following 40 rules passed holdout evaluation
…
ETH12<=0 HC15<=0 [Coverage=0.987 (25786); Support=0.946 (24722);
Confidence=0.959; Lift=1.00]
…
The following 9843 rules failed holdout evaluation, adjusted critical value =
5.09E-06
…
NOEXCH=0 & ETH12<=0 HC15<=0 [Coverage=0.984 (25703);
Support=0.943 (24644); Confidence=0.959; Lift=1.00]
…
NOEXCH=0 & ETH12<=0 & MDMAUD_F=X HC15<=0 [Coverage=0.981
(25629); Support=0.940 (24573); Confidence=0.959; Lift=1.00]
…
NOEXCH=0 & ETH12<=0 & ADATE_2>=9706 & MDMAUD_R=X HC15<=0
[Coverage=0.981 (25623); Support=0.940 (24567); Confidence=0.959;
Lift=1.00]
…
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
60
Comparison of direct adjustment and
holdout tests on artificial data
False Discoveries
Experimentwise Error
Direct
Holdout
True Discoveries
Averages over 100 runs, 84 true rules at antecedent size 4
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
61
Comparison on real data
Retail
3000
1200
2500
1000
2000
800
No of rules
No of rules
Letter Recognition
1500
600
1000
400
500
200
0
0
2.33E+03 1.32E+05 2.29E+06 2.68E+07 2.27E+08 1.47E+09
1.36E+08 2.23E+12 1.23E+16 5.05E+19 1.66E+23 4.56E+26
Search Space Size
Search space size
Direct
Holdout
Direct
Holdout
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
62
Part 6:
Limitations of minimum support
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
63
Limitations of minimum support
• Discontinuity in ‘interestingness’ function
• The vodka and caviar problem
• some high value associations are infrequent
• Feast or famine
• minimum support is a crude control mechanism
• often results in too few or too many associations
• Cannot handle dense data
• Cannot prune search space using constraints on
relationship between antecedent and consequent
• eg confidence
• Minimum support may not be relevant
• cannot be sufficiently low to capture all valid rules
• cannot be sufficiently high to exclude all spurious rules
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
64
Very low support rules can be significant
Data file: Brijs retail.itl [50% sample]
44081 cases / 44081 holdout cases / 16470 items
The following 5 rules passed holdout evaluation
168 & 4685 → 1 [Coverage=0.000 (3); Support=0.000 (3);
Confidence estimate=0.601; Lift estimate=192.06]
168 & 3021 → 1 [Coverage=0.000 (3); Support=0.000 (3);
Confidence estimate=0.601; Lift estimate=192.06]
1476 & 4685 → 1 [Coverage=0.000 (2); Support=0.000 (2);
Confidence estimate=0.502; Lift estimate=160.21]
168 & 783 → 1 [Coverage=0.000 (4); Support=0.000 (3);
Confidence estimate=0.501; Lift estimate=160.05]
3021 & 4685 → 1 [Coverage=0.000 (4); Support=0.000 (3);
Confidence estimate=0.501; Lift estimate=160.05]
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
65
Very high support rules can be spurious
Data file: covtype.data
581012 cases / 125 values
ST15=0 → ST07=0 [Coverage=1.000 (581009); Support=1.000
(580904); Confidence=1.000; Lift=1.00]
ST07=0 → ST15=0 [Coverage=1.000 (580907); Support=1.000
(580904); Confidence=1.000; Lift=1.00]
ST15=0 → ST36=0 [Coverage=1.000 (581009); Support=1.000
(580890); Confidence=1.000; Lift=1.00]
ST36=0 → ST15=0 [Coverage=1.000 (580893); Support=1.000
(580890); Confidence=1.000; Lift=1.00]
ST15=0 → ST08=0 [Coverage=1.000 (581009); Support=1.000
(580830); Confidence=1.000; Lift=1.00]
ST08=0 → ST15=0 [Coverage=1.000 (580833); Support=1.000
(580830); Confidence=1.000; Lift=1.00]
….. 197,183,686 such rules have highest support
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
66
Roles of constraints
1. Select most relevant patterns
•
patterns that are likely to be interesting
2. Control the number of patterns that the user
must consider
3. Make computation feasible
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
67
Minimum support can get overloaded!
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
68
Part 6:
K-most interesting pattern
discovery
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
69
K-most interesting pattern discovery
• Find k patterns that maximise a measure of
interest within other constraints that the user may
specify
•
•
•
•
•
removes need for minimum support constraint
efficient with dense data
empowers user to use relevant measure of interest
user specifies number of patterns to be returned
does not require either monotone or anti-monotone
constraints
• Relies on efficient search
• must be able to retain all data in memory
• constraints must sufficiently constraint the search
space
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
70
Part 7:
Itemset discovery
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
71
Itemset discovery
• In some contexts it is the collection of variables
that are correlated that are of interest and the rule
structure is superfluous.
• If A is associated with B then B must be
associated with A (in the sense of the presence of
the antecedent increasing the probability of the
presence of the consequent).
• Discovering interesting itemsets is an area that
has been little explored.
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
72
Part 8:
Contrast discovery
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
73
Contrast sets (emerging patterns)
• Sometimes it is interesting to identify
differences between contrasting groups
• Eg: how do purchasing patterns differ on
weekends to weekdays?
• Contrast sets find sets of conditions that
differ significantly between groups
ij P(cset | Gi ) P(cset | G j )
max ij support( cset , Gi ) support( cset , G j )
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
74
Contrast sets (cont.)
• Different analytic objective to association rules
• more directed
• focus on differences between groups instead of
associations between variables
• Different to classification rules
• not discriminative
• no attempt to distinguish all individuals of each
group
• find all contrasts rather than sufficient
discriminators
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
75
Can be discovered by existing
techniques!
ij P(cset | G ) P(cset | G ) ij P(G | cset ) P(G | cset )
• Contrast / emerging pattern discovery is
strictly equivalent to standard exploratory
rule discovery with the consequent
restricted to the group variable
i
j
i
j
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
76
Part 9:
Impact rules
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
77
Impact rules (quantitative association
rules)
• Most rule discovery techniques require that
numeric variables be discretised.
• This often loses important information.
• Impact rules associate an antecedent with a
distribution on a numeric variable.
• The user specifies what makes a distribution
interesting
• eg largest mean, smallest standard deviation, …
• System finds rules that maximise the measure of
interest within other user-specified constraints
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
78
Impact rule discovery example
LengthOfStay: mean = 10.6; min = -6; max
= 1687; sum = 367781
COUNTRYOFBIRTH=1100 -> LengthOfStay:
Coverage=0.054 (1861); Mean=22.2; Min=-4;
Max=1687; Sum=41314; Impact=21612.4
ADMITDay=Wednesday -> LengthOfStay:
Coverage=0.159 (5518); Mean=13.3; Min=0;
Max=1548; Sum=73389; Impact=15307.6
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
79
Summary
• Exploratory pattern discovery empowers the user to select
the patterns that are most useful
• Rules provide a modular and powerful knowledge
representation formalism
• Association rules discover associations between qualitative
variables that are frequent
• K-optimal rules discover associations between qualitative
variables that optimise a measure of interest
• Impact rules discover associations between qualitative and
quantitative variables
• Contrasts discover differences in distributions over variables
between different groups
• If you mine for patterns without appropriate statistical
evaluation, expect to find fool’s gold!
http://www.csse.monash.edu.au/~webb
Intelligent Systems
Copyright © Geoffrey I Webb 2006
80