Download Course on Data Mining

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Course on Data Mining (581550-4)
Intro/Ass. Rules
7.11.
24./26.10.
Clustering
14.11.
Episodes
KDD Process
Home Exam
30.10.
Text Mining
21.11.
28.11.
Appl./Summary
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Course on Data Mining (581550-4)
Today 26.10.2001
•
Summary:
– Course organization
•
Summary:
– What is data mining?
•
Today's subject:
– Association rules
•
Next week's program:
– Lecture:
– Exercise:
– Seminar:
Episodes
Associations
Associations
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Course Organization
Lectures, Exercises, Exam
•
12 lectures: 24.10.-30.11.2001
– Wed 14-16, Fri 12-14 (A217)
• Wed: normal lecture
• Fri: seminar like lecture (except for 26.10.)
•
5 exercise sessions: 1.11.-29.11.2001
– Thu 12-14 (A318)
•
Home exam:
– Given: 28.11., Returned due: 21.12.2001
•
Language:
– Lecturing language is Finnish
– Slides and material are in English
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Course Organization
Group Work
•
Group for seminar (and exercise) work:
– 10 groups, à 3 persons, 2 groups/lecture
– Dates are agreed at the beginning of course
– Articles are given on previous week's Wed
•
Seminar presentations:
– Presentation in an HTML page (around 3-5
printed pages) due to seminar starting:
• Can be either a HTML page or a printable
document in PostScript/PDF format
– 30 minutes of presentation
– 5-15 minutes of discussion
– Active participation
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Course Organization / Groups
• Group presentation time allocation:
– Fri 2.11.:
Group 1, Group 2
(associations)
– Fri 9.11.:
Group 3, Group 4
(episodes)
– Fri 16.11.:
Group 5, Group 6
(text mining)
– Fri 23.11.:
Group 7, Group 8
(clustering)
– Fri 30.11.:
Group 9, Group 10
(KDD process)
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Course Organization
Course Evaluation
•
Passing the course: min 30 points
– home exam: min 13 points (max 30 points)
– exercises/experiments: min 8 points (max
20 points)
• at least 3 returned and reported experiments
– group presentation: min 4 points (max 10
points)
•
Remember also the other requirements:
– Attending the lectures (5/7)
– Attending the seminars (4/5)
– Attending the exercises (4/5)
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Course Organization
Course Material
•
•
•
•
Lecture slides
Original articles
Seminar presentations
Book: "Data Mining: Concepts and
Techniques" by Jiawei Han and Micheline
Kamber, Morgan Kaufmann Publishers,
August 2000. 550 pages. ISBN 1-55860489-8
•
Remember to check course website and
folder for the material!
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Summary:What is Data Mining?
• Ultimately:
– "Extraction of interesting (non-trivial,
implicit, previously unknown, potentially
useful) information or patterns from data in
large databases"
• Often just:
– "Tell something interesting about this data",
"Describe this data"
 Exploratory, semi-automatic data
analysis on large data sets
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Summary:What is Data Mining?
• Data mining: semi-automatic discovery of interesting
patterns from large data sets
• Knowledge discovery is a process:
– Preprocessing
– Data mining
– Postprocessing
• To be mined, used or utilized different …
–
–
–
–
Databases (relational, object-oriented, spatial, WWW, …)
Knowledge (characterization, clustering, association, …)
Techniques (machine learning, statistics, visualization, …)
Applications (retail, telecom, Web mining, log analysis, …)
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Summary: Typical KDD Process
Time
based
selection
Raw
data
Operational
Database
Eval. of
interestingness
Input data
1
Preprocessing
Data mining
Cleaned
Verified
Focused
2
Utilization
Postprocessing
Results
3
Selected
usable
patterns
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Association Rules
Basics
Examples
Generation
Multi-level Rules
Constraints
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Market Basket Analysis
• Analysis of customer buying habits by finding
associations and correlations between the different
items that customers place in their "shopping basket"
Milk, eggs, sugar,
bread
Milk, eggs, cereal, bread
Eggs, sugar
Customer1
Customer2
Customer3
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Market Basket Analysis
• Given:
• A database of customer transactions (e.g., shopping baskets),
where each transaction is a set of items (e.g., products)
• Find:
• Groups of items which are frequently purchased together
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Market Basket Analysis
• Extract information on purchasing behavior
– "IF buys beer and sausage, THEN also by mustard with high
probability"
• Actionable information: can suggest...
– New store layouts and product assortments
– Which products to put on promotion
• MBA approach is applicable whenever a customer
purchases multiple things in proximity
–
–
–
–
Credit cards
Services of telecommunication companies
Banking services
Medical treatments
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Market Basket Analysis
• Useful:
"On Thursdays, grocery store consumers often
purchase diapers and beer together."
• Trivial:
"Customers who purchase maintenance agreements
are very likely to purchase large appliances."
• Unexplicable/unexpected:
"When a new hardaware store opens, one of the most
sold items is toilet rings."
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Association Rules: Basics
• Association rule mining:
– Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction databases,
relational databases, and other information repositories.
• Comprehensibility: Simple to understand
• Utilizability: Provide actionable information
• Efficiency: Efficient discovery algorithms exist
• Applications:
– Market basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, classification, etc.
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Association Rules: Basics
• Typical representation formats for association rules:
– diapers  beer [0.5%, 60%]
– buys:diapers  buys:beer [0.5%, 60%]
– "IF buys diapers, THEN buys beer in 60% of the cases. Diapers
and beer are bought together in 0.5% of the rows in the database."
• Other representations (used in Han's book):
– buys(x, "diapers")  buys(x, "beer") [0.5%, 60%]
– major(x, "CS") ^ takes(x, "DB")  grade(x, "A") [1%, 75%]
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Association Rules: Basics
diapers  beer [0.5%, 60%]
1
2
3
4
"IF buys diapers,
THEN buys beer
in 60% of the cases
in 0.5% of the rows"
1 Antecedent, left-hand side (LHS), body
2 Consequent, right-hand side (RHS), head
3 Support, frequency ("in how big part of the data the things
in left- and right-hand sides occur together")
4 Confidence, strength ("if the left-hand side occurs, how
likely the right-hand side occurs")
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Association Rules: Basics
• Support:
denotes the frequency of the rule within
transactions.
support(A  B [ s, c ]) = p(AB) = support ({A,B})
• Confidence:
denotes the percentage of transactions
containing A which contain also B.
confidence(A  B [ s, c ]) = p(B|A) = p(AB) / p(A) =
support({A,B}) / support({A})
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Association Rules: Basics
• Minimum support  :
– High
 few frequent itemsets
 few valid rules which occur very often
– Low
 many valid rules which occur rarely
• Minimum confidence  :
– High  few rules, but all "almost logically true"
– Low
 many rules, many of them very "uncertain"
• Typical values:  = 2 -10 %,  = 70 - 90 %
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Association Rules: Basics
• Transaction:
– Relational format
<Tid,item>
<1, item1>
<1, item2>
<2, item3>
• Item vs. itemsets:
• Support of an itemset I:
• Minimum support :
• Frequent itemset:
Compact format
<Tid,itemset>
<1, {item1,item2}>
<2, {item3}>
single element vs. set of items
# of transaction containing I
threshold for support
with support  .
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Association Rules: Basics
• Given: (1) database of transactions, (2) each transaction is
a list of items bought (purchased by a customer in a visit)
Transaction ID
100
200
400
500
Items Bought
A,B,C
A,C
A,D
B,E,F
Frequent Itemset Support
{A}
3 or 75%
{B} and {C}
2 or 50%
{D}, {E} and {F}
1 or 25%
{A,C}
2 or 50%
Other item pairs max 25%
• Find: all rules with minimum support and confidence
• If min. support 50% and min. confidence 50%, then
A  C [50%, 66.6%], C  A [50%, 100%]
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Association Rule Generation
• Association rule mining is a two-step process:
STEP 1: Find the frequent itemsets: the sets of
items that have minimum support.
– So called Apriori trick: A subset of a frequent itemset must also be
a frequent itemset:
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be
frequent itemsets
– Iteratively find frequent itemsets with size from 1 to k (k-itemset)
STEP 2: Use the frequent itemsets to generate
association rules.
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Frequent Sets with Apriori
• Join Step:
Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset
of a frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k; Lk : Frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = {candidates generated from Lk };
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = {candidates in Ck+1 with min_support}
end
return k Lk;
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Apriori Candidate Generation
• The Apriori principle:
Any subset of a frequent itemset must be frequent
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace
• Pruning:
– acde is removed because ade is not in L3
• C4={abcd}
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Apriori Example (1/6)
Database D
TID
100
200
300
400
Items
134
235
1235
25
C1
itemset sup.
{1}
2
3
Scan D {2}
{3}
3
{4}
1
{5}
3
L1
itemset sup.
{1}
2
{2}
3
{3}
3
{5}
3
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Apriori Example (2/6)
C2
C2
itemset
itemset sup
{1 2}
{1 2}
1
{1 3}
{1 3}
2
Scan D {1 5}
1
{1 5}
{2 3}
2
{2 3}
{2 5}
3
{2 5}
{3 5}
2
{3 5}
L2
itemset
{1 3}
{2 3}
{2 5}
{3 5}
Course on Data Mining
sup
2
2
3
2
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Apriori Example (3/6)
C3
itemset
{2 3 5}
L3
Scan D
itemset sup
{2 3 5} 2
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Apriori Example (4/6)
Search Space
of Database D
123
124
12
13
12345
1234
1235
125
134
135
145
234
14
15
23
24
25
1
2
1245
3
1345
4
2345
235
34
245
35
5
Course on Data Mining
45
345
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Apriori Example (5/6)
Apriori Trick
on Level 1
123
124
12
13
12345
1234
1235
125
134
14
15
1
2
1245
1345
2345
135
145 234
235
23
24
3
25
4
34
245
35
5
Course on Data Mining
345
45
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Apriori Example (6/6)
Apriori Trick
on Level 2
123
124
12
13
12345
1234
1235
125
134
135
145
234
14
15
23
24
25
1
2
1245
3
1345
4
2345
235
34
245
35
5
Course on Data Mining
345
45
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Is Apriori Fast Enough?
• The core of the Apriori algorithm:
– Use frequent (k – 1)-itemsets to generate candidate frequent kitemsets
– Use database scan and pattern matching to collect counts for the
candidate itemsets
• The bottleneck of Apriori: candidate generation
– Huge candidate sets:
• 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100  1030 candidates.
– Multiple scans of database:
• Needs (n +1 ) scans, n is the length of the longest pattern
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Is Apriori Fast Enough?
• In practice:
– For basic Apriori approach the number of attributes in the row is
usually much more critical than the number of transaction rows
– For example:
• 50 attributes each having 1-3 values, 100.000 rows (not very bad)
• 50 attributes each having 10-100 values, 100.000 rows (quite bad)
• 10.000 attributes each having 5-10 values, 100 rows (very bad...)
– Notice:
• One attribute might have several different values
• Association rule algorithms typically treat every attribute-value pair
as one attribute (2 attribute with 5 values each => "10 attributes")
• There are some ways to overcome the problem...
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Improving Apriori Performance
• Hash-based itemset counting:
– A k-itemset whose corresponding hashing bucket count is below
the threshold cannot be frequent
• Transaction reduction:
– A transaction that does not contain any frequent k-itemset is
useless in subsequent scans
• Partitioning:
– Any itemset that is potentially frequent in DB must be frequent in
at least one of the partitions of DB
• Sampling:
– Mining on a subset of given data, lower support threshold + a
method to determine the completeness
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Association Rules from Itemsets
• Pseudo-code:
for every frequent itemset l
generate all nonempty subsets s of l
for every nonempty subset s of l
output the rule "s  (l-s)" if support(l)/support(s) 
min_conf", where min_conf is the minimum confidence
threshold
• E.g.: frequent set l = {abc}, subsets s = {a, b, c, ab, ac, bc)
– a  b, a  c, b  c
– a  bc, b  ac, c  ab
– ab  c, ac  b, bc  a
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Association Rule Generation
• Rule 1 to remember:
– Generating frequent sets is slow (especially itemsets of size 2)
– Generating association rules from frequent itemsets is fast
• Rule 2 to remember:
– For itemset generation, support threshold is used
– For association rules, confidence threshold is used
• What happens in reality, how long does it take to create
frequent sets and association rules?
– Let's take small real-life examples…
– Experiments are made with Citum 4/275 Alpha server with 512
MB of main memory & Red Hat Linux release 5.0 (kernel 2.0.30)
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Performance Example (1/4)
Network Management System
MSC
MSC
MSC
BSC
BSC
BSC
Switched Network
Access Network
Alarms
BTS
BTS
BTS
MSC
Mobile station controller
BSC
Base station controller
BTS
Base station transceiver
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Performance Example (2/4)
• Telecom data containing alarms:
1234 EL1 PCM 940926082623 A1 ALARMTEXT..
Alarm type
Date, time
Alarming network element
Alarm number
Alarm severity class
• Example data 1:
– 43 478 alarms (26.9.94 - 5.10.94; ~ 10 days)
– 2 234 different types of alarms, 23 attributes, 5503 different values
• Example data 2:
– 73 679 alarms (1.2.95 - 22.3.95; ~ 7 weeks)
– 287 different types of alarms, 19 attributes, 3411 different values
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Performance Example (3/4)
Data set 1 (~10 days)
Data set 2 (~7 weeks)
Example rule:
alarm_number=1234, alarm_type=PCM  alarm_severity=A1 [2%,45%]
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Performance Example (4/4)
• Example results for data 1:
–
–
–
–
Frequency threshold:
Candidate sets:
Frequent sets:
Rules:
0.1 (lowest possible with this data)
109 719
Time:
12.02 s
79 311
Time: 64 855.73 s
3 750 000
Time:
860.60 s
• Example results for data 2:
–
–
–
–
Frequency threshold:
Candidate sets:
Frequent sets:
Rules:
0.1 (lowest possible with this data)
43 600
Time:
1.70 s
13 321
Time: 10 478.93 s
509 075
Time:
143.35 s
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Selecting the Interesting Rules?
• Usually the result set is very big, one must select
interesting ones based on:
– Objective measures:
Two popular measurements:
 support; and
 confidence
– Subjective measures (Silberschatz & Tuzhilin, KDD95)
A rule (pattern) is interesting if
 it is unexpected (surprising to the user); and/or
 actionable (the user can do something with it)
• These issues will be discussed with KDD processes
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Boolean vs. Quantitative Rules
• Boolean vs. quantitative association rules (based on the
types of values handled)
– Boolean: Rule concerns associations between the presence or
absence of items (e.g. "buys A" or "does not buy A")
buys=SQLServer, buys=DMBook  buys=DBMiner [2%,60%]
buys(x, "SQLServer") ^ buys(x, "DMBook") buys(x,
"DBMiner") [0.2%, 60%]
– Quantitative: Rule concerns associations between quantitative
items or attributes
age=30..39, income=42..48K  buys=PC [1%, 75%]
age(x, "30..39") ^ income(x, "42..48K") buys(x, "PC") [1%,
75%]
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Quantitative Rules
• Quantitative attributes: e.g., age, income, height, weight
• Categorical attributes: e.g., color of car
CID
1
2
3
4
height
168
175
174
170
weight
75,4
80,0
70,3
65,2
income
30,5
20,3
25,8
27,0
Problem: too many distinct values for quantitative attributes
Solution: transform quantitative attributes in categorical ones
via discretization  more about this in seminar!
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Single- vs. Multi-dimensional Rules
• Single-dimensional vs. multi-dimensional associations
– Single-dimensional: Items or attributes in the rule refer to only
one dimension (e.g., to "buys")
Beer, Chips  Bread [0.4%, 52%]
buys(x, "Beer") ^ buys(x, "Chips") buys(x, "Bread") [0.4%,
52%]
– Multi-dimensional: Items or attributes in the rule refer to two or
more dimensions (e.g., "buys", "time_of_transaction",
"customer_category")
In the following example: nationality, age, income
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Multi-dimensional Rules
CID
1
2
3
4
5
6
nationality
Italian
French
French
Italian
Italian
French
age
50
40
30
50
45
35
income
low
high
high
medium
high
high
RULES:
nationality = French  income = high [50%, 100%]
income = high
 nationality = French [50%, 75%]
age = 50
 nationality = Italian [33%, 100%]
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Single- vs. Multi-level Rules
• Single-level vs. multi-level associations
– Single-level: Associations between items or attributes from the
same level of abstraction (i.e., from the same level of hierarchy)
Beer, Chips  Bread [0.4%, 52%]
– Multi-level: Associations between items or attributes from
different levels of abstraction (i.e, from different levels of
hierarchy)
Beer:Karjala, Chips:Estrella:Barbeque  Bread [0.1%, 74%]
More about multi-level association rules on the next slides…
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Multi-level Association Rules
• Is difficult to find interesting patterns at a too primitive
level
– high support = too few rules
– low support = too many rules, most uninteresting
• Approach: reason at suitable level of abstraction
• A common form of background knowledge is that an
attribute may be generalized or specialized according
to a hierarchy of concepts
• Multi-level association rules: rules which combine
associations with hierarchy of concepts
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Multi-level Association Rules
• Items often form
hierarchies
Food
• Items at the lower level
are expected to have
bread
milk
lower support
2%
• Rules regarding itemsets skim
wheat white
at appropriate levels
Fraser Sunset
could be quite useful
• Transaction database can
be encoded based on
dimensions and levels
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Multi-level Association Rules
Food
1
2
bread
milk
1
skim
2
2%
1
Fraser
1
wheat
TID
T1
T2
T3
T4
T5
2
white
Items
{111, 121, 211, 221}
{111, 211, 222, 323}
{112, 122, 221, 411}
{111, 121}
{111, 122, 211, 221, 413}
2
Sunset
121= milk - 2% - Fraser
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Multi-level Association Rules
• A top-down, progressive deepening approach:
– First find high-level strong rules:
milk  bread [20%, 60%]
– Then find their lower-level "weaker" rules:
2% milk  wheat bread [6%, 50%]
• Variations at mining multi-level association rules:
– Level-crossed association rules:
milk  wheat bread
– Association rules with multiple, alternative hierarchies:
milk  Wonder bread
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Multi-level Association Rules
• Generalizing/specializing values of attributes…
– ...from specialized to general: support of rules increases (new
rules may become valid)
– ...from general to specialized: support of rules decreases (rules
may become not valid, their support falls under the threshold)
• Too low level => too many rules and too primitive
Pepsi light 0.5l bottle  Taffel Barbeque Chips 200gr
• Too high level => uninteresting rules
Food  Clothes
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Redundancy Filtering
• Some rules may be redundant due to "ancestor"
relationships between items
• Example (milk has 4 subclasses):
– milk  wheat bread [support = 8%, confidence = 70%]
– 2% milk  wheat bread [support = 2%, confidence = 72%]
• We say the first rule is an ancestor of the second rule
• A rule is redundant if its support is close to the
"expected" value, based on the rule’s ancestor
– Above the second rule could be redundant
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Uniform vs. Reduced Support
• Uniform Support: the same minimum support for all
levels
+ One minimum support threshold. No need to examine itemsets
containing any item whose ancestors do not have minimum
support.
–
Lower level items do not occur as frequently. If support
threshold
• too high  miss low level associations
• too low  generate too many high level associations
• Reduced Support: reduced minimum support at lower
levels
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Uniform Support
Multi-level mining with uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Reduced Support
Multi-level mining with reduced support
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Progressive Deepening
• A top-down, progressive deepening approach:
–
First mine high-level frequent items:
milk (15%), bread (10%)
– Then mine their lower-level "weaker" frequent itemsets:
2% milk (5%), wheat bread (4%)
• Different min_support thresholds across multi-levels
lead to different algorithms:
– If adopting the same min_support across multi-levels
then do not examine t if any of t’s ancestors is infrequent
– If adopting reduced min_support at lower levels
then examine only those descendents whose ancestor’s support is
frequent/non-negligible
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Constraint-Based Mining
• Interactive, exploratory mining giga-bytes of data?
– Could it be real? — By making good use of constraints!
• What kinds of constraints can be used in mining?
– Knowledge type constraint: classification, association, etc.
– Data constraint: SQL-like queries
• Find product pairs sold together in Vancouver in Dec.’98
– Dimension/level constraints:
• In relevance to region, price, brand, customer category
– Interestingness constraints:
• Strong rules (min_support  3%, min_confidence  60%)
– Rule constraints (see the next slides)
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Rule Constraints
• Two kinds of rule constraints:
– Rule form constraints: meta-rule guided mining
• Metarule: P(X, Y) ^ Q(X, W) takes(X, "database
systems")
• Matching rule: age(X, "30..39") ^ income(X, "41K..60K") 
takes(X, "database systems").
– Rule content constraint: constraint-based query
optimization (Ng, et al., SIGMOD’98)
• sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) >
1000
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Rule Constraints
• 1-variable vs. 2-variable constraints (Lakshmanan, et
al. SIGMOD’99):
– 1-var: A constraint confining only one side (L/R) of
the rule, e.g.,
• sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^
sum(RHS) > 1000
– 2-var: A constraint confining both sides (L and R).
• sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Summary
• Association rule mining:
– Probably the most significant contribution from the database
community in KDD
– Rather simple concept, but the "thinking" gives basis for
extensions and other methods
– A large number of papers have been published
• Many interesting issues have been explored
• Interesting research directions:
– Association analysis in other types of data: spatial data,
multimedia data, time series data, etc.
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
References (1/5)
•
•
•
•
•
•
•
•
•
•
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent
itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data
Mining), 2000.
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large
databases. SIGMOD'93, 207-216, Washington, D.C.
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499,
Santiago, Chile.
R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan.
R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle,
Washington.
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to
correlations. SIGMOD'97, 265-276, Tucson, Arizona.
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for
market basket analysis. SIGMOD'97, 255-264, Tucson, Arizona, May 1997.
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99,
359-370, Philadelphia, PA, June 1999.
D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered association rules in large
databases: An incremental updating technique. ICDE'96, 106-114, New Orleans, LA.
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg
queries efficiently. VLDB'98, 299-310, New York, NY, Aug. 1998.
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
References (2/5)
•
•
•
•
•
•
•
•
•
•
G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00,
512-521, San Diego, CA, Feb. 2000.
Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases. KDOOD'95,
39-46, Singapore, Dec. 1995.
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional
optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96, 13-23, Montreal,
Canada.
E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules.
SIGMOD'97, 277-288, Tucson, Arizona.
J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database.
ICDE'99, Sydney, Australia.
J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95, 420431, Zurich, Switzerland.
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, 1-12,
Dallas, TX, May 2000.
T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of
ACM, 39:58-64, 1996.
M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional association rules
using data cubes. KDD'97, 207-210, Newport Beach, California.
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules
from large sets of discovered association rules. CIKM'94, 401-408, Gaithersburg, Maryland.
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
References (3/5)
•
•
•
•
•
•
•
•
•
F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast,
quantifiable data mining. VLDB'98, 582-593, New York, NY.
B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham,
England.
H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction association rules.
SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'98),
12:1-12:7, Seattle, Washington.
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules.
KDD'94, 181-192, Seattle, WA, July 1994.
H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences.
Data Mining and Knowledge Discovery, 1:259-289, 1997.
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96,
122-133, Bombay, India.
R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson,
Arizona.
R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of
constrained associations rules. SIGMOD'98, 13-24, Seattle, Washington.
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for
association rules. ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999.
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
References (4/5)
•
•
•
•
•
•
•
•
•
•
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules.
SIGMOD'95, 175-186, San Jose, CA, May 1995.
J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets.
DMKD'00, Dallas, TX, 11-20, May 2000.
J. Pei and J. Han. Can We Push More Constraints into Frequent Pattern Mining? KDD'00. Boston,
MA. Aug. 2000.
G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro
and W. J. Frawley, editors, Knowledge Discovery in Databases, 229-238. AAAI/MIT Press, 1991.
B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421,
Orlando, FL.
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules.
SIGMOD'95, 175-186, San Jose, CA.
S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns in
association rules. VLDB'98, 368-379, New York, NY..
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational
database systems: Alternatives and implications. SIGMOD'98, 343-354, Seattle, WA.
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in
large databases. VLDB'95, 432-443, Zurich, Switzerland.
A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large
database of customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
References (5/5)
•
•
•
•
•
•
•
•
•
•
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures.
VLDB'98, 594-605, New York, NY.
R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95, 407-419, Zurich,
Switzerland, Sept. 1995.
R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables.
SIGMOD'96, 1-12, Montreal, Canada.
R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73,
Newport Beach, California.
H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India,
Sept. 1996.
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A
generalization of association-rule mining. SIGMOD'98, 1-12, Seattle, Washington.
K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear
regions for association rules. KDD'97, 96-103, Newport Beach, CA, Aug. 1997.
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association
rules. Data Mining and Knowledge Discovery, 1:343-374, 1997.
M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug. 2000.
O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with Progressive
Resolution Refinement. ICDE'00, 461-470, San Diego, CA, Feb. 2000.
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Course Organization
Next Week
•
Lecture 31.10.: Episodes and recurrent
patterns
– Mika gives the lecture
•
Excercise 1.11.: Associations
– Pirjo takes care of you! :-)
•
Seminar 2.11.: Associations
– Pirjo gives the lecture
– 2 group presentations
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Seminar Presentations
• Seminar presentations:
– Articles are given on previous
week's Wed
– Presentation in an HTML page
(around 3-5 printed pages) due
to seminar starting:
• Can be either a HTML
page or a printable
document in
PostScript/PDF format
– 30 minutes of presentation
– 5-15 minutes of discussion
– Active participation
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Seminar Presentations
• Seminar presentations:
– Try to understand the
"message" in the article
– Try to present the basic ideas
as clearly as possible, use
examples
– Do not present detailed
mathematics or algorithms
– Test: do you understand your
own presentation?
– In the presentation, use
PowerPoint or conventional
slides
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Seminar Presentations/Groups 1-2
Quantitative Rules
R. Srikant, R. Agrawal: "Mining
Quantitative Association Rules
in Large Relational Tables",
Proc. of the ACM-SIGMOD 1996
MINERULE
Rosa Meo, Giuseppe Psaila,
Stefano Ceri: "A New SQL-like
Operator for Mining Association
Rules". VLDB 1996: 122-133
Course on Data Mining
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Introduction to Data Mining (DM)
Thank you for
your attention and
have a nice weekend!
Thanks to Jiawei Han from Simon Fraser University for his slides
which greatly helped in preparing this lecture! Also thanks to Fosca
Giannotti and Dino Pedreschi from Pisa for their slides.
Course on Data Mining