Download Data Mining Using SQL Queries

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Association Rule Mining
Data Mining and Knowledge Discovery
Prof. Carolina Ruiz and Weiyang Lin
Department of Computer Science
Worcester Polytechnic Institute
Sample Applications


Sample Commercial Applications

Market basket analysis

cross-marketing

attached mailing

store layout, catalog design

customer segmentation based on buying patterns

…
Sample Scientific Applications

Genetic analysis

Analysis of medical data

…
Transactions and Assoc. Rules

Association Rule:
Transaction
Id
1
2
3
4
Purchased
Items
{a, b, c}
{a, d}
{a, c}
{b, e, f}
a→c
 support: 50% = P(a & c)
percentage of transactions that contain both a and c

confidence: 66% = P(c | a)
percentage of transactions that contain c among
those transactions that contain a
Association Rules - Intuition


Given a set of transactions where each transaction
is a set of items
Find all rules X → Y that relate the presence of
one set of items X with another set of items Y



Example: 98% of people who purchase diapers and
baby food also buy beer
A rule may have any number of items in the antecedent
and in the consequent
Possible to specify constraints on rules
Mining Association Rules
Problem Statement
Given:

a set of transactions (each transaction is a set of items)

user-specified minimum support

user-specified minimum confidence
Find:

all association rules that have support and confidence
greater than or equal to the user-specified minimum
support and minimum confidence
Naïve Procedure to mine rules


List all the subsets of the
set of items
For each subset




Split the subset into two
parts (one for the
antecedent and one for the
consequent of the rule
Compute the support of the
rule
Compute the confidence of
the rule
IF support and confidence
are no lower than userspecified min. support and
confident THEN output the
rule
Complexity: Let n be the number
of items. The number of rules
naively considered is:
n
i-1
n
i
S[(i ) * S(k )]
i=2
k=1
n
n
i
= S[(i ) * (2 -2) ]
i=2
n
=3 –2
(n+1)
+1
The Apriori Algorithm
1. Find all frequent itemsets: sets of items whose
support is greater than or equal to the userspecified minimum support.
2. Generate the desired rules: if {a, b, c, d} and {a,
b} are frequent itemsets, then compute the ratio
conf (a & b → c & d) = P(c & d | a & b)
= P( a & b & c & d)/P(a & b)
= support({a, b, c, d})/support({a, b}).
If conf >= mincoff, then add rule a & b → c & d
The Apriori Algorithm — Example
slide taken from J. Han & M. Kamber’s Data Mining book
Min. supp = 50%, I.e. min support count = 2
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Apriori Principle

Key observation:
Every subset of a frequent itemset is also a
frequent itemset
Or equivalently,
 The support of an itemset is greater than
or equal to the support of any superset of
the itemset

Apriori - Compute Frequent Itemsets
Making multiple passes over the data
for pass k
{candidate generation: Ck := Lk-1 joined with Lk-1 ;
support counting in Ck;
Lk := All candidates in Ck with minimum support;
}
terminate when Lk==  or Ck+1== 
Frequent-Itemsets =  k Lk
Lk - Set of frequent itemsets of size k. (those with minsup)
Ck - Set of candidate itemsets of size k.
(potentially frequent itemsets)
Apriori – Generating rules
For each frequent itemset:
- Generate the desired rules: if {a, b, c, d} and {a,
b} are frequent itemsets, then compute the ratio
conf (a & b → c & d)
= support({a, b, c, d})/support({a, b}).
If conf >= mincoff, then add rule a & b → c & d