Download Association rule mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 6 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2013 Han, Kamber & Pei. All rights reserved.
1
Now we have done with data Preprocessing
• Data cleaning
• Data integration
• Data reduction
• Data transformation

Mining
April 30, 2017
Data Mining: Concepts and Techniques
2
Chapter 6: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
 Basic Concepts
 Frequent Itemset Mining Methods
 Which Patterns Are Interesting?—Pattern
Evaluation Methods
 Summary
3
What Is Frequent Pattern Analysis?

Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs
frequently in a data set
Frequent pattern mining searches for
recurring relationships in a given data set.

First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent item
sets and association rule mining

Motivation: Finding inherent regularities in data

What products were often purchased together?—What are the subsequent purchases
after buying a PC?


What kinds of DNA are sensitive to this new drug?

Can we automatically classify web documents?
Applications

Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log
(click stream) analysis, and DNA sequence analysis.
4
Market Basket Analysis : A motivating
Example
To understand The basic concepts of mining frequent patterns and associations
Lets see the earliest form of frequent pattern mining for association rules
- A typical example of frequent item set mining is market basket analysis. This
process analyzes customer buying habits by finding associations between the
different items that customers place in their “shopping baskets”
-market basket analysis may be performed on the retail data of customer
transactions at your store. You can then use the results to plan marketing or
advertising strategies, or in the design of a new catalog.
April 30, 2017
Data Mining: Concepts and Techniques
5
What is Association Rule
Association rule mining
Finding frequent patterns, associations, correlations, or
causal structures among sets of items in transaction
databases

Understand customer buying habits by finding
associations and correlations between the different items
that customers place in their “shopping basket”.
 Ideas come from the market basket analysis (MBA)
 Applications
Basket data analysis, cross-marketing, catalog design, lossleader analysis, web log analysis, fraud detection

April 30, 2017
Data Mining: Concepts and Techniques
6
What is Association Rule
Rule form
Antecedent → Consequent [support, confidence]
 (support and confidence are user defined
measures of interestingness)
 Examples
 buys(x, “computer”) → buys(x, “financial
management software”) [0.5%, 60%]
 age(x, “30..39”) ^ income(x, “42..48K”) →
buys(x, “car”) [1%,75%]

April 30, 2017
Data Mining: Concepts and Techniques
7
Association Rule: Basic Concepts
Given:
(1) database of transactions,
(2) each transaction is a list of items
(purchased by a customer in a visit)
 Find: all rules that correlate the presence of one
set of items with that of another set of items.

April 30, 2017
Data Mining: Concepts and Techniques
8
Rule basic Measures: Support and Confidence

Support of the rule A ⇒ B :
denotes the frequency of the rule within all transactions in the database,
i.e., the probability that a transaction contains both A and B.
 A high value means that the rule involve a great part of database.

support(A ⇒ B [ s, c ]) = p(A ∪ B)/N OR p(A ∪ B)
Confidence of the rule A ⇒ B :
denotes the percentage of transactions containing A which also contain B,
i.e., the probability that a transaction containing A also contains B.
 It is an estimation of conditioned probability .
confidence(A ⇒ B [ s, c ]) = p(B|A) = p(A ∪ B) / p(A)
= support({A,B}) / support({A})
April 30, 2017
Data Mining: Concepts and Techniques
9
Why Use Support and Confidence?

Support




Confidence



is an important measure because a rule that has very low support may occur
simply by chance.
A low support rule is also likely to be uninteresting from a business perspective
because it may not be profitable to promote items that customers seldom buy
together.
support is often used to eliminate uninteresting rules.
Is a measures the reliability of the inference made by a rule.
For a given rule X → Y , the higher the confidence, the more likely it is for X to
be present in transactions that contain Y. Confidence also provides an estimate
of the conditional probability of X given Y.
Example:


Rule 1: Computer → Antivirus-software [ support=2%, Confidence= 60%]
A support of 2% for Rule 1 means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together.
A confidence of 60% means that 60% of the customers who purchased a computer also bought the
software.
10
April 30, 2017
Data Mining: Concepts and Techniques
11
Formulation of association rule problem
The association rule mining problem can be formally stated as follows:
Association Rule Discovery.
Given a set of transactions T, find all the rules having support ≥ minsup
and confidence ≥ minconf, where minsup and minconf are the
corresponding support and confidence
thresholds.
If the item set is infrequent (lower support) , then all rules can be
pruned immediately (removed) without our having to compute their
confidence values.
Rules that satisfy both a minimum support threshold (min_sup) and a
minimum confidence threshold (min_conf) are called strong.
Association rules are considered interesting if they satisfy both a
minimum support threshold and a minimum confidence threshold.
12
Formulation of association rule problem

To find the numbers of possible rules extracted
from data set contains d items can calculated by:
R= 3d – 2d+1 +1
Example:
Q: Given frequent set {A,B,E}, what are possible
association rules?
d=3
R= 33 – 23+1 +1 = 12 rules.
April 30, 2017
Data Mining: Concepts and Techniques
13
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40
Nuts, Eggs, Milk
50
Nuts, Coffee, Diaper, Eggs, Milk


Beer ->diaper
Nuts, Egg->Milk
.
.
.
14
Formulation of association rule problem
A common strategy adopted by many association rule mining algorithms
is to decompose the problem into two major subtasks:
1.
2.
Frequent Itemset Generation, whose objective is to find all the
item sets that satisfy the minsup threshold. These item sets are
called frequent item sets.
Rule Generation, whose objective is to extract all the highconfidence rules from the frequent item sets found in the previous
step. These rules are called strong rules.
April 30, 2017
Data Mining: Concepts and Techniques
15
Basic Concepts: Frequent Patterns
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40
Nuts, Eggs, Milk
50
Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both
Customer
buys diaper





Customer
buys beer
itemset: A set of one or more
items
k-itemset X = {x1, …, xk}
(absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
(relative) support, s, is the
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
An itemset X is frequent if X’s
support is no less than a minsup
threshold.
16
Basic Concepts: Association Rules
Tid
Items bought
10
Beer, Nuts, Ice
20
Beer, Coffee, Ice
30
Beer, Ice, Eggs
40
50
Nuts, Eggs, Milk

Nuts, Coffee, Ice, Eggs, Milk
Customer
buys both
Customer
buys Ice
Find all the rules X  Y with
minimum support and confidence

support, s, probability that a
transaction contains X  Y

confidence, c, conditional
probability that a transaction
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Ice:4, Eggs:3,
Customer
buys beer
{Beer, Ice}:3

Association rules: (many more!)

Beer  Ice (60%, 100%)

Ice Beer (60%, 75%)
17
April 30, 2017
Data Mining: Concepts and Techniques
18
April 30, 2017
Data Mining: Concepts and Techniques
19
Exercise 1. Basic association rule creation manually.
The 'database' below has four transactions. What
association rules can be found in this set, if the minimum
support (i.e coverage) is 60% and the minimum confidence
(i.e. accuracy) is 80% ?

Trans_id Itemlist
Transaction
Items
T1
K, A, D, B
T2
D, A, C, E, B
T3
C, A, B, E
T4
B, A, D
April 30, 2017
Data Mining: Concepts and Techniques
20