Download Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Unsupervised Learning - Association Analysis
Section 1 Introduction
Section 2 Market Basket Analysis
Section 3 Using Association Node
Section 4 Understanding Association Rules
Section 5 Association Analysis for Non-Binary Variables
Section 6 Disassociation Analysis
Section 7 Sequential Association Analysis
Section 8 Case Study
Page 2
Page 2
Page 6
Page 9
Page 12
Page 16
Page 20
Page 24
Appendix 1 A priori Algorithm
Appendix 2 SAS Code used in Example 3
Appendix 3 Data Set “Income.txt”
Appendix 4 SAS Code for Disassociation Node
Appendix 5 Sample SAS Code for Case Study
Appendix 6 References
Page 28
Page 29
Page 30
Page 33
Page 34
Page 35
Section 1 Introduction
One significant advance in data mining at the end of the 20th century is that “association
rules analysis” has emerged as a popular tool for mining a very large scale commercial
database (say, the number of variables is greater than104 and the number of observations
is greater than 108). Association mining attempts to construct simple “rules” (descriptive
statistics) that describe regions of relatively high density in a very large commercial
database. When all variables in the database are binary, the association rules analysis can
also be referred to as “market basket analysis”. For example, consider the sales database
of an on-line bookstore, where the objects represent customers and the attributes
represent authors and/or books. The rules to be discovered are the set of books most
frequently bought together by the customers. An example could be that, “15% of the
people who buy Dorian Pyle’s Data Preparation for Data Mining also buy Data Mining
Techniques by Berry and Linoff.” The retail stores can use the knowledge discovered
from the analysis for enhanced shelf placement, cross marketing, catalog design, and
consumer segmentation, etc. Although association analysis has been applied to the retail
industry directly, it can be applied to other industries as well. For example, it has been
used to predict faults in telecommunication networks.
In this session, we will discuss the theoretical foundation of market basket analysis in
Section 2. We then use a small commercial banking data set to illustrate how to use
Association node in Enterprise Miner to obtain association rules in Section 3. Since
association analysis typically produces a very large number of rules, understanding these
rules poses a very challenging data analysis task. We will address this issue in Section 4.
In Section 5, we extend the use of association analysis to a data set with some non-binary
variables. We will address Disassociation Analysis and Sequential Association Analysis
in Sections 6 and 7, respectively. We conclude this session with a case study on showing
the miner to identify rules found in an association exercise.
Section 2 Market Basket Analysis
Suppose X = (X1, X2, … Xp) is a set of variables in a given database. The goal in
association analysis is to find a collection of X-values {vl | l = 1, 2, , L} such that the
each member in {vl | l = 1, 2,
, L} is relatively frequent in the database. Suppose that
there are 10 cases, 10 features and each feature has 10 values, there are 1013 different
value combinations and it is impossibly difficult (“NP hard” to be precise) to search
through all possible values. Even if we can search through the entire database on all
8
4
possible rules, the value of Pr(vl) is very small and the estimate, Pr ( vi ) , the fraction of
all observations for which X = vl , is unreliable. Thus, we need to modify our goal.
One possible modification leads to “market basket analysis”. Market basket analysis was
first introduced by Agrawal et al. (1993). It is a very simple but useful unsupervised data
mining tool for finding rule based patterns. In market basket analysis, we convert each
2
variable in the database to binary-valued dummy variables. The number of dummy
variables K is
p
K = ∑ Sj ,
j =1
where S j is the number of distinct values attainable by Xj. Each dummy variable is
assigned the value Zk = 1 if the variable with which it is assigned takes on the
corresponding value to which Zk is assigned, and Zk = 0, otherwise. The goal is then to
find a subset of the integers κ ⊂ {1, 2, , K } such that
⎡
⎤
⎤
Pr ⎢ ∩ ( Z k = 1) ⎥ = Pr ⎡ Π
(1)
k∈κ ( Z k = 1) ⎦
⎣
⎣ k∈κ
⎦
is large. The subset κ is called an item set and the number of variables Zk in this item
set is called the size of this item set (the size should be smaller than the number of
variables in the database). The estimated value for (1) is taken to be the fraction of
observations in the database for which
⎡ Π
⎤ 1 N
Pr ⎢
(2)
( Z k = 1)⎥ = ∑ Π k∈κ ( zik )
⎣k ∈κ
⎦ N i =1
where zik is the value of Zk for this i-th case. The estimated probability, (2), of the item
set is called the support (S( κ )) for this item set. In market basket analysis, one typically
specifies a lower bound s for this support and the analysis seeks for all item sets that have
support in the database greater than this lower bound, i.e., {kl | S ( kl ) > s} . To discover
all item sets that have support greater than the pre-specified lower bound in a very large
data base is quite challenging task. The search space is exponential in the number of
database attributes, and with millions of database objects the problem of I/O
minimization becomes paramount. Many algorithms such as the Apriori algorithm
(Agrawal et. al., 1995a, 1995b) have been developed for finding the item sets for very
large transaction databases. Since this book focuses on how to apply association analysis
to solve real business problems, readers who are interested in the “association analysis”
algorithms development can find references such as Dunham (2003).
The A priori algorithm (Appendix 1) represents one of the major advances in the data
mining technology. It decomposes the problem of mining association rules into two
parts. The first part is the identifying all frequent item sets, i.e., all item sets with
frequency greater than the given threshold. Each item set is then partitioned into two
disjoint subsets. For example, ( X1, X 2 , X 3 , X 4 ) is an item set and the algorithm
partitions this item set into two disjoint subsets, say, ( X1, X 2 ) and
( X 3 , X 4 ) . It then
item set ( X1, X 2 ) is
( X1, X 2 ) ⇒ ( X 3 , X 4 ) . The first
called the “antecedent” and the second item set ( X 3 , X 4 ) is called “consequent”.
writes the association rule as
The
association rule is defined to have several properties based on the prevalence (support) of
the antecedent and consequent item sets in the database.
3
Association rules can be written in the form θ ⇒ ϕ (left-hand side implies right-hand
side), where θ (the left hand side) is an item set and ϕ (the right-hand side) is another
item set.
Support (Prevalence): The support for rule θ ⇒ ϕ denoted by S (θ ⇒ ϕ ) is the fraction
of the observations in both the antecedent and consequent, i.e.,
fr (θ ∩ ϕ )
S (θ ⇒ ϕ ) =
= Pr (θ ∩ ϕ ) . It can be viewed as the posterior probability of
N
simultaneously observing both item sets in a random basket.
Confidence (Productivity): The confidence for rule θ ⇒ ϕ denoted by C (θ ⇒ ϕ ) is
the support of the association rule divided by the support of its antecedent, i.e.,
fr (θ ∩ ϕ ) Pr (θ ∩ ϕ )
C (θ ⇒ ϕ ) =
=
= Pr (ϕ | θ ) . It can be viewed as the conditional
fr (θ )
Pr (θ )
probability, Pr (ϕ | θ ) , that the right-hand side is presented given the left-hand side is in
the basket.
Lift: The lift for rule θ ⇒ ϕ denoted by L (θ ⇒ ϕ ) is the confidence of the rule divided
C (θ ⇒ ϕ ) Pr (ϕ | θ )
. It can be
=
S (ϕ )
Pr (ϕ )
viewed as the ratio of the likelihood of finding the right-hand side in the basket while it
contains the left-hand side to the likelihood of the right-hand side in the basket in a
random basket. If the lift is two for the rule θ ⇒ ϕ , then a customer having θ is twice
more likely to have ϕ than a randomly chosen customer.
by the support of the consequent, i.e., L(θ ⇒ ϕ ) =
To illustrate how these definitions work, we use the following example.
Example 1 Suppose the supermarket provides five products and each of the ten
transactions including a specific set of product purchases are as shown in Table 1.
Table 1
Transaction
001
002
003
004
005
006
007
008
009
010
Items Bought
Bread
Bread, Jelly, Juice, Milk
Bread, Juice, Beer
Juice
Jelly, Juice, Milk
Bread, Jelly, Juice
Bread, Juice, Milk
Jelly, Juice, Beer
Bread, Milk
Jelly, Juice, Beer
4
(a) Create five dummy variables to represent these transactions in Table 1.
(b) Calculate the support for itemsets with size less than or equal to 3.
(c) Calculate the support for each itemset found in (b).
(d) Find the support, confidence, and lift for rule “Bread => Juice”.
(e) Find the support, confidence, and lift for rule “Jelly => Juice”.
<Solutions>:
(a)
Transaction
X1
X2
X3
X4
X5
(Bread) (Jelly) (Juice) (Milk) (Beer)
001
1
0
0
0
0
002
1
1
1
1
0
003
1
0
1
0
1
004
0
0
1
0
0
005
0
1
1
1
0
006
1
1
1
0
0
007
1
0
1
1
0
008
0
1
1
0
1
009
1
0
0
1
0
010
0
1
1
0
1
A “1” in the table indicates the item was purchased by the transaction and a “0” indicates
the item is not purchased, where X1, X2, X3, X4, and X5 represent Bread, Jelly, Juice,
Milk, and Beer, respectively.
(b) and (c) Supports for itemsets with size one, two, and three are as follows:
Item set
Support Item Set
Support
Bread
0.6 Juice, Beer
0.3
Jelly
0.5 Milk, Beer
0
Juice
0.8 Bread, Jelly, Juice
0.2
Milk
0.4 Bread, Jelly, Milk
0.1
Beer
0.3 Bread, Jelly, Beer
0
Bread, Jelly
0.2 Bread, Juice, Milk
0.2
Bread, Juice
0.4 Bread, Juice, Beer
0.1
Bread, Milk
0.3 Bread, Milk, Beer
0
Bread, Beer
0.1 Jelly, Juice, Milk
0.2
Jelly, Juice
0.5 Jelly, Juice, Beer
0.2
Jelly, Milk
0.2 Jelly, Milk, Beer
0
Jelly, Beer
0.1 Juice, Milk, Beer
0
Juice, Milk
0.3
(d) The support, confidence, and lift for rule “Bread => Juice” are 0.4, 0.67, and 0.83,
respectively. Thus, the customers are less likely (lift = 0.83) to buy juice if they
already had bought bread.
(e) The support, confidence, and lift for rule “Jelly => Juice” are 0.5, 1.0, and 1.25,
respectively. Thus, the customers are 1.25 times more likely to buy juice if they
already had bought jelly.
5
Section 3 Using Association Node
There are many commercial products such as Intelligent Miner from IBM Corporation
and Clementine from SPSS Inc that can be used to perform association analysis. Here,
we only use the Association node in Enterprise Miner to illustrate how to perform
association analysis in Example 2. The data used in Example 2 is from a financial
services company that offers thirteen banking products/services. The descriptions of these
products/services are given in Table 2.
Table 2 Variable Descriptions for Data Used in Example 2
Descriptions
ATM
Automated Teller Machine Debit Card
AUTO
Automobile Installment Loan
CCRD
Credit Card
CD
Certificate of Deposit
CKCRD
Check/Debit Card
CKING
Checking Account
HMEQLC Home Equity Line of Credit
IRA
Individual Retirement Account
MMDA
Money Market Deposit Account
MTG
Mortgage
PLOAN
Personal/Consumer Installation Loan
SVG
Savings Account
TRUST
Personal Trust Account
Example 2 This example shows how to use Association node in Enterprise Miner. The
data set, BANK.SAS7BDAT, has two variables. The first variable is customer account
number that serves as an ID variable, and the second variable is the product list offered
by the bank as shown in Table 2.
Two nodes, Data Source node and Association node, are needed to perform association
analysis in Enterprise Miner.
Data Source:
After selecting the SAS data set, BANK.SAS7BDAT, from a SAS library, we need to
make the following changes:
Complete Diagram:
6
Data Source Wizard – Step 5 of 6 Column Metadata:
• Set the model role of variable ACCOUNT to “ID”
• Set the model role of variable SERVICE to “Target”
• Set the model role of variable VISIT to “Sequence”
Data Source Wizard – Step 6 of 6 Data Source Attributes:
• Set the data role to “Transaction”
Association Node:
Set the Use to “No” for the variable VISIT since we want to perform Market Basket
Analysis.
Note: Both “Association” (Market Basket Analysis) and “Sequences” analysis are
available in the “Association” node.
7
•
•
Association: The node will perform “Association” analysis discussed in this
section.
Sequences: The node will perform “Sequence Association Analysis” if the data
set contains a variable with the model role “Sequence”. Sequence analysis takes
into account the order in which the items are purchased in calculating the
associations.
The following screen shot shows the Property Panel of the Association node.
Options in the Property Panel include
•
Minimum confidence level, which specifies the minimum confidence level to
generate a rule. The default level is 10%.
•
Support Type, which specifies whether the analysis should use the support count or
support percentage property. The default setting is Percent.
8
•
Support Count, which specifies a minimum level of support to claim that items are
associated (that is, they occur together in the database). The default count is 2.
•
Support Percentage, which specifies a minimum level of support to claim that items
are associated (that is, they occur together in the database). The default frequency is
5%.
•
Maximum items, which determines the maximum size of the item set to be
considered. For example, the default of four items indicates that a maximum of 4
items will be included in a single association rule.
Note: If you are interested in associations involving fairly rare products, you should
consider reducing the support count or percentage when you run the Association node. If
you obtain too many rules to be practically useful, you should consider raising the
minimum support count or percentage as one possible alternative.
Some rules (sorted by lift) found with the default settings are shown in the next screen
shot. (View => Rules => Rules Table)
The first rule found in the analysis is “CKCRD => CKING & CCRD” (Check/Debit Card
=> Checking Account and Credit Card). The lift is 3.33. This means the customer who
has check or debit card is 3.33 times more likely to have both checking account and
credit card with the bank than a customer chosen at random. The support is 5.58%. This
means that 5.58% of the customers have “check or debit card”, “checking account” or
“credit card” with the bank. The confidence is 49.39%. This means that about half of all
9
customers with check or debit card also have both the checking account and the credit
card with the bank.
Suppose that we are interested in studying the behavior of customers with Check or Debit
Card. “CKCRD Î CKING” shows that all customers who have Debit Card also have
Checking Account (confidence = 100%). However, there are only half of them who also
have credit card (CKCRD Î CCRD). Since credit card is more profitable than checking
account and the customers are 3.19 times more likely to order “credit card” if they
already have debit card or check card (CKCRD Î CCRD), we can design a special
program to target this group of customers.
Section 4 Understanding Association Rules
The focus on the association analysis exercise is to identify rules that are unexpected and
potentially useful among thousands (millions) of association rules found in a typical
association situation. First, a rule that has very low support will not be discovered.
Although there are thousands of rules discovered in a typical association exercise, there
are many more rules that are undetected since these rules have either low support or low
confidence. It is very likely that some of these undetected rules are very useful. For
example, a high confidence rule such as “caviar implies vodka” will not be uncovered
due to the low sales volume of caviar (i.e., low support). Second, a rule that has very
high confidence is not necessarily very interesting as well. For example, the rule
“pregnancy implies female” has confidence close to 1 (due to data quality). But, this rule
is not interesting at all because it is not unexpected. Third, any of these three measures
(support, confidence, and lift) alone is not a good indicator of the usefulness of an
association rule. For example, rules such as “peanut butter implies milk”, “egg implies
milk”, and “bread implies milk” have high confidence and support because it is hard to
find a shopping basket that contains no milk. However, these high confidence rules are
of very little value because they are known for the data owner. Fourth, a rule with very
high lift is very likely to be interesting, if it also has high support.
Although questions such as “What rules are potentially useful to the data owner?” and
“What rules are well known by the data owner?” can not be answered by the computer
today, pure statistical criteria of interestingness based on measure of association can be
very useful in ranking these association rules. We address these criteria next.
10
Table 3
θ
Yes
No
Total
ϕ
Yes
No
Total
fr (θ ∩ ϕ )
fr (θ ∩ ~ ϕ )
fr (θ )
fr (~ θ ∩ ϕ ) fr (~ θ ∩ ~ ϕ ) fr (~ θ )
N
fr (ϕ )
fr (~ ϕ )
Any association rule θ ⇒ ϕ can be arranged into a 2x2 contingency table (as shown in
Table 3). The “tilda” symbol before a set indicates its complement. Many probability
measures of association in a 2x2 contingency table such as chi-square statistics and jmeasures (Hand, 2000) can be used to measure the strength of the association rules
found. J-measure is defined as
⎛
P (ϕ | θ )
1 − P (ϕ | θ ) ⎞
+ (1 − p (ϕ | θ ) ) log
J (θ ⇒ ϕ ) = P (θ ) ⎜⎜ p (ϕ | θ ) log
⎟
P (ϕ )
1 − P (ϕ ) ⎟⎠
⎝
, where P (ϕ | θ ) is the confidence of the rule θ ⇒ ϕ and Pr (θ ) and Pr (ϕ ) are the
support for item sets θ and ϕ , respectively. Although j-measure is not available on
software packages such as Enterprise Miner, it is not difficult to compute if these
packages do report support, confidence and lift for each rule found. Other than chisquare statistics and j-measures, we can also use SQL to query the output database
generated by the software package to identify rules that satisfy some conditions. We use
Example 3 to demonstrate how to compute j-measures and perform SQL query using the
SAS Code node in Enterprise Miner after performing association analysis.
Example 3
(a) Compute the j-measure for the rules found in Example 2 and display the five rules
with highest j-measure.
(b) Find all rules that satisfy the following three conditions:
1. Checking Account (CKING)” is the consequent.
2. The confidence is greater than 0.
3. The support is greater than 0.2.
<Solutions>:
(a) SAS code can be found in Appendix 2 and the rules with the five highest j-measures
are in the following table.
Rule
CKCRD ==> CKING & CCRD
CKCRD ==> CCRD
CKING & CKCRD ==> CCRD
CKING & CCRD ==> CKCRD
CCRD ==> CKCRD
J-Measure
0.037306
0.035425
0.035425
0.034485
0.032364
Confidence
(%)
49.39%
49.39%
49.39%
37.57%
36.05%
Support
(%)
Lift
5.581%
5.581%
5.581%
5.581%
5.581%
3.33
3.19
3.19
3.33
3.19
Expected
Confidence
(%)
14.85
15.48
15.48
11.30
11.30
11
(b) SAS code can be found in Appendix 2 and the rules found are in the following table.
Rule
Confidence Support
Lift
-----------------------------------------------------------------------SVG & ATM ==> CKING
0.97
0.25
1.13
ATM ==> CKING
0.94
0.36
1.10
SVG ==> CKING
0.88
0.54
1.02
CD ==> CKING
0.86
0.21
1.00
Another popular measure of the strength of an association rule is the chi-square statistic.
We use the following example to illustrate how to use chi-square statistics.
Example 4 The data in Table 4 is the corresponding 2x2 contingency table for the rule
“Savings Account => Checking Account”.
Table 4
Checking Account
No
Yes
Total
500
3,500 4,000
Savings No
Account Yes
1,000
5,000 6,000
1,500
8,500 10,000
Total
(a) Find the support, confidence, and lift of the association rule “(saving) ⇒ (checking)”.
(b) Do “Savings Account” and “Checking Account” appear to be independent based on
the chi-square statistics?
<Solution>:
(a)
fr ( saving and checking ) 5000
S(savings ⇒ checking) =
=
= 0.5.
10000
N
S ( saving ⇒ checking )
0.5
C(savings ⇒ checking) =
=
= 0.83
0.6
S ( saving )
C ( saving ⇒ checking ) 0.5 0.6
=
=0.98 Thus, people
0.85
S ( checking )
are less likely to have checking accounts, if they already own saving accounts.
L (savings ⇒ checking) =
(b)
Table 5 Frequencies under the Independence Assumption
Checking Account
No
Yes
Total
600
3,400 4,000
Savings No
Account Yes
900
5,100 6,000
1,500
8,500 10,000
Total
12
The chi-square statistic for independence is
2
2
2
2
500 − 600 )
3500 − 3400 )
1000 − 900 )
5000 − 5100 )
(
(
(
(
+
+
+
χ=
600
3400
= 16.67 + 2.94 + 11.11 + 1.96
900
5100
= 32.68.
2
Since 32.68 is much greater than the critical value, χ1,0.05
= 3.841 , we can reject the
null hypothesis at α = 0.05 . This means that “the checking account” and “the savings
account” are significant associated. Since the p-value = 1.09 x 10-8 of this test is very
small, the evidence of the association between the savings and checking accounts is
very strong.
Note:
1. High confidence and support do not imply cause and effect. The rule is not
necessarily interesting. The two items might not even be related. However, the
checking and savings accounts are strongly related in this example.
2. Based on these measures, “Savings Account => Checking Account” might be
considered a strong rule. However, those customers without a savings account are
even more likely to have a checking account (C(not savings ⇒checking) is 0.875).
This means the savings and checking accounts are in fact negatively associated.
Section 5 Association Analysis for Non-Binary Variables
The Association node can also be used to perform association analysis, if the data set
contains continuous and categorical variables. The data set, income.txt, taken from
Hastie et al. (2002) is used to illustrate how to perform association analysis on non-binary
variables using Association node. The detailed explanation of this data set can be found
in Appendix 3.
Example 5 Perform Association Analysis on “income” data set
Step 1: Variable List and Description are in Appendix 3
Step 2: Data Preparation for Association Analysis
Since the data is not ready to be used by the Association node, we must prepare the data
in order to use Association node. Although there are many ways to prepare the data
13
based on the purpose of the analysis, we use the following steps to prepare the data for
illustration purposes:
• Removing observations with missing values;
• Cut each ordinal variable at its median and then code it with two dummy variables;
• Use k dummy variables to represent each categorical variable with k categories. The
prepared data is called “income_recode”.
There are 123208 observations in the recoded data set because some observations in the
“income.txt” have missing values. The distribution of the recoded data set is in the
following table.
Table 6 Distribution of the Recoded Income
Variable Category
Variable Content
(Target Variable)
Frequency
Percent
Gender
Female
4,918
3.992%
Gender
Male
4,075
3.307%
Income
Income < $30,000
4,722
3.833%
Income
Income >= $30,000
4,271
3.466%
Marry Status
Divorced or separated
875
0.710%
Marry Status
Living together, not married
668
0.542%
Marry Status
Married
3,334
2.706%
Marry Status
Single, never married
3,654
2.966%
Marry Status
Widowed
302
0.245%
Age
34 and Younger
5,256
4.266%
Age
35 and Over
3,737
3.033%
Education
College Graduated
2,490
2.021%
Education
Non-College Graduated
6,417
5.208%
Occupation
Clerical/Service Worker
1,062
0.862%
Occupation
Factory Worker/Laborer/Driver
767
0.623%
Occupation
Homemaker
650
0.528%
14
Variable Category
Variable Content
(Target Variable)
Frequency
Percent
272
0.221%
2,820
2.289%
Occupation
Military
Occupation
Professional/Managerial
Occupation
Retired
690
0.560%
Occupation
Sales Worker
770
0.625%
Occupation
Student, HS or College
1,489
1.209%
Occupation
Unemployed
337
0.274%
Years in Bay Area
Less Than Ten Years
8,080
6.558%
Dual Income
Dual Income
2,211
1.795%
Dual Income
Dual Income Status: Not Married
5,438
4.414%
Dual Income
Single Income
1,344
1.091%
# of Family
Four or More Family Members
2,664
2.162%
# of Family
Three or Less Family Members
5,954
4.832%
# of Kids
Kids in Family
3,269
2.653%
# of Kids
No Kids in Family
5,724
4.646%
Household status
Live with Parents/Family
1,827
1.483%
Household status
Own Home
3,256
2.643%
Household status
Rent Home
3,670
2.979%
Home Type
Live in Apartment
2,373
1.926%
Home Type
Live in Condominium
655
0.532%
Home Type
Live in House
5,073
4.117%
Home Type
Live in Mobile Home
151
0.123%
Home Type
Other Home Type
384
0.312%
15
Variable Category
Variable Content
(Target Variable)
Frequency
Percent
Ethnic
American Indian
150
0.122%
Ethnic
Asian
477
0.387%
Ethnic
Black
910
0.739%
Ethnic
East Indian
18
0.015%
Ethnic
Hispanic
1,231
0.999%
Ethnic
Other
225
0.183%
Ethnic
Pacific Islander
103
0.084%
Ethnic
White
5,811
4.716%
Language
Speak English at Home
7,794
6.326%
Language
Speak Other Language at Home
261
0.212%
Language
Speak Spanish at home
579
0.470%
Step 3: Association Analysis with Association node
We use the default options in the Association Node and set the “ID” as ID variable and
“Condition” as the Target variable to obtain 89154 rules. The “Statistics Plot”, “Statistics
Line Plot”, “Rule Matrix”, and “Output” can be found in the next screen shot.
16
The Statistics Line Plot graphs the lift, expected confidence, confidence, and support for
each of the rules by rule index number.
Consider the rule A ⇒ B. Recall that the
•
Support of A ⇒ B is the probability that a customer has both A and B.
•
Confidence of A ⇒ B is the probability that a customer has B given that the customer
has A.
•
Expected confidence of A ⇒ B is the probability that a customer has B.
•
Lift of A ⇒ B is a measure of strength of the association. If the Lift=2 for the rule
A=>B, then a customer having A is twice more likely to have B than a customer
chosen at random. Lift is thus the confidence divided by the expected confidence.
To view the descriptions of the rules, select View Ö Rules Ö Rule description.
17
The rule matrix plots the rules based on the items on the left side of the rule and the items
on the right side of the rule. The points are colored based on the confidence of the rules.
For example, the rules with the highest confidence are in the column indicated by the
cursor in the picture above. Using the ActiveX feature of the graph, you discover that
these rules all have “Living with Parents/Family” on the right-hand side of the rule.
To view the link graph, select View Ö Rules Ö Link Graph.
18
The link graph displays association results by using nodes and links. The size and color
of a node indicate the transactions counts in the Rules data set. Larger nodes have
greater counts than smaller nodes. The color and thickness of a link indicate the
confidence level of a rule. The thicker the links are, the higher confidence the rules have.
Suppose you are particularly interested in those associations that involve “College
Graduate”. One way to accomplish that visually in the link graph is to select those nodes
whose label contains “College Graduate” and then show only those links involving the
selected nodes.
1. Right-click in the Link Graph window and then click Select….
2. In the Selection Dialog window, change the options to select the nodes where the
label contains College Graduate as shown below:
3. Select OK.
19
4. In the link graph, the nodes with College Graduate are now selected. Right-click in
the link graph and deselect Show all links.
Step 4: Understanding the results
First, it is not possible to find any one-item set with support less than 0.73% because all
of these one-item sets will be excluded from the analysis (Note: There are 8993
customers. The minimum confidence for a one-item set is 10% and 10% of the
customers is 899 and 899/123208 ≈ 0.73%.). For example, we cannot find any
association rules that have “Speak Spanish at Home” in either the right hand side or the
left hand side in any association rules found because the support for “Speak Spanish at
Home” is only 0.47%. However, there might be some interesting high lift rules that have
“Speak Spanish at Home”. If we are indeed interested in finding rules containing
relatively rare events such as “Speak Spanish at Home”, we can do so either by
combining categories or by over sampling. Second, there is not any single measure
(support, confidence, lift, j-measure, or chi-square statistics) alone can be used to rank
these rules. For example, a rule having very high confidence might have very low jmeasure, if the confidence of the rule is close to 1. Thus, we need to use SQL to search
rules in which we are interested and to use multiple measures to understand these rules.
Table 7 shows five rules with the highest j-measure. These rules typically also have high
lift and support.
20
Table 7 High j-Measure and High Lift Rules
Some interesting rules with “College Graduated” or “Non-College Graduated” in the
right hand side have been identified by using SQL.
21
Section 6 Disassociation Analysis
Sometimes, we might want to conduct “negative association analysis”. For example, we
might want to know the behavior of customers without a money market account.
Association node can also be used to perform disassociation analysis except the data need
to be modified.
The SAS code in Appendix 4 can be used to convert the data set used in association
analysis to performing disassociation analysis. We can use the following steps to add one
SAS Code node between Data Source node and Association node to perform
disassociation analysis on “SVG”, “CKING”, and “MMDA”.
Complete Diagram:
Step 1: Add Data Source as described in Section 3
Step 2: Add SAS Code Node
•
“Training SAS Code”:
o Select the “Program” tab and copy the SAS code in Appendix 4 into the
program window
o Make the following changes
¾ %let values=”SVG”,”CKING”,”MMDA”;
¾ %let in=&EM_IMPORT_TRANSACTION;
¾ %let out=&EM_EXPORT_TRANSACTION;
Step 3: Add “Association” node. Keep the defaults and run the diagram.
The rules found now include negation of the original variables. Selected rules are shown
in the following figure.
Notice that the confidence for rules “~CKING & SVG => ~MMDA” is 95.12%. This
means that a significant portion of customers having savings account without a checking
account tend to not have a money market account. This is an interesting finding because
a money market account is much more profitable than a checking account for the bank.
22
Section 7 Sequential Association Analysis
Sometimes, the data sequence includes an additional dimension, the time associated with
each transaction. For example, a customer might rent videos on star war series in several
transactions. The customer first rents “Star Wars”, then “Empire Strikes Back”, and then
“Return of the Jedi”. However, they rent these three videos in three different times and
possibly rent other videos in between. Unless, we study the data sequence at the
customer level, we will not be able to discover this pattern. Supposing we are not only
interested in the data sequence at the transaction level but also on the customer level, then
we need to consider the additional temporal dimension. Mining sequential data patterns
was initially motivated by applications in the retailing industry, including attached
mailing, add-on sales, and customer satisfaction. However, the results can apply to many
scientific and business domains, as well. For instance, in the medical domain, a datasequence may correspond to the symptoms or diseases of a patient, with a transaction
corresponding to the symptoms exhibited or diseases diagnosed during a visit to the
doctor. The patterns discovered using this data could be used in disease research to help
identify symptoms/diseases that precede certain diseases. Elements of a sequential
pattern need not be simple items. “Fitted Sheet and flat sheet and pillow cases”, followed
by “comforter”, followed by “drapes and ruffles” is an example of a sequential pattern in
which the elements are sets of items.
There are many algorithms such as AprioriAll and SPADE (Zaki, 1998) that can be used
to obtain sequence patterns. Since we focus here on how to use the existing software to
perform sequential association analysis, readers interested in algorithm development are
referred to Dunham (2003).
Terminology
•
•
•
Length: The length of a sequence is the number of item sets in the sequence. A
sequence of length k is called a k-sequence. The sequence formed by the
concatenation of two sequences x and y is denoted as x.y.
Support: The support for an item set is defined as the fraction of customers who
bought these items in this item set in a single transaction. Thus, the count is on
transaction level instead of customer level. Under this definition, an item set and a 1sequence have the same support.
Litemset: An item set with support greater than a given “minimum support” is called
a large item set or litemset. Note that each item set in a large sequence must have
minimum support. Hence, any large sequence must be a list of litemsets.
Example 6
Consider a database with the customer-sequences shown below. The customer sequences
are in transformed form where each transaction has been replaced by the set of litemsets
23
contained in the transaction and the litemsets have been replaced by integers. The
minimum support has been specified to be 40% (i.e., 2 customer sequences).
Customer sequences: {1 5 2 3 4, 1 3 4 3 5, 1 2 3 4, 1 3 5, 4 5}
Find the maximum large sequence “Litemset”.
<Solutions>: The maximal large sequences would be the three sequences 1 2 3 4, 1 3 5
and 4 5 with support 0.4, 0.4, and 0.4, respectively.
Example 7 To perform sequential association analysis with the Association node, the
data needs to have one “ID” variable, one “SEQUENCE” variable, and one “TARGET”
variable. In this example, we use the same data set, “BANK”.
We summarize the options available in the Association node to perform “sequence
association” analysis, as follows:
The options in the Sequence Panel enable you to specify the following properties:
• Chain Count: The default value is three and the maximum value is 10. This option
enables you to set the maximum number of items to include in a sequence. By default, the
maximum number is 3. To change the number of items in the longest chain, enter a new value in
the entry field. The maximum value that you can specify is 10. If the number of items is larger
than what is found in the data, then the chain length in the results will be smaller than the value
that was specified.
• Consolidate Time: This option enables one to collapse sequence time differences. For
example, a customer goes to the store at 7:00 AM to buy eggs and bacon (the unit of measure is
an hour). She returns to the store at 1:00 PM to buy several other items. She makes an additional
visit at 9:00 PM to buy cold medicine and a box of tissues. She returns to the store two days later
at noon to buy orange juice, chicken noodle soup, and more cold medicine. In this example, the
sequence variable (VISIT) has 4 entries. If you want to perform sequence discovery on a daily
basis, then you would need to enter 24 hours in the Consolidate time differences < entry field.
This would enable multiple visits on the same day to be consolidated as a single visit.
24
•
Maximum Transaction Duration: By default, all possible sequences are identified. If you
want to be restricted to a particular time limit, you can specify the maximum transaction duration
to be 3 months. This means that printers bought more than three months after a PC purchase does
not constitute a sequence.
• Support Type, which specifies whether the sequence analysis should use the Support
Count or Support Percentage property. The default setting is Percent.
• Support Count, which specifies the minimum frequency required to include a
sequence in the sequence analysis when the Sequence Support Type is set to Count. If a
sequence has a count less than the specified value, that sequence is excluded from the
output. The default setting is 2.
• Support Percentage, which specifies the minimum level of support to include the
sequence in the analysis when the Sequence Support Type is set to Percentage. If a
sequence has a frequency that is less than the specified percentage of the total number of
transactions then that sequence is excluded from the output. The default percentage is
2%. Permissible values are real numbers between 0 and 100.
Since understanding the results from “sequence association analysis” and “association
analysis” is similar, we will not present the results from using the default settings in the
Association node here.
Section 8 Case Study
Identifying interesting rules is the most important part of association analysis since there
are thousands of rules generated in a typical association analysis. In this case study, we
introduce two ways to guide the rules finding process using the INCOME data set
discussed in Section 5.
Decision Trees in Finding Rules: Suppose you want to identify rules that have one (or
more) specific item(s), decision trees can guide in identifying rules that have this item (or
items) as the right hand side. For example, we can use “income” as the target variable to
find rules that lead to “high income” or “low income” using the following steps.
25
Complete Diagram:
Data Source: Select the data set “INCOME”
SAS Code Node: Add “high_income=(income > 6)” in the Scoring Code to split the
variable “INCOME” into two categories such as “Higher than $40,000” and “Less than
$39,999”.
Metadata Node: Set the newly created variables as the target variable and use
“supervised learning” techniques such as decision trees to discover rules that lead to
“high income” or “lower income”.
Tree Node: Since we do not need to build an optimal model, we do not need to split the
data into training set and validation set. This means that we do not need to use the Data
Partition node. The only change is set the “Depth” to 4 to restrict the maximum number
of rules to be found to 16 (= 24). Since the number of rules are significantly reduced, we
can look at all rules carefully and identify some rules that might be interesting. For
example, the following rule, “IF “Education” IS ONE OF “COLLEGE GRADUATE”,
26
“GRAD STUDY” AND “Marry Status” IS ONE OF “MARRIED” and “LIVING
TOGETHER, Not Married” AND “Household Status” EQUALS “Rent” or “Living with
Parent” AND “Occupation” is “Professional/Managerial” THEN 65.4% of customers
have income greater than $40,000”, is very interesting. The support, confidence, and lift
161 246
for this rule are 0.0179, 0.654 (=161/246), and 1.86 (
), respectively. It should
3161 8993
be noted that this rule is a generalization association rule that can not be found by most
simple association analysis packages such as Association node in Enterprise Miner.
Convert “Unsupervised Data Mining” problems to “Supervised Data Mining”
Problems: An idea suggested by Hastie et al. (2001) can be used to identify rules of
interest if the miner does not have any specific items in mind to be found as the right
hand side of these rules. In this illustration, we assume that we want to compare the
original data set with a simulated data set that has a uniform distribution. We can
generate a simple random sample with uniform distribution and set the target to be 0 for
this newly generated data set and set the target to be 1 for the original data set. The
sample SAS code to perform this data preparation process is in Appendix 5.
Since the data is ready to perform supervised data mining, we can use supervised data
mining techniques such as decision trees to identifying rules that might be interesting.
Since the maximum number of rules depends on the depth of the tree, the number of rules
found is much less than that can be found in a typical association analysis. For example,
an eight-depth binary tree can have at most 256 rules that are much smaller than
thousands of rules can be found in a typical association analysis. Also, these rules found
by trees can have multiple categories of a given variable as one item in the item set. We
also refer these association rules as generalized association rules.
Since we already converted this problem into a “supervised data mining” problem, all
techniques that can be used to find an optimal tree can be used here. We will discuss the
detail on finding an optimal decision trees model in the next session and will only present
selected interesting rules found here.
The first rule is a rule with high lift and very low support. Since this support is very
small, most association analyses will not be able to find this rule (or this rule will be
hidden in thousands of rules). The four items of this rule are “Speak Spanish at Home”,
“Home Type is either House or Apartment”, “Hispanic”, and “Two or less Kids at
Home”. Several association rules can be derived using these four items. One interesting
rule with three items derived is “Hispanic” and “Two or less kids at home” => “Speak
Spanish at Home”. The support, confidence, and lift for this rule are 0.043, 0.378, and
5.87, respectively. Due to the low support, this rule was not found in Example 5.
Another rule with four items derived is “Home or Apartment”, “Hispanic” and “Two or
less kids at home” => “Speak Spanish at Home”. The support, confidence, and lift for
this rule are 0.037, 0.347, and, 5.400, respectively. Another four-item rule found by trees
is “White”, “Mobile Home or Others”, “Two or less kids at Home”, and “Speak English
at home”. One association based on these four items is “White”, “Two or less kids”, and
27
“Live in mobile Home or Others” => “Speak English at Home”. This rule has 0.0406,
0.793, and 1.228 as its support, confidence, and lift, respectively.
The last rule presented is “Speak English at Home”, “Live in House”, “Three or More
Kids” => “Live in Bay Area for more than ten years”. This rule has very low support
(0.0226). However, the confidence (0.646) is very high and the lift (1.12) is moderate.
28
Appendix 1 Apriori Algorithm
The Apriori algorithm (Agrawal et al., 1995) for finding all frequent item sets is given
below:
procedure AprioriAlg()
begin
L1 := {frequent 1-itemsets};
for ( k := 2; Lk-1 0; k++ ) do {
Ck= apriori-gen(Lk-1) ;
// new candidates
for all transactions t in the dataset do {
for all candidates c Ck contained in t do
c:count++
}
Lk = { c Ck | c:count >= min-support}
}
Answer := k Lk
end
The code makes multiple passes over the database. In the first pass, the algorithm simply
counts item occurrences to determine the frequent single-item sets (i.e., single-item sets
with frequency greater than the threshold). A subsequent pass, say pass k, consists of two
phases. First, the frequent item sets Lk-1 (the set of all frequent (k-1)-item sets) found in
the (k-1)th pass are used to generate the candidate item sets Ck, using the apriori-gen()
function. Next, it deletes all item sets from the join result that have some (k-1)-subset that
is not in Lk-1 yielding Ck. The algorithm now scans the database. For each transaction, it
determines which of the candidates in Ck are contained in the transaction using a hashtree data structure and increments the count of those candidates. At the end of the pass,
Ck is examined to determine which of the candidates are frequent, yielding Lk . The
algorithm terminates when Lk becomes empty.
29
Appendix 2 SAS Code for Example 3
/* Example 3 Part (a) */
options nodate nocenter pageno=1 pagesize=54 linesize=80 nonumber;
data work.jmeasure; set &em_import_rules;
conf=conf/100;
support=support/100;
if conf = 1 then jm = .; /* j-measure is undefined if conf = 0 or 1 */
else jm=(support/conf)*(conf*log(lift)+(1-conf)*log((1-conf)/(1conf/lift)));
label jm = "J-Measure";
label support = "Support";
label conf = "Confidence";
run;
proc sort data=work.jmeasure;
by descending jm;
run;
proc sql inobs=5;
select rule, jm, conf, support, lift, exp_conf
from work.jmeasure;
quit;
/* Part (b) */
options linesize=132;
proc sql;
select rule, conf, support, lift
from work.jmeasure
where (_RHAND = 'CKING' and conf > 0.8 and support > 0.2);
quit;
30
Appendix 3 Data Set “Income”
A total of N=9409 questionnaires containing 502 questions were filled out by shopping
mall customers in the San Francisco Bay area (Impact Resources, Inc., Columbus, OH
(1987)).
The dataset “income” is an extract from this survey. It consists of 14
demographic attributes. The dataset is a good mixture of categorical and continuous
variables with a lot of missing data.
The variable list is, as follows:
Alphabetic List of Variables and Attributes
#
Variable
Type
Len
Format
Informat
4
Age
Num
8
BEST12.
BEST32.
5
Education
Num
8
BEST12.
BEST32.
2
Gender
Num
8
BEST12.
BEST32.
1
Income
Num
8
BEST12.
BEST32.
7
Length
Num
8
BEST12.
BEST32.
3
Married
Num
8
BEST12.
BEST32.
6
Occupation
Num
8
BEST12.
BEST32.
8
dual
Num
8
BEST12.
BEST32.
13
ethnic
Num
8
BEST12.
BEST32.
12
home_type
Num
8
BEST12.
BEST32.
11
household
Num
8
BEST12.
BEST32.
14
language
Num
8
BEST12.
BEST32.
9
n_family
Num
8
BEST12.
BEST32.
10
n_kids
Num
8
BEST12.
BEST32.
The attribute information is, as follows:
1
ANNUAL INCOME OF HOUSEHOLD (PERSONAL INCOME IF SINGLE)
1. Less than $10,000
2. $10,000 to $14,999
31
3. $15,000 to $19,999
4. $20,000 to $24,999
5. $25,000 to $29,999
6. $30,000 to $39,999
7. $40,000 to $49,999
8. $50,000 to $74,999
9. $75,000 or more
2 SEX
1. Male
2. Female
3 MARITAL STATUS
1. Married
2. Living together, not married
3. Divorced or separated
4. Widowed
5. Single, never married
4 AGE
1. 14 thru 17
2. 18 thru 24
3. 25 thru 34
4. 35 thru 44
5. 45 thru 54
6. 55 thru 64
7. 65 and Over
5 EDUCATION
1. Grade 8 or less
2. Grades 9 to 11
3. Graduated high school
4. 1 to 3 years of college
5. College graduate
6. Grad Study
6 OCCUPATION
1. Professional/Managerial
2. Sales Worker
3. Factory Worker/Laborer/Driver
4. Clerical/Service Worker
5. Homemaker
6. Student, HS or College
7. Military
8. Retired
9. Unemployed
32
7 HOW LONG HAVE YOU LIVED IN THE SAN FRAN./OAKLAND/SAN JOSE
AREA?
1. Less than one year
2. One to three years
3. Four to six years
4. Seven to ten years
5. More than ten years
8 DUAL INCOMES (IF MARRIED)
1. Not Married
2. Yes
3. No
9 PERSONS IN YOUR HOUSEHOLD
1. One
2. Two
3. Three
4. Four
5. Five
6. Six
7. Seven
8. Eight
9. Nine or more
10
PERSONS IN HOUSEHOLD UNDER 18
0. None
1. One
2. Two
3. Three
4. Four
5. Five
6. Six
7. Seven
8. Eight
9. Nine or more
11
HOUSEHOLDER STATUS
1. Own
2. Rent
3. Live with Parents/Family
12
TYPE OF HOME
1. House
2. Condominium
3. Apartment
33
4. Mobile Home
5. Other
13 ETHNIC CLASSIFICATION
1. American Indian
2. Asian
3. Black
4. East Indian
5. Hispanic
6. Pacific Islander
7. White
8. Other
14
WHAT LANGUAGE IS SPOKEN MOST OFTEN IN YOUR HOME?
1. English
2. Spanish
3. Other
There are 8993 observations in this data set that are obtained from the original dataset
with 9409 instances by removing those observations with the response (Annual Income)
missing (with coding “.”).
34
Appendix 4 SAS Code for Disassociation Analysis
%let values='SVG','CKING','MMDA';
%let in=&EM_IMPORT_TRANSACTION;
%let out=&EM_EXPORT_TRANSACTION;
proc sql;
create table v56767c as
select distinct %em_target from &in;
create table r57304x as
select distinct %em_id as %em_id,
as notvalue
from &in, v56767c as a;
a.%em_target, '~'||a.%em_target
create table &out as
select b.%em_id, coalesce(a.%em_target, b.notvalue) as %em_target
from &in as a right join r57304x as b
on a.%em_id=b.%em_id and
a.%em_target=b.%em_target
where a.%em_target ~= '' or b.%em_target in (&values);
quit;
proc datasets library=work nolist;
delete r57304x v56767c;
quit;
35
Appendix 5 Sample SAS Code for Case Study
/* Prepare a data that suitable for decision tree analysis */
data &mylib.income_tree (keep = income gender marry age
education occupation length dual n_family n_kids
household home_type ethnic language);
set &mylib.income;
array aa(14);
do i = 1 to 14;
aa(i) = uniform(0);
end;
gender = int(aa(1)*2)+1;
income = int(aa(2)*9)+1;
marry = int(aa(3)*5)+1;
age = int(aa(4)*7)+1;
education = int(aa(5)*6)+1;
occupation = int(aa(6)*9)+1;
length = int(aa(7)*5)+1;
n_family = int(aa(8)*9)+1;
n_kids = int(aa(9)*10);
household = int(aa(10)*3)+1;
home_type = int(aa(11)*5)+1;
ethnic = int(aa(12)*8)+1;
language = int(aa(13)*3)+1;
dual = int(aa(14)*3)+1;
run;
data &mylib.incomecombine;
set &mylib.income (in=ina)
&mylib.income_tree (in=inb);
if ina then target = 1;
else if inb then target = 0;
run;
36
Appendix 6 References
Agrawal, R., Imielenski, T. and Swami, A. (1993) “Mining Association Rules between
Sets of Items in Large Database”, Page 207-216, Proceedings of the ACM SIGMOD
Conference on Management of Data, ACM Press: New York.
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. and Verkamo, A. I. (1995a) “Fast
Discovery of Association Rules, Advances in Knowledge Discovery and Data Mining,”
AAAI/MIT Press, Cambridge, MA.
Agrawal, R., Mannila and H., Srikant, R. (1995b) “Mining Sequential Patterns,” Page 314, Proceedings of the IEEE International Conference on Data Engineering.
Danham, Margaret H. (2003) “Data Mining – Introductory and Advanced Topics”,
Chapter 6), Prentice Hall: Upper Sand River, New Jersey.
David Hand, Heikki Mannila, and Padhraic Smyth (2001) “Principles of Data Mining”
Massachusetts Institute of Technology, Cambridge, Massachusetts. (Chapter 13)
SAS Institute (2000) “Enterprise Miner: Applying Data Mining Techniques – Course
Notes”, SAS Institute, Cary, NC. (Chapter 7).
Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2002) The Elements of
Statistical Learning – Data Mining, Inference, and Prediction (Chapter 14).
Zaki, M. J. (1998) “Efficient Enumeration of Frequent Sequences”, Proceedings of the
ACM CIKM Conference, Page 68-75.
37
Exercise:
Problem 1
For any given rule A=>B
(a) Prove that one can compute the J-measure with three quantities – support, confidence,
and lift.
(b) Prove that one can compute the one can compute the chi-square statistics for
independence with four quantities – support, confidence, lift, and the sample size.
Problem 2
(a) Write a SAS program to convert “income.txt” into a SAS data set that can be used in
Association Node. You can use the median to split each ordinal variable into two
dummy variables and use each category as one dummy variable for each nominal
variable.
(b) Perform an association analysis with Association node and the following conditions:
• Find rules with 5 or less items
• The support for the rule is 10%
• The confidence used to generate the rules is 10%
(c) Use a SAS code node to compute the j-measure combined with PROC SQL to
identify five interesting rules.
Problem 3
Can you find out a rule “A=>B” that has S(A⇒B)=0.8,
L(A⇒B)=1.5? Please explain.
C(A⇒B)=0.75, and
Problem 4
Implement Apriori and AprioriAll algorithm.
Problem 5
One weakness of Apriori algorithm is that it assumes that the database can completely
reside in the memory. Please perform research to find algorithms that do not make this
assumption.
Problem 6
Perform research to find new algorithms on association analysis such as an incremental
updating approach to identify association rules to avoid a re-run the algorithm from the
beginning each time.
38