Download association rule mining using sas e-miner

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Expense and cost recovery system (ECRS) wikipedia , lookup

Transcript
ASSOCIATION RULE MINING USING SAS E-MINER
Anshu Bharadwaj
I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012
[email protected]
1. Introduction
Association rule mining, one of the most important and well researched techniques of data
mining, was first introduced in 1993 (Agrawal, 1993) and are used to identify relationships
among a set of items in a database. It aims to extract interesting correlations, frequent patterns,
associations or casual structures among sets of items in the transaction databases or other data
repositories. These relationships are not based on inherent properties of the data themselves (as
in the case of functional dependencies), but are rather based on co-occurrence of the data items.
Association rules are widely used in various areas such as telecommunication networks, market
and risk management, inventory control etc. Association rules are mainly used to analyse
transactional data. They are useful in management to increase the effectiveness and /or reduce
the cost associated with advertising, marketing, inventory, stock location on the floor etc.
Association rules also provide assistance in other applications such as prediction by identifying
what events occur before a set of particular events. An association rule may be one of the
following types: Boolean, Spatial, temporal, Generalised, Quantitative, Interval and Multiple
Min-Support Association etc. or a mix of them.
Association rule (Agrawal1993) (Cheung1996) gives the association among the attribute in a
transactional database. Let D be a transaction database and I = {I1, I2, …, Im} be a set of m
distinct items (attributes) of D, where each transaction (record) T has a set of items such that TI
and has unique identifier. A transaction T is said to contain a set of item A if and only if AT.
An association rule is an implication of the form AB, where A, BI, are sets of items called
itemsets, and A  B=. Here, A is called antecedent, and B consequent. The rule AB holds in
the transaction data D with support (s) where s is the ratio (in percent) of the records that contain
A  B (i.e. both A and B) to the total number of records in the database. This is taken to be the
probability P(A  B). The rule AB has confidence (c) in the D, the ratio (in percent) of the
number of records that contain X  Y to the number of records that contain X. This is taken to be
the conditional probability P(B|A).
Association rule mining is to find out association rules that satisfy the predefined minimum
support and confidence from a given database. The problem is usually decomposed into two subproblems (Agrawal, 1994).
 To find all sets of items which occur with a frequency that is greater than or equal to the
user-specified threshold support, say s.
 To generate the rules using the frequent itemsets, which have confidence greater than or
equal to the user-specified threshold confidence, say c.
2. Evaluation methods for Association Rule Mining
To select interesting rules from the set of all possible rules, constraints on various measures of
significance and interest can be used. The best-known constraints are minimum thresholds on
Association Rule Mining using SAS E-Miner
support and confidence. Since the database is large and users concern about only those
frequently purchased items, usually thresholds of support and confidence are predefined by users
to drop those rules that are not so interesting or useful. The two thresholds are called minimal
support and minimal confidence respectively. A few more measure of interestingness for
association rule mining are Lift, Conviction and Succinctness.
2.1 Support
The support supp(X) of an itemset X is defined as the proportion of transactions in the data set
which contain the itemset. In the example database, the itemset {milk,bread,butter} has a support
of 1 / 5 = 0.2 since it occurs in 20% of all transactions (1 out of 5 transactions).
2.2 Confidence
The confidence of a rule is defined as:
has a confidence of 0.2 / 0.4 = 0.5 in the
For example, the rule
database, which means that for 50% of the transactions containing milk and bread the rule is
correct. Confidence can be interpreted as an estimate of the probability P(Y | X), the probability
of finding the RHS of the rule in transactions under the condition that these transactions also
contain the LHS.
2.3 Lift
The lift of a rule is defined as:
or the ratio of the observed support to that expected if X and Y were independent. The rule
has a lift of
.
3. Illustration
Consider the following scenario. A store wants to examine its customer base and to understand
which of its products tend to be purchased together. It has chosen to conduct a market-basket
analysis of a sample of its customer base. This information might help you make decisions such
as when to distribute coupons, when to put a product on sale, or how to present items in store
displays. To perform the association analysis, follow these steps. The ASSOCS data set lists the
grocery products that are purchased by 1,001 customers. Twenty possible items are represented:
Table1. Selected Variables in the ASSOCS Data Set
Code
Product
apples
apples
artichok
artichokes
avocado
avocado
baguette
baguettes
bordeaux
wine
bourbon
bourbon
chicken
chicken
coke
cola
196
Association Rule Mining using SAS E-Miner
corned_b
cracker
ham
heineken
herring
ice_crea
olives
peppers
sardines
soda
steak
turkey
corned beef
cracker
ham
beer
fish
ice cream
olives
peppers
sardines
soda
steak
turkey
Seven items were purchased by each of 1,001 customers, which yields 7,007 rows in the data set.
Each row of the data set represents a customer-product combination. In most data sets, not all
customers have the same number of products.
3.1 Create a Process Flow Diagram
1. Create a data source ASSOCS by using the SAS sample data set called
SAMPSIO.ASSOCS.
197
Association Rule Mining using SAS E-Miner
2. Select the ASSOCS data set from the SAMPSIO library.
3. Click the Variables tab.
198
Association Rule Mining using SAS E-Miner
Using either the Basic or the Advanced Metadata Advisor, assign the following roles to the
variables:
4. Set the model role for CUSTOMER to Id.
5. Set the model role for PRODUCT to Target.
6. Set the model role for TIME to Rejected.
199
Association Rule Mining using SAS E-Miner
Note: TIME is a variable that identifies the sequence in which the products were purchased. In
this example, all of the products were purchased at the same time, so the order relates only to the
order in which they are priced at the register. When order is taken into account, association
analysis is known as sequence analysis. Sequence analysis is not demonstrated here.
7. Close and save changes to the Input Data Source node.
8. Assign the data source the role of Transaction in the Data Source Attributes window of
the Data Source Wizard and save SAMPSIO.ASSOCS.
9. Add the data source SAMPSIO.ASSOCS to your diagram workspace.
10. Add an Association node to the diagram workspace and connect it to the data source
ASSOCS.
11. Change the Maximum Items property to 2.
200
Association Rule Mining using SAS E-Miner
12. Run the Association node.
After the node runs successfully, open the Results window.
201
Association Rule Mining using SAS E-Miner
Support (%) is the percentage of customers who have all the services that are involved in the
rule. For example, 36.56% of the 1,001 customers purchased crackers and beer (rule 1), 25.57%
purchased olives and herring (rule 7). Consider the Confidence (%) column above.
Confidence (%) represents the percentage of customers who have the right-hand side (RHS) item
among those who have the left-hand side (LHS) item. For example, of the customers who
purchased crackers, 75% purchased beer (rule 2). Of the customers who purchased beer,
however, only 61% purchased crackers (rule 1). Lift, in the context of association rules, is the
ratio of the confidence of a rule to the confidence of a rule, assuming that the RHS was
independent of the LHS.
Consequently, lift is a measure of association between the LHS and RHS of the rule. Values that
are greater than one represent positive association between the LHS and RHS. Values that are
equal to one represent independence. Values that are less than one represent a negative
association between the LHS and RHS. Click the LIFT column with the right mouse button and
select
The lift for rule 1 indicates that a customer who buys peppers and avocados is about 5.67 times
as likely to purchase sardines and apples as a customer taken at random. Support (%) for this
rule, unfortunately, is very low (8.99%), indicating that the event in which all four products are
purchased together is a relatively rare occurrence.
202