Download association rule mining using sas e-miner

ASSOCIATION RULE MINING USING SAS E-MINER Anshu Bharadwaj I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction Association rule mining, one of the most important and well researched techniques of data mining, was first introduced in 1993 (Agrawal, 1993) and are used to identify relationships among a set of items in a database. It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories. These relationships are not based on inherent properties of the data themselves (as in the case of functional dependencies), but are rather based on co-occurrence of the data items. Association rules are widely used in various areas such as telecommunication networks, market and risk management, inventory control etc. Association rules are mainly used to analyse transactional data. They are useful in management to increase the effectiveness and /or reduce the cost associated with advertising, marketing, inventory, stock location on the floor etc. Association rules also provide assistance in other applications such as prediction by identifying what events occur before a set of particular events. An association rule may be one of the following types: Boolean, Spatial, temporal, Generalised, Quantitative, Interval and Multiple Min-Support Association etc. or a mix of them. Association rule (Agrawal1993) (Cheung1996) gives the association among the attribute in a transactional database. Let D be a transaction database and I = {I1, I2, …, Im} be a set of m distinct items (attributes) of D, where each transaction (record) T has a set of items such that TI and has unique identifier. A transaction T is said to contain a set of item A if and only if AT. An association rule is an implication of the form AB, where A, BI, are sets of items called itemsets, and A  B=. Here, A is called antecedent, and B consequent. The rule AB holds in the transaction data D with support (s) where s is the ratio (in percent) of the records that contain A  B (i.e. both A and B) to the total number of records in the database. This is taken to be the probability P(A  B). The rule AB has confidence (c) in the D, the ratio (in percent) of the number of records that contain X  Y to the number of records that contain X. This is taken to be the conditional probability P(B|A). Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two subproblems (Agrawal, 1994).  To find all sets of items which occur with a frequency that is greater than or equal to the user-specified threshold support, say s.  To generate the rules using the frequent itemsets, which have confidence greater than or equal to the user-specified threshold confidence, say c. 2. Evaluation methods for Association Rule Mining To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on Association Rule Mining using SAS E-Miner support and confidence. Since the database is large and users concern about only those frequently purchased items, usually thresholds of support and confidence are predefined by users to drop those rules that are not so interesting or useful. The two thresholds are called minimal support and minimal confidence respectively. A few more measure of interestingness for association rule mining are Lift, Conviction and Succinctness. 2.1 Support The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. In the example database, the itemset {milk,bread,butter} has a support of 1 / 5 = 0.2 since it occurs in 20% of all transactions (1 out of 5 transactions). 2.2 Confidence The confidence of a rule is defined as: has a confidence of 0.2 / 0.4 = 0.5 in the For example, the rule database, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the probability P(Y | X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. 2.3 Lift The lift of a rule is defined as: or the ratio of the observed support to that expected if X and Y were independent. The rule has a lift of . 3. Illustration Consider the following scenario. A store wants to examine its customer base and to understand which of its products tend to be purchased together. It has chosen to conduct a market-basket analysis of a sample of its customer base. This information might help you make decisions such as when to distribute coupons, when to put a product on sale, or how to present items in store displays. To perform the association analysis, follow these steps. The ASSOCS data set lists the grocery products that are purchased by 1,001 customers. Twenty possible items are represented: Table1. Selected Variables in the ASSOCS Data Set Code Product apples apples artichok artichokes avocado avocado baguette baguettes bordeaux wine bourbon bourbon chicken chicken coke cola 196 Association Rule Mining using SAS E-Miner corned_b cracker ham heineken herring ice_crea olives peppers sardines soda steak turkey corned beef cracker ham beer fish ice cream olives peppers sardines soda steak turkey Seven items were purchased by each of 1,001 customers, which yields 7,007 rows in the data set. Each row of the data set represents a customer-product combination. In most data sets, not all customers have the same number of products. 3.1 Create a Process Flow Diagram 1. Create a data source ASSOCS by using the SAS sample data set called SAMPSIO.ASSOCS. 197 Association Rule Mining using SAS E-Miner 2. Select the ASSOCS data set from the SAMPSIO library. 3. Click the Variables tab. 198 Association Rule Mining using SAS E-Miner Using either the Basic or the Advanced Metadata Advisor, assign the following roles to the variables: 4. Set the model role for CUSTOMER to Id. 5. Set the model role for PRODUCT to Target. 6. Set the model role for TIME to Rejected. 199 Association Rule Mining using SAS E-Miner Note: TIME is a variable that identifies the sequence in which the products were purchased. In this example, all of the products were purchased at the same time, so the order relates only to the order in which they are priced at the register. When order is taken into account, association analysis is known as sequence analysis. Sequence analysis is not demonstrated here. 7. Close and save changes to the Input Data Source node. 8. Assign the data source the role of Transaction in the Data Source Attributes window of the Data Source Wizard and save SAMPSIO.ASSOCS. 9. Add the data source SAMPSIO.ASSOCS to your diagram workspace. 10. Add an Association node to the diagram workspace and connect it to the data source ASSOCS. 11. Change the Maximum Items property to 2. 200 Association Rule Mining using SAS E-Miner 12. Run the Association node. After the node runs successfully, open the Results window. 201 Association Rule Mining using SAS E-Miner Support (%) is the percentage of customers who have all the services that are involved in the rule. For example, 36.56% of the 1,001 customers purchased crackers and beer (rule 1), 25.57% purchased olives and herring (rule 7). Consider the Confidence (%) column above. Confidence (%) represents the percentage of customers who have the right-hand side (RHS) item among those who have the left-hand side (LHS) item. For example, of the customers who purchased crackers, 75% purchased beer (rule 2). Of the customers who purchased beer, however, only 61% purchased crackers (rule 1). Lift, in the context of association rules, is the ratio of the confidence of a rule to the confidence of a rule, assuming that the RHS was independent of the LHS. Consequently, lift is a measure of association between the LHS and RHS of the rule. Values that are greater than one represent positive association between the LHS and RHS. Values that are equal to one represent independence. Values that are less than one represent a negative association between the LHS and RHS. Click the LIFT column with the right mouse button and select The lift for rule 1 indicates that a customer who buys peppers and avocados is about 5.67 times as likely to purchase sardines and apples as a customer taken at random. Support (%) for this rule, unfortunately, is very low (8.99%), indicating that the event in which all four products are purchased together is a relatively rare occurrence. 202

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download association rule mining using sas e-miner