Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Unsupervised Learning - Association Analysis Section 1 Introduction Section 2 Market Basket Analysis Section 3 Using Association Node Section 4 Understanding Association Rules Section 5 Association Analysis for Non-Binary Variables Section 6 Disassociation Analysis Section 7 Sequential Association Analysis Section 8 Case Study Page 2 Page 2 Page 6 Page 9 Page 12 Page 16 Page 20 Page 24 Appendix 1 A priori Algorithm Appendix 2 SAS Code used in Example 3 Appendix 3 Data Set “Income.txt” Appendix 4 SAS Code for Disassociation Node Appendix 5 Sample SAS Code for Case Study Appendix 6 References Page 28 Page 29 Page 30 Page 33 Page 34 Page 35 Section 1 Introduction One significant advance in data mining at the end of the 20th century is that “association rules analysis” has emerged as a popular tool for mining a very large scale commercial database (say, the number of variables is greater than104 and the number of observations is greater than 108). Association mining attempts to construct simple “rules” (descriptive statistics) that describe regions of relatively high density in a very large commercial database. When all variables in the database are binary, the association rules analysis can also be referred to as “market basket analysis”. For example, consider the sales database of an on-line bookstore, where the objects represent customers and the attributes represent authors and/or books. The rules to be discovered are the set of books most frequently bought together by the customers. An example could be that, “15% of the people who buy Dorian Pyle’s Data Preparation for Data Mining also buy Data Mining Techniques by Berry and Linoff.” The retail stores can use the knowledge discovered from the analysis for enhanced shelf placement, cross marketing, catalog design, and consumer segmentation, etc. Although association analysis has been applied to the retail industry directly, it can be applied to other industries as well. For example, it has been used to predict faults in telecommunication networks. In this session, we will discuss the theoretical foundation of market basket analysis in Section 2. We then use a small commercial banking data set to illustrate how to use Association node in Enterprise Miner to obtain association rules in Section 3. Since association analysis typically produces a very large number of rules, understanding these rules poses a very challenging data analysis task. We will address this issue in Section 4. In Section 5, we extend the use of association analysis to a data set with some non-binary variables. We will address Disassociation Analysis and Sequential Association Analysis in Sections 6 and 7, respectively. We conclude this session with a case study on showing the miner to identify rules found in an association exercise. Section 2 Market Basket Analysis Suppose X = (X1, X2, … Xp) is a set of variables in a given database. The goal in association analysis is to find a collection of X-values {vl | l = 1, 2, , L} such that the each member in {vl | l = 1, 2, , L} is relatively frequent in the database. Suppose that there are 10 cases, 10 features and each feature has 10 values, there are 1013 different value combinations and it is impossibly difficult (“NP hard” to be precise) to search through all possible values. Even if we can search through the entire database on all 8 4 possible rules, the value of Pr(vl) is very small and the estimate, Pr ( vi ) , the fraction of all observations for which X = vl , is unreliable. Thus, we need to modify our goal. One possible modification leads to “market basket analysis”. Market basket analysis was first introduced by Agrawal et al. (1993). It is a very simple but useful unsupervised data mining tool for finding rule based patterns. In market basket analysis, we convert each 2 variable in the database to binary-valued dummy variables. The number of dummy variables K is p K = ∑ Sj , j =1 where S j is the number of distinct values attainable by Xj. Each dummy variable is assigned the value Zk = 1 if the variable with which it is assigned takes on the corresponding value to which Zk is assigned, and Zk = 0, otherwise. The goal is then to find a subset of the integers κ ⊂ {1, 2, , K } such that ⎡ ⎤ ⎤ Pr ⎢ ∩ ( Z k = 1) ⎥ = Pr ⎡ Π (1) k∈κ ( Z k = 1) ⎦ ⎣ ⎣ k∈κ ⎦ is large. The subset κ is called an item set and the number of variables Zk in this item set is called the size of this item set (the size should be smaller than the number of variables in the database). The estimated value for (1) is taken to be the fraction of observations in the database for which ⎡ Π ⎤ 1 N Pr ⎢ (2) ( Z k = 1)⎥ = ∑ Π k∈κ ( zik ) ⎣k ∈κ ⎦ N i =1 where zik is the value of Zk for this i-th case. The estimated probability, (2), of the item set is called the support (S( κ )) for this item set. In market basket analysis, one typically specifies a lower bound s for this support and the analysis seeks for all item sets that have support in the database greater than this lower bound, i.e., {kl | S ( kl ) > s} . To discover all item sets that have support greater than the pre-specified lower bound in a very large data base is quite challenging task. The search space is exponential in the number of database attributes, and with millions of database objects the problem of I/O minimization becomes paramount. Many algorithms such as the Apriori algorithm (Agrawal et. al., 1995a, 1995b) have been developed for finding the item sets for very large transaction databases. Since this book focuses on how to apply association analysis to solve real business problems, readers who are interested in the “association analysis” algorithms development can find references such as Dunham (2003). The A priori algorithm (Appendix 1) represents one of the major advances in the data mining technology. It decomposes the problem of mining association rules into two parts. The first part is the identifying all frequent item sets, i.e., all item sets with frequency greater than the given threshold. Each item set is then partitioned into two disjoint subsets. For example, ( X1, X 2 , X 3 , X 4 ) is an item set and the algorithm partitions this item set into two disjoint subsets, say, ( X1, X 2 ) and ( X 3 , X 4 ) . It then item set ( X1, X 2 ) is ( X1, X 2 ) ⇒ ( X 3 , X 4 ) . The first called the “antecedent” and the second item set ( X 3 , X 4 ) is called “consequent”. writes the association rule as The association rule is defined to have several properties based on the prevalence (support) of the antecedent and consequent item sets in the database. 3 Association rules can be written in the form θ ⇒ ϕ (left-hand side implies right-hand side), where θ (the left hand side) is an item set and ϕ (the right-hand side) is another item set. Support (Prevalence): The support for rule θ ⇒ ϕ denoted by S (θ ⇒ ϕ ) is the fraction of the observations in both the antecedent and consequent, i.e., fr (θ ∩ ϕ ) S (θ ⇒ ϕ ) = = Pr (θ ∩ ϕ ) . It can be viewed as the posterior probability of N simultaneously observing both item sets in a random basket. Confidence (Productivity): The confidence for rule θ ⇒ ϕ denoted by C (θ ⇒ ϕ ) is the support of the association rule divided by the support of its antecedent, i.e., fr (θ ∩ ϕ ) Pr (θ ∩ ϕ ) C (θ ⇒ ϕ ) = = = Pr (ϕ | θ ) . It can be viewed as the conditional fr (θ ) Pr (θ ) probability, Pr (ϕ | θ ) , that the right-hand side is presented given the left-hand side is in the basket. Lift: The lift for rule θ ⇒ ϕ denoted by L (θ ⇒ ϕ ) is the confidence of the rule divided C (θ ⇒ ϕ ) Pr (ϕ | θ ) . It can be = S (ϕ ) Pr (ϕ ) viewed as the ratio of the likelihood of finding the right-hand side in the basket while it contains the left-hand side to the likelihood of the right-hand side in the basket in a random basket. If the lift is two for the rule θ ⇒ ϕ , then a customer having θ is twice more likely to have ϕ than a randomly chosen customer. by the support of the consequent, i.e., L(θ ⇒ ϕ ) = To illustrate how these definitions work, we use the following example. Example 1 Suppose the supermarket provides five products and each of the ten transactions including a specific set of product purchases are as shown in Table 1. Table 1 Transaction 001 002 003 004 005 006 007 008 009 010 Items Bought Bread Bread, Jelly, Juice, Milk Bread, Juice, Beer Juice Jelly, Juice, Milk Bread, Jelly, Juice Bread, Juice, Milk Jelly, Juice, Beer Bread, Milk Jelly, Juice, Beer 4 (a) Create five dummy variables to represent these transactions in Table 1. (b) Calculate the support for itemsets with size less than or equal to 3. (c) Calculate the support for each itemset found in (b). (d) Find the support, confidence, and lift for rule “Bread => Juice”. (e) Find the support, confidence, and lift for rule “Jelly => Juice”. <Solutions>: (a) Transaction X1 X2 X3 X4 X5 (Bread) (Jelly) (Juice) (Milk) (Beer) 001 1 0 0 0 0 002 1 1 1 1 0 003 1 0 1 0 1 004 0 0 1 0 0 005 0 1 1 1 0 006 1 1 1 0 0 007 1 0 1 1 0 008 0 1 1 0 1 009 1 0 0 1 0 010 0 1 1 0 1 A “1” in the table indicates the item was purchased by the transaction and a “0” indicates the item is not purchased, where X1, X2, X3, X4, and X5 represent Bread, Jelly, Juice, Milk, and Beer, respectively. (b) and (c) Supports for itemsets with size one, two, and three are as follows: Item set Support Item Set Support Bread 0.6 Juice, Beer 0.3 Jelly 0.5 Milk, Beer 0 Juice 0.8 Bread, Jelly, Juice 0.2 Milk 0.4 Bread, Jelly, Milk 0.1 Beer 0.3 Bread, Jelly, Beer 0 Bread, Jelly 0.2 Bread, Juice, Milk 0.2 Bread, Juice 0.4 Bread, Juice, Beer 0.1 Bread, Milk 0.3 Bread, Milk, Beer 0 Bread, Beer 0.1 Jelly, Juice, Milk 0.2 Jelly, Juice 0.5 Jelly, Juice, Beer 0.2 Jelly, Milk 0.2 Jelly, Milk, Beer 0 Jelly, Beer 0.1 Juice, Milk, Beer 0 Juice, Milk 0.3 (d) The support, confidence, and lift for rule “Bread => Juice” are 0.4, 0.67, and 0.83, respectively. Thus, the customers are less likely (lift = 0.83) to buy juice if they already had bought bread. (e) The support, confidence, and lift for rule “Jelly => Juice” are 0.5, 1.0, and 1.25, respectively. Thus, the customers are 1.25 times more likely to buy juice if they already had bought jelly. 5 Section 3 Using Association Node There are many commercial products such as Intelligent Miner from IBM Corporation and Clementine from SPSS Inc that can be used to perform association analysis. Here, we only use the Association node in Enterprise Miner to illustrate how to perform association analysis in Example 2. The data used in Example 2 is from a financial services company that offers thirteen banking products/services. The descriptions of these products/services are given in Table 2. Table 2 Variable Descriptions for Data Used in Example 2 Descriptions ATM Automated Teller Machine Debit Card AUTO Automobile Installment Loan CCRD Credit Card CD Certificate of Deposit CKCRD Check/Debit Card CKING Checking Account HMEQLC Home Equity Line of Credit IRA Individual Retirement Account MMDA Money Market Deposit Account MTG Mortgage PLOAN Personal/Consumer Installation Loan SVG Savings Account TRUST Personal Trust Account Example 2 This example shows how to use Association node in Enterprise Miner. The data set, BANK.SAS7BDAT, has two variables. The first variable is customer account number that serves as an ID variable, and the second variable is the product list offered by the bank as shown in Table 2. Two nodes, Data Source node and Association node, are needed to perform association analysis in Enterprise Miner. Data Source: After selecting the SAS data set, BANK.SAS7BDAT, from a SAS library, we need to make the following changes: Complete Diagram: 6 Data Source Wizard – Step 5 of 6 Column Metadata: • Set the model role of variable ACCOUNT to “ID” • Set the model role of variable SERVICE to “Target” • Set the model role of variable VISIT to “Sequence” Data Source Wizard – Step 6 of 6 Data Source Attributes: • Set the data role to “Transaction” Association Node: Set the Use to “No” for the variable VISIT since we want to perform Market Basket Analysis. Note: Both “Association” (Market Basket Analysis) and “Sequences” analysis are available in the “Association” node. 7 • • Association: The node will perform “Association” analysis discussed in this section. Sequences: The node will perform “Sequence Association Analysis” if the data set contains a variable with the model role “Sequence”. Sequence analysis takes into account the order in which the items are purchased in calculating the associations. The following screen shot shows the Property Panel of the Association node. Options in the Property Panel include • Minimum confidence level, which specifies the minimum confidence level to generate a rule. The default level is 10%. • Support Type, which specifies whether the analysis should use the support count or support percentage property. The default setting is Percent. 8 • Support Count, which specifies a minimum level of support to claim that items are associated (that is, they occur together in the database). The default count is 2. • Support Percentage, which specifies a minimum level of support to claim that items are associated (that is, they occur together in the database). The default frequency is 5%. • Maximum items, which determines the maximum size of the item set to be considered. For example, the default of four items indicates that a maximum of 4 items will be included in a single association rule. Note: If you are interested in associations involving fairly rare products, you should consider reducing the support count or percentage when you run the Association node. If you obtain too many rules to be practically useful, you should consider raising the minimum support count or percentage as one possible alternative. Some rules (sorted by lift) found with the default settings are shown in the next screen shot. (View => Rules => Rules Table) The first rule found in the analysis is “CKCRD => CKING & CCRD” (Check/Debit Card => Checking Account and Credit Card). The lift is 3.33. This means the customer who has check or debit card is 3.33 times more likely to have both checking account and credit card with the bank than a customer chosen at random. The support is 5.58%. This means that 5.58% of the customers have “check or debit card”, “checking account” or “credit card” with the bank. The confidence is 49.39%. This means that about half of all 9 customers with check or debit card also have both the checking account and the credit card with the bank. Suppose that we are interested in studying the behavior of customers with Check or Debit Card. “CKCRD Î CKING” shows that all customers who have Debit Card also have Checking Account (confidence = 100%). However, there are only half of them who also have credit card (CKCRD Î CCRD). Since credit card is more profitable than checking account and the customers are 3.19 times more likely to order “credit card” if they already have debit card or check card (CKCRD Î CCRD), we can design a special program to target this group of customers. Section 4 Understanding Association Rules The focus on the association analysis exercise is to identify rules that are unexpected and potentially useful among thousands (millions) of association rules found in a typical association situation. First, a rule that has very low support will not be discovered. Although there are thousands of rules discovered in a typical association exercise, there are many more rules that are undetected since these rules have either low support or low confidence. It is very likely that some of these undetected rules are very useful. For example, a high confidence rule such as “caviar implies vodka” will not be uncovered due to the low sales volume of caviar (i.e., low support). Second, a rule that has very high confidence is not necessarily very interesting as well. For example, the rule “pregnancy implies female” has confidence close to 1 (due to data quality). But, this rule is not interesting at all because it is not unexpected. Third, any of these three measures (support, confidence, and lift) alone is not a good indicator of the usefulness of an association rule. For example, rules such as “peanut butter implies milk”, “egg implies milk”, and “bread implies milk” have high confidence and support because it is hard to find a shopping basket that contains no milk. However, these high confidence rules are of very little value because they are known for the data owner. Fourth, a rule with very high lift is very likely to be interesting, if it also has high support. Although questions such as “What rules are potentially useful to the data owner?” and “What rules are well known by the data owner?” can not be answered by the computer today, pure statistical criteria of interestingness based on measure of association can be very useful in ranking these association rules. We address these criteria next. 10 Table 3 θ Yes No Total ϕ Yes No Total fr (θ ∩ ϕ ) fr (θ ∩ ~ ϕ ) fr (θ ) fr (~ θ ∩ ϕ ) fr (~ θ ∩ ~ ϕ ) fr (~ θ ) N fr (ϕ ) fr (~ ϕ ) Any association rule θ ⇒ ϕ can be arranged into a 2x2 contingency table (as shown in Table 3). The “tilda” symbol before a set indicates its complement. Many probability measures of association in a 2x2 contingency table such as chi-square statistics and jmeasures (Hand, 2000) can be used to measure the strength of the association rules found. J-measure is defined as ⎛ P (ϕ | θ ) 1 − P (ϕ | θ ) ⎞ + (1 − p (ϕ | θ ) ) log J (θ ⇒ ϕ ) = P (θ ) ⎜⎜ p (ϕ | θ ) log ⎟ P (ϕ ) 1 − P (ϕ ) ⎟⎠ ⎝ , where P (ϕ | θ ) is the confidence of the rule θ ⇒ ϕ and Pr (θ ) and Pr (ϕ ) are the support for item sets θ and ϕ , respectively. Although j-measure is not available on software packages such as Enterprise Miner, it is not difficult to compute if these packages do report support, confidence and lift for each rule found. Other than chisquare statistics and j-measures, we can also use SQL to query the output database generated by the software package to identify rules that satisfy some conditions. We use Example 3 to demonstrate how to compute j-measures and perform SQL query using the SAS Code node in Enterprise Miner after performing association analysis. Example 3 (a) Compute the j-measure for the rules found in Example 2 and display the five rules with highest j-measure. (b) Find all rules that satisfy the following three conditions: 1. Checking Account (CKING)” is the consequent. 2. The confidence is greater than 0. 3. The support is greater than 0.2. <Solutions>: (a) SAS code can be found in Appendix 2 and the rules with the five highest j-measures are in the following table. Rule CKCRD ==> CKING & CCRD CKCRD ==> CCRD CKING & CKCRD ==> CCRD CKING & CCRD ==> CKCRD CCRD ==> CKCRD J-Measure 0.037306 0.035425 0.035425 0.034485 0.032364 Confidence (%) 49.39% 49.39% 49.39% 37.57% 36.05% Support (%) Lift 5.581% 5.581% 5.581% 5.581% 5.581% 3.33 3.19 3.19 3.33 3.19 Expected Confidence (%) 14.85 15.48 15.48 11.30 11.30 11 (b) SAS code can be found in Appendix 2 and the rules found are in the following table. Rule Confidence Support Lift -----------------------------------------------------------------------SVG & ATM ==> CKING 0.97 0.25 1.13 ATM ==> CKING 0.94 0.36 1.10 SVG ==> CKING 0.88 0.54 1.02 CD ==> CKING 0.86 0.21 1.00 Another popular measure of the strength of an association rule is the chi-square statistic. We use the following example to illustrate how to use chi-square statistics. Example 4 The data in Table 4 is the corresponding 2x2 contingency table for the rule “Savings Account => Checking Account”. Table 4 Checking Account No Yes Total 500 3,500 4,000 Savings No Account Yes 1,000 5,000 6,000 1,500 8,500 10,000 Total (a) Find the support, confidence, and lift of the association rule “(saving) ⇒ (checking)”. (b) Do “Savings Account” and “Checking Account” appear to be independent based on the chi-square statistics? <Solution>: (a) fr ( saving and checking ) 5000 S(savings ⇒ checking) = = = 0.5. 10000 N S ( saving ⇒ checking ) 0.5 C(savings ⇒ checking) = = = 0.83 0.6 S ( saving ) C ( saving ⇒ checking ) 0.5 0.6 = =0.98 Thus, people 0.85 S ( checking ) are less likely to have checking accounts, if they already own saving accounts. L (savings ⇒ checking) = (b) Table 5 Frequencies under the Independence Assumption Checking Account No Yes Total 600 3,400 4,000 Savings No Account Yes 900 5,100 6,000 1,500 8,500 10,000 Total 12 The chi-square statistic for independence is 2 2 2 2 500 − 600 ) 3500 − 3400 ) 1000 − 900 ) 5000 − 5100 ) ( ( ( ( + + + χ= 600 3400 = 16.67 + 2.94 + 11.11 + 1.96 900 5100 = 32.68. 2 Since 32.68 is much greater than the critical value, χ1,0.05 = 3.841 , we can reject the null hypothesis at α = 0.05 . This means that “the checking account” and “the savings account” are significant associated. Since the p-value = 1.09 x 10-8 of this test is very small, the evidence of the association between the savings and checking accounts is very strong. Note: 1. High confidence and support do not imply cause and effect. The rule is not necessarily interesting. The two items might not even be related. However, the checking and savings accounts are strongly related in this example. 2. Based on these measures, “Savings Account => Checking Account” might be considered a strong rule. However, those customers without a savings account are even more likely to have a checking account (C(not savings ⇒checking) is 0.875). This means the savings and checking accounts are in fact negatively associated. Section 5 Association Analysis for Non-Binary Variables The Association node can also be used to perform association analysis, if the data set contains continuous and categorical variables. The data set, income.txt, taken from Hastie et al. (2002) is used to illustrate how to perform association analysis on non-binary variables using Association node. The detailed explanation of this data set can be found in Appendix 3. Example 5 Perform Association Analysis on “income” data set Step 1: Variable List and Description are in Appendix 3 Step 2: Data Preparation for Association Analysis Since the data is not ready to be used by the Association node, we must prepare the data in order to use Association node. Although there are many ways to prepare the data 13 based on the purpose of the analysis, we use the following steps to prepare the data for illustration purposes: • Removing observations with missing values; • Cut each ordinal variable at its median and then code it with two dummy variables; • Use k dummy variables to represent each categorical variable with k categories. The prepared data is called “income_recode”. There are 123208 observations in the recoded data set because some observations in the “income.txt” have missing values. The distribution of the recoded data set is in the following table. Table 6 Distribution of the Recoded Income Variable Category Variable Content (Target Variable) Frequency Percent Gender Female 4,918 3.992% Gender Male 4,075 3.307% Income Income < $30,000 4,722 3.833% Income Income >= $30,000 4,271 3.466% Marry Status Divorced or separated 875 0.710% Marry Status Living together, not married 668 0.542% Marry Status Married 3,334 2.706% Marry Status Single, never married 3,654 2.966% Marry Status Widowed 302 0.245% Age 34 and Younger 5,256 4.266% Age 35 and Over 3,737 3.033% Education College Graduated 2,490 2.021% Education Non-College Graduated 6,417 5.208% Occupation Clerical/Service Worker 1,062 0.862% Occupation Factory Worker/Laborer/Driver 767 0.623% Occupation Homemaker 650 0.528% 14 Variable Category Variable Content (Target Variable) Frequency Percent 272 0.221% 2,820 2.289% Occupation Military Occupation Professional/Managerial Occupation Retired 690 0.560% Occupation Sales Worker 770 0.625% Occupation Student, HS or College 1,489 1.209% Occupation Unemployed 337 0.274% Years in Bay Area Less Than Ten Years 8,080 6.558% Dual Income Dual Income 2,211 1.795% Dual Income Dual Income Status: Not Married 5,438 4.414% Dual Income Single Income 1,344 1.091% # of Family Four or More Family Members 2,664 2.162% # of Family Three or Less Family Members 5,954 4.832% # of Kids Kids in Family 3,269 2.653% # of Kids No Kids in Family 5,724 4.646% Household status Live with Parents/Family 1,827 1.483% Household status Own Home 3,256 2.643% Household status Rent Home 3,670 2.979% Home Type Live in Apartment 2,373 1.926% Home Type Live in Condominium 655 0.532% Home Type Live in House 5,073 4.117% Home Type Live in Mobile Home 151 0.123% Home Type Other Home Type 384 0.312% 15 Variable Category Variable Content (Target Variable) Frequency Percent Ethnic American Indian 150 0.122% Ethnic Asian 477 0.387% Ethnic Black 910 0.739% Ethnic East Indian 18 0.015% Ethnic Hispanic 1,231 0.999% Ethnic Other 225 0.183% Ethnic Pacific Islander 103 0.084% Ethnic White 5,811 4.716% Language Speak English at Home 7,794 6.326% Language Speak Other Language at Home 261 0.212% Language Speak Spanish at home 579 0.470% Step 3: Association Analysis with Association node We use the default options in the Association Node and set the “ID” as ID variable and “Condition” as the Target variable to obtain 89154 rules. The “Statistics Plot”, “Statistics Line Plot”, “Rule Matrix”, and “Output” can be found in the next screen shot. 16 The Statistics Line Plot graphs the lift, expected confidence, confidence, and support for each of the rules by rule index number. Consider the rule A ⇒ B. Recall that the • Support of A ⇒ B is the probability that a customer has both A and B. • Confidence of A ⇒ B is the probability that a customer has B given that the customer has A. • Expected confidence of A ⇒ B is the probability that a customer has B. • Lift of A ⇒ B is a measure of strength of the association. If the Lift=2 for the rule A=>B, then a customer having A is twice more likely to have B than a customer chosen at random. Lift is thus the confidence divided by the expected confidence. To view the descriptions of the rules, select View Ö Rules Ö Rule description. 17 The rule matrix plots the rules based on the items on the left side of the rule and the items on the right side of the rule. The points are colored based on the confidence of the rules. For example, the rules with the highest confidence are in the column indicated by the cursor in the picture above. Using the ActiveX feature of the graph, you discover that these rules all have “Living with Parents/Family” on the right-hand side of the rule. To view the link graph, select View Ö Rules Ö Link Graph. 18 The link graph displays association results by using nodes and links. The size and color of a node indicate the transactions counts in the Rules data set. Larger nodes have greater counts than smaller nodes. The color and thickness of a link indicate the confidence level of a rule. The thicker the links are, the higher confidence the rules have. Suppose you are particularly interested in those associations that involve “College Graduate”. One way to accomplish that visually in the link graph is to select those nodes whose label contains “College Graduate” and then show only those links involving the selected nodes. 1. Right-click in the Link Graph window and then click Select…. 2. In the Selection Dialog window, change the options to select the nodes where the label contains College Graduate as shown below: 3. Select OK. 19 4. In the link graph, the nodes with College Graduate are now selected. Right-click in the link graph and deselect Show all links. Step 4: Understanding the results First, it is not possible to find any one-item set with support less than 0.73% because all of these one-item sets will be excluded from the analysis (Note: There are 8993 customers. The minimum confidence for a one-item set is 10% and 10% of the customers is 899 and 899/123208 ≈ 0.73%.). For example, we cannot find any association rules that have “Speak Spanish at Home” in either the right hand side or the left hand side in any association rules found because the support for “Speak Spanish at Home” is only 0.47%. However, there might be some interesting high lift rules that have “Speak Spanish at Home”. If we are indeed interested in finding rules containing relatively rare events such as “Speak Spanish at Home”, we can do so either by combining categories or by over sampling. Second, there is not any single measure (support, confidence, lift, j-measure, or chi-square statistics) alone can be used to rank these rules. For example, a rule having very high confidence might have very low jmeasure, if the confidence of the rule is close to 1. Thus, we need to use SQL to search rules in which we are interested and to use multiple measures to understand these rules. Table 7 shows five rules with the highest j-measure. These rules typically also have high lift and support. 20 Table 7 High j-Measure and High Lift Rules Some interesting rules with “College Graduated” or “Non-College Graduated” in the right hand side have been identified by using SQL. 21 Section 6 Disassociation Analysis Sometimes, we might want to conduct “negative association analysis”. For example, we might want to know the behavior of customers without a money market account. Association node can also be used to perform disassociation analysis except the data need to be modified. The SAS code in Appendix 4 can be used to convert the data set used in association analysis to performing disassociation analysis. We can use the following steps to add one SAS Code node between Data Source node and Association node to perform disassociation analysis on “SVG”, “CKING”, and “MMDA”. Complete Diagram: Step 1: Add Data Source as described in Section 3 Step 2: Add SAS Code Node • “Training SAS Code”: o Select the “Program” tab and copy the SAS code in Appendix 4 into the program window o Make the following changes ¾ %let values=”SVG”,”CKING”,”MMDA”; ¾ %let in=&EM_IMPORT_TRANSACTION; ¾ %let out=&EM_EXPORT_TRANSACTION; Step 3: Add “Association” node. Keep the defaults and run the diagram. The rules found now include negation of the original variables. Selected rules are shown in the following figure. Notice that the confidence for rules “~CKING & SVG => ~MMDA” is 95.12%. This means that a significant portion of customers having savings account without a checking account tend to not have a money market account. This is an interesting finding because a money market account is much more profitable than a checking account for the bank. 22 Section 7 Sequential Association Analysis Sometimes, the data sequence includes an additional dimension, the time associated with each transaction. For example, a customer might rent videos on star war series in several transactions. The customer first rents “Star Wars”, then “Empire Strikes Back”, and then “Return of the Jedi”. However, they rent these three videos in three different times and possibly rent other videos in between. Unless, we study the data sequence at the customer level, we will not be able to discover this pattern. Supposing we are not only interested in the data sequence at the transaction level but also on the customer level, then we need to consider the additional temporal dimension. Mining sequential data patterns was initially motivated by applications in the retailing industry, including attached mailing, add-on sales, and customer satisfaction. However, the results can apply to many scientific and business domains, as well. For instance, in the medical domain, a datasequence may correspond to the symptoms or diseases of a patient, with a transaction corresponding to the symptoms exhibited or diseases diagnosed during a visit to the doctor. The patterns discovered using this data could be used in disease research to help identify symptoms/diseases that precede certain diseases. Elements of a sequential pattern need not be simple items. “Fitted Sheet and flat sheet and pillow cases”, followed by “comforter”, followed by “drapes and ruffles” is an example of a sequential pattern in which the elements are sets of items. There are many algorithms such as AprioriAll and SPADE (Zaki, 1998) that can be used to obtain sequence patterns. Since we focus here on how to use the existing software to perform sequential association analysis, readers interested in algorithm development are referred to Dunham (2003). Terminology • • • Length: The length of a sequence is the number of item sets in the sequence. A sequence of length k is called a k-sequence. The sequence formed by the concatenation of two sequences x and y is denoted as x.y. Support: The support for an item set is defined as the fraction of customers who bought these items in this item set in a single transaction. Thus, the count is on transaction level instead of customer level. Under this definition, an item set and a 1sequence have the same support. Litemset: An item set with support greater than a given “minimum support” is called a large item set or litemset. Note that each item set in a large sequence must have minimum support. Hence, any large sequence must be a list of litemsets. Example 6 Consider a database with the customer-sequences shown below. The customer sequences are in transformed form where each transaction has been replaced by the set of litemsets 23 contained in the transaction and the litemsets have been replaced by integers. The minimum support has been specified to be 40% (i.e., 2 customer sequences). Customer sequences: {1 5 2 3 4, 1 3 4 3 5, 1 2 3 4, 1 3 5, 4 5} Find the maximum large sequence “Litemset”. <Solutions>: The maximal large sequences would be the three sequences 1 2 3 4, 1 3 5 and 4 5 with support 0.4, 0.4, and 0.4, respectively. Example 7 To perform sequential association analysis with the Association node, the data needs to have one “ID” variable, one “SEQUENCE” variable, and one “TARGET” variable. In this example, we use the same data set, “BANK”. We summarize the options available in the Association node to perform “sequence association” analysis, as follows: The options in the Sequence Panel enable you to specify the following properties: • Chain Count: The default value is three and the maximum value is 10. This option enables you to set the maximum number of items to include in a sequence. By default, the maximum number is 3. To change the number of items in the longest chain, enter a new value in the entry field. The maximum value that you can specify is 10. If the number of items is larger than what is found in the data, then the chain length in the results will be smaller than the value that was specified. • Consolidate Time: This option enables one to collapse sequence time differences. For example, a customer goes to the store at 7:00 AM to buy eggs and bacon (the unit of measure is an hour). She returns to the store at 1:00 PM to buy several other items. She makes an additional visit at 9:00 PM to buy cold medicine and a box of tissues. She returns to the store two days later at noon to buy orange juice, chicken noodle soup, and more cold medicine. In this example, the sequence variable (VISIT) has 4 entries. If you want to perform sequence discovery on a daily basis, then you would need to enter 24 hours in the Consolidate time differences < entry field. This would enable multiple visits on the same day to be consolidated as a single visit. 24 • Maximum Transaction Duration: By default, all possible sequences are identified. If you want to be restricted to a particular time limit, you can specify the maximum transaction duration to be 3 months. This means that printers bought more than three months after a PC purchase does not constitute a sequence. • Support Type, which specifies whether the sequence analysis should use the Support Count or Support Percentage property. The default setting is Percent. • Support Count, which specifies the minimum frequency required to include a sequence in the sequence analysis when the Sequence Support Type is set to Count. If a sequence has a count less than the specified value, that sequence is excluded from the output. The default setting is 2. • Support Percentage, which specifies the minimum level of support to include the sequence in the analysis when the Sequence Support Type is set to Percentage. If a sequence has a frequency that is less than the specified percentage of the total number of transactions then that sequence is excluded from the output. The default percentage is 2%. Permissible values are real numbers between 0 and 100. Since understanding the results from “sequence association analysis” and “association analysis” is similar, we will not present the results from using the default settings in the Association node here. Section 8 Case Study Identifying interesting rules is the most important part of association analysis since there are thousands of rules generated in a typical association analysis. In this case study, we introduce two ways to guide the rules finding process using the INCOME data set discussed in Section 5. Decision Trees in Finding Rules: Suppose you want to identify rules that have one (or more) specific item(s), decision trees can guide in identifying rules that have this item (or items) as the right hand side. For example, we can use “income” as the target variable to find rules that lead to “high income” or “low income” using the following steps. 25 Complete Diagram: Data Source: Select the data set “INCOME” SAS Code Node: Add “high_income=(income > 6)” in the Scoring Code to split the variable “INCOME” into two categories such as “Higher than $40,000” and “Less than $39,999”. Metadata Node: Set the newly created variables as the target variable and use “supervised learning” techniques such as decision trees to discover rules that lead to “high income” or “lower income”. Tree Node: Since we do not need to build an optimal model, we do not need to split the data into training set and validation set. This means that we do not need to use the Data Partition node. The only change is set the “Depth” to 4 to restrict the maximum number of rules to be found to 16 (= 24). Since the number of rules are significantly reduced, we can look at all rules carefully and identify some rules that might be interesting. For example, the following rule, “IF “Education” IS ONE OF “COLLEGE GRADUATE”, 26 “GRAD STUDY” AND “Marry Status” IS ONE OF “MARRIED” and “LIVING TOGETHER, Not Married” AND “Household Status” EQUALS “Rent” or “Living with Parent” AND “Occupation” is “Professional/Managerial” THEN 65.4% of customers have income greater than $40,000”, is very interesting. The support, confidence, and lift 161 246 for this rule are 0.0179, 0.654 (=161/246), and 1.86 ( ), respectively. It should 3161 8993 be noted that this rule is a generalization association rule that can not be found by most simple association analysis packages such as Association node in Enterprise Miner. Convert “Unsupervised Data Mining” problems to “Supervised Data Mining” Problems: An idea suggested by Hastie et al. (2001) can be used to identify rules of interest if the miner does not have any specific items in mind to be found as the right hand side of these rules. In this illustration, we assume that we want to compare the original data set with a simulated data set that has a uniform distribution. We can generate a simple random sample with uniform distribution and set the target to be 0 for this newly generated data set and set the target to be 1 for the original data set. The sample SAS code to perform this data preparation process is in Appendix 5. Since the data is ready to perform supervised data mining, we can use supervised data mining techniques such as decision trees to identifying rules that might be interesting. Since the maximum number of rules depends on the depth of the tree, the number of rules found is much less than that can be found in a typical association analysis. For example, an eight-depth binary tree can have at most 256 rules that are much smaller than thousands of rules can be found in a typical association analysis. Also, these rules found by trees can have multiple categories of a given variable as one item in the item set. We also refer these association rules as generalized association rules. Since we already converted this problem into a “supervised data mining” problem, all techniques that can be used to find an optimal tree can be used here. We will discuss the detail on finding an optimal decision trees model in the next session and will only present selected interesting rules found here. The first rule is a rule with high lift and very low support. Since this support is very small, most association analyses will not be able to find this rule (or this rule will be hidden in thousands of rules). The four items of this rule are “Speak Spanish at Home”, “Home Type is either House or Apartment”, “Hispanic”, and “Two or less Kids at Home”. Several association rules can be derived using these four items. One interesting rule with three items derived is “Hispanic” and “Two or less kids at home” => “Speak Spanish at Home”. The support, confidence, and lift for this rule are 0.043, 0.378, and 5.87, respectively. Due to the low support, this rule was not found in Example 5. Another rule with four items derived is “Home or Apartment”, “Hispanic” and “Two or less kids at home” => “Speak Spanish at Home”. The support, confidence, and lift for this rule are 0.037, 0.347, and, 5.400, respectively. Another four-item rule found by trees is “White”, “Mobile Home or Others”, “Two or less kids at Home”, and “Speak English at home”. One association based on these four items is “White”, “Two or less kids”, and 27 “Live in mobile Home or Others” => “Speak English at Home”. This rule has 0.0406, 0.793, and 1.228 as its support, confidence, and lift, respectively. The last rule presented is “Speak English at Home”, “Live in House”, “Three or More Kids” => “Live in Bay Area for more than ten years”. This rule has very low support (0.0226). However, the confidence (0.646) is very high and the lift (1.12) is moderate. 28 Appendix 1 Apriori Algorithm The Apriori algorithm (Agrawal et al., 1995) for finding all frequent item sets is given below: procedure AprioriAlg() begin L1 := {frequent 1-itemsets}; for ( k := 2; Lk-1 0; k++ ) do { Ck= apriori-gen(Lk-1) ; // new candidates for all transactions t in the dataset do { for all candidates c Ck contained in t do c:count++ } Lk = { c Ck | c:count >= min-support} } Answer := k Lk end The code makes multiple passes over the database. In the first pass, the algorithm simply counts item occurrences to determine the frequent single-item sets (i.e., single-item sets with frequency greater than the threshold). A subsequent pass, say pass k, consists of two phases. First, the frequent item sets Lk-1 (the set of all frequent (k-1)-item sets) found in the (k-1)th pass are used to generate the candidate item sets Ck, using the apriori-gen() function. Next, it deletes all item sets from the join result that have some (k-1)-subset that is not in Lk-1 yielding Ck. The algorithm now scans the database. For each transaction, it determines which of the candidates in Ck are contained in the transaction using a hashtree data structure and increments the count of those candidates. At the end of the pass, Ck is examined to determine which of the candidates are frequent, yielding Lk . The algorithm terminates when Lk becomes empty. 29 Appendix 2 SAS Code for Example 3 /* Example 3 Part (a) */ options nodate nocenter pageno=1 pagesize=54 linesize=80 nonumber; data work.jmeasure; set &em_import_rules; conf=conf/100; support=support/100; if conf = 1 then jm = .; /* j-measure is undefined if conf = 0 or 1 */ else jm=(support/conf)*(conf*log(lift)+(1-conf)*log((1-conf)/(1conf/lift))); label jm = "J-Measure"; label support = "Support"; label conf = "Confidence"; run; proc sort data=work.jmeasure; by descending jm; run; proc sql inobs=5; select rule, jm, conf, support, lift, exp_conf from work.jmeasure; quit; /* Part (b) */ options linesize=132; proc sql; select rule, conf, support, lift from work.jmeasure where (_RHAND = 'CKING' and conf > 0.8 and support > 0.2); quit; 30 Appendix 3 Data Set “Income” A total of N=9409 questionnaires containing 502 questions were filled out by shopping mall customers in the San Francisco Bay area (Impact Resources, Inc., Columbus, OH (1987)). The dataset “income” is an extract from this survey. It consists of 14 demographic attributes. The dataset is a good mixture of categorical and continuous variables with a lot of missing data. The variable list is, as follows: Alphabetic List of Variables and Attributes # Variable Type Len Format Informat 4 Age Num 8 BEST12. BEST32. 5 Education Num 8 BEST12. BEST32. 2 Gender Num 8 BEST12. BEST32. 1 Income Num 8 BEST12. BEST32. 7 Length Num 8 BEST12. BEST32. 3 Married Num 8 BEST12. BEST32. 6 Occupation Num 8 BEST12. BEST32. 8 dual Num 8 BEST12. BEST32. 13 ethnic Num 8 BEST12. BEST32. 12 home_type Num 8 BEST12. BEST32. 11 household Num 8 BEST12. BEST32. 14 language Num 8 BEST12. BEST32. 9 n_family Num 8 BEST12. BEST32. 10 n_kids Num 8 BEST12. BEST32. The attribute information is, as follows: 1 ANNUAL INCOME OF HOUSEHOLD (PERSONAL INCOME IF SINGLE) 1. Less than $10,000 2. $10,000 to $14,999 31 3. $15,000 to $19,999 4. $20,000 to $24,999 5. $25,000 to $29,999 6. $30,000 to $39,999 7. $40,000 to $49,999 8. $50,000 to $74,999 9. $75,000 or more 2 SEX 1. Male 2. Female 3 MARITAL STATUS 1. Married 2. Living together, not married 3. Divorced or separated 4. Widowed 5. Single, never married 4 AGE 1. 14 thru 17 2. 18 thru 24 3. 25 thru 34 4. 35 thru 44 5. 45 thru 54 6. 55 thru 64 7. 65 and Over 5 EDUCATION 1. Grade 8 or less 2. Grades 9 to 11 3. Graduated high school 4. 1 to 3 years of college 5. College graduate 6. Grad Study 6 OCCUPATION 1. Professional/Managerial 2. Sales Worker 3. Factory Worker/Laborer/Driver 4. Clerical/Service Worker 5. Homemaker 6. Student, HS or College 7. Military 8. Retired 9. Unemployed 32 7 HOW LONG HAVE YOU LIVED IN THE SAN FRAN./OAKLAND/SAN JOSE AREA? 1. Less than one year 2. One to three years 3. Four to six years 4. Seven to ten years 5. More than ten years 8 DUAL INCOMES (IF MARRIED) 1. Not Married 2. Yes 3. No 9 PERSONS IN YOUR HOUSEHOLD 1. One 2. Two 3. Three 4. Four 5. Five 6. Six 7. Seven 8. Eight 9. Nine or more 10 PERSONS IN HOUSEHOLD UNDER 18 0. None 1. One 2. Two 3. Three 4. Four 5. Five 6. Six 7. Seven 8. Eight 9. Nine or more 11 HOUSEHOLDER STATUS 1. Own 2. Rent 3. Live with Parents/Family 12 TYPE OF HOME 1. House 2. Condominium 3. Apartment 33 4. Mobile Home 5. Other 13 ETHNIC CLASSIFICATION 1. American Indian 2. Asian 3. Black 4. East Indian 5. Hispanic 6. Pacific Islander 7. White 8. Other 14 WHAT LANGUAGE IS SPOKEN MOST OFTEN IN YOUR HOME? 1. English 2. Spanish 3. Other There are 8993 observations in this data set that are obtained from the original dataset with 9409 instances by removing those observations with the response (Annual Income) missing (with coding “.”). 34 Appendix 4 SAS Code for Disassociation Analysis %let values='SVG','CKING','MMDA'; %let in=&EM_IMPORT_TRANSACTION; %let out=&EM_EXPORT_TRANSACTION; proc sql; create table v56767c as select distinct %em_target from ∈ create table r57304x as select distinct %em_id as %em_id, as notvalue from &in, v56767c as a; a.%em_target, '~'||a.%em_target create table &out as select b.%em_id, coalesce(a.%em_target, b.notvalue) as %em_target from &in as a right join r57304x as b on a.%em_id=b.%em_id and a.%em_target=b.%em_target where a.%em_target ~= '' or b.%em_target in (&values); quit; proc datasets library=work nolist; delete r57304x v56767c; quit; 35 Appendix 5 Sample SAS Code for Case Study /* Prepare a data that suitable for decision tree analysis */ data &mylib.income_tree (keep = income gender marry age education occupation length dual n_family n_kids household home_type ethnic language); set &mylib.income; array aa(14); do i = 1 to 14; aa(i) = uniform(0); end; gender = int(aa(1)*2)+1; income = int(aa(2)*9)+1; marry = int(aa(3)*5)+1; age = int(aa(4)*7)+1; education = int(aa(5)*6)+1; occupation = int(aa(6)*9)+1; length = int(aa(7)*5)+1; n_family = int(aa(8)*9)+1; n_kids = int(aa(9)*10); household = int(aa(10)*3)+1; home_type = int(aa(11)*5)+1; ethnic = int(aa(12)*8)+1; language = int(aa(13)*3)+1; dual = int(aa(14)*3)+1; run; data &mylib.incomecombine; set &mylib.income (in=ina) &mylib.income_tree (in=inb); if ina then target = 1; else if inb then target = 0; run; 36 Appendix 6 References Agrawal, R., Imielenski, T. and Swami, A. (1993) “Mining Association Rules between Sets of Items in Large Database”, Page 207-216, Proceedings of the ACM SIGMOD Conference on Management of Data, ACM Press: New York. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. and Verkamo, A. I. (1995a) “Fast Discovery of Association Rules, Advances in Knowledge Discovery and Data Mining,” AAAI/MIT Press, Cambridge, MA. Agrawal, R., Mannila and H., Srikant, R. (1995b) “Mining Sequential Patterns,” Page 314, Proceedings of the IEEE International Conference on Data Engineering. Danham, Margaret H. (2003) “Data Mining – Introductory and Advanced Topics”, Chapter 6), Prentice Hall: Upper Sand River, New Jersey. David Hand, Heikki Mannila, and Padhraic Smyth (2001) “Principles of Data Mining” Massachusetts Institute of Technology, Cambridge, Massachusetts. (Chapter 13) SAS Institute (2000) “Enterprise Miner: Applying Data Mining Techniques – Course Notes”, SAS Institute, Cary, NC. (Chapter 7). Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2002) The Elements of Statistical Learning – Data Mining, Inference, and Prediction (Chapter 14). Zaki, M. J. (1998) “Efficient Enumeration of Frequent Sequences”, Proceedings of the ACM CIKM Conference, Page 68-75. 37 Exercise: Problem 1 For any given rule A=>B (a) Prove that one can compute the J-measure with three quantities – support, confidence, and lift. (b) Prove that one can compute the one can compute the chi-square statistics for independence with four quantities – support, confidence, lift, and the sample size. Problem 2 (a) Write a SAS program to convert “income.txt” into a SAS data set that can be used in Association Node. You can use the median to split each ordinal variable into two dummy variables and use each category as one dummy variable for each nominal variable. (b) Perform an association analysis with Association node and the following conditions: • Find rules with 5 or less items • The support for the rule is 10% • The confidence used to generate the rules is 10% (c) Use a SAS code node to compute the j-measure combined with PROC SQL to identify five interesting rules. Problem 3 Can you find out a rule “A=>B” that has S(A⇒B)=0.8, L(A⇒B)=1.5? Please explain. C(A⇒B)=0.75, and Problem 4 Implement Apriori and AprioriAll algorithm. Problem 5 One weakness of Apriori algorithm is that it assumes that the database can completely reside in the memory. Please perform research to find algorithms that do not make this assumption. Problem 6 Perform research to find new algorithms on association analysis such as an incremental updating approach to identify association rules to avoid a re-run the algorithm from the beginning each time. 38