Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Relational Association Rule Mining in Market Basket using the RoloDex Model with pTree Arijit Chatterjee, Mohammad Hossain, Arjun G. Roy, William Perrizo Department of Computer Science North Dakota State University Fargo, North Dakota 58102 {arijit.chatterjee, mohammad.hossain, arjun.roy, william.perrizo}@ndsu.edu Abstract In this paper1 we are concerned with finding how the RoloDex Model can be used to find relational association rules between different entities in Market Basket research using pTrees. The significance of Association rules is measured via support and confidence and they are used to identify the rules in particular transactions. In this paper we will however try to extrapolate that notion into extending it to multiple entities and multi relations using the RoloDex model. RoloDex model is fairly a new concept introduced in this paper. We structure the paper by initially providing some background information on using the notion of Support and Confidence in Market Basket Analysis and then introducing the concept of RoloDex Model and pTrees and finally how the RoloDex model can be used in Market Basket research with pTrees to find multiple relationships between multiple entities. 1. INTRODUCTION Since its introduction in 1993 by Agarwal et al. [1][2], association rule mining has continuously received a great deal of attention from the database research community. Association Rule Mining (ARM) is the data-mining process of finding interesting association and /or correlation relationships among large sets of data items. The original motivation for discovering association rules comes from the need to analyze super market transactions in what is known as Market Basket Research (MBR) where analysts are interested in examining customer shopping patterns in terms of the purchased product. The market basket databases consist of a large number of transactional records. In addition to the transactional identifier, each record lists all the items bought by a customer during a single visit to the store. Knowledge workers are typically interested in finding out which 1 We acknowledge financial support for this research came from a Department of Energy Award (award # DEFG52-08NA28921). group of items are constantly purchased together. Such knowledge could be useful in many business decisionmaking processes, such as adjusting store layouts (like placing products optimally with respect to each other), running promotions, designing catalogs and identifying potential customer segments as targets for marketing campaigns. 1.1 Association Rules Association rules [1][2][3][4][5] provide information in the form of “if-then” statements. These rules are computed from the data and unlike the rules of logic they are probabilistic in nature. In association analysis, the antecedent (or the “if” part of the rule) and the consequent (or the “then” part of the rule) are sets of items referred to as item sets that are disjoint (i.e. do not have any item in common). In addition to the antecedent and the consequent, an association rule usually has statistical interest measures that express the degree of certainty in the rule. Two ubiquitously used measures are support and confidence. The support of an item set is the number of transactions that include all the items in the item set. The support of an association rule is simply the support of the union of items in the antecedent as well as in the consequent. It can be either expressed as an absolute number or as a percentage out of the total number of transactions in the database. In statistical terms, this expresses the statistical significance of a rule. The confidence of an association is defined as the ratio of the number of transactions containing all the items in the antecedent as well as the consequent of the rule (i.e. support of the rule) over the number of transactions that include all the items in the antecedent only (i.e. the support of the antecedent). Statistically, this measure expresses the statistical strength of a rule. Alternatively, one can think of support as the probability that a randomly selected transaction from the database will contain all the items in the antecedent and the consequent, and of confidence as the conditional probability that a randomly selected transaction will include all the items in the consequent given that the transaction includes all the items in the antecedent. 1.2 Formal Problem Statement Formally, let I be a set of items defined in an item space[3][4][6]. A set of items S = {i1 ,……,ik ) belonging to I is referred to an itemset (or a k-itemset if S contains k items). Any transaction over I is defined as a couple T = (tid,ilist) with tid being the transaction identifier and ilist an itemset over I. A transaction T = (tid, ilist) is said to support an itemset S in I, if S is a subset of T’s ilist. A transaction database D over I is defined as a set of transactions over I. For every itemset S, the support of S in D adds the number of transaction identifiers for all transactions in D that support S (i.e contain S in their ilists): support(S,D) = |{tid |(tid,ilist) in D, S being a subset of ilist }|. An itemset is said to be frequent if the support is greater than or equal to a given absolute minimum support threshold, minsupp where 0 <= minsupp <= |D|. An itemset which is not known to be frequent or infrequent is referred to as a candidate frequent itemset. Generally speaking ARM is defined as a three way process: (1) Choosing the right set of items/level of detail, (2) finding all frequent patterns which occur at least as frequently as a pre-determined minimum support threshold and (3) generating strong association rules from the frequent patterns which must satisfy the minimum confidence threshold. However, it is worth noting that few ARM approaches do not strictly adhere to this three way format. 1.3 Rule Generation The support [7]of an association rule A->C in D, support (A->C,D), is the support of A union C in D. An association rule is called frequent if its support exceeds the given minsupp. The confidence[8] of an association rule A->C in D, confidence (A->C, D), is the conditional probability of having C contained in a transaction, given that A is contained in the same transaction : P(C|A) or confidence (A->C,D): = support(A->C,D)/support(A,D). A rule is confident if its confidence exceeds a given minimal confidence threshold, minconf , where 0 <= minconf <= 1.So given a set of items I and a transactional database D over I we will be considered of generating collection of strong rules in D with respect to minsupp and minconf. 2. ROLODEX MODEL AND pTREE 2.1 RoloDex Model A RoloDex is a rotating file device used to store the information of individuals. The RoloDex holds specially designed index cards and the user stores the contact information of individuals on these cards. We extend the notion of RoloDex in Association Rule Mining where the axis of the RoloDex will represent entities and the index cards used in the RoloDex will represent the relationships. This view may facilitate research into the data mining of multi relationships. In each of the RoloDex’s we can store the relationship between multiple entities and then compare between other RoloDex’s storing similar information’s. While association rule mining is mainly involved in mining the relationship between two entities the goal using the RoloDex model is to extend ARM to the data mining of multiple relationships. The benefit of using the RoloDex model over the DataCube model and the Relational model is, the DataCube model is not as flexible and since there are more nulls, it is not as easy to isolate particular relationships. The relational model suffers from the lack of flexibility as well and the pictoral representation of data. 2.2 pTree Algorithm Tremendous volumes of data cause the cardinality problem for conventional transaction based ARM algorithms. For fast and efficient data processing, we transform the data into pTree[18], the loss-less, compressed, and data-mining-ready vertical data structure. pTrees are used for fast computation[20] of counts and for masking specific phenomena. This vertical data representation consists of set structures representing the data column-by-column rather than row-by-row (horizontal relational data). Predicate-trees are one choice of vertical data representation, which can be used for data mining instead of, the more common sets of relational records. This data structure has been successfully applied in data mining applications ranging from Classification and Clustering with K-Nearest Neighbor, to Classification with Decision Tree Induction, to Association Rule Mining[19][20][21][22][23]. A basic pTree represents one attribute bit that is reorganized into a tree structure by recursive sub-division, while recording the predicate true value for each division. Each level of the tree contains truth-bits that represent sub-trees and can then be used for phenomena masking and fast computation of counts. This construction is continued recursively down each tree path until downward closure is reached. For example, if the predicate is "purely 1 bits", downward closure is reached when purity is reached (either purely 1 bits or purely 0 bits). In this case, a tree branch is terminated when a subdivision is reached that is entirely pure (which may or may not be at the leaf level). These basic pTrees and their complements are combined using Boolean algebra operators such as AND(&) OR(j) and NOT(0) to produce mask pTrees for individual values, individual tuples, value intervals, tuple rectangles, or any other attribute pattern. The root count of any pTree will indicate the occurrence count of that pattern. The pTree data structure provides a structure for counting patterns in an efficient, highly scalable manner. 2.3. RoloDex model illustration using an example movie The following figure Fig 1.0 is an example of the illustration of the RoloDex model where there is a movie entity and the customer entity and the relationship between the customer and the movie is through the customer rating movie cards. The following customers have rated movieID 4 and 1 with the following ratings in the corresponding axis of the RoloDex as shown by the dotted-lines. 4 3 0 3 0 0 2 0 2 1 0 0 0 0 5 0 5 0 0 0 0 customer rates movie card 0 0 1 0 0 0 0 4 1 0 0 0 0 0 0 (Customer,Items). The idea is that if the cutomer often buys an item likely the cutomer is going to give that item a fairly positive rating. Our rating systems are defined from a scale between 0 to 5 where 0 means customer is never going to buy the item while 5 would mean that the customer is definitely going to buy the item. In the following diagram Fig 3.0 we show the relationship between the entities Cutomers ( C ) and Items ( I ) using two RoloDex models where the first RoloDex has the customers and the items as two axis’s and buy is the relationship containing the cutomer-item cards. The second RoloDex also has the customers and items as its axis and uses rating as the relationship containing Rating5 customer item cards. The idea is to fill each index card with a true (1) or a false (0) value if it satisfies the relationship condition. 3 pre-computed BpTtreec 1-counts 5 0 pre-computed BpTtreei 1-counts Fig1.0 Customer Rating Movie Card diagram So the customer rating 5 movie card from the above cards can be constructed as follows. C pre-computed R5pTree counts 1 2 3 1 2 3 2 3 1 1 3 2 2 1 1 2 0 0 0 1 B(C,I) 1 0 4 5 4 3 2 1 I 0 1 0 0 0 0 R5(CI) 0 0 0 0 0 1 0 1 1 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 customer rates movie as 5 card 0 0 0 0 0 0 0 0 0 0 0 1 0 Fig 2.0 Customer Rating Movie 5 Card diagram If we want to study the customers who have rated movies only as 5 with other entities then then we will consider only the above customer rating card. 3.0 ROLODEX Model in MBR using pTrees 3.1 2 Entities 2 Relationship MultiRelationships In Market Basket Analysis we will be considering the buying patterns of the customers and we define the two major entities we will be concerned about Customers and Items. One relationship will be Buy (Cutomer, Items) and the other relationship we have introduced is Rating Fig 3.0 2Entity 2Relationship diagram In the above diagram customer with custID 3 buys item with ItemID 4 and rates that item 5, similarly customer with custID 5 buys item with itemID 1 and rates the item as 5. The 1 counts of the pre computed Buy RoloDex, column wise are 3 2 1 2 and row wise are 2 1 3 2 and the 1-counts of the pre computed Rating RoloDex, column wise are 1 2 3 1 and row wise are 0 1 1 2. We will be interested to find that if for iI, if we can generate strong rules for the customers. We made the assumption before that if a customer rates an item as 5 there are chances for the customer to buy the items. We will need to find the support and the confidence of this rule and if its more than minconf and minsupp then we can recommend the customer to buy that item. So, {c| rate(i,c)=5}{c| buy(c,i)=yes}. In the notion of pTrees we define the Support and the confidence as follows: 1 Confidence =count(R5pTreei&BpTreei) / count(R5pTreei) 1 Support =count(R5pTreei) / size(R5pTreei) 1 0 S(E,F) E 1 2 0 0 3 The overall schema for thie RoloDex model is as follows : size(Customer)=size(R5pTreei)=size(BpTreei)=4 size(Itemset)= size(R5pTreec)=size(BpTreec)=4 4 3 2 1 1 0 0 0 R(E,F) 0 0 2 1 F If the minconf=minsup=.2, For itemset=1: count ( R5PTree1)=0001 count ( BpTree1)=1001 Confidence=count(0001&1001)/count(0001)=1 So, confidence=1 Support=1/4 = 0.25 Both of these values are greater than the minsupp and minconf and so this is a strong rule. We can now include variations in the rules such as instead of 5 as the rating if we choose 4 whether the rules are generated are strong or not. Moreover instead of choosing a singleton {i} from the itemset we might also be interested to find the type of the itemsets such as grocery items, food items and so on. It will be lot faster to datamine the results as we are doing mainly binary AND operations or OR operations to do the computing. Moreover since we are vertically processing the data in the form of pTrees we can extract the information from the RoloDex in a single pass In the previous example which we calculated the strength of the rule that can be considered to be an expected mine as its expected that the person who rates the item as 5 will buy the item. Let’s take some further examples where we will try to mine other relationships and might be interested to find some other items the customers might be interested in. In the following diagram Fig 4.0 there are two entities E and F and has two types of RoloDex cards in each of the two RoloDex’s S(E,F) and R (E,F).. Fig 4.0 Entity Relationship Model diagram The following are some of the relations which are shown below as how it can be derived from using PTrees. 1.Given eE, If R(e,f), then S(e,f) Confidence = count(Re & Se) / count(Re) Support=count(Re) / size(Re) 2.If eA R(e,f), then eB S(e,f) where A and B are different entities Confidence=count( &eARe &eBSe) /count(&eARe) Support= count(&eARe)/ sz(&eARe) 3.If eA R(e,f), then eB S(e,f) Confidence=count( &eARe OReBSe) / count(&eAre Support= count(&eARe)/ sz(&eAR e) From the RoloDex model we have obtained the relationship among various entities and can also calculate the Support and the Confidence of the various association rules determining them to be strong or not from the chosen minsupp and minconf values. 3.2 2 Entities 3 Relationship MultiRelationships The usage of 2 Entities 2 relationship Multirelations can be extended to 2 entities 3 relationship multirelations. We take an example where a customer C rates (relationship) an item I, buys (relationship) the item I and uses (relationship) item I frequently. The following diagram is an a RoloDex Model representation: 1 1 0 0 B(I,C) 1 2 3 4 0 1 1 0 0 6. REFERENCES U(I,C) 0 1 0 [1] R.Agarwal, A. Arning, T . Bollinger, M .Mehta, J. Shafer and R.Srikant, The Quest Data Mining System. In Proceedings of the Second International Conference on Knowledge Discovery in Databases and Data, August 1996. 0 1 5 4 3 2 1 0 1 0 0 0R1(C,I) 0 0 0 0 1 0 1 Fig 4.0 2Entity 3Relationship diagram For a customer C who has rated item I as 1 and buying item I will frequently use item I. For iI, {c| R1(c,i)=y & P(c,i)=y} {c| U(c,i)=y} Confidence=count(R1pTreeb&PpTreeb&SpTreeb) /count(R1pTreeb & PpTreeb) Support=count(R1pTreeb&PpTreeb) /size(R1pTreeb&PpTreeb) So the Confidence and the Support for the rules can be calculated using pTrees in the RoloDex model extending to 3 relations between two entities. 4. LIMITATIONS In this paper we have not used real life data and have not compared whether by using the RoloDex model over the other models such as the DataCube model we gain any sort of accuracy in results both in terms of prediction and speed. We mentioned in the paper that by using pTrees we gain computational benefit as we are basically doing binary operations but have not had any evidence citing this fact. This research is still in the early phases and lack evaluation and testing. 5. CONCLUSIONS The concept of using the RoloDex model in Market Basket research using pTrees is a new research which is first been shown in this paper. We plan to extend the concept to n entities n relationships and see how we can provide better and more accurate results using pTrees. We also plan to extend this concept to not only Market Basket research but also in other areas of ARM where we would like to study the Support and Confidence of various association rules and gather important results. [2] R. Agarwal , T .Imielinski and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pages 207-216, May 1993. [3] R. Agarwal , T .Imielinski and A. Swami. Database Mining: a performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5:914-925,1993. [4] R. Agarwal , H .Mannila, R. Srikant , H. Toivonen and A.I. Verkamo. Fast Discovery of Association Rules. In Fayyad et al. . pages 307-328,1996. [5] R.Agarwal and R.Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on very large databases, pages 487-499, September 1994. [6] Imad Mohamad Rahal and William Perrizo. A Vertical Extensible ARM Framework for the scalable mining of Association Rules. Pages 5 -25. [7] Michael Steinbach,Pang-Ning Tan, Hui Xiong and Vipin Kumar . Generalizing the Notion of Support. International Conference on Knowledge Discovery and Data Mining.Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. [8] Michael Steinbach and Vipin Kumar . Generalizing the Notion of Confidence. Fifth IEEE International Conference on Data Mining .Volume , Issue , 27-30 Nov. 2005 Page(s). [9] Goethals, B. Survey of frequent pattern mining. N.d. Internet. http://www.cs.helsinki.fi/u/goethals [30 December 2004] [10] A Book on Statistical Methods by N.G. Das. 2001 Edition . Publisher M. Das and Co. [11] Notes on Statistics from N.d. Internet. http://www.netmba.com/statistics/covariance/ [12] E.-H. Han, G. Karypis, and V. Kumar. Tr# 97-068: Min-apriori: An algorithm for finding association rules in data with continuous attributes. Technical report, Department of Computer Science, University of Minnesota, Minneapolis, MN, 1997. [13] Jiawei Han , Micheline Kamber, Data mining: concepts and techniques, Morgan Kaufmann Publishers Inc., San Francisco, CA, 2000 [14] M. J. Zaki and M. Ogihara. Theoretical foundations of association rules. In DMKD 98, pages 7:1--7:8, 1998. [15] J. Han and Y.Fu . Discovery of multiple level association rules from large databases. In Proceedings of the 21st International Conference on very large databases, pages 420-431, September 1995. [16] James W. Demmel, Applied numerical linear algebra, Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997. [17] C. Yang, U.M Fayyad and P.S. Bradley. Efficient discovery of error-tolerent frequent itemsets in high dimensions. In KDD 2001, pages 194-203, 2001 [18] PTree Application Programming Interface Documentation, North Dakota State University. http://midas.cs.ndsu.nodak.edu/ ~datasurg/pTree/ [19] Q. Ding, M. Khan, A. Roy, and W. Perrizo, The PTree Algebra, Proceedings of the ACM Symosium on Applied Computing, pp 426-431, 2002 [20] A. Perera and W. Perrizo, Parameter Optimized, Vertical, Nearest Neighbor Vote and Boundary Based Classi_cation, CATA, 2007 [21] A. Perera, T. Abidin, G. Hamer and W. Perrizo, Vertical Set Square Distance Based Clustering without Prior Knowledge of K, 14th International Conference on Intelligent and Adaptive Systems and Software Engineering (IASSE 05), Toronto, Canada, 2004 [22] W. Perrizo, G. Wettstein, A. Perera and T. Lu, The Universality of Nearest Neighbor Sets in Classi_cation and Prediction, Software Engineering and Data Engineering, 2009 [23] Y. Wang, T. Lu and W. Perrizo, A Novel Combinatorial Score for Feature Selection with PTree in DNA Microarray Data Analysis, Software Engineering and Data Engineering, 2010