Download 2012 ISAC Journal Maximum Likelihood

Maximum Likelihood Function used to calculate confidence of Association rules in Market Baskets Arijit Chatterjee Department of Computer Science North Dakota State University Fargo , ND 58102, USA [email protected] 701-540-3804 Dr. William Perrizo Department of Computer Science North Dakota State University Fargo , ND 58102, USA [email protected] 701-231-7248 Abstract In this paper we are concerned in looking at different ways for calculating the strength of Association Rules in Market Basket data. The significance of Association rules is measured via two measures support and confidence and the way these measures are used to determine strong rules. In the realm of Market Basket Research these measures can be used to find the strength of the rules in a particular transaction of the form, “When a customer buys items A&B also buys item C”. The first portion of this paper illustrates the usage of the method of Maximum Likelihood for Point Estimation and gives an idea how the maximum likelihood estimator can also be used for predicting the confidence of an association rule. The second portion of the paper mainly describes with examples how maximum likelihood function can be used for calculating the collective confidence of association rules. Keywords: Association Rules, Maximum Likelihood estimator, Market Basket Research. 1. INTRODUCTION Since its introduction in 1993 by Agarwal et al [1][2], association rule mining has continuously received a great deal of attention from the database research community. Association Rule Mining (ARM) is the data-mining process of finding interesting association and /or correlation relationships among large sets of data items. The original motivation for discovering association rules comes from the need to analyze super market transactions in what is known as Market Basket Research (MBR) where analysts are interested in examining customer shopping patterns in terms of the purchased product. The market basket databases consist of a large number of transactional records. In addition to the transactional identifier, each record lists all the items bought by a customer during a single visit to the store. Knowledge workers are typically interested in finding out which group of items are constantly purchased together. Such knowledge could be useful in many business decision-making processes, such as adjusting store layouts( like placing products optimally with respect to each other), running promotions, designing catalogs and identifying potential customer segments as targets for marketing campaigns. 1.1 Association Rules Association rules [1][2][3][4][5] provide information in the form of “if-then” statements. These rules are computed from the data and unlike the rules of logic they are probabilistic in nature. In association analysis, the antecedent (or the “if” part of the rule) and the consequent (or the “then” part of the rule) are sets of items referred to as item sets that are disjoint (i.e. do not have any item in common). In addition to the antecedent and the consequent, an association rule usually has statistical interest measures that express the degree of certainty in the rule. Two ubiquitously used measures are support and confidence. The support of an item set is the number of transactions that include all the items in the item set. The support of an association rule is simply the support of the union of items in the antecedent as well as in the consequent. It can be either expressed as an absolute number or as a percentage out of the total number of transactions in the database. In statistical terms, this expresses the statistical significance of a rule. The confidence of an association is defined as the ratio of the number of transactions containing all the items in the antecedent as well as the consequent of the rule (i.e. support of the rule) over the number of transactions that include all the items in the antecedent only (i.e. the support of the antecedent). Statistically, this measure expresses the statistical strength of a rule. Alternatively, one can think of support as the probability that a randomly selected transaction from the database will contain all the items in the antecedent and the consequent, and of confidence as the conditional probability that a randomly selected transaction will include all the items in the consequent given that the transaction includes all the items in the antecedent. In this paper we will illustrate that the maximum likelihood function can also be used to determine the confidence of an association rule. 1.2 Formal Problem Statement Formally, let I be a set of items defined in an item space[3][4][6]. A set of items S = {i 1 ,……,ik ) belonging to I is referred to an item set (or a k-item set if S contains k items). Any transaction over I is defined as a couple T = (tid,ilist) with tid being the transaction identifier and ilist an item set over I. A transaction T = (tid, ilist) is said to support an item set S in I, if S is a subset of T’s ilist. A transaction database D over I is defined as a set of transactions over I. For every item set S, the support of S in D adds the number of transaction identifiers for all transactions in D that support S (i.e contain S in their ilists): support(S,D) = |{tid |(tid,ilist) in D, S being a subset of ilist }|. An item set is said to be frequent if the support is greater than or equal to a given absolute minimum support threshold ,minsupp where 0 <= minsupp <= |D|. An item set which is not known to be frequent or infrequent is referred to as a candidate frequent item set. Generally speaking ARM is defined as a three way process: (1) Choosing the right set of items/level of detail, (2) finding all frequent patterns which occur at least as frequently as a predetermined minimum support threshold and (3) generating strong association rules from the frequent patterns which must satisfy the minimum confidence threshold. However, it is worth noting that few ARM approaches do not strictly adhere to this three way format. 1.3 Rule Generation The support [7] [26]of an association rule A->C in D, support (A->C, D), is the support of A union C in D. An association rule is called frequent if its support exceeds the given minsupp. The confidence [8] of an association rule A->C in D, written as confidence (A->C, D), is the conditional probability of having C contained in a transaction, given that A is contained in the same transaction: Mathematically denoted as P (C|A) = confidence (A->C, D): = support (A->C, D) / support (A, D). A rule is confident if its confidence exceeds a given minimal confidence threshold, minconf , where 0 <= minconf <= 1.So given a set of items I and a transactional database D over I we will be considered of generating collection of strong rules in D with respect to minsupp and minconf. 2. Method of Maximum Likelihood Since a sample is only a part of a population, the features of the former will generally differ from those of the later. The question that naturally arises is then: what can be said about the properties of the population from knowledge of the properties of the sample? Although a satisfactory answer to this question may not be found in all cases, in the case of random sampling it can be answered with the help of probability theory. In sampling theory, we are primarily concerned with this very question. The process of going from known sample information to an unknown population is called statistical inference[18][19][27]. Suppose we have a population X which is characterized by a single parameter  ( or by a set of parameter(s) ). The basic problem of sampling theory stated above usually presents itself in one of two forms: (a) Some features of the population X may be completely unknown to him, and he may wants to make a guess about the feature (which is labeled as ) completely on the basis of a random sample from the population, (b) Some information of a tentative nature regarding  may be available to the experimenter and he may wants to see whether the information is tenable in the light of the random sample taken from the population X. The first type of problem is the problem of estimation [25] [30] and in this paper we are primarily concerned with one particular strategy of estimation known as maximum likelihood estimation. The method of maximum likelihood is a convenient method for finding a good estimator. Consider f(x1x2,….., xn|), the joint probability or probability mass of the sample observations. For fixed , it may be looked upon as a function of the sample observations and then it gives their probability density function or probability mass functions. But, when x1,x2,…,xn are given, it may also be looked upon as a function of , called the likelihood function of  and denoted by L ( | x). The principle of maximum likelihood consists in taking that value of as the estimate of  for which L( | x) is a maximum. Thus if * be the maximum-likelihood estimate (m.l.e) of , then by definition L(*)=max  L( | x)……………………..(i) (The maximum likelihood estimator is * when looked upon as a function of the random variables X1,X2,….Xn.) In many cases, it will be convenient to deal with log L ( | x) , rather than L ( | x), and since log L ( | x) attains it’s highest value for the same value of  as L ( | x), does, * is such that log L ( *| x) = max  L ( | x). This  *, again will many cases be obtainable by differentiating log L(( | x) partially with respect to  and solving the two equations as given below :  log L = 0 ,  …………………………………………………….(ii) 2 log L < 0 2 when  =*………………………………………..(iii). But one must make sure that the value obtained by solving (ii), which gives a local maximum of L(|x) , also gives the absolute (global) maximum, which is mentioned in (iii). Indeed, the derivative may not exist at  =*, and then this method will fail. The shape of the log-likelihood function is important in a conceptual way. If the log-likelihood function is relatively flat, one can make the interpretation that several (perhaps many) values of  are nearly equally likely. They are relatively alike; this is quantified as the sampling variance or standard error. On the other hand, if the log-likelihood function is fairly peaked near its maximum point, this indicates some values of  are relatively very likely compared to others. There is some considerable degree of certainty implied and this is reflected in small sampling variances and standard errors, and narrow confidence intervals. So, the log-likelihood function at its maximum point is important as well as the shape of the function near this maximum point. 2.1 Using the Maximum Likelihood Function to calculate the confidence of an Association Rule Probability Distribution of a random variable is a statement specifying the set of its possible values together with their respective probabilities. When a random experiment is theoretically assumed to serve as a model, the probabilities can be given as a function of the random variable. The probability distribution concerned is then generally known as theoretical distribution. In this paper we are however only concerned with a particular form of distribution known as the Binomial Distribution and how we can apply this distribution to the realm of market basket research. To understand binomial distributions and binomial probability let us try to focus on understanding binomial experiments first. A binomial experiment (also known as a Bernoulli trial) is a statistical experiment that has the following properties:     The experiment consists of n repeated trials. Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure. The probability of success, denoted by p, is the same on every trial. The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials. The Binomial Distribution is a discrete probability distribution and is defined by the p.m.f (probability mass function) and given by P(X= x) = f(x) = nCx px (1-p)n-x X I(x = 0, 1, 2, ………n), where p denotes the probability of success and I(.) is an indicator function. We will take a real life example from a market basket data and show how we can apply the binomial distribution. Let us consider a supermarket database having 100,000 point of sale transactions, out of which 4,000 include both items A and B and 1000 of these include item C then the corresponding association rule “If A and B are purchased then C is purchased in the same trip” has a support of 1% and a confidence of 25%. We will try to formulate the same problem statement using the binomial distribution and show that the confidence of the association rule can be calculated using the method of Maximum Likelihood. Let p be the probability that when items A and B are purchased then item C is also purchased. So the 1000,000 point of sale transactions can be each considered to be n repeated trials. Each of these transactions can involve either that the item C is been purchased whenever the items A&B are brought together or it is not. Each of these transactions has two possible outcomes. The probability of item C being brought whenever A&B items are been bought is always the same in each of these transactions. Each of these transactions is independent and the outcome of one of the transactions does not affect the other. So since all clauses of the real life market basket data satisfy the properties of the binomial trial we will formulate the p.m.f as follows: 4000 of the transactions have items A and B and out of those 1000 include item C. So the probability that item C occurs 1000 times in 4000 trials is denoted by f(x) = C. 4000 C1000 p1000 (1-p) 4000-1000 where p is the probability of success of the occurrence of item Since this is a single association rule then the likelihood function is given by L = 4000C1000 p1000 (1-p) 4000-1000 = 4000C1000 p1000 (1-p) 3000 Since L is maximum when log L is maximum taking the logarithmic on both sides: log L = log (4000C1000) + 1000 log p + 3000 log (1-p) Differentiating with respect to p on both sides:  log L = 1000/p – 3000/ (1-p) P The maximum likelihood estimator p̂ is therefore obtained by equating the above equation to 0. 1000/ p̂ – 3000/ (1- p̂) = 0 Solving for p̂ , p̂ = .25 In order to find if the value of p̂ is maximum we have to take the double differentiation of the likelihood function.  log L = 1000/p – 3000/ (1-p) P Now differentiating both sides with respect to p we have: 2 log L p2 = - 1000 / p2 - 3000/ (1-p) 2 2 log L | = - 1000 / (0.25)2 - 3000/ (1-0.25) 2 p2 p̂ = 0.25 < 0………………….(iv) So the likelihood function attains its maximum at p̂ = .25. In fact equation (iv) will be < 0 for any value of p ∈ (0,1). In this way the maximum likelihood function can be used to calculate the confidence of an Association Rule. 3. Using Maximum Likelihood Function to calculate collective Confidence of Association Rules In most models we have more than one parameters [25]. In general let there be K parameters 1 , 2 , 3 k. so based on a specific model we can construct the log likelihood , loge (L (1 , 2 , 3 k | data)) = loge (L) and the K likelihood equations,  log L = 0 1  log L = 0 2 . . . .  log L = 0 k The solutions of these equations gives the MLE’s, ̂ 1,̂ 2,̂ 3,………………….̂ k . The MLEs are almost always unique; in particular this is true of multinomial-based models [21] [25]. In principle log (L (1,2,3…………..k | data)) = loge (L) defines a “surface" in K-dimensional space, ideas of curvature still apply (as mathematical constructs)[23] [25]. However, plotting is hard for more than 2 parameters. Sampling variances and covariance’s of the MLEs are computed from the log-likelihood, log (L (1 , 2 , 3…………………………k | data)) based on curvature at the maximum [21] [24] [25]. We extend this particular concept of MLE to market basket research when there is a collection of association rules involved in a database D then the maximum likelihood function can be used to calculate the collective confidence of these independent rules. “When items A & B are purchased then item C is purchased in the same trip” is an association rule. Let’s assume that out of n items in which A & B are together purchased x number of those items contain item C. If p is the probability of the item C to be occurring then the likelihood function of this rule can be expressed as follows: L1 = nCx px (1-p)n-x where (x = 0,1,2,………n) …………………………..(v) Say we have another independent transaction which shows that “When items D & E are purchased then also item C is also purchased in the same trip”. If out of m items in which items D & E are together purchased y number of those items contain C then the likelihood function of this association rule can be expressed as follows: L2 = mCy py (1-p)m-y where (y = 0,1,2,………m) ..........................................(vi) Based on both transactions the collective confidence of both of the association rules can be thought to be the maximum likelihood estimate of buying item C whenever a transaction happens. The likelihood function L is given by the product of these likelihood functions. L = L1 . L2 = nCx px (1-p)n-x . m Cy py (1-p)m-y =( nCx mCy ) . px+y . (1-p)[(m+n)-(x+y)] Since L is maximum, when log L is maximum we take the logarithmic on both sides: log L=log( nCx mCy ) + (x+y)log p +[(m+n)-(x+y)]log(1-p) Differentiating both sides with respect to p :  log L = (x+y)/p – [(m+n)-(x+y)] / (1-p) P The maximum likelihood estimator p̂ is therefore obtained by solving: (x+y) /p̂ – [ (m+n) -(x+y) ] / (1- p̂) = 0 Solving, p̂= (x+y) / (m+n) In order to find if the value of p̂ is maximum we have to take the double differentiation of the likelihood function.  log L = (x+y)/p – [(m+n)-(x+y)] / (1-p) P Now differentiating both sides with respect to p we have : 2 log L p2 = - (x+y)/ p2 - [(m+n)-(x+y)] / (1-p) 2 2 log L | = - (x+y) / [(x+y)/(m+n)]2 - [(m+n)-(x+y)]/ (1-[(x+y)/(m+n)]) 2 < 0…….(vii) p2 p̂ = (𝑥 + 𝑦)/(𝑚 + 𝑛) So the likelihood function attains its maximum at p̂ = (x+y) / (m+n). In fact equation (vii) will be < 0 for ∀ (x, y) and ∀ (m ,n) > 0 The collective confidence of selecting item C from the association rules is thus found to be (x+y) / (m+n). 4. LIMITATIONS The confidence of an association rule can be calculated in a simpler way if we just know the number of items in the antecedent out of which items in the consequent occur. The method of Maximum Likelihood also generates the same result but takes a rather complicated mathematical approach. The paper moreover does not illustrate the approach of calculating collective confidence of an item with real life data. 5. CONCLUSIONS This paper takes a different approach in calculating the confidence of association rules and moreover tries to find an estimate of the extent of association between two transactions in predicting the occurrence of a particular item. The method for calculating collective confidence of a particular item for predicting the extent of association using the Maximum likelihood function can be extended to n association rules and the solution will maximize the occurrence of the consequent item in the database. A further extension of this paper will be to find the maximum likelihood estimator of different association rules when the probabilities of occurrence of consequents differ. 6. ACKNOWLEDGEMENTS I would like to extend my sincere acknowledgements to the members of the DATASURG group at North Dakota State Univeristy at Fargo, North Dakota for providing me with their valuable suggestions. I would also like to expend my sincere gratitude and respect to my PhD research advisor Dr. William Perrizo. REFERENCES [1] R.Agarwal, A. Arning, T . Bollinger, M .Mehta, J. Shafer and R.Srikant, The Quest Data Mining System. In Proceedings of the Second International Conference on Knowledge Discovery in Databases and Data, August 1996. [2] R. Agarwal , T .Imielinski and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pages 207-216, May 1993. [3] R. Agarwal , T .Imielinski and A. Swami. Database Mining: a performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5:914-925,1993. [4] R. Agarwal , H .Mannila, R. Srikant , H. Toivonen and A.I. Verkamo. Fast Discovery of Association Rules. In Fayyad et al . pages 307-328,1996. [5] R.Agarwal and R.Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on very large databases, pages 487-499, September 1994. [6] Imad Mohamad Rahal and William Perrizo. A Vertical Extensible ARM Framework for the scalable mining of Association Rules. Pages 5 -25. [7] Michael Steinbach,Pang-Ning Tan, Hui Xiong and Vipin Kumar . Generalizing the Notion of Support. International Conference on Knowledge Discovery and Data Mining.Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. [8] Michael Steinbach and Vipin Kumar . Generalizing the Notion of Confidence. Fifth IEEE International Conference on Data Mining .Volume , Issue , 27-30 Nov. 2005 Page(s). [9] Goethals, B. Survey of frequent pattern mining. N.d. Internet. http://www.cs.helsinki.fi/u/goethals [30 December 2004] [10] A Book on Statistical Methods by N.G. Das. 2001 Edition . Publisher M. Das and Co. [11] Notes on Statistics from N.d. Internet. http://www.netmba.com/statistics/covariance/ [12] E.-H. Han, G. Karypis, and V. Kumar. Tr# 97-068: Min-apriori: An algorithm for finding association rules in data with continuous attributes. Technical report, Department of Computer Science, University of Minnesota, Minneapolis, MN, 1997. [13] Jiawei Han , Micheline Kamber, Data mining: concepts and techniques, Morgan Kaufmann Publishers Inc., San Francisco, CA, 2000 [14] M. J. Zaki and M. Ogihara. Theoretical foundations of association rules. In DMKD 98, pages 7:1--7:8, 1998. [15] J. Han and Y.Fu . Discovery of multiple level association rules from large databases. In Proceedings of the 21st International Conference on very large databases, pages 420-431, September 1995. [16] James W. Demmel, Applied numerical linear algebra, Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997. [17] C. Yang, U.M Fayyad and P.S. Bradley. Efficient discovery of error-tolerent frequent item sets in high dimensions. In KDD 2001, pages 194-203, 2001 [18] Notes on Statistics from N.d. Internet. http://www.weibull.com/AccelTestWeb/mle_maximum_likelihood_parameter_estimation.htm [19] Notes on Statistics from N.d. Internet. http://www.autonlab.org/tutorials/mle.html [20] Notes on Statistics from N.d. Internet. http://cnx.org/content/m11446/latest/ [21]Efron, B. and Hinkley, D.V.(1978).“ Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information”. Biometrika, 65, 457-482. [22] Feder, P.I (1968).“On the distribution of the log likelihood ratio test statistic when the true parameter is near the boundaries of the hypothesis regions”. Annals of Mathematical Statistics, 39, 2044-2055. [23] Lehman, E.L., and Casella G.(2001).Theory of Point Estimation. Springer, New York. [24]Self, S.G. and Liang, K.Y.(1987).“Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions”. Journal of the American Statistical Association, 82, 605-610. [25] Notes on Statistics from N.d. Internet. http://mercury.bio.uaf.edu/courses/wlf625/readings/MLEstimation.PDF [26] Brin, S., Motwani, R. and Silverstein, C., ”Beyond Market Baskets: Generalizing Association Rules to Correlations,” Proc. ACM SIGMOD Conf., pp. 265-276, May 1997. [27] Cheung, D., Han, J., Ng, V., Fu, A. and Fu, Y. (1996), A fast distributed algorithm for mining association rules, in `Proc. of 1996 Int'l. Conf. on Parallel and Distributed Information Systems', Miami Beach, Florida, pp. 31 - 44. [28] Chuang, K., Chen, M., Yang, W., Progressive Sampling for Association Rules Based on Sampling Error Estimation, Lecture Notes in Computer Science, Volume 3518, Jun 2005, Pages 505 - 515 [29] Cristofor, L., Simovici, D., Generating an informative cover for association rules. In Proc. of the IEEE International Conference on Data Mining, 2002. [30] Das, A., Ng, W.-K., and Woon, Y.-K. 2001. Rapid association rule mining. In Proceedings of the tenth international conference on Information and knowledge management. ACM Press, 474-481.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 2012 ISAC Journal Maximum Likelihood