Download 2012 ISAC Journal Maximum Likelihood

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Birthday problem wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Pattern recognition wikipedia , lookup

Probability box wikipedia , lookup

Generalized linear model wikipedia , lookup

Transcript
Maximum Likelihood Function used to calculate confidence of
Association rules in Market Baskets
Arijit Chatterjee
Department of Computer Science
North Dakota State University
Fargo , ND 58102, USA
[email protected]
701-540-3804
Dr. William Perrizo
Department of Computer Science
North Dakota State University
Fargo , ND 58102, USA
[email protected]
701-231-7248
Abstract
In this paper we are concerned in looking at different ways for calculating the strength of
Association Rules in Market Basket data. The significance of Association rules is measured via
two measures support and confidence and the way these measures are used to determine strong
rules. In the realm of Market Basket Research these measures can be used to find the strength of
the rules in a particular transaction of the form, “When a customer buys items A&B also buys item
C”. The first portion of this paper illustrates the usage of the method of Maximum Likelihood for
Point Estimation and gives an idea how the maximum likelihood estimator can also be used for
predicting the confidence of an association rule. The second portion of the paper mainly describes
with examples how maximum likelihood function can be used for calculating the collective
confidence of association rules.
Keywords: Association Rules, Maximum Likelihood estimator, Market Basket Research.
1. INTRODUCTION
Since its introduction in 1993 by Agarwal et al [1][2], association rule mining has continuously
received a great deal of attention from the database research community. Association Rule Mining
(ARM) is the data-mining process of finding interesting association and /or correlation
relationships among large sets of data items. The original motivation for discovering association
rules comes from the need to analyze super market transactions in what is known as Market
Basket Research (MBR) where analysts are interested in examining customer shopping patterns in
terms of the purchased product. The market basket databases consist of a large number of
transactional records. In addition to the transactional identifier, each record lists all the items
bought by a customer during a single visit to the store. Knowledge workers are typically interested
in finding out which group of items are constantly purchased together. Such knowledge could be
useful in many business decision-making processes, such as adjusting store layouts( like placing
products optimally with respect to each other), running promotions, designing catalogs and
identifying potential customer segments as targets for marketing campaigns.
1.1 Association Rules
Association rules [1][2][3][4][5] provide information in the form of “if-then” statements. These
rules are computed from the data and unlike the rules of logic they are probabilistic in nature. In
association analysis, the antecedent (or the “if” part of the rule) and the consequent (or the “then”
part of the rule) are sets of items referred to as item sets that are disjoint (i.e. do not have any item
in common). In addition to the antecedent and the consequent, an association rule usually has
statistical interest measures that express the degree of certainty in the rule. Two ubiquitously used
measures are support and confidence. The support of an item set is the number of transactions that
include all the items in the item set. The support of an association rule is simply the support of the
union of items in the antecedent as well as in the consequent. It can be either expressed as an
absolute number or as a percentage out of the total number of transactions in the database. In
statistical terms, this expresses the statistical significance of a rule. The confidence of an
association is defined as the ratio of the number of transactions containing all the items in the
antecedent as well as the consequent of the rule (i.e. support of the rule) over the number of
transactions that include all the items in the antecedent only (i.e. the support of the antecedent).
Statistically, this measure expresses the statistical strength of a rule. Alternatively, one can think
of support as the probability that a randomly selected transaction from the database will contain all
the items in the antecedent and the consequent, and of confidence as the conditional probability
that a randomly selected transaction will include all the items in the consequent given that the
transaction includes all the items in the antecedent. In this paper we will illustrate that the
maximum likelihood function can also be used to determine the confidence of an association rule.
1.2 Formal Problem Statement
Formally, let I be a set of items defined in an item space[3][4][6]. A set of items S = {i 1 ,……,ik )
belonging to I is referred to an item set (or a k-item set if S contains k items). Any transaction over
I is defined as a couple T = (tid,ilist) with tid being the transaction identifier and ilist an item set
over I. A transaction T = (tid, ilist) is said to support an item set S in I, if S is a subset of T’s ilist.
A transaction database D over I is defined as a set of transactions over I.
For every item set S, the support of S in D adds the number of transaction identifiers for all
transactions in D that support S (i.e contain S in their ilists): support(S,D) = |{tid |(tid,ilist) in D, S
being a subset of ilist }|. An item set is said to be frequent if the support is greater than or equal to
a given absolute minimum support threshold ,minsupp where 0 <= minsupp <= |D|. An item set
which is not known to be frequent or infrequent is referred to as a candidate frequent item set.
Generally speaking ARM is defined as a three way process: (1) Choosing the right set of
items/level of detail, (2) finding all frequent patterns which occur at least as frequently as a predetermined minimum support threshold and (3) generating strong association rules from the
frequent patterns which must satisfy the minimum confidence threshold. However, it is worth
noting that few ARM approaches do not strictly adhere to this three way format.
1.3 Rule Generation
The support [7] [26]of an association rule A->C in D, support (A->C, D), is the support of A union
C in D. An association rule is called frequent if its support exceeds the given minsupp. The
confidence [8] of an association rule A->C in D, written as confidence (A->C, D), is the
conditional probability of having C contained in a transaction, given that A is contained in the
same transaction: Mathematically denoted as
P (C|A) = confidence (A->C, D): = support (A->C, D) / support (A, D).
A rule is confident if its confidence exceeds a given minimal confidence threshold, minconf ,
where 0 <= minconf <= 1.So given a set of items I and a transactional database D over I we will be
considered of generating collection of strong rules in D with respect to minsupp and minconf.
2. Method of Maximum Likelihood
Since a sample is only a part of a population, the features of the former will generally differ from
those of the later. The question that naturally arises is then: what can be said about the properties
of the population from knowledge of the properties of the sample? Although a satisfactory answer
to this question may not be found in all cases, in the case of random sampling it can be answered
with the help of probability theory. In sampling theory, we are primarily concerned with this very
question. The process of going from known sample information to an unknown population is
called statistical inference[18][19][27].
Suppose we have a population X which is characterized by a single parameter  ( or by a set of
parameter(s) ). The basic problem of sampling theory stated above usually presents itself in one
of two forms: (a) Some features of the population X may be completely unknown to him, and he
may wants to make a guess about the feature (which is labeled as ) completely on the basis of a
random sample from the population, (b) Some information of a tentative nature regarding  may
be available to the experimenter and he may wants to see whether the information is tenable in the
light of the random sample taken from the population X. The first type of problem is the problem
of estimation [25] [30] and in this paper we are primarily concerned with one particular strategy of
estimation known as maximum likelihood estimation.
The method of maximum likelihood is a convenient method for finding a good estimator. Consider
f(x1x2,….., xn|), the joint probability or probability mass of the sample observations. For fixed ,
it may be looked upon as a function of the sample observations and then it gives their probability
density function or probability mass functions. But, when x1,x2,…,xn are given, it may also be
looked upon as a function of , called the likelihood function of  and denoted by L ( | x). The
principle of maximum likelihood consists in taking that value of as the estimate of  for which
L( | x) is a maximum. Thus if * be the maximum-likelihood estimate (m.l.e) of , then by
definition
L(*)=max  L( | x)……………………..(i)
(The maximum likelihood estimator is * when looked upon as a function of the random variables
X1,X2,….Xn.)
In many cases, it will be convenient to deal with log L ( | x) , rather than L ( | x), and since
log L ( | x) attains it’s highest value for the same value of  as L ( | x), does, * is such that
log L ( *| x) = max  L ( | x).
This  *, again will many cases be obtainable by differentiating log L(( | x) partially with respect
to  and solving the two equations as given below :
 log L = 0 ,

…………………………………………………….(ii)
2 log L < 0
2
when  =*………………………………………..(iii).
But one must make sure that the value obtained by solving (ii), which gives a local maximum of
L(|x) , also gives the absolute (global) maximum, which is mentioned in (iii). Indeed, the
derivative may not exist at  =*, and then this method will fail.
The shape of the log-likelihood function is important in a conceptual way. If the log-likelihood
function is relatively flat, one can make the interpretation that several (perhaps many) values of 
are nearly equally likely. They are relatively alike; this is quantified as the sampling variance or
standard error. On the other hand, if the log-likelihood function is fairly peaked near its maximum
point, this indicates some values of  are relatively very likely compared to others. There is some
considerable degree of certainty implied and this is reflected in small sampling variances and
standard errors, and narrow confidence intervals. So, the log-likelihood function at its maximum
point is important as well as the shape of the function near this maximum point.
2.1 Using the Maximum Likelihood Function to calculate the confidence of an Association
Rule
Probability Distribution of a random variable is a statement specifying the set of its possible
values together with their respective probabilities. When a random experiment is theoretically
assumed to serve as a model, the probabilities can be given as a function of the random variable.
The probability distribution concerned is then generally known as theoretical distribution.
In this paper we are however only concerned with a particular form of distribution known as the
Binomial Distribution and how we can apply this distribution to the realm of market basket
research. To understand binomial distributions and binomial probability let us try to focus on
understanding binomial experiments first.
A binomial experiment (also known as a Bernoulli trial) is a statistical experiment that has the
following properties:




The experiment consists of n repeated trials.
Each trial can result in just two possible outcomes. We call one of these outcomes a
success and the other, a failure.
The probability of success, denoted by p, is the same on every trial.
The trials are independent; that is, the outcome on one trial does not affect the outcome
on other trials.
The Binomial Distribution is a discrete probability distribution and is defined by the p.m.f
(probability mass function) and given by
P(X= x) = f(x) = nCx px (1-p)n-x X I(x = 0, 1, 2, ………n),
where p denotes the probability of success and I(.) is an indicator function.
We will take a real life example from a market basket data and show how we can apply the
binomial distribution. Let us consider a supermarket database having 100,000 point of sale
transactions, out of which 4,000 include both items A and B and 1000 of these include item C then
the corresponding association rule “If A and B are purchased then C is purchased in the same trip”
has a support of 1% and a confidence of 25%. We will try to formulate the same problem
statement using the binomial distribution and show that the confidence of the association rule can
be calculated using the method of Maximum Likelihood.
Let p be the probability that when items A and B are purchased then item C is also purchased. So
the 1000,000 point of sale transactions can be each considered to be n repeated trials. Each of
these transactions can involve either that the item C is been purchased whenever the items A&B
are brought together or it is not. Each of these transactions has two possible outcomes. The
probability of item C being brought whenever A&B items are been bought is always the same in
each of these transactions. Each of these transactions is independent and the outcome of one of the
transactions does not affect the other. So since all clauses of the real life market basket data satisfy
the properties of the binomial trial we will formulate the p.m.f as follows:
4000 of the transactions have items A and B and out of those 1000 include item C.
So the probability that item C occurs 1000 times in 4000 trials is denoted by
f(x) =
C.
4000
C1000 p1000 (1-p) 4000-1000 where p is the probability of success of the occurrence of item
Since this is a single association rule then the likelihood function is given by
L = 4000C1000 p1000 (1-p) 4000-1000
= 4000C1000 p1000 (1-p) 3000
Since L is maximum when log L is maximum taking the logarithmic on both sides:
log L = log (4000C1000) + 1000 log p + 3000 log (1-p)
Differentiating with respect to p on both sides:
 log L = 1000/p – 3000/ (1-p)
P
The maximum likelihood estimator p̂ is therefore obtained by equating the above equation to 0.
1000/ p̂ – 3000/ (1- p̂) = 0
Solving for p̂ ,
p̂ = .25
In order to find if the value of p̂ is maximum we have to take the double differentiation of the
likelihood function.
 log L = 1000/p – 3000/ (1-p)
P
Now differentiating both sides with respect to p we have:
2 log L
p2
= - 1000 / p2 - 3000/ (1-p) 2
2 log L | = - 1000 / (0.25)2 - 3000/ (1-0.25) 2
p2
p̂ = 0.25
< 0………………….(iv)
So the likelihood function attains its maximum at p̂ = .25. In fact equation (iv) will be < 0 for
any value of p ∈ (0,1).
In this way the maximum likelihood function can be used to calculate the confidence of an
Association Rule.
3. Using Maximum Likelihood Function to calculate collective Confidence of Association
Rules
In most models we have more than one parameters [25]. In general let there be K parameters 1 ,
2 , 3
k. so based on a specific model we can construct the log likelihood ,
loge (L (1 , 2 , 3
k | data)) = loge (L)
and the K likelihood equations,
 log L = 0
1
 log L = 0
2
.
.
.
.
 log L = 0
k
The solutions of these equations gives the MLE’s, ̂ 1,̂ 2,̂ 3,………………….̂ k .
The MLEs are almost always unique; in particular this is true of multinomial-based models [21]
[25].
In principle log (L (1,2,3…………..k | data)) = loge (L) defines a “surface" in K-dimensional space,
ideas of curvature still apply (as mathematical constructs)[23] [25]. However, plotting is hard for
more than 2 parameters.
Sampling variances and covariance’s of the MLEs are computed from the log-likelihood, log (L
(1 , 2 , 3…………………………k | data)) based on curvature at the maximum [21] [24] [25].
We extend this particular concept of MLE to market basket research when there is a collection of
association rules involved in a database D then the maximum likelihood function can be used to
calculate the collective confidence of these independent rules.
“When items A & B are purchased then item C is purchased in the same trip” is an association
rule. Let’s assume that out of n items in which A & B are together purchased x number of those
items contain item C. If p is the probability of the item C to be occurring then the likelihood
function of this rule can be expressed as follows:
L1 = nCx px (1-p)n-x where (x = 0,1,2,………n)
…………………………..(v)
Say we have another independent transaction which shows that “When items D & E are purchased
then also item C is also purchased in the same trip”. If out of m items in which items D & E are
together purchased y number of those items contain C then the likelihood function of this
association rule can be expressed as follows:
L2 = mCy py (1-p)m-y where (y = 0,1,2,………m)
..........................................(vi)
Based on both transactions the collective confidence of both of the association rules can be
thought to be the maximum likelihood estimate of buying item C whenever a transaction happens.
The likelihood function L is given by the product of these likelihood functions.
L = L1 . L2
= nCx px (1-p)n-x .
m
Cy py (1-p)m-y
=( nCx mCy ) . px+y . (1-p)[(m+n)-(x+y)]
Since L is maximum, when log L is maximum we take the logarithmic on both sides:
log L=log( nCx mCy ) + (x+y)log p +[(m+n)-(x+y)]log(1-p)
Differentiating both sides with respect to p :
 log L = (x+y)/p – [(m+n)-(x+y)] / (1-p)
P
The maximum likelihood estimator p̂ is therefore obtained by solving:
(x+y) /p̂ – [ (m+n) -(x+y) ] / (1- p̂) = 0
Solving,
p̂= (x+y) / (m+n)
In order to find if the value of p̂ is maximum we have to take the double differentiation of the
likelihood function.
 log L = (x+y)/p – [(m+n)-(x+y)] / (1-p)
P
Now differentiating both sides with respect to p we have :
2 log L
p2
= - (x+y)/ p2 - [(m+n)-(x+y)] / (1-p) 2
2 log L | = - (x+y) / [(x+y)/(m+n)]2 - [(m+n)-(x+y)]/ (1-[(x+y)/(m+n)]) 2 < 0…….(vii)
p2
p̂ = (𝑥 + 𝑦)/(𝑚 + 𝑛)
So the likelihood function attains its maximum at p̂ = (x+y) / (m+n). In fact equation (vii) will be
< 0 for ∀ (x, y) and ∀ (m ,n) > 0
The collective confidence of selecting item C from the association rules is thus found to be (x+y) /
(m+n).
4. LIMITATIONS
The confidence of an association rule can be calculated in a simpler way if we just know the
number of items in the antecedent out of which items in the consequent occur. The method of
Maximum Likelihood also generates the same result but takes a rather complicated mathematical
approach. The paper moreover does not illustrate the approach of calculating collective confidence
of an item with real life data.
5. CONCLUSIONS
This paper takes a different approach in calculating the confidence of association rules and
moreover tries to find an estimate of the extent of association between two transactions in
predicting the occurrence of a particular item. The method for calculating collective confidence of
a particular item for predicting the extent of association using the Maximum likelihood function
can be extended to n association rules and the solution will maximize the occurrence of the
consequent item in the database. A further extension of this paper will be to find the maximum
likelihood estimator of different association rules when the probabilities of occurrence of
consequents differ.
6. ACKNOWLEDGEMENTS
I would like to extend my sincere acknowledgements to the members of the DATASURG group at
North Dakota State Univeristy at Fargo, North Dakota for providing me with their valuable
suggestions. I would also like to expend my sincere gratitude and respect to my PhD research
advisor Dr. William Perrizo.
REFERENCES
[1] R.Agarwal, A. Arning, T . Bollinger, M .Mehta, J. Shafer and R.Srikant, The Quest Data
Mining System. In Proceedings of the Second International Conference on Knowledge Discovery
in Databases and Data, August 1996.
[2] R. Agarwal , T .Imielinski and A. Swami. Mining association rules between sets of items in
large databases. In Proceedings of the ACM SIGMOD International Conference on the
Management of Data, pages 207-216, May 1993.
[3] R. Agarwal , T .Imielinski and A. Swami. Database Mining: a performance perspective. IEEE
Transactions on Knowledge and Data Engineering, 5:914-925,1993.
[4] R. Agarwal , H .Mannila, R. Srikant , H. Toivonen and A.I. Verkamo. Fast Discovery of
Association Rules. In Fayyad et al . pages 307-328,1996.
[5] R.Agarwal and R.Srikant. Fast algorithms for mining association rules in large databases. In
Proceedings of the 20th International Conference on very large databases, pages 487-499,
September 1994.
[6] Imad Mohamad Rahal and William Perrizo. A Vertical Extensible ARM Framework for the
scalable mining of Association Rules. Pages 5 -25.
[7] Michael Steinbach,Pang-Ning Tan, Hui Xiong and Vipin Kumar . Generalizing the Notion of
Support. International Conference on Knowledge Discovery and Data Mining.Proceedings of the
tenth ACM SIGKDD international conference on Knowledge discovery and data mining.
[8] Michael Steinbach and Vipin Kumar . Generalizing the Notion of Confidence. Fifth IEEE
International Conference on Data Mining .Volume , Issue , 27-30 Nov. 2005 Page(s).
[9] Goethals, B. Survey of frequent pattern mining. N.d. Internet.
http://www.cs.helsinki.fi/u/goethals [30 December 2004]
[10] A Book on Statistical Methods by N.G. Das. 2001 Edition . Publisher M. Das and Co.
[11] Notes on Statistics from N.d. Internet.
http://www.netmba.com/statistics/covariance/
[12] E.-H. Han, G. Karypis, and V. Kumar. Tr# 97-068: Min-apriori: An algorithm for finding
association rules in data with continuous attributes. Technical report, Department of Computer
Science, University of Minnesota, Minneapolis, MN, 1997.
[13] Jiawei Han , Micheline Kamber, Data mining: concepts and techniques, Morgan Kaufmann
Publishers Inc., San Francisco, CA, 2000
[14] M. J. Zaki and M. Ogihara. Theoretical foundations of association rules. In DMKD 98, pages
7:1--7:8, 1998.
[15] J. Han and Y.Fu . Discovery of multiple level association rules from large databases. In
Proceedings of the 21st International Conference on very large databases, pages 420-431,
September 1995.
[16] James W. Demmel, Applied numerical linear algebra, Society for Industrial and Applied
Mathematics, Philadelphia, PA, 1997.
[17] C. Yang, U.M Fayyad and P.S. Bradley. Efficient discovery of error-tolerent frequent item
sets in high dimensions. In KDD 2001, pages 194-203, 2001
[18] Notes on Statistics from N.d. Internet.
http://www.weibull.com/AccelTestWeb/mle_maximum_likelihood_parameter_estimation.htm
[19] Notes on Statistics from N.d. Internet.
http://www.autonlab.org/tutorials/mle.html
[20] Notes on Statistics from N.d. Internet.
http://cnx.org/content/m11446/latest/
[21]Efron, B. and Hinkley, D.V.(1978).“ Assessing the accuracy of the maximum likelihood
estimator: Observed versus expected Fisher information”. Biometrika, 65, 457-482.
[22] Feder, P.I (1968).“On the distribution of the log likelihood ratio test statistic when the true
parameter is near the boundaries of the hypothesis regions”. Annals of Mathematical Statistics, 39,
2044-2055.
[23] Lehman, E.L., and Casella G.(2001).Theory of Point Estimation. Springer, New York.
[24]Self, S.G. and Liang, K.Y.(1987).“Asymptotic properties of maximum likelihood estimators
and likelihood ratio tests under nonstandard conditions”. Journal of the American Statistical
Association, 82, 605-610.
[25] Notes on Statistics from N.d. Internet.
http://mercury.bio.uaf.edu/courses/wlf625/readings/MLEstimation.PDF
[26] Brin, S., Motwani, R. and Silverstein, C., ”Beyond Market Baskets: Generalizing Association
Rules to Correlations,” Proc. ACM SIGMOD Conf., pp. 265-276, May 1997.
[27] Cheung, D., Han, J., Ng, V., Fu, A. and Fu, Y. (1996), A fast distributed algorithm for
mining association rules, in `Proc. of 1996 Int'l. Conf. on Parallel and Distributed Information
Systems', Miami Beach, Florida, pp. 31 - 44.
[28] Chuang, K., Chen, M., Yang, W., Progressive Sampling for Association Rules Based on
Sampling Error Estimation, Lecture Notes in Computer Science, Volume 3518, Jun 2005,
Pages 505 - 515
[29] Cristofor, L., Simovici, D., Generating an informative cover for association rules. In Proc.
of the IEEE International Conference on Data Mining, 2002.
[30] Das, A., Ng, W.-K., and Woon, Y.-K. 2001. Rapid association rule mining. In Proceedings
of the tenth international conference on Information and knowledge management.
ACM Press, 474-481.