Download Relational Association Rule Mining in Market Basket using the

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Functional Database Model wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Relational Association Rule Mining in Market Basket using the RoloDex Model with pTree
Arijit Chatterjee, Mohammad Hossain, Arjun G. Roy, William Perrizo
Department of Computer Science
North Dakota State University
Fargo, North Dakota 58102
{arijit.chatterjee, mohammad.hossain, arjun.roy, william.perrizo}@ndsu.edu
Abstract
In this paper1 we are concerned with finding how the
RoloDex Model can be used to find relational association
rules between different entities in Market Basket research
using pTrees. The significance of Association rules is
measured via support and confidence and they are used to
identify the rules in particular transactions. In this paper
we will however try to extrapolate that notion into
extending it to multiple entities and multi relations using
the RoloDex model. RoloDex model is fairly a new
concept introduced in this paper. We structure the paper
by initially providing some background information on
using the notion of Support and Confidence in Market
Basket Analysis and then introducing the concept of
RoloDex Model and pTrees and finally how the RoloDex
model can be used in Market Basket research with
pTrees to find
multiple relationships between
multiple entities.
1. INTRODUCTION
Since its introduction in 1993 by Agarwal et al. [1][2],
association rule mining has continuously received a great
deal of attention from the database research community.
Association Rule Mining (ARM) is the data-mining
process of finding interesting association and /or
correlation relationships among large sets of data items.
The original motivation for discovering association rules
comes from the need to analyze super market transactions
in what is known as Market Basket Research (MBR)
where analysts are interested in examining customer
shopping patterns in terms of the purchased product. The
market basket databases consist of a large number of
transactional records. In addition to the transactional
identifier, each record lists all the items bought by a
customer during a single visit to the store. Knowledge
workers are typically interested in finding out which
1
We acknowledge financial support for this research
came from a Department of Energy Award (award # DEFG52-08NA28921).
group of items are constantly purchased together. Such
knowledge could be useful in many business decisionmaking processes, such as adjusting store layouts (like
placing products optimally with respect to each other),
running promotions, designing catalogs and identifying
potential customer segments as targets for marketing
campaigns.
1.1 Association Rules
Association rules [1][2][3][4][5] provide information in
the form of “if-then” statements. These rules are
computed from the data and unlike the rules of logic they
are probabilistic in nature. In association analysis, the
antecedent (or the “if” part of the rule) and the consequent
(or the “then” part of the rule) are sets of items referred to
as item sets that are disjoint (i.e. do not have any item in
common). In addition to the antecedent and the
consequent, an association rule usually has statistical
interest measures that express the degree of certainty in
the rule. Two ubiquitously used measures are support and
confidence. The support of an item set is the number of
transactions that include all the items in the item set. The
support of an association rule is simply the support of the
union of items in the antecedent as well as in the
consequent. It can be either expressed as an absolute
number or as a percentage out of the total number of
transactions in the database. In statistical terms, this
expresses the statistical significance of a rule. The
confidence of an association is defined as the ratio of the
number of transactions containing all the items in the
antecedent as well as the consequent of the rule (i.e.
support of the rule) over the number of transactions that
include all the items in the antecedent only (i.e. the
support of the antecedent). Statistically, this measure
expresses the statistical strength of a rule. Alternatively,
one can think of support as the probability that a
randomly selected transaction from the database will
contain all the items in the antecedent and the consequent,
and of confidence as the conditional probability that a
randomly selected transaction will include all the items in
the consequent given that the transaction includes all the
items in the antecedent.
1.2 Formal Problem Statement
Formally, let I be a set of items defined in an item
space[3][4][6]. A set of items S = {i1 ,……,ik
)
belonging to I is referred to an itemset (or a k-itemset if S
contains k items). Any transaction over I is defined as a
couple T = (tid,ilist) with tid being the transaction
identifier and ilist an itemset over I. A transaction T =
(tid, ilist) is said to support an itemset S in I, if S is a
subset of T’s ilist. A transaction database D over I is
defined as a set of transactions over I.
For every itemset S, the support of S in D adds the
number of transaction identifiers for all transactions in D
that support S (i.e contain S in their ilists): support(S,D) =
|{tid |(tid,ilist) in D, S being a subset of ilist }|. An
itemset is said to be frequent if the support is greater than
or equal to a given absolute minimum support threshold,
minsupp where 0 <= minsupp <= |D|. An itemset which is
not known to be frequent or infrequent is referred to as a
candidate frequent itemset.
Generally speaking ARM is defined as a three way
process: (1) Choosing the right set of items/level of detail,
(2) finding all frequent patterns which occur at least as
frequently as a pre-determined minimum support
threshold and (3) generating strong association rules from
the frequent patterns which must satisfy the minimum
confidence threshold. However, it is worth noting that few
ARM approaches do not strictly adhere to this three way
format.
1.3 Rule Generation
The support [7]of an association rule A->C in D, support
(A->C,D), is the support of A union C in D. An
association rule is called frequent if its support exceeds
the given minsupp. The confidence[8] of an association
rule A->C in D, confidence (A->C, D), is the conditional
probability of having C contained in a transaction, given
that A is contained in the same transaction : P(C|A) or
confidence (A->C,D): = support(A->C,D)/support(A,D).
A rule is confident if its confidence exceeds a given
minimal confidence threshold, minconf , where 0 <=
minconf <= 1.So given a set of items I and a transactional
database D over I we will be considered of generating
collection of strong rules in D with respect to minsupp
and minconf.
2. ROLODEX MODEL AND pTREE
2.1 RoloDex Model
A RoloDex is a rotating file device used to store the
information of individuals. The RoloDex holds specially
designed index cards and the user stores the contact
information of individuals on these cards. We extend the
notion of RoloDex in Association Rule Mining where the
axis of the RoloDex will represent entities and the index
cards used in the RoloDex will represent the relationships.
This view may facilitate research into the data mining of
multi relationships. In each of the RoloDex’s we can store
the relationship between multiple entities and then
compare between other RoloDex’s storing similar
information’s. While association rule mining is mainly
involved in mining the relationship between two entities
the goal using the RoloDex model is to extend ARM to
the data mining of multiple relationships. The benefit of
using the RoloDex model over the DataCube model and
the Relational model is, the DataCube model is not as
flexible and since there are more nulls, it is not as easy to
isolate particular relationships. The relational model
suffers from the lack of flexibility as well and the pictoral
representation of data.
2.2 pTree Algorithm
Tremendous volumes of data cause the cardinality
problem for conventional transaction based ARM
algorithms. For fast and efficient data processing, we
transform the data into pTree[18], the loss-less,
compressed, and data-mining-ready vertical data
structure.
pTrees are used for fast computation[20] of counts and for
masking specific phenomena. This vertical data
representation consists of set structures representing the
data column-by-column rather than row-by-row
(horizontal relational data). Predicate-trees are one choice
of vertical data representation, which can be used for data
mining instead of, the more common sets of relational
records. This data structure has been successfully applied
in data mining applications ranging from Classification
and Clustering with K-Nearest Neighbor, to Classification
with Decision Tree Induction, to Association Rule
Mining[19][20][21][22][23]. A basic pTree represents one
attribute bit that is reorganized into a tree structure by
recursive sub-division, while recording the predicate true
value for each division. Each level of the tree contains
truth-bits that represent sub-trees and can then be used for
phenomena masking and fast computation of counts. This
construction is continued recursively down each tree path
until downward closure is reached. For example, if the
predicate is "purely 1 bits", downward closure is reached
when purity is reached (either purely 1 bits or purely 0
bits). In this case, a tree branch is terminated when a subdivision is reached that is entirely pure (which may or
may not be at the leaf level). These basic pTrees and their
complements are combined using Boolean algebra
operators such as AND(&) OR(j) and NOT(0) to produce
mask pTrees for individual values, individual tuples,
value intervals, tuple rectangles, or any other attribute
pattern. The root count of any pTree will indicate the
occurrence count of that pattern. The pTree data structure
provides a structure for counting patterns in an efficient,
highly scalable manner.
2.3. RoloDex model illustration using an example
movie
The following figure Fig 1.0 is an example of the
illustration of the RoloDex model where there is a movie
entity and the customer entity and the relationship
between the customer and the movie is through the
customer rating movie cards. The following customers
have rated movieID 4 and 1 with the following ratings in
the corresponding axis of the RoloDex as shown by the
dotted-lines.
4
3
0
3
0
0
2
0 2
1
0
0
0
0
5
0
5
0
0
0
0
customer rates movie card
0 0
1
0
0
0
0
4
1 0
0
0
0
0
0
(Customer,Items). The idea is that if the cutomer often
buys an item likely the cutomer is going to give that item
a fairly positive rating. Our rating systems are defined
from a scale between 0 to 5 where 0 means customer is
never going to buy the item while 5 would mean that the
customer is definitely going to buy the item. In the
following diagram Fig 3.0 we show the relationship
between the entities Cutomers ( C ) and Items ( I ) using
two RoloDex models where the first RoloDex has the
customers and the items as two axis’s and buy is the
relationship containing the cutomer-item cards. The
second RoloDex also has the customers and items as its
axis and uses rating as the relationship containing
Rating5 customer item cards. The idea is to fill each index
card with a true (1) or a false (0) value if it satisfies the
relationship condition.
3
pre-computed BpTtreec 1-counts
5 0
pre-computed
BpTtreei 1-counts
Fig1.0 Customer Rating Movie Card diagram
So the customer rating 5 movie card from the above cards
can be constructed as follows.
C
pre-computed
R5pTree counts
1
2
3
1
2
3
2
3
1
1
3
2
2 1
1
2
0
0
0
1
B(C,I)
1
0
4 5
4
3
2
1
I
0
1 0 0
0
0
R5(CI)
0
0
0
0 0 1
0 1 1 2
0
0 0 0 1
0
0
0
0
0
0 0 0 0
0
0
0
1
0
0
0
customer rates movie as 5 card
0
0
0
0
0
0
0
0 0
0
0
1 0
Fig 2.0 Customer Rating Movie 5 Card diagram
If we want to study the customers who have rated movies
only as 5 with other entities then then we will consider
only the above customer rating card.
3.0 ROLODEX Model in MBR using pTrees
3.1 2 Entities 2 Relationship MultiRelationships
In Market Basket Analysis we will be considering the
buying patterns of the customers and we define the two
major entities we will be concerned about Customers and
Items. One relationship will be Buy (Cutomer, Items) and
the other relationship we have introduced is Rating
Fig 3.0 2Entity 2Relationship diagram
In the above diagram customer with custID 3 buys item
with ItemID 4 and rates that item 5, similarly customer
with custID 5 buys item with itemID 1 and rates the item
as 5. The 1 counts of the pre computed Buy RoloDex,
column wise are 3 2 1 2 and row wise are 2 1 3 2 and the
1-counts of the pre computed Rating RoloDex, column
wise are 1 2 3 1 and row wise are 0 1 1 2. We will be
interested to find that if for iI, if we can generate strong
rules for the customers. We made the assumption before
that if a customer rates an item as 5 there are chances for
the customer to buy the items. We will need to find the
support and the confidence of this rule and if its more
than minconf and minsupp then we can recommend the
customer to buy that item. So, {c| rate(i,c)=5}{c|
buy(c,i)=yes}.
In the notion of pTrees we define the Support and the
confidence as follows:
1
Confidence =count(R5pTreei&BpTreei) / count(R5pTreei)
1
Support =count(R5pTreei) / size(R5pTreei)
1
0
S(E,F)
E
1
2
0
0
3
The overall schema for thie RoloDex model is as follows :
size(Customer)=size(R5pTreei)=size(BpTreei)=4
size(Itemset)= size(R5pTreec)=size(BpTreec)=4
4
3
2
1
1 0 0
0
R(E,F)
0
0 2 1
F
If the minconf=minsup=.2,
For itemset=1:
count ( R5PTree1)=0001
count ( BpTree1)=1001
Confidence=count(0001&1001)/count(0001)=1
So, confidence=1 Support=1/4 = 0.25 Both of these
values are greater than the minsupp and minconf and so
this is a strong rule. We can now include variations in the
rules such as instead of 5 as the rating if we choose 4
whether the rules are generated are strong or not.
Moreover instead of choosing a singleton {i} from the
itemset we might also be interested to find the type of the
itemsets such as grocery items, food items and so on. It
will be lot faster to datamine the results as we are doing
mainly binary AND operations or OR operations to do the
computing. Moreover since we are vertically processing
the data in the form of pTrees we can extract the
information from the RoloDex in a single pass In the
previous example which we calculated the strength of the
rule that can be considered to be an expected mine as its
expected that the person who rates the item as 5 will buy
the item.
Let’s take some further examples where we will try to
mine other relationships and might be interested to find
some other items the customers might be interested in. In
the following diagram Fig 4.0 there are two entities E and
F and has two types of RoloDex cards in each of the two
RoloDex’s S(E,F) and R (E,F)..
Fig 4.0 Entity Relationship Model diagram
The following are some of the relations which are shown
below as how it can be derived from using PTrees.
1.Given eE, If R(e,f), then S(e,f)
Confidence = count(Re & Se) / count(Re)
Support=count(Re) / size(Re)
2.If eA R(e,f), then eB S(e,f) where A and B are
different entities
Confidence=count( &eARe &eBSe) /count(&eARe)
Support= count(&eARe)/ sz(&eARe)
3.If eA R(e,f), then eB S(e,f)
Confidence=count( &eARe OReBSe) / count(&eAre
Support= count(&eARe)/ sz(&eAR e)
From the RoloDex model we have obtained the
relationship among various entities and can also calculate
the Support and the Confidence of the various association
rules determining them to be strong or not from the
chosen minsupp and minconf values.
3.2 2 Entities 3 Relationship MultiRelationships
The usage of 2 Entities 2 relationship Multirelations can
be extended to 2 entities 3 relationship multirelations. We
take an example where a customer C rates (relationship)
an item I, buys (relationship) the item I and uses
(relationship) item I frequently. The following diagram is
an a RoloDex Model representation:
1
1
0 0
B(I,C)
1
2
3 4
0
1 1
0
0
6. REFERENCES
U(I,C)
0
1 0
[1] R.Agarwal, A. Arning, T . Bollinger, M .Mehta, J.
Shafer and R.Srikant, The Quest Data Mining System. In
Proceedings of the Second International Conference on
Knowledge Discovery in Databases and Data, August
1996.
0 1
5
4
3
2
1
0 1 0 0
0R1(C,I)
0
0
0
0 1 0 1
Fig 4.0 2Entity 3Relationship diagram
For a customer C who has rated item I as 1 and buying
item I will frequently use item I.
For iI, {c| R1(c,i)=y & P(c,i)=y}
 {c| U(c,i)=y}
Confidence=count(R1pTreeb&PpTreeb&SpTreeb)
/count(R1pTreeb & PpTreeb)
Support=count(R1pTreeb&PpTreeb)
/size(R1pTreeb&PpTreeb)
So the Confidence and the Support for the rules can be
calculated using pTrees in the RoloDex model extending
to 3 relations between two entities.
4. LIMITATIONS
In this paper we have not used real life data and have not
compared whether by using the RoloDex model over the
other models such as the DataCube model we gain any
sort of accuracy in results both in terms of prediction and
speed. We mentioned in the paper that by using pTrees
we gain computational benefit as we are basically doing
binary operations but have not had any evidence citing
this fact. This research is still in the early phases and lack
evaluation and testing.
5. CONCLUSIONS
The concept of using the RoloDex model in Market
Basket research using pTrees is a new research which is
first been shown in this paper. We plan to extend the
concept to n entities n relationships and see how we can
provide better and more accurate results using pTrees.
We also plan to extend this concept to not only Market
Basket research but also in other areas of ARM where we
would like to study the Support and Confidence of
various association rules and gather important results.
[2] R. Agarwal , T .Imielinski and A. Swami. Mining
association rules between sets of items in large databases.
In Proceedings of the ACM SIGMOD International
Conference on the Management of Data, pages 207-216,
May 1993.
[3] R. Agarwal , T .Imielinski and A. Swami. Database
Mining: a performance perspective. IEEE Transactions on
Knowledge and Data Engineering, 5:914-925,1993.
[4] R. Agarwal , H .Mannila, R. Srikant , H. Toivonen and
A.I. Verkamo. Fast Discovery of Association Rules. In
Fayyad et al. . pages 307-328,1996.
[5] R.Agarwal and R.Srikant. Fast algorithms for mining
association rules in large databases. In Proceedings of the
20th
International Conference on very large
databases, pages 487-499, September 1994.
[6] Imad Mohamad Rahal and William Perrizo. A
Vertical Extensible ARM Framework for the scalable
mining of Association Rules. Pages 5 -25.
[7] Michael Steinbach,Pang-Ning Tan, Hui Xiong and
Vipin Kumar . Generalizing the Notion of Support.
International Conference on Knowledge Discovery and
Data Mining.Proceedings of the tenth ACM SIGKDD
international conference on Knowledge discovery and
data mining.
[8] Michael Steinbach and Vipin Kumar . Generalizing
the Notion of Confidence. Fifth IEEE International
Conference on Data Mining .Volume , Issue , 27-30 Nov.
2005 Page(s).
[9] Goethals, B. Survey of frequent pattern mining. N.d.
Internet.
http://www.cs.helsinki.fi/u/goethals [30 December 2004]
[10] A Book on Statistical Methods by N.G. Das. 2001
Edition . Publisher M. Das and Co.
[11] Notes on Statistics from N.d. Internet.
http://www.netmba.com/statistics/covariance/
[12] E.-H. Han, G. Karypis, and V. Kumar. Tr# 97-068:
Min-apriori: An algorithm for finding association rules in
data with continuous attributes. Technical report,
Department of Computer Science, University of
Minnesota, Minneapolis, MN, 1997.
[13] Jiawei Han , Micheline Kamber, Data mining:
concepts and techniques, Morgan Kaufmann Publishers
Inc., San Francisco, CA, 2000
[14] M. J. Zaki and M. Ogihara. Theoretical foundations
of association rules. In DMKD 98, pages 7:1--7:8, 1998.
[15] J. Han and Y.Fu . Discovery of multiple level
association rules from large databases. In Proceedings of
the 21st International Conference on very large databases,
pages 420-431, September 1995.
[16] James W. Demmel, Applied numerical linear
algebra, Society for Industrial and Applied Mathematics,
Philadelphia, PA, 1997.
[17] C. Yang, U.M Fayyad and P.S. Bradley. Efficient
discovery of error-tolerent frequent itemsets in high
dimensions. In KDD 2001, pages 194-203, 2001
[18] PTree Application Programming Interface
Documentation, North Dakota State University. http://midas.cs.ndsu.nodak.edu/
~datasurg/pTree/
[19] Q. Ding, M. Khan, A. Roy, and W. Perrizo, The
PTree Algebra, Proceedings of the ACM Symosium on Applied Computing, pp 426-431, 2002
[20] A. Perera and W. Perrizo, Parameter Optimized,
Vertical, Nearest Neighbor Vote and Boundary
Based Classi_cation, CATA, 2007
[21] A. Perera, T. Abidin, G. Hamer and W. Perrizo, Vertical Set Square Distance Based Clustering without Prior Knowledge of K, 14th International Conference on Intelligent and Adaptive
Systems and Software Engineering (IASSE 05),
Toronto, Canada, 2004
[22] W. Perrizo, G. Wettstein, A. Perera and T. Lu,
The Universality of Nearest Neighbor Sets in
Classi_cation and Prediction, Software Engineering and Data Engineering, 2009
[23] Y. Wang, T. Lu and W. Perrizo, A Novel Combinatorial Score for Feature Selection with PTree
in DNA Microarray Data Analysis, Software Engineering and Data Engineering, 2010