Download Frequent item set mining

Overview Frequent item set mining Christian Borgelt∗ Frequent item set mining is one of the best known and most popular data mining methods. Originally developed for market basket analysis, it is used nowadays for almost any task that requires discovering regularities between (nominal) variables. This paper provides an overview of the foundations of frequent item set mining, starting from a definition of the basic notions and the core task. It continues by discussing how the search space is structured to avoid redundant search, how it is pruned with the a priori property, and how the output is reduced by confining it to closed or maximal item sets or generators. In addition, it reviews some of the most important algorithmic techniques and data structures that were developed to make the search for frequent item sets as efficient as possible. C 2012 Wiley Periodicals, Inc. How to cite this article: WIREs Data Mining Knowl Discov 2012, 2: 437–456 doi: 10.1002/widm.1074 INTRODUCTION I t is hardly an exaggeration to say that the popular research area of data mining was started by the tasks of frequent item set mining and association rule induction. At the very least, these tasks have a strong and long-standing tradition in data mining and knowledge discovery in databases, and triggered an abundance of publications in data mining conferences and journals. The huge research efforts devoted to these tasks had considerable impact and led to a variety of sophisticated and efficient algorithms to find frequent item sets. Among the best-known methods are Apriori,1,2 Eclat,3–5 FP-Growth (Frequent Pattern Growth),6–9 and LCM (Linear time Closed item set Miner)10–12 ; but there is also an abundance of alternatives. A curious historical aspect is that researchers in the area of neurobiology came very close to frequent item set mining as early as 1978 with the accretion algorithm,13 thus preceding Apriori by as much 15 years. This paper surveys some of the most important ideas, algorithmic concepts, and data structures in this area. The material is structured as follows: Basic Notions introduces the basic notions such as item base, transaction, and support; formally defines the ∗ frequent item set mining problem; shows how the search space can be structured to avoid redundant search; and reviews how the output can be reduced by confining it to closed or maximal item sets or generators. Item Set Enumeration derives the general topdown search scheme for item set enumeration from the fundamental properties of the support measure, resulting in breadth-first and depth-first search, with the subproblem and item order providing further distinctions. Database Representations reviews different data structures by which the initial as well as conditional transaction databases can be represented and how these are processed in the search. Advanced Techniques collects several advanced techniques that have been developed to make the search maximally efficient, including perfect extension pruning, conditional item reordering, the k-items machine, and special output schemes. Intersecting Transactions briefly surveys intersecting transactions as an alternative to item set enumeration for finding closed (and maximal) item sets, which can be preferable in the presence of (very) many items. Extensions discusses selected extensions of the basic approaches, such as association rule induction, alternatives to item set support, association rule and item set ranking and filtering methods, and fault-tolerant item sets. Finally, Summary summarizes this survey. Correspondence to: [email protected] European Centre for Soft Computing, Edificio de Investigación, Calle Gonzalo Gutiérrez Quirós s/n, Mieres, Asturias, Spain. Email: [email protected], [email protected]; WWW: http://www.borgelt.net/, http://www.softcomputing.es/ DOI: 10.1002/widm.1074 Volume 2, November/December 2012 BASIC NOTIONS Frequent item set mining is a data analysis method that was originally developed for market basket c 2012 John Wiley & Sons, Inc. 437 Overview wires.wiley.com/widm analysis. It aims at finding regularities in the shopping behavior of the customers of supermarkets, mailorder companies, and online shops. In particular, it tries to identify sets of products that are frequently bought together. Once identified, such sets of associated products may be used to optimize the organization of the offered products on the shelves of a supermarket or the pages of a mail-order catalog or Web shop, or may give hints which products may conveniently be bundled. However, frequent item set mining may be used for a much wider variety of tasks, which share that one is interested in finding regularities between (nominal) variables in a given data set. Problem Definition Formally, frequent item set mining is the following task: we are given a set B = {i1 , . . ., in } of items, called the item base, and a database T = (t1 , . . ., tm ) of transactions. An item may, for example, represent a product. In this case, the item base represents the set of all products offered by a supermarket. The term item set refers to any subset of the item base B. Each transaction is an item set and may represent, in the supermarket setting, a set of products that has been bought by a customer. As several customers may have bought the same set of products, the total of all transactions must be represented as a vector (as above) or as a multiset. Alternatively, each transaction may be enhanced by a transaction identifier (tid). Note that the item base B is usually not given explicitly, but only implicitly as the union of all transactions, that is, B = ∪k∈{1,...,m} tk . The cover KT (I) = {k ∈ {1, . . ., m} | I ⊆ tk } of an item set I ⊆ B indicates the transactions it is contained in. The support sT (I) of I is the number of these transactions and hence sT (I) = |KT (I)|. Given a user-specified minimum support smin ∈ N, an item set I is called frequent (in T) iff sT (I) ≥ smin . The goal of frequent item set mining is to find all item sets I ⊆ B that are frequent in the database T and thus, in the supermarket setting, to identify all sets of products that are frequently bought together. Note that frequent item set mining may be defined equivalently based on the (relative) frequency σ T (I) = sT (I)/m of an item set I and a corresponding lower bound σ min . As an illustration, Figure 1 shows a simple transaction database with 10 transactions over the item base B = {a, b, c, d, e}. With a minimum support of smin = 3, a total of 16 frequent item sets can be found in this database, which are shown, together with their support values, in the table on the right in Figure 1. Note that the empty set is often discarded 438 F I G U R E 1 | (a) A simple example database with 10 transactions (market baskets, shopping carts) over the item base B = {a, b, c, d, e} and (b) the frequent item sets that can be found in it if the minimum support is chosen to be smin = 3 (the numbers state the support of these item sets). (not reported) because it is trivially contained in all transactions and thus not informative. Search Space and Support Properties It is immediately clear that simply generating every candidate item set in the power set 2B , determining its support, and filtering out the infrequent sets is computationally infeasible, because even small supermarkets usually offer thousands of different products. To make the search efficient, one exploits a fairly obvious property of item set support, namely that it is antimonotone: ∀I ⊆ J ⊆ B: sT (I) ≥ sT (J), regardless of the transaction database T. In other words, if an item set is extended (if another item is added to it), its support cannot increase. Together with the userspecified minimum support, we immediately obtain the Apriori property1,2 : ∀I ⊆ J ⊆ B: sT (I) < smin ⇒ sT (J) < smin , that is, no superset of an infrequent item set can be frequent. It follows that the set FT (smin ) of item sets that are frequent in a database T w.r.t. minimum support smin is downward closed: ∀I ∈ FT (smin ) : J ⊆ I ⇒ J ∈ FT (smin ). As a consequence, the search space is naturally structured according to the subset relationships between item sets, which form a partial order on 2B . This partial order can be nicely depicted as a Hasse diagram, which is essentially a graph, in which each item set I ⊆ B forms a node and there is an edge between the nodes for two sets I, J with I ⊂ J if ∃K : I ⊂ K ⊂ J . As an example, Figure 2(a) shows a Hasse diagram for the partial order of 2B for B = {a, b, c, d, e}, the item base underlying Figure 1. Due to the Apriori property (or the fact that the set of frequent item sets is downward closed), the frequent item sets are dense at the top of such a Hasse diagram (see Figure 2b). Thus frequent item sets are naturally found by a top-down search in this structure. c 2012 John Wiley & Sons, Inc. Volume 2, November/December 2012 WIREs Data Mining and Knowledge Discovery Frequent item set mining F I G U R E 2 | (a) Hasse diagram for the partial order induced by ⊆ on 2{a,b,c,d,e} and (b) frequent item sets for the database shown in Figure 1 and smin = 3. F I G U R E 3 | Schematic illustration of maximal (a) and closed item sets (b) demonstrating their relation. Closed and Maximal Item Sets and Generators An annoying problem in frequent item set mining is that the number of frequent item sets is often huge and thus the output can easily exceed the size of the transaction database to mine. To mitigate this problem, several restrictions of the set of frequent item sets have been suggested. A frequent item set I ∈ FT (smin ) is called • a maximal (frequent) item set14–18 iff ∀J⊃I: sT (J) < smin ; • a closed (frequent) item set19–24 iff ∀J⊃I: sT (J) < sT (I); • a (frequent) generator19,25–29 iff ∀J⊂I: sT (J) < sT (I). Due to the contraposition of the Apriori property, that is, ∀I ⊆ J ⊆ B : sT (J) ≥ smin ⇒ sT (I) ≥ smin , the set of all frequent item sets can easily be recovered from the set MT (smin ) of maximal item sets as FT (smin ) = I∈MT (smin ) 2 I . However, for the support of nonmaximal item sets it only preserves a lower bound: ∀I ∈ FT (smin ) : sT (I) ≥ Volume 2, November/December 2012 F I G U R E 4 | The number of frequent, closed, and maximal item sets on a common benchmark data set (BMS-Webview-1). max J ∈MT (smin )∧J ⊇I sT (J ). As an intuitive illustration, the schematic Hasse diagram in Figure 3(a) shows the maximal item sets as red dots, whereas all frequent item sets are shown as a blue region. The set CT (smin ) of all closed item sets, however, also preserves knowledge of all support values according to ∀I ∈ FT (smin ) : sT (I) = max J ∈CT (smin )∧J ⊇I sT (J ), because any frequent item set is either closed or possesses a uniquely determined closed superset. Note that maximal item sets are obviously also closed, but not vice versa. The exact relationship of closed and maximal item sets is: CT (smin ) = s∈{smin ,smin +1,...,m} MT (s), where m is the total number of transactions. This relationship is illustrated schematically in Figure 3(b). As an illustration of the huge savings that can result from restricting the output to closed or even maximal item sets, Figure 4 shows the number of frequent, closed, and maximal item sets for a common benchmark data set (BMS-Webview-1, see Ref 30; note the logarithmic scale). c 2012 John Wiley & Sons, Inc. 439 Overview wires.wiley.com/widm F I G U R E 5 | (a) A subset tree that results from assigning a unique parent to each item set and (b) a prefix tree in which sibling nodes with the same prefix are merged. The set GT (smin ) of generators (or free item sets) has the convenient property that it is downward closed, that is, ∀I ∈ GT (smin ) : J ⊆ I ⇒ J ∈ GT (smin ). Unfortunately, though, GT (smin ) does not preserve knowledge which item sets are frequent. However, because every generator possesses (like any frequent item set) a uniquely determined closed superset with the same support, one may report with each generator the difference to this closed superset. In this way, one preserves the same knowledge as with the set of closed item sets. Note, however, that |GT (smin )| ≥ |CT (smin )|, although recovering the support values from a generator-based output may be somewhat easier or more efficient than recovering them from the set of closed item sets. Note also that generators are often induced alone (i.e., without reporting the difference to their closed supersets), because they are seen as more useful features for classification purposes, as they contain fewer items than closed sets. Alternative lossless compressions of the set of frequent item sets, which can reduce the output even more, are, for example, nonderivable item sets31 and closed nonderivable item sets.32 However, the better the compression, the more complex it usually becomes to derive the support values of item sets that are not directly reported. ITEM SET ENUMERATION Due to the Apriori property, most frequent item set mining algorithms search top-down in (the Hasse diagram representing) the partial order ⊆ on 2B , thus enumerating the (frequent) item sets. A notable alternative (intersecting transactions) is discussed in Intersecting Transactions. 440 Organizing the Search Searching the partial order ⊆ on 2B top-down means growing item sets from the empty set or from single items toward the item base B. However, a naive implementation of this scheme leads to redundant search because the same item set can be constructed multiple times by adding its items in different orders. To eliminate this redundancy, the Hasse diagram representing the partial order is reduced to a tree by assigning a unique parent set π (I) to every item set I ⊆ B. This is achieved by choosing an arbitrary, but fixed order of the items and setting π (I) = I − {max (I)}. As an example, Figure 5(a) shows the resulting item subset tree for B = {a, b, c, d, e} and the item order a < b < c < d < e. Clearly, this tree is a prefix tree because sibling sets differ only in their last item. Thus it is often convenient to combine these siblings into one node as shown in Figure 5(b). This structure is used in the following to explain and illustrate the different search schemes. Breadth-First/Levelwise Search The most common ways to traverse the nodes of a tree are breadth-first and depth-first. The former is used in the Apriori algorithm,1,2 which derives its name from the Apriori property. It is most conveniently implemented with a data structure representing a prefix tree such as the one shown in Figure 5(a). This tree is built level by level. A new level is added by creating a child node for every frequent item set on the current level. From these child nodes all item sets are deleted that possess an infrequent subset (a priori pruning). Then the transaction database is accessed to count the support of the remaining candidate item sets. This is usually done by traversing the transactions and constructing, with a doubly recursive procedure, all c 2012 John Wiley & Sons, Inc. Volume 2, November/December 2012 WIREs Data Mining and Knowledge Discovery Frequent item set mining Depth-First Search F I G U R E 6 | Two steps of the Apriori algorithm: (a) adding the second level; (b) adding the third level (blue: a priori pruning, red: a posteriori pruning for smin = 3). a a a b b c d b a a b b c c ab ac ad ae bc bd be b c c d d abc abd abe acd ace ade bcd bce bde c d d d abcd abce abde acde bcde d abcde e d c ab ac ad ae bc bd be b c c d d abc abd abe acd ace ade bcd bce bde c d d d abcd abce abde acde bcde d abcde cd d cde d ce de e d cd d cde ce de F I G U R E 7 | Illustration of the divide-and-conquer approach to find frequent item sets: (a) first split, (b) second split; blue: split item/prefix, green: first subproblem (include split item), red: second subproblem (exclude split item). subsets of the size that corresponds to the depth of the new tree level. Afterward, all item sets that are found to be infrequent are eliminated (a posteriori pruning). Technical and optimization details can be found, for example, in Refs 33–37. As an example, Figure 6 shows two steps of the Apriori algorithm for the transaction database shown in Figure 1, namely, adding the second and the third level (item sets with two and three items). A posteriori pruning is indicated by red crosses, and Apriori pruning by blue ones. It should be noted, though, that nowadays the Apriori algorithm is mainly of historical interest as one of the first frequent item set mining and association rule induction algorithms, because its performance (in terms of both speed and memory consumption) usually cannot compete with that of stateof-the-art depth-first approaches. Volume 2, November/December 2012 Although a depth-first version of Apriori has been suggested,38 depth-first search is usually known from algorithms such as Eclat,3–5 FP-Growth,6–9 LCM,10–12 and many others. The general approach can be seen as a simple divide-and-conquer scheme. For a chosen item i, the problem to find all frequent item sets is split into two subproblems: (1) find all frequent item sets containing i and (2) find all frequent item sets not containing i. Each subproblem is then further divided based on another item j: find all frequent item sets containing (1.1) both i and j, (1.2) i, but not j, (2.1) j, but not i, (2.2) neither i nor j, and so on. This division scheme is illustrated in Figure 7. All subproblems occurring in this recursion can be defined by a conditional transaction database and a prefix. The prefix is a set of items that has to be added to all frequent item sets that are discovered in the conditional transaction database. Formally, all subproblems are pairs S = (C, P), where C is a conditional database and P ⊆ B is a prefix. The initial problem, with which the recursion is started, is S = (T, ∅), where T is the given transaction database and the prefix is empty. A subproblem S0 = (C0 , P0 ) is processed as follows: choose an item i ∈ B0 , where B0 is the set of items occurring in C0 . This choice is, in principle, arbitrary, but often follows some predefined order of the items. If sC0 ({i}) ≥ smin , then report the item set P0 ∪ {i} as frequent with the support sC0 ({i}), and form the subproblem S1 = (C1 , P1 ) with P1 = P0 ∪ {i}. The conditional database C1 comprises all transactions in C0 that contain the item i, but with the item i removed. This also implies that transactions that contain no other item than i are entirely removed: no empty transactions are ever kept. If C1 is not empty, process S1 recursively. In any case (i.e., regardless of whether sC0 ({i}) ≥ smin or not), form the subproblem S2 = (C2 , P2 ), where P2 = P0 . The conditional database C2 comprises all transactions in C0 (including those that do not contain the item i), but again with the item i (and resulting empty transactions) removed. If C2 is not empty, process S2 recursively. Note that in this search scheme, due to the conditional transaction databases, one needs only to count the support of individual items (or singleton sets). As an illustration, Figure 8 shows the (top levels of) a subproblem tree resulting from this divideand-conquer search. The indices of the transaction databases T indicate which (split) items have been included (no bar) or excluded (bar). c 2012 John Wiley & Sons, Inc. 441 Overview wires.wiley.com/widm if one filters for generators with a repository that is queried for a subset with the same support, S2 must be processed before S1 . Order of the Items F I G U R E 8 | (Top levels of) the subproblem tree of the divide-and-conquer scheme (with a globally fixed item order). Order of the Subproblems Whereas it is irrelevant in which order the child nodes are traversed in a breadth-first/levelwise scheme (since the whole next level is needed before the support counting can start), the depth-first search scheme allows for a choice whether the subproblem S1 or the subproblem S2 is processed first. In the former case, the item sets are considered in lexicographic order (w.r.t. the chosen item order), in the latter case this order is reversed. At first sight, this does not seem to make much of a difference. However, depending on how conditional transaction databases are represented, processing subproblem S2 first can be advantageous, because then one may be able to use (part of) the memory storing the conditional database C0 to store the conditional databases C1 and C2 . Intuitively, this can be understood as follows: C2 is essentially C0 , only that the split item is eliminated or ignored. Hence it may not require extra memory, as C0 (properly viewed) may be used instead (examples are provided in Vertical Representation and Hybrid Representations). C1 is never larger than C2 as it is a subset of the transactions in C2 , namely, those that contain the split item in C0 . Therefore, one may reuse C2 ’s (and thus C0 ’s) memory for representing and processing C1 . Hence, processing subproblem S2 before S1 can lead to effectively constant memory requirements (all computations are carried out on the memory storing the initial transaction database). This is demonstrated by LCM10–12 and the so-called top-down version of FP-Growth.39 However, if S1 is solved before S2 , representing the transaction selection needed for S1 may need extra memory, because the transaction database without this selection (i.e., all transactions) needs to be maintained for later solving S2 . The traversal order can also be relevant if the output is to be confined to closed or maximal item sets, namely, if this restriction is achieved by a repository of already found closed item sets, which is queried for a superset with the same support: in this case S1 must be processed before S2 . Analogously, 442 Apart from the order in which the subproblems are processed, the order in which the items are used to split the subproblems (or the order used to assign unique parents to single out a prefix tree from the Hasse diagram, see Figure 5) can have a considerable impact on the time needed for the search. Experimentally, it was determined very early that it is usually best to process the items in the order of increasing frequency or, even better, in the order of increasing size sum of the transactions they are contained in. The reason for this behavior is that the average size of the conditional transaction databases tends to be smaller if the items are processed in this order. This observation holds for basically all database representations (see Vertical Representation for more details). Note, however, that the order of the items influences only the search time, not the result of the algorithms. To emphasize this fact, in the following some algorithms will be illustrated with a default alphabetic order of the items, others with items reordered according to their frequency in the given transaction database. DATABASE REPRESENTATIONS Algorithms that enumerate (frequent) item sets with the divide-and-conquer scheme outlined in DepthFirst Search (such as Eclat, FP-growth, LCM, and so on) differ in how conditional transaction databases are represented: horizontally (Horizontal Representation), vertically (Vertical Representation), or in a hybrid fashion (Hybrid Representations), which combines a horizontal and a vertical representation. With the general algorithmic scheme of subproblem splits (see Depth-First Search), all that is needed in this section to obtain a concrete algorithm is to specify how the conditional transaction databases are constructed that are needed for the two subproblems. Horizontal Representation In a horizontal representation, a transaction database is stored as a list (or array) of transactions, each of which lists the items contained in it. This is the most obvious form of storing a transaction database, which was used in the problem definition in Problem Definition and for the example in Figure 1. The Apriori algorithm as well as some transaction intersection c 2012 John Wiley & Sons, Inc. Volume 2, November/December 2012 WIREs Data Mining and Knowledge Discovery a ad a b b b a b b b a cd d cd c f bd de cd c bd e g e f b g: 1 f: e: a: c: b: d: 2 3 4 5 8 8 smin = 3 c ad e b c c a e e c a acd d bd b bd bd cbd b bd d eacd ecbd ebd abd abd ad cbd cb cb bd Frequent item set mining e 1 e a c d 1 e c b d 1 e b d 2 a b d 1 a d 1 c b d 2 c b 1 b d F I G U R E 9 | Preprocessing a transaction database: (a) original form; (b) item frequencies; (c) transactions with sorted items and infrequent items eliminated (smin = 3); (d) lexicographically sorted reduced transactions; (e) data structure used in the SaM (Split and Merge) algorithm. F I G U R E 1 1 | Split into subproblems in the Eclat algorithm: (a) transaction database; (b) vertical representation with split item in blue; (c) conditional transaction database after intersection with split item list (first subproblem: include split item); (d) transaction index list of split item removed (second subproblem: exclude split item). Vertical Representation F I G U R E 1 0 | The basic operations of the SaM (Split and Merge) algorithm: split (left; first subproblem, include split item) and merge (right; second subproblem, exclude split item). approaches naturally use this representation. However, it can also be used in a divide-and-conquer/ depth-first algorithm, as is demonstrated by the SaM (Split and Merge) algorithm.40 The SaM algorithm requires some preprocessing of the transaction database, which usually includes reordering the items according to their frequency in the transaction database. The steps of this preprocessing are illustrated for a simple example database in Figure 9, together with the final data structure, which is a simple array of transactions. Note, however, that equal transactions are merged and their multiplicity is kept in a counter. How this list is processed in a subproblem split is demonstrated in Figure 10: The left part shows splitting off the transactions starting with the split item e (first subproblem; note that due to the lexicographic sorting done in the preprocessing all of these transactions are consecutive). The right part shows how the split-off transactions, with the split item removed, are merged with the rest of the transaction list (second subproblem). The merge operation is essentially a single phase of the mergesort sorting algorithm, with the only difference that it combines equal transactions (or transaction suffixes). It also ensures that the transactions (suffixes) are lexicographically sorted for the next subproblem split. Volume 2, November/December 2012 In a vertical representation of a transaction database, the items are first referred to with a list (or array) and for each item the transactions containing it are listed. As a consequence, a vertical scheme essentially represents the item covers KT ({i}) for all i ∈ B. As an example, Figure 11(b) shows the vertical representation of the example transaction database of Figure 11(a) (the second row states the support values of the items). A vertical representation is used in the Eclat algorithm3 and its variants. The conditional transaction database for the first subproblem is constructed by intersecting the list of transaction indices for the split item with the lists of transaction indices for the other items. This is illustrated in Figure 11(b), in which the transaction index list for the split item is highlighted in blue, and Figure 11(c), which shows the result of the intersections and thus the conditional transaction database for the first subproblem. Constructing the conditional transaction database for the second subproblem is particularly simple: one merely has to delete the transaction index list for the split item, see Figure 11(d). The execution time of the Eclat algorithm depends mainly on the length of the transaction index lists: the shorter these lists, the faster the algorithm. This explains why it is advantageous to process the items in the order of increasing frequency (see Order of the Items): if a split item has a low frequency, it has a short transaction index list. Intersecting this list with the transaction index lists of other items cannot yield a list that is longer than the list of the split item (this is another form of the Apriori property). Therefore, if the first split item has a low support, the transaction index lists that have to be processed in the recursion for the first problem have a low average length. Hence one should strive to use items with a low support as early as possible, that is, when there are still many other items, so that the lists of many items are reduced c 2012 John Wiley & Sons, Inc. 443 Overview wires.wiley.com/widm by the intersection. Only later, when fewer items remain (in the branches for the second subproblem), items with higher support, which yield longer intersection lists, are processed, thus reducing the number of long lists. In this way the average length of the transaction lists is reduced. This insight, applied to the general average size of conditional transaction databases, can be transferred to other algorithms as well. Especially, if the transaction database to mine is dense (that is, if the average transaction size (number of items) is large relative to the size of the item base), it can happen that intersecting two transaction index lists removes fewer transaction indices than it keeps. This observation led to the idea to represent a conditional transaction database not by covers, but by so-called diffsets (short for difference sets).4 Diffsets are defined as follows: ∀I ⊆ B : ∀a ∈ B − I : DT (a | I) = KT (I) − KT (I ∪ {a}). In other words, DT (a | I) contains the indices of the transactions that contain I but not a. With diffsets, the support of direct supersets of I can be computed as ∀I ⊆ B : ∀a ∈ B − I : sT (I ∪ {a}) = sT (I) − |DT (a | I)|, and the diffsets for the next level (subproblems) can be computed with the help of ∀I ∈ B : ∀a, b ∈ B − I, a = b : DT (b | I ∪ {a}) = DT (b | I) − DT (a | I). As an alternative, Eclat can be improved for dense transaction databases by transferring certain elements of the FP-Growth algorithm. This extension, which uses lists of transaction ranges instead of plain lists of transaction indices, is described in the following section after the FP-Growth algorithm has been discussed. Hybrid Representations Although algorithms that use a purely horizontal or purely vertical transaction database representation are attractive (because they are, to some degree, conceptually simpler), they are often outperformed by algorithms that use a hybrid data structure, exploiting elements of both vertical and horizontal representations. The simplest form of such a hybrid structure can be found in the LCM algorithm10–12 : it employs a purely vertical and a purely horizontal representation in parallel. Its processing scheme is best understood by viewing it as a variant of the Eclat algorithm, in which the intersection of transaction index lists is replaced by a scheme called occurrence deliver. This scheme accesses the horizontal representation to fill the transaction index lists as illustrated in Figure 12 for the same subproblem split used to illustrate the Eclat algorithm in Figure 11. The transaction index list of the split item is traversed and for each index 444 F I G U R E 1 2 | Occurrence deliver scheme used by LCM (Linear time Closed item set Miner) to find the conditional transaction database for the first subproblem (needs a horizontal representation in parallel). the corresponding transaction is retrieved. The item list of this transaction is then traversed (up to the split item) and the transaction index is added to the lists of the items that are encountered in the transaction. LCM has the advantage that to construct the conditional transaction database (for the first subproblem) it only reads memory linearly and stores the transaction indices through direct access. In contrast to this, the intersection procedure of the standard Eclat algorithm needs if-then-else statements, which are difficult to predict by modern processors (see Ref 41 for details). However, LCM has the disadvantage that it is more difficult to eliminate infrequent and other removable items; even though their vertical representations can be eliminated, removing them also from the horizontal representation requires a special projection operation. Nevertheless, with state-of-theart optimizations, LCM is one of the fastest frequent item set mining algorithms: it won (in version 211 ) the second Frequent Item Set Mining Implementations (FIMI) Workshop competition42 and version 312 is even faster. Note that due to its hybrid transaction database representation, LCM can be implemented in such a way that only an amount of memory linear in the size of the transaction database is needed: if the second subproblem is processed before the first, the transaction index lists can be reused for the conditional transaction databases. A more sophisticated hybrid structure is used by the FP-Growth algorithm,6–9 namely, a so-called FPTree (Frequent Pattern Tree), which combines a horizontal and a vertical representation. The core idea is to represent a transaction database by a prefix tree, thus combining transactions with the same prefix. At the same time an FP-Tree keeps track of the transactions an item is contained in by linking the prefix tree nodes referring to the same item into a list. This structure is enhanced by a header table, each entry of which refers to one item and contains the head of the item list as well as the item support. An FP-Tree is best explained by how it is constructed from a transaction database, which is c 2012 John Wiley & Sons, Inc. Volume 2, November/December 2012 WIREs Data Mining and Knowledge Discovery a ad a b b b a b b b a cd d cd c f bd de cd c bd e g e f b db d d d d d d d b b bc ba ba be c cae a c ce d: 8 c Frequent item set mining b: 7 c: 5 a: 4 e: 3 c: 1 b: 5 a: 2 e: 1 d: 8 c: 2 a: 1 e: 1 10 a: 1 b: 2 c: 2 e: 1 F I G U R E 1 3 | Building an FP-tree (Frequent Pattern Tree): (a) original transaction database; (b) lexicographically sorted transactions (infrequent items eliminated, other items sorted according to descending frequency); (c) resulting FP-tree structure. illustrated in Figure 13 for a simple example, shown in Figure 13(a). This transaction database is first preprocessed in a similar manner as for the SaM algorithm (see Figure 10) by sorting the items according to their frequency and then the transactions lexicographically. However, in contrast to the SaM algorithm, for an FP-Tree the items should be ordered according to descending frequency in the transactions. This transforms the transaction database into the form shown in Figure 13(b). From this database, the FP-Tree can be built directly by representing it as a prefix tree, see Figure 13(c). Note that sorting the items in the order of descending frequency is essential for obtaining a compact tree (even though it cannot be guaranteed that the tree needs less memory than a simple horizontal representation due to the overhead for the support and the pointers). Note also that the prefix tree is essentially a (compressed) horizontal representation, whereas the links between the branches (shown as dashed lines in Figure 13) are a vertical representation. An FP-Tree is processed in the subproblem split as follows: the rightmost item is chosen as the split item and its list is traversed, see Figure 14(a). This selects all transactions that contain the split item. From nodes on the split item list, the parent pointers of the prefix tree are followed to recover the rest of the transactions containing the split item. To build the FP-Tree for the conditional transaction database, the encountered nodes are copied and linked. Note, however, that the support values are derived from the support values in the nodes for the split item: only these state in how many transactions the split item is contained. After the split item list has been fully traversed, the created FP-Tree is detached (see Figure 14b) and processed recursively (first subproblem). Note, however, that for an implementation it may be easier to extract the transactions one by one, see Figure 14(d) and to insert them into a new (and initially empty) FPTree than to copy nodes. Of course, in this approach no full horizontal database representation is created, Volume 2, November/December 2012 F I G U R E 1 4 | Subproblem split in the FP-growth (Frequent Pattern Growth) algorithm: (a) projecting an FP-tree (Frequent Pattern Tree) (to item e); (b) detached projection (FP-tree of conditional transaction database); (c) remaining FP-tree after split item level is removed; (d) conditional transaction database in horizontal representation. but each extracted transaction is immediately inserted into a new tree and then discarded again, which can be achieved with a single transaction buffer. For the second subproblem, the nodes on the list for the split item are simply discarded (or ignored; note that an explicit deletion is not necessary) as shown in Figure 14(c). The advantages of the FP-Growth algorithm are that its data structure allows for an easy removal of infrequent and other eliminated items and that it reduces, for some data sets, the amount of memory needed to store the transaction database (due to the combination of transactions with the same prefix). Its disadvantages are that the FP-Tree structure is comparatively complex and that the overhead for the support values and the pointers in the nodes can also have the effect that an FP-Tree is larger than a purely horizontal representation of the same database. Nevertheless, in a state-of-the-art implementation FP-Growth is usually one of the fastest frequent item set mining algorithms, as exemplified by the fact that the version of Ref 7 won the first FIMI Workshop competition.43 An even faster version, which represents the FP-Tree very cleverly with the help of only two integer arrays was presented in Ref 8. Variants of FP-Growth include a version that uses Patricia trees to improve the compression44 and a compressed coding of a standard FP-Tree to process particularly large data sets.45 As already mentioned in Vertical Representation, certain aspects of the FP-Growth algorithm can be transferred to Eclat to improve the performance on dense data sets. The core idea is that Eclat also allows to combine transaction prefixes if one uses c 2012 John Wiley & Sons, Inc. 445 Overview wires.wiley.com/widm Perfect Extension Pruning F I G U R E 1 5 | Eclat with transaction ranges: (a) lexicographically sorted transactions; (b) transaction range lists. transaction range lists instead of plain transaction index lists. This is illustrated in Figure 15 for the same example database used to illustrate FP-Growth (a range is represented as start−end:support). Clearly, using ranges can reduce the length of the lists considerably, especially for dense data sets, and can thus reduce the processing time, even though intersecting transaction range lists is more complex than intersecting plain transaction index lists. ADVANCED TECHNIQUES The basic algorithms as defined above by a general divide-and-conquer/depth-first approach and a representation form and processing scheme for conditional transaction databases can be enhanced by various optimizations to increase their speed. The following sections list the most important and most effective ones. Note, however, that none of these can change the fact that the asymptotic time complexity of frequent item set mining is essentially linear in the number of item sets, and thus potentially exponential in the number of items. Reducing the Transaction Database One of the simplest and most straightforward optimizations consists in eliminating all infrequent items from the initial transaction database. This not only removes items that cannot possibly be elements of frequent item sets (due to the Apriori property), but also increases the chances to find equal transactions in the database, which can be combined keeping their multiplicity in a counter. We already saw this technique for the SaM algorithm (see Figure 10) and it is implicit in the construction of an FP-Tree (see Figure 13), but it can also be applied for Eclat-style algorithms (including LCM). This makes the support counting slightly more complex in these algorithms, because it requires an additional array holding the transaction weights (multiplicities), but it can significantly improve speed for certain data sets and minimum support values and thus is worth the effort. 446 The basic divide-and-conquer/depth-first scheme can easily be improved with so-called perfect extension pruning: an item i ∈ I is called a perfect extension of an item set I (or a parent equivalent item or a full support item), iff I and I ∪ {i} have the same support. Perfect extensions have the following properties: (1) if an item i is a perfect extension of an item set I, then it is also a perfect extension of any item set J ⊇ I as long as i ∈ J and (2) if K is the set of all perfect extensions of an item set I, then all sets I ∪ J with J ∈ 2K (power set of K) have the same support as I. These properties can be exploited by collecting in the recursion not only prefix items, but also, in a third element of a subproblem description, perfect extension items. These items are also removed from the conditional databases and are only used when an item set is reported to generate all supersets of the reported set that have the same support. The correctness of this approach can also be understood by realizing that if a perfect extension item is chosen as the split item, the conditional transaction databases C1 and C2 are identical (since there are no transactions in C0 that do not contain the split item). The only difference between the subproblems is the prefix, which contains the split item for the first subproblem, whereas it is missing in the second subproblem. As a consequence, the frequent item sets reported in the solutions of the two subproblems are essentially the same, only that the item sets reported for the first subproblem contain the split item, whereas those of the second subproblem do not. Or formally: let i be the split item and S(Sk) the frequent item sets in the solution of the kth subproblem, k = 1, 2. Then I ∈ S(S2 ) ⇔ I ∪ {i} ∈ S(S1 ). Note that closed item sets (see Closed and Maximal Item Sets and Generators) may be defined as item sets that do not possess a perfect extension. However, note also that using perfect extension pruning is not sufficient to restrict the output to closed item sets (see Closed and Maximal Item Set Filtering for the additional requirements). Few Items One of the core issues of making an item set enumeration approach efficient is that in the recursion the set of items in the conditional transaction databases is reduced. This can make it possible to combine (projected) transactions that differed initially in some item, but became equal w.r.t. the reduced item base. We have seen a direct example of this in the illustration of the SaM algorithm (see Figure 10) and the FP-growth algorithm does essentially the same. c 2012 John Wiley & Sons, Inc. Volume 2, November/December 2012 WIREs Data Mining and Knowledge Discovery Frequent item set mining F I G U R E 1 6 | Illustration of a 4-items machine: (a) table of highest set bits; (b) empty 4-items machine (no transactions); (c) after inserting the transactions of Figure 1 (without e); (d) after propagating the transaction lists left-/downward. However, the computational overhead for this transaction combination procedure can be considerably reduced with a special data structure if the set of remaining items becomes sufficiently small (as it will necessarily be at some point in the recursion). The core idea is to represent a certain number k of the most frequent items in a horizontal representation as bit arrays, in which each bit stands for an item. A set bit indicates that the item is contained in a transaction, a cleared bit that it is missing. Provided that k is sufficiently small, we can then represent all possible (projected) transactions with k items as indices of an integer array, which simply holds the multiplicities of these transactions. Together with transaction lists holding the transactions containing an item (in a similar style as for the LCM algorithm), we arrive at what will be called a k-items machine in the following. This data structure was proposed in Ref 12 for the LCM algorithm (version 3). However, the same technique can equally well be used with other algorithms, including SaM, Eclat (with plain transaction index lists as well as with transaction ranges), FP-Growth, and many others. To illustrate this method, Figure 16 shows the elements of a 4-items machine as well as different states. Figure 16(b) shows the main components of the 4-items machine (in an empty state, i.e., without any transactions). The top array records the multiplicities of transactions, which are identified by their index (see Figure 16a for the correspondence of bit Volume 2, November/December 2012 patterns and item sets). The bottom arrays record, for each item, the transactions it is contained in (in a bit representation). Note that these data structures explain why k needs to be small for a k-items machine to be feasible: the multiplicities array at the top has 2k elements, which is also the total (maximum) size of the transaction lists. Bit-represented transactions are entered into a k-items machine by simply indexing the transaction weights array to check whether an equal transaction is already contained: if the corresponding weight is zero, the transaction is not yet represented, otherwise it is. The principle underlying this aspect is essentially that of binsort. If a transaction is not yet represented, the array shown in Figure 16(a) is used to determine the highest set bit (and the item corresponding to it; in the array denoted as i.b, where i is the item and b the corresponding bit index). Then the transaction is appended to the list for this item. Executing this procedure for all transactions in the database of Figure 1 (without item e) yields Figure 16(c). However, after this step only the list for the highest item (in this case d) lists all the transactions containing it, because a transaction has been added only to the list for its highest item. To complete the assignment, the transaction lists are traversed from the highest item downward, the bit corresponding to the list is masked out, and the (projected) transactions are, if necessary, assigned to the list for the next highest bit. If the transaction is already contained in this list (determined from the entry in the weights arrays as above), only the transaction weight/multiplicity is updated. This finally yields Figure 16(d), in which new list entries created in the second step are shown in blue and which can then be processed recursively in the usual way. Conditional Item Reordering Although the differences are often fairly small, in certain transaction databases the frequency order of the items in the conditional transaction databases can differ considerably from the global frequency order. Because the order in which split items are chosen is, in principle, arbitary, it can be beneficial to reorder the items to reduce the average size of the conditional transaction databases in the deeper levels of the recursion. This is one of the core features underlying the speed of the FP-growth implementations of Refs 7, 9. However, the same idea can be used (and even much more easily) with an Eclat-based scheme (although it is slightly more difficult to implement for the LCM variant). It should be noted, though, that the cost of reordering can also slow down the search c 2012 John Wiley & Sons, Inc. 447 Overview wires.wiley.com/widm unnecessarily if the frequency orders do not differ much. It depends heavily on the data set whether item reordering is advantageous or not. Generating the Output In the described divide-and-conquer/depth-first scheme the item sets are enumerated in lexicographic order (if the first subproblem is processed before the second) or in reverse lexicographic order (if the second is processed before the first). This can be exploited to optimize the output, by avoiding the need to generate the textual description of each item set from scratch. Due to the lexicographic order, it will often be the case that the next item set to be reported shares a prefix with the last one reported. The output, and with it the whole search, can be made considerably faster by using an output buffer for the textual description of an item set. It is then recorded which part is still valid and this part is reused. That this should be relevant may sound surprising, but the benchmark results of Ref 41 clearly indicate that a well-designed output routine can be orders of magnitude faster than a basic one and thus can decide which implementation wins a speed competition. Closed and Maximal Item Set Filtering All techniques discussed up to now are essentially geared to find all frequent item sets. If one wants to confine the output to closed or maximal item sets, additional filtering procedures are needed. The most straightforward is perfect extension pruning: if an item set has a perfect extension, it cannot be closed (and thus also not maximal). However, not all perfect extensions are easy to detect. Only if the perfect extension item is in the conditional transaction database associated with the item set to check, it will be immediately detected in the recursion and thus can be used without effort to suppress the output of the item set in question. However, if the path to the current subproblem contains calls for the second type of subproblem, there can be perfect extension items that will not be checked in the recursion, namely among those that were eliminated in these calls (eliminated items). To find such perfect extensions, essentially two approaches are employed: the first refers to the definition of a perfect extension, goes back to the original transaction database and checks whether there is an eliminated item that is contained in all originals of the projected transactions of the current conditional transaction database. This check is best carried out with a horizontal transaction representation by intersecting the set of elimi- 448 nated items incrementally with the transactions. The advantage of such a scheme is that if the set becomes empty, there is no perfect extension and no further intersections are necessary. If, on the other hand, items remain after all relevant transactions have been intersected, there is a perfect extension, choices of which have even been identified in the intersection result. This approach is used in LCM,10–12 where it is particularly fast, especially due to a bit representation of the k most frequent items. An alternative to a direct test of the defining condition is to maintain a repository of already found closed item sets, which is queried for a superset with the same support. (Note that this requires that the first subproblem is processed before the second one.) If there is such a superset, there is a perfect extension among the eliminated items and the current item set cannot be closed. It is important to note that a repository approach is competitive only if not only a single global repository is used, because the number of item sets accumulating in it in the course of the recursive search severely impedes the efficiency of looking up a found item set. The solution is to create, alongside the conditional transaction databases, conditional repositories, which are filtered in the same way by the prefix/split item. This approach works extremely well in the FP-Growth implementations of Refs 7, 9, where the repository is laid out as an FP-Tree, which makes the retrieval very efficient. However, a simple topdown prefix tree may be used as well. Additional pruning techniques for closed item sets as well as special algorithms have been suggested, for example, in Refs 19, 20, 22, 23, 37, 46–48. Since maximal item sets are also closed, they allow for basically the same or at least analogous filtering techniques as closed item sets. Some additional techniques, as well as other specialized algorithms, can be found, for instance, in Refs 14–15, 49, 50. INTERSECTING TRANSACTIONS The general search schemes reviewed in the preceding sections all enumerate candidate item sets and then prune infrequent candidates. However, there exists an alternative, which intersects (sets of) transactions to find the closed (frequent) item sets. Different variants of this approach were proposed in Refs 24, 51, 52 and it was combined with an item set enumeration scheme in Ref 53. The fundamental idea of methods based on intersecting transactions is that closed item sets cannot only be defined as in Closed and Maximal Item Sets c 2012 John Wiley & Sons, Inc. Volume 2, November/December 2012 WIREs Data Mining and Knowledge Discovery Frequent item set mining and Generators (no superset has the same support), but also as follows: an item set I ⊆ B is (frequent and) closed if sT (I) = |KT (I)| ≥ smin ∧ I = k∈KT (I) tk. In other words, an item set is closed if it is equal to the intersection of all transactions that contain it (its cover). This definition is obviously equivalent to the one given in Closed and Maximal Item Sets and Generators: if an item set is a proper subset of the intersection of the transactions it is contained in, there exists a superset (especially the intersection of the containing transactions itself) that has the same cover and thus the same support. If, however, an item set is equal to the intersection of the transactions containing it, adding any item will remove at least one transaction from its cover and will thus reduce the item set support. Intersection approaches can nicely be justified in a formal way by analyzing the Galois connection between the set of all possible item sets 2B and the set of all possible sets of transaction indices 2{1,...,m} (where m is the number of transactions),54 as it was emphasized and explored in detail in Ref 55: the Galois connection gives rise to a bijective mapping between the closed item sets and closed sets of transaction indices. The intersection approach is implemented in the Carpenter algorithm24 by enumerating sets of transactions (or, equivalently, sets of transaction indices) and intersecting them. This is done with basically the same divide-and-conquer scheme as for the item set enumeration approaches, only that it is applied to transactions (i.e., items and transactions exchange their meaning, cf., Ref 55). Technically, the task to enumerate all transaction index sets is split into two subtasks: (1) enumerate all transaction index sets that contain the index 1 and (2) enumerate all transaction index sets that do not contain the index 1. These subtasks are then further divided w.r.t. the transaction index 2: enumerate all transaction index sets containing (1.1) both indices 1 and 2, (1.2) index 1, but not index 2, (2.1) index 2, but not index 1, (2.2) neither index 1 nor index 2, and so on. Formally, all subproblems occurring in the recursion can be described by triples S = (I, K, ). K ⊆ {1, . . . , m} is a set of transaction indices, I = k∈K tk, that is, I is the item set that results from intersecting the transactions referred to by K, and is a transaction index, namely, the index of the next transaction to consider. The initial problem, with which the recursion is started, is S = (B, ∅, 1), where B is the item base, no transactions have been intersected yet, and transaction 1 is the next to process. A subproblem S0 = (I0 , K0 , 0 ) is processed as follows: form the intersection I1 = I0 ∩ t0 . If I1 = ∅, do nothing (return from recursion). If |K0 | + 1 ≥ smin , Volume 2, November/December 2012 and there is no transaction tj with j ∈ {1, . . . , m} − K0 such that I1 ⊆ tj , report I1 with support sT (I1 ) = |K0 | + 1. If 0 < m, form the subproblem S1 = (I1 , K1 , 1 ) with K1 = K0 ∪ {0 } and 1 = 0 + 1 and the subproblem S2 = (I2 , K2 , 2 ) with I2 = I0 , K2 = K0 and 2 = 0 + 1 and process them recursively. For the necessary optimizations to make this approach efficient, see Refs 24, 52, 56. An alternative to transaction set enumeration is a scheme that maintains a repository of all closed item sets, which is updated by intersecting it with the next transaction (incremental approach).51,56 To justify this approach formally, we consider the set of all closed frequent item sets for smin = 1, that is, the set C(T) = I ⊆ B | ∃S ⊆ T : S = ∅ ∧ I = t∈S t . This set satisfies the following simple recursive relation: (1) C(∅) = ∅, (2) C(T ∪ {t}) = C(T) ∪ {t} ∪ {I | ∃s ∈ C(T) : I = s ∩ t}. As a consequence, we can start the procedure with an empty set of closed item sets and then process the transactions one by one, each time updating the set of closed item sets by adding the new transaction itself and the additional closed item sets that result from intersecting it with the already known closed item sets. In addition, the support of already known closed item sets may have to be updated. Details of an efficient data structure and updating scheme can be found in Ref 56. It should be noted that the performance of the intersection approaches is usually not competitive for standard data sets (like supermarket data). However, they can be the method of choice for data sets with few transactions and (very) many items as they occur, for instance, in gene expression analysis56,57 or text mining. On such data, transaction intersection approaches can outperform item set enumeration by orders of magnitude. EXTENSIONS Frequent item set mining approaches have been extended in various ways and considerable efforts have been made to reduce the output to the relevant patterns. This section is certainly far from complete and only strives to give a flavor of some of the many possibilities. Association Rules Although historically preceding frequent item set mining, association rules58 have to be considered an extension. The reason is that association rule induction is a two step process: in the first step the frequent item sets are found, from which association rules are c 2012 John Wiley & Sons, Inc. 449 Overview wires.wiley.com/widm generated in the second step. The idea is simply to split a frequent item set into two disjoint subsets, the union of which is the frequent item set (2-partition). One of the subsets is used as the antecedent of a rule, the other as its consequent. This rule is then evaluated by computing its so-called confidence, which for a rule X → Y and a given transaction database T is defined as cT (X → Y) = sT (X ∪ Y)/sT (X). Intuitively, the confidence estimates the conditional probability of the consequent given the antecedent of the rule. Rules are then filtered with a user-specified minimum confidence cmin : only rules reaching or exceeding this threshold are reported. Extensions of standard association rule induction include, among many others, the incorporation of taxonomies for the items,59 quantitative association rules,60 and fuzzy association rules,61 which use fuzzy sets over continuous domains as items. A popular way to rank association rules is = csTT(X→Y) , which meathe lift62 l T (X → Y) = ccTT(X→Y) (∅→Y) (Y)/m sures how much the relative frequency of Y is increased if the transactions are restricted to those that contain X. Alternatives include leverage63 λT (X → Y) = sT (X ∪ Y) − sT (X)sT (Y)/m, which states how much more often X and Y occur together than expected under independence, and conviction62 γT (X → Y) = 1−sT (Y)/m 1−cT (∅→Y) = 1−c , which measures how much 1−cT (X→Y) T (X→Y more often the rule would be incorrect if X and Y occurred independently. Overviews of ranking and selection measures can be found in Refs 64–66. In general, any measure for the dependence of two binary variables (X is contained in a transaction or not, Y is contained in a transaction or not) is applicable. Approaches based on statistical methods to select the best k patterns have been proposed, for example, in Ref 67, 68. Generally, the task to select relevant rules from the abundance that is produced by unfiltered mining has become a strong focus of current research. Cover Similarity Support-based frequent item set mining has the disadvantage that the support does not say much about the actual strength of association of the items in the set: a set of items may be frequent simply because its elements are frequent and thus their frequent cooccurrence can be expected by chance. As a consequence, the (usually few) interesting item sets drown in a sea of irrelevant ones. One of several approaches to improve this situation is to replace the support by a different antimonotone measure. Such a measure can, for instance, be obtained by generalizing measures for the similarity 450 of sets or binary vectors69 to more than two arguments and applying them to the covers of the items in a set.70 As an example, consider the Jaccard index,71 which for two sets A and B is defined as J(A, B) = |A ∩ B|/|A ∪ B|. Obviously, J(A, B) is 1 if the sets coincide (i.e., A = B) and 0 if they are disjoint (i.e., A ∩ B = ∅). To generalize this measure to more than two sets (here: item covers), one defines the carrier LT (I) of an item set I as LT (I) = {k ∈ {1, . . . , m} | I ∩ tk = ∅} = i∈I KT ({i}). The extent rT (I) of an item set I w.r.t. a transaction database T is the size of its carrier, that is, rT (I) = |LT (I)|. Together with the notions of cover and support (see Problem Definition), the generalized Jaccard index of an item set I is then defined as its support divided by its extent, that is, as i∈I KT ({i})| = |∩ . It is easy to show that JT (I) J T (I) = rsTT (I) (I) |∪i∈I KT ({i})| is antimonotone. Depending on the application, Jaccard item set mining can yield better and more informative results than a simple support-based mining, but should always be combined with an additional filter for the support. Notable other modifications of the support criterion include a size-decreasing minimum support72,72 (i.e., the larger an item set, the lower the support threshold) and using the area of the binary matrix tile corresponding to an item set as a selection criterion,74 which is analogous to a size-dependent support. Item Set Ranking and Selection To reduce the number of reported item sets, additional measures to rank and filter them can be employed. A straightforward approach simply compares the support of an item set I to its expected support under independence, for example, as eT (I) = sT (I)/ i∈I sT (i). However, this has the disadvantage that adding an independent item j to a well scoring set still yields a high value, as in this case eT (I ∪ {j}) = sT (j)sT (I)/(sT (j) i∈I sT (i)) = eT (I). Better approaches rely on measures for association rule evaluation and ranking (see Association Rules): form all possible association rules that can be created from a given item set I (or only those with a single item in the consequent) and aggregate (average, take minimum or maximum) their evaluations to obtain an evaluation for the item set I. However, any such measure-based evaluation suffers from the multiple testing problem due to which one loses control of the significance level of statistical tests: in a large number of tests some positive results are to be expected simply by chance, which can c 2012 John Wiley & Sons, Inc. Volume 2, November/December 2012 WIREs Data Mining and Knowledge Discovery Frequent item set mining lead to many false discoveries. Common approaches to deal with this problem are to apply Bonferroni correction75,76 or the Holm–Bonferroni method77 in the search,68,78 mining only part of the data and statistically validating the results on a hold-out subset68 as well as randomization approaches,79,80 which create surrogate data sets that implicitly encode the null hypothesis. An extended problem is the selection of socalled pattern sets, for example, as sets of (binary) features for classification purposes. In this case, not individual item sets, but sets of such patterns are desired, for example, a (small) pattern set that covers the data well or exhibits little overlap between its member patterns (low redundancy). To find such pattern sets, various approaches have been devised, for example, finding pattern sets with which the data can be compressed well81,82 or pattern sets in which all patterns contribute to partitioning the data.83 A general framework for this task, which has become known as constraint based pattern mining, has been suggested in Ref 84. An alternative, statistics based reduction of the output in the spirit of closed item sets are selfsufficient item sets85 : item sets the support of which is within expectation are removed. Fault-Tolerant Item Sets In standard frequent item set mining only transactions that contain all of the items in a given set are counted as supporting this set. In contrast to this, in fault-tolerant (or approximate) item set mining transactions that contain only a subset of the items can still support an item set, though possibly to a lesser degree than transactions containing all items. To cope with missing items in the transaction data to analyze, several fault-tolerant item set mining approaches have been proposed (for an overview see, e.g., Ref 86). They can be categorized roughly into three classes: (1) error-based, (2) density-based, and (3) cost-based approaches. Error-based approaches: Examples of errorbased approaches are Refs 87 and 88. In the former, the standard support measure is replaced by a faulttolerant support, which allows for a maximum number of missing items in the supporting transactions, thus ensuring that the measure is still antimonotone. The search algorithm itself is derived from the Apriori algorithm (see Breadth-First/Levelwise Search). In Ref 88, constraints are placed on the number of missing items as well as on the number of (supporting) transactions that do not contain an item in the set. Hence, it is related to the tile-finding approach in Ref 89. However, it uses an enumeration search Volume 2, November/December 2012 scheme that traverses sublattices of items and transactions, thus ensuring a complete search, whereas Ref 89 relies on a heuristic scheme. Density-based approaches: Rather than fixing a maximum number of missing items, density-based approaches allow a certain fraction of the items in a set to be missing from the transactions, thus requiring the corresponding binary matrix tile to have a minimum density. This means that for larger item sets more items are allowed to be missing than for smaller item sets. As a consequence, the measure is no longer antimonotone if the density requirement is to be fulfilled by each individual transaction. To overcome this, Ref 90 requires only that the average density over all supporting transaction must exceed a userspecified threshold, whereas Ref 91 defines a recursive measure for the density of an item set. The approach in Ref 92 requires both items and transactions to satisfy a density constraint and defines a corresponding fault-tolerant support that allows for efficient mining. Cost-based approaches: In error- or densitybased approaches all transactions that satisfy the constraints contribute equally to the support of an item set, regardless of how many items of the set they contain. In contrast to this, cost-based approaches define the support contribution of transactions in proportion to the number of missing items. In Refs 40, 93, this is achieved by means of user-provided item-specific costs or penalties, with which missing items can be inserted. These costs are combined with each other and with the initial transaction weight of 1 with the help of a t-norm. In addition, a minimum weight for a transaction can be specified, by which the number of insertions can be limited. Note that the cost-based approaches can be made to contain the error-based approaches as a limiting or extreme case, as one may set the cost/penalty of inserting an item in such a way that the transaction weight is not reduced. In this case, limiting the number of insertions obviously has the same effect as allowing for a maximum number of missing items. Related to fault-tolerant item set mining—but nevertheless fundamentally different—is the case of uncertain transactional data. In such data, each item is endowed with a transaction-specific weight or probability, which is meant to indicate the degree or chance with which it is a member of the transaction. Approaches to this problem can be found, for example, in Refs 94–96. The problem of determining the (expected) support of an item set is best treated (in the case of independent occurrences of items in the transactions) by simple sampling and then applying standard frequent item set mining97 or by using a normal distribution approximation.98 c 2012 John Wiley & Sons, Inc. 451 Overview wires.wiley.com/widm SUMMARY Frequent item set mining has been a fruitful and intensely researched topic, which has produced remarkable results. The currently fastest frequent item set mining algorithms are the Eclat-variant LCM and FP-Growth, provided they are equipped with state-ofthe-art optimizations, and there seems to be very little room left for speed improvements. Lasting challenges of frequent item set mining are to find better ways to filter the produced frequent item sets and association rules (or produce fewer in the first place), as even with the methods discussed above (such as closed and maximal item sets), the really interesting patterns still run the risk of drowning in a sea of irrelevant ones. Additional filtering with quality measures or statistical tests improves the situation, but still leaves a lot of room for improvements. REFERENCES 1. Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Databases (VLDB 1994; Santiago de, Chile). San Mateo, CA: Morgan Kaufmann; 1994, 487–499. 2. Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo A. Fast discovery of association rules. In: Advances in Knowledge Discovery and Data Mining. Cambridge, CA: AAAI Press/MIT Press; 1996, 307– 328. 3. Zaki MJ, Parthasarathy S, Ogihara M, Li W. New algorithms for fast discovery of association rules. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD 1997; Newport Beach, CA). Menlo Park, CA: AAAI Press; 1997, 283–296. 4. Zaki MJ, Gouda K. Fast vertical mining using diffsets. In: Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2003; Washington, DC). New York: ACM Press; 2003, 326–335. 5. Schmidt-Thieme L. Algorithmic features of Eclat. In: Proceedings of the Workshop Frequent Item Set Mining Implementations (FIMI 2004; Brighton, UK). Aachen, Germany: CEUR Workshop Proceedings 126; 2004. 6. Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: Proceedings of the 19th ACM International Conference on Management of Data (SIGMOD 2000; Dallas, TX). New York, NY: ACM Press; 2000, 1–12. 7. Grahne G, Zhu J. Efficiently using prefix-trees in mining frequent itemsets. In: Proceedings of the Workshop Frequent Item Set Mining Implementations (FIMI 2003; Melbourne, FL). Aachen, Germany: CEUR Workshop Proceedings 90; 2003. 8. Rácz B. Nonordfp: an FP-growth variation without rebuilding the FP-tree. In: Proceedings of the 2nd International Workshop on Frequent Itemset Mining Implementations (FIMI 2004; Brighton, UK). Aachen, Germany: CEUR Workshop Proceedings 126; 2003. 452 9. Grahne G, Zhu J. Reducing the main memory consumptions of Fpmax∗ and FPclose. In: Proceedings of the Workshop Frequent Item Set Mining Implementations (FIMI 2004; Brighton, UK). Aachen, Germany: CEUR Workshop Proceedings 126; 2004. 10. Uno T, Asai T, Uchida Y, Arimura H. LCM: an efficient algorithm for enumerating frequent closed item sets. In: Proceedings of the Workshop on Frequent Item Set Mining Implementations (FIMI 2003; Melbourne, FL). TU Aachen, Germany: CEUR Workshop Proceedings 90; 2003. 11. Uno T, Kiyomi M, Arimura H. LCM ver. 2: efficient mining algorithms for frequent/closed/maximal itemsets. In: Proceedings of the Workshop Frequent Item Set Mining Implementations (FIMI 2004; Brighton, UK). Aachen, Germany: CEUR Workshop Proceedings 126; 2004. 12. Uno T, Kiyomi M, Arimura H. LCM ver. 3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st Open Source Data Mining on Frequent Pattern Mining Implementations (OSDM 2005; Chicago, IL). New York, NY: ACM Press; 2005, 77–86. 13. Gerstein GL, Perkel DH, Subramanian KN. Identification of functionally related neural assemblies. Brain Res 1978, 140:43–62. 14. Bayardo RJ. Efficiently mining long patterns from databases. In: Proceedings of the ACM International Conference Management of Data (SIGMOD 1998; Seattle, WA). New York, NY: ACM Press; 1998, 85– 93. 15. Lin D-I, Kedem ZM. Pincer-search: a new algorithm for discovering the maximum frequent set. In: Proceedings of the 6th International Conference on Extending Database Technology (EDBT 1998; Valencia, Spain). Heidelberg, Germany: Springer-Verlag; 1998, 103–119. 16. Agrawal RC, Aggarwal CC, Prasad VVV. Depth first generation of long patterns. In: Proceedings of the 6th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2000; Boston, MA). New York, NY: ACM Press; 2000, 108–118. c 2012 John Wiley & Sons, Inc. Volume 2, November/December 2012 WIREs Data Mining and Knowledge Discovery Frequent item set mining 17. Aggarwal CC. Towards long pattern generation in dense databases. SIGKDD Explor 2001, 3:20–26. 18. Burdick D, Calimlim M, Gehrke J. MAFIA: a maximal frequent itemset algorithm for transactional databases. In: Proceedings of the 17th International Conference on Data Engineering (ICDE 2001; Heidelberg, Germany). Piscataway, NJ: IEEE Press; 2001, 443–452. 19. Pasquier N, Bastide Y, Taouil R, Lakhal L. Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th International Conference on Database Theory (ICDT 1999; Jerusalem, Israel). London, United Kingdom: Springer-Verlag; 1999, 398– 416. 20. Bastide Y, Taouil R, Pasquier N, Stumme G, Lakhal L. Mining frequent patterns with counting inference. SIGKDD Explor 2002, 2:66–75. 21. Zaki MJ. Generating non-redundant association rules. In: Proceedings of the 6th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2000, Boston, MA). New York, NY: ACM Press; 2000, 34–43. 22. Cristofor D, Cristofor L, Simovici D. Galois connection and data mining. J Univ Comput Sci 2000, 6:60–73. 23. Pei J, Han J, Mao R. Closet: an efficient algorithm for mining frequent closed itemsets. In: Proceedings of the SIGMOD International Workshop on Data Mining and Knowledge Discovery (DMKD 2000; Dallas, TX). ACM Press, New York, NY; 2000, 21–30. 24. Pan F, Cong G, Tung AKH, Yang J, Zaki MJ. Carpenter: finding closed patterns in long biological datasets. In: Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2003; Washington, DC). New York, NY: ACM Press; 2003, 637–642. 25. Bastide Y, Pasquier N, Taouil R, Stumme G, Lakhal L. Mining minimal non-redundant association rules using frequent closed itemsets. In: Proceedings of the 1st International Conference on Computational Logic (CL 2000; London, UK). London, United Kingdom: Springer-Verlag, 2000, 972–986. 26. Bykowski A, Rigotti C. A condensed representation to find frequent patterns. In: Proceedings of the 20th ACM Symposium on Principles of Database Systems (PODS 2001; Santa Barbara, CA). New York, NY: ACM Press; 2001, 267–273. 27. Kryszkiewicz M, Gajek M. Concise representation of frequent patterns based on generalized disjunctionfree generators. In: Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2002; Paipei, Taiwan). New York, NY: Springer-Verlag; 2002, 159–171. 28. Liu G, Li J, Wong L, Hsu W. Positive borders or negative borders: how to make lossless generators based representations concise. In: Proceedings of the 6th SIAM International Conference on Data Mining (SDM 2006; Bethesda, MD). Philadelphia, PA: Society for In- Volume 2, November/December 2012 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. dustrial and Applied Mathematics (SIAM); 2006, 469– 473. Liu G, Li J, Wong L. A new concise representation of frequent itemsets using generators and a positive border. J Knowl Inf Syst 2008, 17:35–56. Kohavi R, Bradley CE, Frasca B, Mason L, Zheng Z. KDD-Cup 2000 organizers’ report: peeling the onion. SIGKDD Explor 2000, 2:86–93. Calders T, Goethals B. Mining all non-derivable frequent itemsets. In: Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002; Helsinki, Finland). Berlin, Germany: Springer; 2002, 74–85. Muhonen J, Toivonen H. Closed non-derivable itemsets. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2006; Berlin, Germany). Berlin, Germany: Springer; 2006, 601–608. Bodon F. A fast apriori implementation. In: Proceedings of the Workshop on Frequent Item Set Mining Implementations (FIMI 2003; Melbourne, FL). TU Aachen, Germany: CEUR Workshop Proceedings 90; 2003. Borgelt C. Efficient implementations of apriori and Eclat. In: Proceedings of the Workshop on Frequent Item Set Mining Implementations (FIMI 2003; Melbourne, FL). TU Aachen, Germany: CEUR Workshop Proceedings 90; 2003. Bodon F. Surprising results of trie-based fim algorithms. In: Proceedings of the 2nd Workshop Frequent Item Set Mining Implementations (FIMI 2004; Brighton, UK). Aachen, Germany: CEUR Workshop Proceedings 126; 2004. Borgelt C. Recursion pruning for the apriori algorithm. In: Proceedings of the 2nd Workshop Frequent Item Set Mining Implementations (FIMI 2004; Brighton, UK). Aachen, Germany: CEUR Workshop Proceedings 126; 2004. Bodon F, Schmidt-Thieme L. The relation of closed itemset mining, complete pruning strategies and item ordering in apriori-based FIM algorithms. In: Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2005; Porto, Portugal). Berlin Germany: SpringerVerlag; 2005. Kosters WA, Pijls W. Apriori: a depth first implementation. In: Proceedings of the Workshop on Frequent Item Set Mining Implementations (FIMI 2003; Melbourne, FL). TU Aachen, Germany: CEUR Workshop Proceedings 90; 2003. Wang K, Tang L, Han J, Liu J. Top-down FP-growth for association rule mining. In: Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD 2002; Taipei, Taiwan). London, United Kingdom: Springer-Verlag; 2002, 334–340. c 2012 John Wiley & Sons, Inc. 453 Overview wires.wiley.com/widm 40. Borgelt C, Wang X. SaM: a split and merge algorithm for fuzzy frequent item set mining. In: Proceedings of the 13th International Fuzzy Systems Association World Congress and 6th Conference of European Society for Fuzzy Logic and Technology (IFSA/EUSFLAT’09; Lisbon, Portugal). Lisbon, Portugal: IFSA/EUSFLAT Organization Committee; 2009, 968–973. 41. Rácz B, Bodon F, Schmidt-Thieme L. Benchmarking frequent itemset mining algorithms: from measurement to analysis. In: Proceedings of the 1st Open Source Data Mining on Frequent Pattern Mining Implementations (OSDM 2005, Chicago, IL). New York, NY: ACM Press; 2005, 36–45. 42. Bayardo R, Goethals B, Zaki MJ, eds. In: Proceedings of the 2nd Workshop Frequent Item Set Mining Implementations (FIMI 2004; Brighton, UK). Aachen, Germany: CEUR Workshop Proceedings 126; 2004. 43. Goethals B, Zaki MJ, eds. In: Proceedings of the Workshop Frequent Item Set Mining Implementations (FIMI 2003; Melbourne, FL). Aachen, Germany: CEUR Workshop Proceedings 90; 2003. 44. Pietracaprina A, Zandolin D. Mining frequent itemsets using patricia tries. In: Proceedings of the Workshop on Frequent Item Set Mining Implementations (FIMI 2003; Melbourne, FL). TU Aachen, Germany: CEUR Workshop Proceedings 90; 2003. 45. Schlegel B, Gemulla R, Lehner W. Memory-efficient frequent-itemset mining. In: Proceedings of the 14th International Conference on Extending Database Technology (EDBT 2011; Uppsala, Sweden). New York, NY: ACM Press; 2011, 461–472. 46. Zaki MJ, Hsiao C-J. CHARM: an efficient algorithm for closed itemset mining. In: Proceedings of the 2nd SIAM International Conference on Data Mining (SDM 2002; Arlington, VA). Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM); 2002, 457– 473. 47. Wang J, Han J, Pei J. Closet+: searching for the best strategies for mining frequent closed itemsets. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD 2003; Washington, DC). New York, NY: ACM Press; 2003. 48. Lucchese C, Orlando S, Perego R. DCI closed: a fast and memory efficient algorithm to mine frequent closed itemsets. In: Proceedings of the 2nd Workshop Frequent Item Set Mining Implementations (FIMI 2004; Brighton, UK). Aachen, Germany: CEUR Workshop Proceedings 126; 2004. 49. Gouda K, Zaki MJ. Efficiently mining maximal frequent itemsets. In: Proceedings of the 1st IEEE International Conference on Data Mining (ICDM 2001; San Jose, CA). Piscataway, NJ: IEEE Press; 2001, 163– 170. 50. Burdick D, Calimlim M, Flannick J, Gehrke J, Yiu T. MAFIA: a performance study of mining maximal frequent itemsets. In: Proceedings of the Workshop 454 on Frequent Item Set Mining Implementations (FIMI 2003; Melbourne, FL). TU Aachen, Germany: CEUR Workshop Proceedings 90; 2003. 51. Mielikäinen T. Intersecting data to closed sets with constraints. In: Proceedings of the Workshop Frequent Item Set Mining Implementations (FIMI 2003; Melbourne, FL). Aachen, Germany: CEUR Workshop Proceedings 90; 2003. 52. Cong G, Tan KI, Tung AKH, Pan F. Mining frequent closed patterns in microarray data. In: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004; Brighton, UK). Piscataway, NJ: IEEE Press; 2004, 363–366. 53. Pan F, Tung AKH, Cong G, Xu X. Cobbler: combining column and row enumeration for closed pattern discovery. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM 2004; Santori Island, Greece). Piscataway, NJ: IEEE Press; 2004, 21–30. 54. Ganter B, Wille R. Formal Concept Analysis: Mathematical Foundations. Berlin: Springer, 1999. 55. Rioult F, Boulicaut J-F, Crémilleux B, Besson J. Using transposition for pattern discovery from microarray data. In: Proceedings of the 8th ACMSIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD2003; San Diego, CA). New York, NY: ACM Press; 2003, 73–79. 56. Borgelt C, Yang X, Nogales-Cadenas R, CarmonaSaez P, Pascual-Montano A. Finding closed frequent item sets by intersecting transactions. In: Proceedings of the 14th International Conference on Extending Database Technology (EDBT 2011; Uppsala, Sweden). New York, NY: ACM Press; 2011, 367–376. 57. Creighton C, Hanash S. Mining gene expression databases for association rules. Bioinformatics 2003, 19:79–86. 58. Agrawal R, Imielienski T, Swami A. Mining association rules between sets of items in large databases. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD 1993; Washington, DC). New York, NY: ACM Press; 1993, 207–216. 59. Srikant R, Agrawal R. Mining generalized association rules. In: Proceedings of the 21st International Conference on Very Large Databases (VLDB 1995; Zurich, Switzerland). San Mateo, CA: Morgan Kaufmann; 1995, 407–419. 60. Srikant R, Agrawal R. Mining quantitative association rules in large relational tables. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD 1996; Montreal, Canada). New York, NY: ACM Press; 1996, 1–12. 61. Kuok C, Fu A, Wong M. Mining fuzzy association rules in databases. SIGMOD Rec 1998, 27:41–46. 62. Brin S, Motwani R, Ullman JD, Tsur S. Dynamic itemset counting and implication rules for market basket data. In: Proceedings of the ACM International c 2012 John Wiley & Sons, Inc. Volume 2, November/December 2012 WIREs Data Mining and Knowledge Discovery 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. Frequent item set mining Conference on Management of Data (SIGMOD 1997; Tucson, AZ). New York, NY: ACM Press; 1997, 265– 276. Piatetsky-Shapiro G. Discovery, analysis, and presentation of strong rules. In: Piatetsky-Shapiro G, Frawley WJ, eds. Knowledge Discovery in Databases. Palo Alto, CA: AAAI Press; 1991, 229–248. Tan P-N, Kumar V, Srivastava J. Selecting the right interestingness measure for association patterns. In: Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2002; Edmonton, Canada). New York, NY: ACM Press; 2002, 32–41. Tan P-N, Kumar V, Srivastava J. Selecting the right objective measure for association analysis. Inf Syst 2004, 29:293–313. Geng L, Hamilton HJ. Interestingness measures for data mining: a survey. ACM Comput Surv (CSUR) 2006, 38:Article 9. Webb GI, Zhang S. k-Optimal-rule-discovery. Data Min Knowl Discov 2005, 10:39–79. Webb GI. Discovering significant patterns. Mach Learn 2007, 68:1–33. Choi S-S, Cha S-H, Tappert CC. A survey of binary similarity and distance measures. J Syst Cybern Inf 2010, 8:43–48. Segond M, Borgelt C. Item set mining based on cover similarity. In: Proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2011; Shenzhen, China). Berlin, Germany: Springer-Verlag; 2011, LNCS 6635:493–505. Jaccard P. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles 1991; 37:547–579. France 1901. Seno M, Karypis G. LPMiner: an algorithm for finding frequent itemsets using length decreasing support constraint. In: Proceedings of the 1st IEEE International Conference on Data Mining (ICDM 2001; San Jose, CA). Piscataway, NJ: IEEE Press; 2001, 505– 512. Wang J, Karypis G. BAMBOO: accelerating closed itemset mining by deeply pushing the length-decreasing support constraint. In: Proceedings of the SIAM International Conference on Data Mining (SDM 2004; Disneyworld, FL). Philadelphia, PA: Society for Industrial and Applied Mathematics; 2004, 432–436. Geerts F, Goethals B, Mielikäinen T. Tiling databases. In: Proceedings of the 7th International Conference on Discovery Science (DS 2004; Padova, Italy). Berlin, Germany: Springer; 2004, 278–289. Bonferroni CE. Il calcolo delle assicurazioni su gruppi di teste. Studi in Onore del Professore Salvatore Ortu Carboni 1935, 13–60. Abdi H. Bonferroni and Sidák corrections for multiple comparisons. In: Salkind NJ, ed. Encyclopedia of Volume 2, November/December 2012 77. 78. 79. 80. 81. 82. Measurement and Statistics. Thousand Oaks, CA: Sage Publications; 2007. Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat 1979, 6:65–70. Webb GI. Layered critical values: a powerful direct adjustment approach to discovering significant patterns. Mach Learn 2008, 71:307–323. Megiddo N, Srikant R. Discovering predictive association rules. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD 1998; New York, NY). Menlo Park, CA: AAAI Press; 1998, 27–78. Gionis A, Mannila H, Mielikäinen T, Tsaparas P. Assessing data mining results via swap randomization. In: Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2006; Philadelphia, PA). New York, NY: ACM Press; 2006, 167–176. Siebes A, Vreeken J, van Leeuwen M. Item sets that compress. In: Proceedings of the SIAM International Conference on Data Mining (SDM 2006; Bethesda, MD). Philadelphia, PA: Society for Industrial and Applied Mathematics; 2006, 393–404. Vreeken J, van Leeuwen M, Siebes A. Krimp: mining itemsets that compress. Data Min Knowl Discov 2011, 23:169–214. 83. Bringmann B, Zimmermann A. The chosen few: on identifying valuable patterns. In: Proceedings of the 7th IEEE International Conference on Data Mining (ICDM 2007; Omaha, NE). Piscataway, NJ: IEEE Press; 2007, 63–72. 84. De Raedt L, Zimmermann A. Constraint-based pattern set mining. In: Proceedings of the 7th IEEE International Conference on Data Mining (ICDM 2007; Omaha, NE). Piscataway, NJ: IEEE Press; 2007, 237–248. 85. Webb GI. Self-sufficient itemsets: an approach to screening potentially interesting associations between items. ACM Trans Knowl Discov Data (TKDD) 2010, 4:Article 3. 86. Cheng H, Yu PS, Han J. Approximate frequent itemset mining in the presence of random noise. In: Maimon O, Rokach L, eds. Soft Computing for Knowledge Discovery and Data Mining. Vol. IV. Berlin: Springer; 2008, 363–389. 87. Pei J, Tung AKH, Han J. Fault-tolerant frequent pattern mining: problems and challenges. In: Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2001; Santa Babara, CA). New York, NY: ACM Press; 2001. 88. Besson J, Robardet C, Boulicaut J-F. Mining a new fault-tolerant pattern type as an alternative to formal concept discovery. In: Proceedings of the International Conference on Computational Science (ICCS 2006; Reading, United Kingdom). Berlin, Germany: SpringerVerlag; 2006, 144–157. c 2012 John Wiley & Sons, Inc. 455 Overview wires.wiley.com/widm 89. Gionis A, Mannila H, Seppänen JK. Geometric and combinatorial tiles in 0-1 data. In: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2004; Pisa, Italy). Berlin, Germany: Springer-Verlag; 2004, LNAI 3202:173–184. 90. Yang C, Fayyad U, Bradley PS. Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings of the 7th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2001; San Francisco, CA). New York, NY: ACM Press; 2001, 194–203. 94. 95. 91. Seppänen JK, Mannila H. Dense itemsets. In: Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2004; Seattle, WA). New York, NY: ACM Press; 2004, 683– 688. 96. 92. Liu J, Paulsen S, Sun X, Wang W, Nobel A, Prins J. Mining approximate frequent itemsets in the presence of noise: algorithm and analysis. In: Proceedings of the 6th SIAM Conference on Data Mining (SDM 2006; Bethesda, MD). Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM); 2006, 405– 416. 93. Wang X, Borgelt C, Kruse R. Mining fuzzy frequent item sets. In: Proceedings of the 11th International Fuzzy Systems Association World Congress (IFSA 2005; Beijing, China). Beijing, China; Heidel- 97. 456 98. berg, Germany: Tsinghua University Press; SpringerVerlag; 2005, 528–533. Chui C-K, Kao B, Hung E. Mining frequent itemsets from uncertain data. In: Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2007; Nanjing, China). Berlin, Germany: Springer-Verlag; 2007, 47–58. Leung CK-S, Carmichael CL, Hao B. Efficient mining of frequent patterns from uncertain data. In: Proceedings of the 7th IEEE International Conference on Data Mining Workshops (ICDMW 2007; Omaha, NE). Piscataway, NJ: IEEE Press; 2007, 489–494. Aggarwal CC, Lin Y, Wang J, Wang J. Frequent pattern mining with uncertain data. In: Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2009; Paris, France). New York, NY: ACM Press; 2009, 29–38. Calders T, Garboni C, Goethals B. Efficient pattern mining of uncertain data with sampling. In: Proceedings of the 14th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2010; Hyderabad, India). Berlin, Germany: Springer-Verlag; 2010, I:480–487. Calders T, Garboni C, Goethals B. Approximation of frequentness probability of itemsets in uncertain data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM 2010; Sydney, Australia). Piscataway, NJ: IEEE Press; 2010, 749–754. c 2012 John Wiley & Sons, Inc. Volume 2, November/December 2012

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Frequent item set mining