Download The Apriori Algorithm - Institute for Mathematical Sciences

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
March 30, 2005
9:7
WSPC/Lecture Notes Series: 9in x 6in
40
heg05a
M. Hegland
Here we can apply the previous theorem to get a lower bound for Ck :
|Ck | ≥ b(k) (mk , . . . , ms , s + p − 1).
This, however is contradicting the higher upper bound we got previously
and so we have to have |Ck+p | ≤ b(k+p) (mj , . . . , mr ).
As a simple consequence one also gets tightness:
Corollary 17: For any m and k there exists a Ck with |Ck | = m =
b(k+p) (mk , . . . , ms+1 ). such that
|Ck+p | = b(k+p) (mk , . . . , ms+1 ).
Proof: The Ck consists of the first m k-itemsets in the colexicographic
ordering.
In practice one would know not only the size but also the contents of any
Ck and from that one can get a much better bound than the one provided
by the theory. A consequence
of the theorem is that for Lk with |Lk | ≤ mkk
mk
one has |Ck+p | ≤ k+p . In particular, one has Ck+p = ∅ for k > mp − p.
4. Extensions
4.1. Apriori Tid
One variant of the apriori algorithm discussed above computes supports
of itemsets by doing intersections of columns. Some of these intersections
are repeated over time and, in particular, entries of the Boolean matrix
are revisited which have no impact on the support. The Apriori TID [?]
algorithm provides a solution to some of these problems. For computing
the supports for larger itemsets it does not revisit the original table but
transforms the table as it goes along. The new columns correspond to the
candidate itemsets. In this way each new candidate itemset only requires
the intersection of two old ones.
The following demonstrates with an example how this works. The example is adapted from [?]. In the first row the itemsets from Ck are depicted.
The minimal support is 50 percent or 2 rows. The initial matrix of the tid