Download Incrementally Mining Frequent Itemsets in Update Distorted Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Open Database Connectivity wikipedia , lookup

Oracle Database wikipedia , lookup

IMDb wikipedia , lookup

Commitment ordering wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Functional Database Model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Serializability wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Concurrency control wikipedia , lookup

ContactPoint wikipedia , lookup

Transcript
SA-IFIM: Incrementally Mining Frequent
Itemsets in Update Distorted Databases⋆
Jinlong Wang, Congfu Xu⋆⋆ , Hongwei Dan, and Yunhe Pan
Institute of Artificial Intelligence, Zhejiang University
Hangzhou, 310027, China
[email protected] [email protected]
[email protected] [email protected]
Abstract. The issue of maintaining privacy in frequent itemset mining has attracted considerable attentions. In most of those works, only
distorted data are available which may bring a lot of issues in the datamining process. Especially, in the dynamic update distorted database
environment, it is nontrivial to mine frequent itemsets incrementally due
to the high counting overhead to recompute support counts for itemsets. This paper investigates such a problem and develops an efficient
algorithm SA-IFIM for incrementally mining frequent itemsets in update distorted databases. In this algorithm, some additional information
is stored during the earlier mining process to support the efficient incremental computation. Especially, with the introduction of supporting
aggregate and representing it with bit vector, the transaction database is
transformed into machine oriented model to perform fast support computation. The performance studies show the efficiency of our algorithm.
1
Introduction
Recently, privacy becomes one of the prime concerns in data mining. For not
compromising the privacy, most of works make use of distortion or randomization
techniques to the original dataset, and only the disguised data are shared for data
mining [1–3].
Mining frequent itemset models from the distorted databases with the reconstruction methods brings expensive overheads as compared to directly mining
original data sets [2]. In [3, 4], the basic formula from set theory are used to eliminate these counting overheads. But, in reality, for many applications, a database
is dynamic in the sense. The changes on the data set may invalidate some existing
frequent itemsets and introduce some new ones, so the incremental algorithms
[5, 6] were proposed for addressing the problem. However, it is not efficient to
directly use these incremental algorithms in the update distorted database, because of the high counting overhead to recompute support for itemsets. Although
⋆
⋆⋆
Supported by the Natural Science Foundation of China (No. 60402010), Zhejiang Provincial Natural Science Foundation of China (Y105250) and the ScienceTechnology Progrom of Zhejiang Province of China (No. 2004C31098).
Congfu Xu is the corresponding author.
2
Jinlong Wang et al.
[7] has proposed an algorithm for incremental updating, the efficiency still cannot
satisfy the reality.
This paper investigates the problem of incremental frequent itemset mining
in update distorted databases. We first develop an efficient incremental updating
computation method to quickly reconstruct an itemset’s support by using the
additional information stored during the earlier mining process. Then, a new
concept supporting aggregate (SA) is introduced and represented with bit vector. In this way, the transaction database is transformed into machine oriented
model to perform fast support computation. Finally, an efficient algorithm SAIFIM (Supporting Aggregate based Incremental Frequent Itemset Mining in
update distorted databases) is presented to describe the process. The performance studies show the efficiency of our algorithm.
The remainder of this paper is organized as follows. Section 2 presents the
SA-IFIM algorithm step by step. The performance studies are reported in Section
3. Finally, Section 4 concludes this paper.
2
The SA-IFIM Algorithm
In this section, the SA-IFIM algorithm is introduced step by step. Before mining,
the data sets are distorted respectively using the method mentioned by EMASK
[3]. In the following, we first describe the preliminaries about incremental frequent itemsets mining, then investigate the essence of the updating technique
and use some additional information recorded during the earlier mining and the
set theory for quick updating computation. Next, we introduce the supporting
aggregate and represent it with bit vector to transform the database into machine
oriented model for speeding up computations. Finally, the SA-IFIM algorithm
is summarized.
2.1
Preliminaries
In this subsection, some preliminaries about the concept of incremental frequent
itemset mining are presented, summarizing the formal description in [5, 6].
Let D be a set of transactions and I = {i1 , i2 , . . . , im } a set of distinct
literals (items). For a dynamic database, old transactions △− are deleted from
the database D and new transactions △+ are added. Naturally, △− ⊆ D. Denote
the updated database by D′ , therefore D′ = (D − △− ) ∪ △+ , and the unchanged
transactions by D− = D − △− . Let F p express the frequent itemsets in the
original database D, F pk denote k-frequent itemsets. The problem of incremental
mining is to find frequent itemsets (denoted by F p′ ) in D′ , given △− , D− , △+ ,
and the mining result F p, with respect to the same user specified minimum
support s. Furthermore, the incremental approach needs to take advantage of
previously obtained information to avoid rerunning the mining algorithms on
the whole database when the database is updated. For the clarity, we present s
as a relative support value, but δc+ , δc− , σc , and σc′ as absolute ones, respectively
in △+ , △− , D, D′ . And set δc as the change of support count of itemset c. Then
δc = δc+ − δc− , σc′ = σc + δc+ − δc− .
The SA-IFIM Algorithm
2.2
3
Efficient incremental computation
Generally, in dynamically updating environment, the important aspect of mining
is how to deal with the frequent itemsets in D, recorded in F p, and how to add
the itemsets, which are non-frequent in D (not existing in F p) but frequent in
D′ . In the following, for simplicity, we define | • | as the tuple number in the
transaction database.
1. For the frequent itemsets in F p, find the non-frequent or still available frequent itemsets in the updated database D′ .
Lemma 1 If c ∈ F p (σc ≥ |D| × s), and δc ≥ (|△+ | − |△− |) × s, then
c ∈ F p′ .
Proof. σc′ =σc + δc+ − δc− ≥ (|D| × s + |△+ | × s − |△− | × s) =(|D| + |△+ | −
|△− |) × s = |D′ | × s. ⊓
⊔
Property 1. When c ∈ F p, and δc < (|△+ | − |△− |) × s, then c ∈ F p′ if and
only if σc′ ≥ |D′ | × s.
2. For itemsets which are non-frequent in D, mine the frequent itemsets in the
changed database △+ − △− and recompute their support counts through
scanning D− .
Lemma 2 If c 6∈ F p, and δc < (|△+ | − |△− |) × s, then c 6∈ F p′ .
Proof. Refer to Lemma 1. ⊓
⊔
Property 2. When c 6∈ F p, and δc ≥ (|△+ | − |△− |) × s, then c ∈ F p′ if and
only if σc′ ≥ |D′ | × s.
Under the framework of symbol-specific distortion process in [3], ‘1’ and ‘0’
in the original database are respectively flipped with (1−p) and (1−q). In incremental frequent itemset mining, the goal is to mine frequent itemsets from the
distorted databases with the information obtained during the earlier process. To
test the condition for an itemset not in F p in the situation Property 2, we need reconstruct an itemset’s support in the unchanged database D− through scanning
∗
D− . Not only the distorted support of the itemset itself, but also some other
counts related to it need to be tracked of. This makes that the support count
computing in Property 2 is difficult and paramount important in incremental
mining. And it is nontrivial to directly apply traditional incremental algorithms
to it. To address the problem, an efficient incremental updating operation is first
developed through computation with the support in the distorted database, then
another method is presented to improve the support computation efficiency in
the section 2.3.
In distorted databases, the support computations of frequent itemsets are
tedious. Motivated by [3], the similar support computation method is used in
incremental mining. With the method, for computing an itemset’s support, we
should have the support counts of all its subsets in the distorted database. However, if we save the support counts of all the itemsets, this will be unpractical
4
Jinlong Wang et al.
and greatly increase cost and degrade indexing efficiency. Thus in incremental mining, when recording the frequent itemsets and their support counts, the
corresponding ones in each distorted database are registered at the same time.
In this way, for a k-itemset not in F p, since all its subsets are frequent in the
database, we can use the existing support counts in each distorted database to
compute and reconstruct its support in the updated database quickly. Thus, the
efficiency is improved.
2.3
Supporting aggregate and database transformation
In order to improve the efficiency, we introduce the concept supporting aggregate
and use bit vector to represent it. By virtue of elementary supporting aggregate
based on bit vector, the database is transformed into the machine oriented data
model, which improves the efficiency of itemsets’ support computation.
In the following statement, for transaction database D, let U denote a set
of objects (universe), as unique identifiers for the transactions. For simplicity,
we refer U as the transactions without differences. For an itemset A ⊆ I, a
transaction u ∈ U is said to contain A if A ⊆ u.
Definition 1. supporting aggregate (SA). For an attribute itemset A ⊆ I,
denote S(A) = {u ∈ U |A ⊆ u} as its supporting aggregate, where S(A) is
the aggregate, composed of the transactions including the attribute itemset A.
Generally, S(A) ⊆ U . For the supporting aggregate of each attribute items, we
call it elementary supporting aggregate (ESA).
Using ESA, the original transaction database is vertically inverted and transformed into attribute-transaction list. Through the ESA, the SA of an itemset
can be obtained quickly with set intersection. And the itemsets’ support can
be efficiently computed. In order to further improve processing speed, for each
SA (ESA), we denote it as BV-SA (BV-ESA) with a binary vector of |U | dimensions (|U | is the number of transaction in U ). If an itemset’s SA contains
the i th transaction, its binary vector’s i th dimension is set to 1, otherwise, the
corresponding position is set to 0. By this representation, the support count of
each attribute item can be computed efficiently.
With the vertical database representation, where each row presents an attribute’s BV-ESA, the attribute items can be removed sequentially due to download closure property [8], which efficiently reduced the size of the data set. On
the other hand, the whole BV-ESA sometimes cannot be loaded into memory
entirely because of the memory constraints. Our approach seeks to solve the
scalable problem through horizontally partitioning the transaction data set into
subsets, which is composed of partial objects (transactions), then load them partition by partition. Through the method, each partition is disjointed with each
other, which makes it suitable for the parallel and distributed processing. Furthermore, in reality, the optimizational memory swap strategy can be adopted
to reduce the I/O cost.
The SA-IFIM Algorithm
2.4
5
The process of SA-IFIM algorithm
In this subsection, the algorithm SA-IFIM is summarized as Algorithm 1. When
∗
∗
∗
the distorted data sets D− , △− and △+ are firstly scanned, they are trans∗
formed into the corresponding vertical bit vector representations BV (D− ),
∗
∗
BV (△− ) and BV (△+ ) partition by partition, and saved into hard disk. From
the representations, frequent k-itemsets F pk can be obtained level by level. And
based on the candidate set generation-and-test approach, candidate frequent
k-itemsets (Ck ) are generated from frequent (k-1)-itemsets (F pk−1 ).
Algorithm 1: Algorithm SA-IFIM
∗
∗
∗
Input: D− , △+ , △− , F p (Frequent itemsets and the support counts in D),
∗
F p (Frequent itemsets of F p and the corresponding support counts in D∗ ),
minimum support s, and distortion parameter p, q as EMASK [3].
Output: F p′ (Frequent itemsets and the support counts in D′ )
Method: As shown in Fig.1. In the algorithm, we use some temporal
files to store the support counts in the distorted database for
efficiency.
Fig. 1. SA-IFIM algorithm diagram.
6
3
Jinlong Wang et al.
Performance Evaluation
This section performed comprehensive experiments to compare SA-IFIM with
EMASK, provided by the authors in [9]. And for the better performance evaluation, we also implemented the algorithm IFIM (Similar as IPPFIM [7]). All
programs were coded in C++ using Cygwin with gcc 2.9.5. The experiments
were done on a P4, 3GHz Processor, with 1G memory. SA-IFIM and IFIM yield
the same itemsets as EMASK with the same data set and the same minimum
support parameters.
Our experiments were performed on the synthetic data sets by IBM synthetic
market-basket data generator [8]. In the following, we use the notation as D
(number of transactions), T (average size of the transactions), I (average size
of the maximal potentially large itemsets), and N (number of items), and set
N=1000. In our method, the sizes of |△+ | and |△− | are not required to be the
same. Without loss of generality, let |d|= |△+ | = |△− | for simplicity. For the
sake of clarity, TxIyDmdn is used to represent an original database with an
update database, where the parameters T = x and I = y are the same, only
different in the number of the original transaction database |D| = m and the
update transaction database |d| = n.
In the following, we used the distorted benchmark data sets as the input
databases to the algorithms. The distortion parameters are same as EMASK [3],
with p=0.5 and q=0.97. In the experiments, for a fair comparison of algorithms
and scalable requirements, SA-IFIM is run where only 5K transactions are loaded
into the main memory one time.
3.1
Different support analysis
In Fig.2, the relative performance of SA-IFIM, IFIM and EMASK are compared
on two different data sets, T25I4D100Kd10K (sparse) and T40I10D100Kd10K
(dense) with respect to various minimum support. As shown in Fig.2, SA-IFIM
leads to prominent performance improvement. Explicitly, on the sparse data
sets (T25I4D100Kd10K), IFIM is close to EMASK, and SA-IFIM is orders of
magnitude faster than them; on the dense data sets (T40I10D100Kd10K), IFIM
is faster than EMASK, but SA-IFIM also outperforms IFIM, and the margin
grows as the minimum support decreases.
3.2
Effect of the update size
Two data sets T25I4D100Kdm and T40I10D100Kdm were experimented, and
the results shown in Fig.3. As expected, when the same number of transactions
are deleted and added, the time of rerunning EMASK maintains constant, but
the one of IFIM increases sharply and surpass EMASK quickly. In Fig.3, the
execution time of SA-IFIM is much less than EMASK. SA-IFIM still significantly
outperforms EMASK, even when the update size is much large.
The SA-IFIM Algorithm
(a) T25I4D100Kd10K
7
(b) T40I10D100Kd10K
Fig. 2. Extensive analysis for different support
(a) T25I4D100Kdm(s=0.6%)
(b) T40I10D100Kdm(s=1.25%)
Fig. 3. Different updating tuples analysis
3.3
Scale up performance
Finally, to assess the scalability of the algorithm SA-IFIM, two experiments,
T25I4Dmd(m/10) at s = 0.6% and T40I10Dmd(m/10) at s = 1.25%, were
conducted to examine the scale up performance by enlarging the number of
mined data set. The scale up results for the two data sets are obtained as Fig.4,
which shows the impact of |D| and |d| to the algorithms SA-IFIM and EMASK.
In the experiments, the size of the update database is as 10% of the original
database, and the size of the transaction database m was increased from 100K
to 1000K. As shown in Fig.4, EMASK is very sensitive to the updating tuple
but SA-IFIM is not, and the execution time of SA-IFIM increases linearly as the
database size increases. This shows that the algorithm can be applied to very
large databases and demonstrates good scalability of it.
8
Jinlong Wang et al.
(a) T25I4Dmd(m/10)(s=0.6%)
(b) T40I10Dmd(m/10)(s=1.25%)
Fig. 4. Scale up performance analysis
4
Conclusions
In this paper, we explore the issue of frequent itemset mining under the dynamically updating distorted databases environment. We first develop an efficient
incremental updating computation method to quickly reconstruct an itemset’s
support. Through the introduction of the supporting aggregate represented with
bit vector, the databases are transformed into the representations more accessible
and processible by computer. The support count computing can be accomplished
efficiently. Experiments conducted show that SA-IFIM significantly outperforms
EMASK of mining the whole updated database, and also have the advantage of
the incremental algorithms only based on EMASK.
References
1. Agrawal, R., and Srikant, R.: Privacy-preserving data mining. In: Proceedings of
SIGMOD. (2000) 439-450
2. Rizvi, S., and Haritsa, J.: Maintaining data privacy in association rule mining. In:
Proceedings of VLDB. (2002) 682-693
3. Agrawal, S., Krishnan, V., and Haritsa, J.: On addressing efficiency concerns in
privacy-preserving mining. In: Proceedings of DASFAA. (2004) 113-124
4. Xu, C., Wang, J., Dan, H., and Pan, Y.: An improved EMASK algorithm for
privacy-preserving frequent pattern mining. In: Proceedings of CIS. (2005) 752757
5. Cheung, D., Han, J., Ng, V., and Wong, C.: Maintenance of discovered association
rules in large databases: An incremental updating tedchnique. In: Proceedings of
ICDE. (1996) 104-114
6. Cheung, D., Lee, S., and Kao, B.: A general incremental technique for updating
discovered association rules. In: Proceedings of DASFAA. (1997) 106-114
7. Wang, J., Xu, C., and Pan, Y.: An Incremental Algorithm for Mining PrivacyPreserving Frequent Itemsets. In: Proceedings of ICMLC. (2006)
8. Agrawal, R., and Srikant, R.: Fast algorithms for mining association rules. In:
Proceedings of VLDB. (1994) 487-499
9. http://dsl.serc.iisc.ernet.in/projects/software/software.html.