Download An Incremental Algorithm for Mining Privacy

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Database wikipedia , lookup

Concurrency control wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

ContactPoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
An Incremental Algorithm for Mining Privacy-Preserving Frequent Itemsets
Jinlong Wang, Congfu Xu∗, Yunhe Pan
Institute of Artificial Intelligence, Zhejiang University
Hangzhou, 310027, China
[email protected]
[email protected]
[email protected]
Abstract
Privacy preserving data mining is a novel research
direction in data mining and statistical databases,
where data mining algorithms are analyzed for the
side-effects they incur in data privacy. There have
been many studies on efficient discovery of frequent
itemsets in privacy preserving data mining. However,
it is nontrivial to maintain such discovered frequent
itemsets because a database may allow frequent itemsets updates and such frequent itemsets may be turned
into infrequent itemsets. In this paper, an incremental
updating algorithm IPPFIM is proposed for efficient
maintenance of discovered frequent itemsets when new
transaction data are added to a transaction database
in privacy preserving. The algorithm makes use of
previous mining results to cut down the cost of finding
new frequent itemsets in an updated database, the performance evaluation shows the efficiency of this method.
Keyword:
Data mining, privacy-preserving, incremental
1
Introduction
With the development of computer hardware and software and the rapid computerization of business, the amount
of data available for analysis has grown exponentially in
many areas such as retail industry, financial forecast, decision support and intrusion detection. When the scale of
data manipulation, exploration and inferencing went beyond human capacities, a new technique named data mining emerged. The term data mining refers to the nontrivial extraction of valid, implicit, potentially useful and
ultimately understandable information in large databases
∗ Correspondence
author: Congfu Xu
with the help of the ubiquitous modern computing devices
[1, 2]. During the past decade, many successful applications in data mining have been reported from varied sectors such as marketing, finance, banking, manufacturing and
telecommunication. As a valuable technique, data mining is
developing flourishly, meanwhile, there arise serious concerns over individual privacy in data collection, processing
and mining [3], as a result preserving privacy appears as
a prime concern in the field of data mining. The conventional wisdom held that data mining, with its promise to efficiently discover valuable, non-obvious information from
large databases, is particularly vulnerable to misuse [4]. [5]
predicted the making of a conflict between data mining and
privacy. The reality is that data mining results rally violate privacy. The objective of data mining is to generalize
across populations, rather than reveal information about individuals. The hitch is that data mining works by evaluating
individual data that is subject to privacy concerns. So, the
true problem is not data mining, but the way data mining is
done. [6] and [7] proposed the concept of privacy preserving data mining (PPDM) aimed at alleviating the conflict
between data mining and privacy. As a novel research direction in data mining and statistical databases [8], PPDM
has begun to receive attentions and be investigated by many
researchers [6, 7, 9, 10, 11, 12, 13, 14, 15].
PPDM is defined as “getting valid data mining results
without learning the underlying data values” [9]. PPDM
encompasses the dual goal of meeting privacy requirements
and providing valid data mining results [10]. As a first introduction of PPDM, the idea of [6] is to perturb individual
data values. By the perturbation, the original data is hidden
and only the randomized values are revealed, but the statistical characteristics have been kept in order that accurate
models without access to the precise information in individual data records can be developed. An alternative approach
[7] in PPDM is to build a data mining model from local
data sets of various participating sites without revealing individual records of one site to other participating sites. The
method is usually based on SMC (Secure Multiparty Computation) [16].
Although the security definition of the randomization
model is much weaker than the one in the SMC model, the
randomization model aims to protect the (exact) actual data
value, and it can gain the higher efficiency than the SMC
model (When the number of participants becomes large, the
performance will not be desirable.), because of this, randomization method is currently greatly applied to privacy
preserving data mining [12, 13, 14, 15].
Randomization methods address the issue of privacy preserving by perturbing the data and reconstructing the distributions at an aggregate level in order to perform mining. All these papers [12, 13, 15] consider randomization
techniques in privacy preserving frequent itemset mining.
These algorithms try to extract the data itemsets without directly accessing the original data and attempt to guarantee
that the mining process does not get sufficient information
to reconstruct the original data. However, in reality, data
changes from time to time. The itemset mined can present
some development trends, when the database dynamically
incrementing, some new frequent itemsets can appear, and
some old frequent itemsets can disappear, so the incremental mining is paramount important. When database changes,
mining the updating database again will not meet the requirements of the real-time response, so, the efficient updating algorithm must be devised to update, maintenance and
management the mined knowledge. In this paper, an efficient updating technique will be applied to privacy preserving frequent itemset mining, and an incremental algorithm,
called IPPFIM (Incremental Privacy-Preserving Frequent
Itemset Mining) will be proposed to improve the efficiency
in database incremental updating.
The remainder of this paper is organized as follows. The
related concepts are described in Section 2. In Section 3,
an efficient incremental privacy preserving frequent itemset
mining algorithm IPPFIM is presented. The performance
study of IPPFIM is reported in Section 4, which shows the
efficiency of this method. Finally, Section 5 concludes this
paper.
2
2.1
Preliminaries
Frequent Itemset Mining
As a key stage in many data mining applications, including the discovery of association rules, strong rules, correlations, sequential rules, episodes, multidimensional itemsets,
and many other important discovery tasks, frequent itemset
(itemset) mining problem has received a great deal of attentions since its introduction in 1993 by Agrawal et al. [17].
Let I = {i1 , i2 , . . . , im } be a set of distinct literals, usually called items. Let D, the task-relevant data, be a set of
database transactions where each transaction has a unique
identifier, called T ID, and contains a set of items. For a
set of items A ⊆ I, a transaction T is said to contain A
if A ⊆ T . A set of items is referred to as an itemset. An
itemset that contains k items is a k-itemset.
Definition 1 Suppose A ⊆ I is an itemset, then the frequent itemset in a given database D with respect to a frequency threshold min sup is F (D, min sup) = {A ⊆
I|sup(A, D) ≥ min sup}, where, sup(A, D), the support
of an itemset A, is the relative frequency of an item set A in
transaction databases D, and min sup is the least support
defined by users, min sup ∈ (0, 1).
The problem of mining frequent itemset is to mine all
itemsets whose support is no less than s, a user-specified
minimum support threshold. It usually makes use of the
download closure of frequent itemsets: all subsets of a frequent itemset are frequent and that all supersets of an infrequent itemset are infrequent. These two properties are
usually to be applied to prune elements of the itemset lattice.
2.2
Privacy Preserving Frequent Itemset Mining
In [13], Rizvi and Srikant presented a scheme called
MASK (Mining Associations with Secrecy Konstraints),
based on a simple probabilistic distortion of user data, employing random numbers generated from a pre-defined distribution function. In order to address the runtime efficiency
issue in MASK, [15] achieved the improvement through
changes in both the distortion process and the mining process of MASK, presenting a new algorithm EMASK (Efficient MASK). By generalizing the distortion process to perform symbol-specific distortion and appropriately choosing
the distortion parameters and applying a variety of optimizations of set theory in the reconstruction process, runtime efficiencies are well achieved.
By the virtue of randomization to distort the original
database, frequent itemset can be mined in privacy preserving, the definition is as the following.
Definition 2 Let D be a transactional database, D∗ be a
distorted database from D in order to preserve individual
privacy in D. It is this distorted database D∗ that is eventually supplied to the data miner, along with a description
of the distortion procedure. The data miner mines the distorted database D∗ to estimate the frequent itemsets with
support count satisfying the minimal support in the original database D, by virtue of the distribution procedure.
Figure.1 illustrates the process.
Table 1. Incremental Mining Relationship Table
Figure 1. Privacy preserving frequent itemset
mining process.
X ∈ F pd
Yes
Yes
No
No
X ∈ Fp
Yes
No
Yes
No
X ∈ F p′
Yes
Visiting D to computing
Visiting d to computing
No
Property 3 For c ∈ F p, when c ∈ F pd , c ∈ F p′ .
3
Efficient Incremental Privacy Preserving
Frequent Itemset Mining
Property 4 For c ∈ F p, when c 6∈ F pd , c ∈ F p′ ⇔
sup(c, D ∪ d) ≥ s.
In reality, data changes from time to time in many areas, including the retail industry and the finance sector.
The itemset mined in these wide applications can present
some development trends. When the transaction database
changes with time, dynamically increments, some new frequent itemsets can appear, and some old frequent itemsets can disappear, which induces the incremental mining,
a paramount important mining method. When increasing,
mining the updating database, composed with the original
and update database again, will not meet the requirements
of the real-time response. In these dynamic databases, the
knowledge acquired can facilitate successive discovery processes, and the efficient updating algorithm must be devised
to updating, maintenance and management such knowledge,
2. For infrequent itemsets in D, computing the support of
the frequent itemsets in F pd visiting D∗ .
Definition 3 Let D be the original database and s be the
minimum support threshold, and d denote the incremental database where new transactions or new customers are
added to D. D∗ and d∗ denote the database after distorted.
F p expresses the frequent itemsets in the original database
D, F pk expresses k-frequent itemsets, F p′ is the set of frequent itemsets in D ∪ d, F p′k expresses k-frequent itemsets
in D ∪ d.
Property 1 Suppose c ∈ F pk . c 6∈ F p′k ⇔ sup(c, D∪d) <
s.
Property 2 Suppose c 6∈ F pk . c ∈ F p′k ⇔ sup(c, D ∪d) ≥
s.
In Table.1, we summarize the relationship of the frequent
itemset in an incremental update environment (Let F pd denote the frequent itemset in incremental database d.). From
Table.1, we can find the two key problems in incremental
mining:
1. For frequent itemsets F p in D, find the not or still
available frequent itemsets.
Property 5 For c 6∈ F p, when c 6∈ F pd , c 6∈ F p′ .
Property 6 For c 6∈ F p, when c ∈ F pd , c ∈ F p′ ⇔
sup(c, D ∪ d) ≥ s.
In this paper, an efficient incremental privacy preserving
frequent itemset mining algorithm is proposed to address
the updating problem in PPDM, when new transactions appended. In the algorithm, we make use of the distortion
technique mentioned in [15], and some optimizations in
[15].
Because the reconstruction procedure is cost, a k-itemset
may be distorted to produce any of 2k combinations, in order to accurately reconstruct the support of the k-itemset,
we need the counts of all these 2k combinations in the distorted database. Through the basic formula from set theory,
the support of all combinations of the itemset can be computed efficiently [15]. In mining, for the frequent itemset,
the support can be obtained in the file F p, but for the infrequent itemset not in the file F p, a new scan must be done to
recomputed the support, decreasing the efficiency. In order
to overcome this, a minor modification is proposed, when
mining, the support of the frequent itemsets in the distorted
database is also registered along with registering the support in the original database, conveniencing the incremental mining, improving the efficiency. For example, itemsets {A} and {B} are frequent, but {AB} is infrequent,
in incremental mining, if we need compute the support of
{AB} in the original database D, when the support counts
of {A} and {B} in the distorted database D∗ are registered,
the support of {AB} can be computed quickly through the
basic formula from set theory [15] without scanning the
database D∗ . In the following, an efficient incremental privacy preserving frequent itemset mining IPPFIM is shown
2. When sup(X, d) < s, itemset X is infrequent in incremental database d.
in Algorithm.1, the input F pD∗ denotes the corresponding frequent itemsets F p and their support in the distorted
database D∗ .
Algorithm 1: An efficient incremental privacy
preserving frequent itemset mining algorithm
IPPFIM.
The objective of this algorithm is to process privacy
preserving frequent itemset mining in an
incremental updating environment.
Input: D∗ , d∗ , F p, F pD∗ , minimum support s,
distorted parameter p, q.
Output: F p′ (Frequent itemset and the support in
D ∪ d)
Method: As Fig.2
(a) Making use of the Property 5, X is a infrequent
itemset in the whole database file if X 6∈ F p.
(b) Step 3 - 5 make use of Property 4, when
sup(X, D ∪ d) ≥ s, put the itemset X into F p′ .
4
Experimental Results
In order to assess the performance of IPPFIM, experiments are conducted to compare its performance with that
of EMASK [15], an efficient privacy preserving frequent
itemset mining algorithm. Both algorithms are implemented using Cygwin with gcc 2.9.5. Our target platform
is a Pentium4 1.6GHz processor, with 384MB memory, using a Maxtor IDE disk (7200rpms, 80GB). The operating
system is Windows Xp with service pack 2.
Benchmark data sets. The performance tests are performed on synthetic database benchmark, publicly available
from IBM synthetic market-basket data generator [18]. The
data sets, using the IBM data generator, mimic the transactions in a retailing environment. In the data sets, the meanings of the parameters are shown in Table.2. In the following experiments, we use the data set T25.I4.D1M.N1K as
the original database.
D
T
I
N
Figure 2. An efficient incremental privacy preserving frequent itemset mining.
When computing 1-itemset, itemset X is brought
through scanning d∗ . When computing k-itemset (k > 1),
X is brought through Ck+1 , generated by F p′k . In incremental mining, we use the aforesaid Property 3-6.
1. When sup(X, d) ≥ s, itemset X is frequent in the
incremental database d.
(a) Step 1 and 2 make use of the Property 3, put the
frequent itemset X into F p′ .
(b) Step 6 - 10 make use of the Property 6, computing the support of X in the original database, if
the condition is satisfied, put it into F p′
Table 2. Parameters meanings
Number of transactions
Average size of the transactions
Average size of the maximal potentially large itemsets
Number of items
Performance Testing. We compare the performance of
IPPFIM against EMASK for different data sets as Fig.3 and
Fig.4. In the experiments, every databases are distorted with
parameter p=0.5, q=0.97 as EMASK [15].
The first experiment is done on the original
database T25.I4.D1M.N1K with an update database
T25.I4.D25K.N1K. The execution time and performance
ratio (execution time ratio) with various minimum support
thresholds are evluated as Fig.3. Fig.3(a) shows that the
execution time of IPPFIM is much less than the time
of EMASK. When the setting on minimum support is
decreased, the execution time of both algorithms increases,
moreover, the increasing rates of both algorithms are
dissimilar. As Fig.3(b), for the same data sets, the less
minimum support is, the more the execution time ratio
(EMASK/IPPFIM) is (The fluctuate in the Fig.3(b) is relate
to the synthetic database, Fig.4(b) is the same.). For small
support, IPPFIM is 10 to 20 times faster than EMASK.
For larger support, it is less costly to re-run the mining
algorithm on the updated database since the number of
large itemsets is relatively smaller. For example, when he
support increases from 0.75% to 1.0%, the execution time
ration decreases from 10.24 to 3.
In general, the larger the increment is, the longer it would
take to do the update operations. Also, the gain in speedup
would slow down. Fig.4 shows the execution time and ratio of the two methods on the data set T25.I4.D1M.N1K
with updates of 5K, 10K ,25K, 50K, 100k and 250K for the
same minimal support. Just as Fig.4, for the same support,
the efficiency is improved. When update size increases, the
speed-up ratio decreases. For example, when the incremental databases increase from 50K to 250K, the execution time
ratio decreases from 15.98 to 4.74. As the size increases
in the update, the performance ratio will decrease, because
IPPFIM adopts the incremental updating technique, it only
need to compute a fraction of the candidate itemsets generated newly, which saves the time of scanning the original database compared to re-run the EMASK on the update
database, eventually the performance of IPPFIM will not be
under that of EMASK.
5
(a) Execution Time
Conclusion
In this paper, we have presented an efficient, incremental
update algorithm IPPFIM for the maintenance of the frequent itemsets discovered by data mining in privacy preserving when a set of new transactions are added to the
transaction database. Our algorithm strives to reduce the
I/O requirements for updating the frequent itemsets. IPPFIM uses the information available from a previous mining
to reduce the amount of work that has to be done to remove
itemsets that no longer exist in the updated database, and
to add new itemsets which were not in the set of old transactions but now exist in the updated database. The experiments on data sets show that IPPFIM achieves a better performance than re-running EMASK (an efficient privacy preserving frequent itemset mining algorithm) over the whole
set of transactions. To extend IPPFIM algorithm by applying the strategies for mining evolving data streams with privacy preservation is under our current study.
Acknowledgements
This paper was supported by the Natural Science Foundation of China (No. 60402010), Zhejiang Provincial
Natural Science Foundation of China (Y105250) and the
Science-Technology Progrom of Zhejiang Province of
China (No. 2004C31098).
(b) Execution Time Ratio
Figure 3. Performance Comparison with Different Minimum Support
References
[1] U. M. Fayyad, G. Piatesky-Shapiro, P. Smyth, and R.
Uthurusamy. Advances in Konwledge Discovery and
Data Mining. AAAI/MIT Press, 1996.
[2] J. Han, M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, August 2000.
[3] The Economist. The end of Privacy. May 1st, 1999.
pp. 15.
[4] C. Clifton and D. Marks. Security and privacy implications of data mining. ACM SIGMOD WOrkshop on
Research Issues on Data Mining and Knowledge Discovery, 1996. pp. 15-19.
[5] K. Thearling. Data mining and privacy: a conflict in
making. DS, November 1998.
[11] D. Agrawal and C. Aggarwal. On the Design and
Quantification of Privacy Preserving Data Mining Algorithms. PODS 2001. pp. 247-255.
[12] A. Evfimievski, R. Srikant, R. Agrawal and J.
Gehrke. Privacy preserving mining of association
rules. SIGKDD 2002. pp. 217-228.
[13] S. Rizvi and J. Haritsa. Maintaining data privacy in
association rule mining. VLDB 2002. pp. 682-693.
[14] W. Du and Z. Zhan. Using Randomized Response
Techniques for Privacy-Preserving Data Mining.
SIGKDD 2003. pp. 505-510.
(a) Execution Time
[15] S. Agrawal and J. Haritsa. On addressing efficiency
concerns in privacy-preserving mining. DASFAA
2004. pp. 113-124.
[16] A. C. Yao. Protocols for secure computations (Extended Abstract). FOCS 1982. pp. 160-164.
[17] R. Agrawal, T. Imielinski and A. N. Swami. Mining association rules between sets of items in large
databases. SIGMOD 1993. pp. 207-216.
[18] R.Agrawal and R.Srikant. Fast Algorithms for Mining
Association Rules. VLDB 1994. pp. 487-499.
(b) Execution Time Ratio
Figure 4. Performance Comparison with Different Incremental Database
[6] R. Agrawal and R. Srikant. Privacy-preserving data
mining. SIGMOD 2000. pp. 439-450.
[7] Y. Lindell and B. Pinkas. Privacy preserving data mining. Crypto 2000. pp. 36-54.
[8] N. R. Adam and J. C. Wortmann. Security Control
Methods for Statistical Databases: A Comparison
Study. ACM Comput. Surv. 21(4), 1989. pp. 515-556.
[9] C. Clifton, M. Kantarcioglu, and J. Vaidya. Defining
Privacy For Data Mining. Proc. of the National Science Foundation Workshop on Next Generation Data
Mining, 2002. pp. 126-133.
[10] S. R. M. Oliveira and Osmar R. Zaiane. Toward Standardization in Privacy-Preserving Data Mining. DMSSP 2004 (In conjunction with SIGKDD 2004).