Download Reconstruction-Based Association Rule Hiding

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Reconstruction-Based Association Rule Hiding
Devineni Venkata Ramana & Dr.Hima Sekhar
MIC COLLEGE OF TECHNOLOGY
KANCHIKACHERALA
Presented by
M.Vamsi krishna
Cse final year
MIC COLLEGE OF TECHNOLOGY
Kanchikacherla
Email:[email protected]
Phone : 9298408594
S.Pradeep
Cse final year
MIC COLLEGE OF TECHNOLOGY
Kanchikacherla
Email:[email protected]
Phone : 9298655770
Reconstruction-Based Association Rule Hiding
ABSTRACT
As large repositories of data contain confidential rules that must be protected
before published, association rule hiding becomes one of important privacy preserving
data mining problems. Compared with traditional data modification methods, data
reconstruction is a new promising, but not sufficiently investigated method, which is
inspired by the inverse frequent set mining problem.Research focuses on further
investigating reconstruction-based techniques for association rule hiding. FP-tree based
method have been proposed for inverse frequent set mining, which can be used in the
proposed reconstruction-based framework.
1. MOTIVATION
Data mining extracts novel and useful knowledge from large repositories of data
and has become an effective analysis and decision means in corporation. The sharing of
data for data mining can bring a lot of advantages for research and business collaboration;
however, large repositories of data contain private data and sensitive rules that must be
protected before published. Motivated by the multiple conflicting requirements of data
sharing, privacy preserving and knowledge discovery, privacy preserving data mining
(PPDM) has become a research hotspot in data mining and database security fields.
Two problems are addressed in PPDM: one is the protection of private data;
another is the protection of sensitive rules (knowledge) contained in the data. The former
settles how to get normal mining results when private data cannot be accessed accurately;
the latter settles how to protect sensitive rules contained in the data from being
discovered, while non-sensitive rules can still be mined normally. The latter problem is
called knowledge hiding in database in (KHD) which is opposite to knowledge discovery
in database (KDD). Concretely, the problem of KHD can be described as follows:
Given a data set D to be released, a set of rules R mined from D, and a set of
sensitive rules Rh R to be hided, how can we get a new data set D’, such that the rules
in Rh cannot be mined from D’, while the rules in R-Rh can still be mined as many as
possible.
Typically, when D is a transaction database and R is specific to the set of
association rules mined from D with minimum support threshold MST and minimum
confidence threshold MCT, the problem of KHD becomes association rule hiding
problem. Clifton in provided a well designed scenario which clearly shows the
importance of the association rule hiding problem. In the scenario, by providing the
original unaltered database to an external party, some strategic association rules that are
crucial to the data owner are disclosed with serious adverse effects. The sensitive
association rule hiding problem is very common in a collaborative association rule
mining project, in which one company may decide to disclose only part of knowledge
contained in its data and hide strategic knowledge represented by sensitive rules. These
sensitive rules must be protected before its data is shared. Besides, by hiding some
association rules, data owners can prevent the rule-based vicious inferences used for
unwarrantable purposes, e.g. uncovering private data, as discussed in.
To settle the association rule hiding problem the first proposed concept is
“data sanitization”. Its main idea is to select some transactions to modify (delete or add
items) from original database through some heuristics.It also proved that the optimal
sanitization is an NP-hard problem. After that, many approaches have been proposed in
data sanitization framework. Association rule hiding based on data sanitization
framework operates simply. However, data sanitization techniques cannot control the
hiding effects of confidential rules obviously. The hiding effects can only be validated
after sanitization. In other words, they suffer from the weakness of providing a way of
fine tuning the generation of the released database. Moreover, data sanitization can
produce a lot of I/O operations, which greatly increase the time cost, especially when the
original database includes a large number of transactions.
Different from the data sanitization framework, the authors in proposed a novel
framework that can be regarded as “knowledge sanitization” approach, which is inspired
by the inverse frequent set mining problem. The new proposed framework first performs
sanitization on an itemset lattice called a knowledge base from which association rules
can be derived. The itemset lattice is defined as all partial ordered subset items generated
from given transactions. Then a reconstruction procedure reconstructs a new
released dataset from the sanitized itemset lattice. In one word, this approach conceals the
sensitive rules by sanitizing itemset lattice rather than sanitizing original dataset.
Compared with original dataset, itemset lattice is a medium production that is
closer to association rules. In this way, one can easily control the availability of rules that
can be mined from original dataset and control the hiding effects directly. However, as a
rudimental work, the approach proposed in is still very incomplete and limited in
the following two aspects:
1) It does not give concrete guidance on how to sanitize the itemset lattice
according to the sensitive association rules.
2) The feasibility of the data reconstruction process is restricted to whether the
knowledge sanitization process can produce an itemset lattice with consistent support
value configuration relationship. But, in fact, the proposed knowledge sanitization
process cannot guarantee that one can always find a consistent one within a polynomial
time.
Whereas, as a new promising, but not sufficiently investigated framework
research focuses on further investigating effective knowledge sanitization, data
reconstruction based techniques for association rule hiding. These techniques are refered
as reconstruction-based association rule hiding, which is relative to data modification
techniques. Particularly, FP-tree-based method was proposed for inverse frequent set
mining, which can be used in reconstruction based framework. The aim is to provide an
easily controllable and robust association rule hiding secure mechanism in privacy
preserving data sharing context.
2. RELATED WORK
The problem of association rule hiding was first probed in Knowledge and Data
Workshop. After that, many approaches were proposed. Roughly, they can fall into two
groups: data sanitization data modification approaches (data modification for short) and
knowledge sanitization data reconstruction (data reconstruction) approaches.
1) Data Modification Approaches
Data modification methods hide sensitive association rules by directly
modifying original data. Most of the early methods belong to this track.
According to different modification means, it can be further classified into the
two subcategories: Data-Distortion techniques and Data-Blocking techniques.
Data-Distortion is based on data perturbation or data transformation, and
in particular, the procedure is to change a selected set of 1-values to 0-values
(delete items) or 0-values to 1-values (add items) if we consider the transaction
database as a two-dimensional matrix. Its aim is to reduce the support or
confidence of the sensitive rules below the user predefined security threshold.
Early data distortion techniques adopt simple heuristic-based sanitization strategies
like Algo1a/Algo1b/Algo2a,Algo2b/Algo2c,Naive/MinFIA/MaxFIA/IGA, RRA/RA, and
SWA. Different heuristics determine different selection strategies on which
transactions are to be sanitized and which items are to be victims, which are two
core issues affecting the hiding effects in the algorithms. Subsequent techniques
like WSDA/PDA and Border-Based advanced the simple heuristics to heuristic
greedy (local optimal) strategies trying to greedily select the modifications with
minimal side effects on data utility. Further, Integer-Programming techniques
based on global optimization were proposed. The target of is to minimize the
number of sanitized transactions, while the goal of is to minimize the number of
sanitized items. From the simple heuristics to the greedy strategies, then to the
Integer-Programming techniques, the target solutions are gradually closer to the
optimal with the loss of increasing computational complexity. Compared with the
fore mentioned techniques, Sanitization-Matrix is a special distortion technique.
This technique constructs the new released database through multiplying the
original database by a sanitization matrix.
Data-Blocking is another data modification approach for association rule hiding.
Instead of making data distorted (part of data is altered to false), blocking
approach is implemented by replacing certain data items with a question mark
“?”. The introduction of this special unknown value brings uncertainty to the data,
making the support and confidence of an association rule become two uncertain
intervals respectively. At the beginning, the lower bounds of the intervals equal to
the upper bounds. As the number of “?” in the data increases, the lower and upper
bounds begin to separate gradually and the uncertainty of the rules grows
accordingly. When either of the lower bounds of a rule’s support interval and
confidence interval gets below the security threshold, the rule is deemed to be
concealed.
2) Data Reconstruction Approaches
Data reconstruction methods put the original data aside and start from
sanitizing the so-called “knowledge base”. The new released data is then
reconstructed from the sanitized knowledge base. This idea first depicted gave a
coarse Constraint-based Inverse Itemset Lattice Mining procedure (CIILM) for
hiding sensitive frequent itemsets.. The main differences are: 1)first method aims
at hiding frequent itemsets, while second addresses hiding association rules;
2) data reconstruction is based on itemset lattice, while latter is based on FP-tree.
Another dimension to classify existing algorithms is: hiding rules or hiding large
(frequent) itemsets. Part of the existing work above choose to hide association
rules, while others choose to hide large itemsets. Relatively, hiding rules is more
complicated than hiding itemsets. For clarity (see Table 1), existing algorithms in
the literature are classified into six different parts according to the three different
hiding thoughts and whether the algorithm hides association rules or hides large
itemsets. From the table, we can see the field of data reconstruction for hiding
rules is blank.. In addition, the related inverse frequent set mining inferring the
original data from given frequent itemsets is an emerging topic in privacy
preserving data sharing. Mielikainen first proposed this problem in. He showed
finding a dataset compatible with a given collection of frequent itemsets is
NPcomplete. After that, several methods were proposed like. Settlement of the
inverse frequent set mining problem will give strong support for reconstructionbased association rule hiding.
3. PROBLEM FORMULATION
Let I = {I1, I2… Im} be a set of items. Any X⊆ I is called an
itemset. Further, an itemset consisting of k elements is called a k-Itemset. Let D = {TB1B,
TB2B, ..., TBnB} be a set of transactions, where each transaction TBi B(i[1..n]) is an
itemset. The support count of an itemset X⊆ I in a transaction database D, denoted as
supp_count(X) or |X|, is the number of transactions containing X. The support of an
itemset XI in D, denoted support(X), is defined as the percentage of transactions
containing X in D. support(X) = |X|/|D| (where |D| is the number of
transactions in D). X is a frequent itemset if X’s support is no less than a predefined
minimum support threshold MST.
An association rule is an implication of the form X Y, where X I, Y I and X
∩ Y= Φ . We say the rule X Y holds in the database D with confidence c if |XY|/|X| ≥
c. We also say the rule X Y has support s if |XY|/|D|≥ s. Note while the support is a
measure of the frequency of a rule, the confidence is a measure of the strength of the
relation between sets of items.
The well-known association rule mining problem aims to find all significant
association rules. A rule is significant if its support and confidence is no less than the user
specified minimum support threshold (MST) and minimum confidence threshold (MCT).
To find the significant rules, an association rule mining algorithm first finds all the
frequent itemsets and then derives the association rules from them. On the contrary, the
association rule hiding problem aims to prevent some of these rules, which we refer to as
“sensitive rules”, from being mined. The association rule hiding problem focused on this
paper can be formulated as follows:
Given a transaction database D, minimum support threshold “MST”, minimum
confidence threshold “MCT”, a set of significant association rules R mined from D and a
set of sensitive rules RhR to be hided, find a new database D’, such that iff the
rules in R-Rh can be mind from D’ under the same “MST” and “MCT”. The iff clause
means no normal rules in R-Rh are falsely hidden, and no extra artificial rules (also
called “ghost" rules meaning that non-frequent rules become frequent) are falsely
generated during the rule hiding process.
The related inverse frequent set mining problem is defined as: Given a set of
items I = {I1,I2, ..., Im}, minimum support threshold MST, and a set of all frequent
itemsets F = {f1, f2, ...fn} with the support set S ={support(f1), support(f2), ...,
support(fn)} discovered from a real database D, find a database D’ satisfies the following
constraints:
1) D’ is over the same set of items I;
2) From D’, we can discover exactly the same set of frequent itemsets F with the same
support set S under the same minimum support threshold MST.
4. PROPOSED SOLUTION
4.1 Framework
The framework of this approach is depicted as Figure 1. The whole approach is
divided into three phases: Frequent set mining, perform sanitation algorithm, and FP-treebased inverse frequent set mining.
Framework of this approach
The first phase is to use frequent itemset mining algorithm to generate all frequent
itemsets with their supports and support counts (FS in short in the figure) from original
database D. The second phase is to perform sanitation algorithm over FS,
which involves selecting the hiding strategy and identifying sensitive frequent itemsets
according to sensitive association rules. In best case, the sanitation algorithm ensures
from the sanitized set of frequent itemsets with supports and support counts (FS’ in
short in the figure) we can get exactly the set of non-sensitive rules with no normal rules
lost and no ghost rules generated. The third phase is to generate released database D’
from FS’ by using inverse frequent set mining algorithm. In this framework, we
plan to adopt an inverse frequent set mining algorithm based on FP-tree which comprises
the following two steps:
1) The algorithm tries to “guess” a FP-tree that satisfies all the frequent itemsets
and their support counts in FS’. We call such a FP-tree a compatible FP-tree meaning that
from this FP-tree we can mine the same set of frequent itemsets with the same support
counts as FS’.
2) Generate a corresponding database D’ directly from the compatible FP-tree by
outspreading all the paths of the tree. The idea of the FP-tree-based inverse frequent set
mining in the third phase comes from the fact that FP-tree is a highly compact structure
which stores the complete information of a transaction database in relevance to frequent
itemsets mining. Thus we can look upon FP-tree as a medium production between
database and its corresponding frequent itemsets. Intuitively, FP-tree reduces
the gap between a database and its frequent itemsets, which makes the transformation
from given frequent itemsets to database more smoothly, more naturally and more easily.
Roughly, the procedure of the FP-tree-based inverse frequent set mining in the third
phase can be seen as the reverse process of the FP-tree-based frequent
itemsets mining method proposed in.
4.2 Example
To illustrate this proposed approach for the association rule hiding problem and validate
its feasibility, let us consider an example (see Figure 2).
In Figure 2, given I= {A, B, C, D, E}, an original database D= {T1, T2, T3, T4,
T5, T6}, minimum support count threshold σ=4, minimum support threshold MST=66%,
minimum confidence threshold MCT=75%. All frequent itemsets, their support counts
and their supports obtained from D are listed in the FS (top middle of Figure 2). All
significant association rules obtained from the frequent itemsets in FS are shown in table
R (top right of Figure 2). Let us suppose B ⇒ A is a sensitive rule that needs hiding.
First, instead of performing sanitation on the original database, we perform
sanitation algorithm on FS by deleting the sensitive frequent itemsets B: 4 and AB: 4.
Here, we select hiding a sensitive rule by reducing the support of its corresponding large
itemsets. Furthermore, we adopt thorough hiding strategy meaning that the large itemset
the sensitive rule corresponds to needs to be completely hided and its support is reduced
to zero. So after the sanitation we get the frequent itemsets set FS’ from which we can
obtain the set of association rules R-Rh exactly (with no normal rules lost and no ghost
rules generated).
Then, the FP-tree-based frequent set mining algorithm generates a released
database D’ (middle left of Figure 2) from the frequent itemsets and their corresponding
support counts FS’ via a FP-tree. Clearly, the FP-tree is compatible with the FS’, which
means from the FP-tree we can obtain exactly the same FS’. The released database D’ is
obtained by outspreading the paths of the FP-tree one by one. As FP-tree is composed of
only frequent items, the initial D’ obtained directly from FP-tree is also made up of
frequent items, with no infrequent items (in this example, “E” is an infrequent item). By
scattering infrequent item “E” into at most 3 (σ=4) transactions of D’, we get a series of
compatible and releasable databases shown as D1’, D2’, Dp’,…Dq’…in the bottom. By
“compatible”, we mean from each database we can get exactly the set of non-sensitive
rules R-Rh under the same MST, MCT. By “releasable”, we mean each database is secure
containing no sensitive rules. Details about this procedure can be found in.
4.3 Discussion
This subsection gives a discussion of how this suggested solution is different, new, and
better as compared to existing approaches to the problem. The discussion surrounds the
two major phases in this approach framework: sanitation algorithm and inverse
frequent set mining algorithm.
1) Sanitation algorithm
First, compared with the early popular data sanitation algorithms,sanitation
algorithm is performed over the set of frequent itemsets with support counts, not on the
original data. The set of frequent itemsets is much closer to the set of association rules
than the data, which gives the database owner a more direct, visible and intuitive control
towards the rules set. That is, by performing sanitation directly on knowledge level of
data, one can control the discovered knowledge more handily. Second, compared with the
recent emerging knowledge sanitation algorithm proposed in, sanitation algorithm aims at
hiding sensitive association rules, while theirs aims at hiding sensitive
itemsets for simplicity. Usually, hiding sensitive rules is a more general, familiar and
intuitive requirement than hiding sensitive itemsets. Another difference is that their
sanitation algorithm performs on the whole itemsets space, while this performs only
on the small part of frequent itemsets, which can reduce much of sanitation cost.
2) Inverse frequent set mining algorithm
FP-tree-based inverse frequent set mining algorithm can work more efficiently
than the constraint-based inverse itemset lattice mining algorithm in and other inverse
frequent set mining algorithms. The main reason is this algorithm deals with only
frequent items at the beginning followed by the infrequent items (generates a database
with only frequent items first, then scatters infrequent items), which significantly reduces
the search space, while others consider frequent and infrequent items as a whole making
the search space very large. More importantly, by dealing
with frequent items and infrequent items separately, this inverse mining algorithm can
easily output a large number of databases for release instead of finding only one database
for release in. To be summary for the discussion, this solution can provide user
with a knowledge level window to perform sanitation handily and can efficiently generate
a number of secure, sharable databases.
5. CURRENT PROGRESS
5.1 Work to Date
The FP-tree-based method for inverse frequent set mining in that can be used into
the third phase in the framework of association rule hiding approach (Figure 1). A
further work of inverse frequent set mining based on FP-tree has
recently been accepted by the Journal of Software (Ruanjian Xuebao ISSN 1000-9825,
JOS for short).
The work describes shows effort towards the NP-complete problem of inverse frequent
set mining. It gives a feasible and efficient algorithm to find a set of databases that agree
with the given frequent itemsets discovered from a real database. Specifically, the
algorithm provides a good heuristic search strategy to rapidly find a FP-tree satisfying the
given frequent sets constraints, leading to rapidly finding a set of compatible databases.
Compared with previous “generation-and-test” methods, the method in is a zero trace
back algorithm, without rollback operations, which saves huge computational costs.
Furthermore, the algorithm can find a set of compatible databases (usually a lot of
databases) instead of finding only one compatible database in previous methods. And the
number of databases the algorithm can find is carefully probed in.
The work in the Journal of Software proposes a more mature and well-designed
FP-tree-based method for inverse frequent set mining after expanding the definition of
the inverse frequent set mining problem and exploring its three practical applications.
First, the method divides target constraints into some sub constraints and each time it
solves a sub linear constraint problem. After some iterations, it finds an FP-tree that
exactly satisfies with the whole given constraints. Then, based on the FP-tree it generates
a temporary database TempD that only involves frequent items. The target datasets are
obtained by scattering infrequent items into TempD. Theoretic analysis and experiments
show that the method is right and effect. This work differs from the work of mainly in
that: the generation of FP-tree in JOS is based on solving constraint equations leading the
support counts of frequent itemsets obtained from the FP-tree are exactly equal to the
given, while the generation of FP-tree in is based on a heuristic strategy leading the
support counts of frequent itemsets obtained from the FP-tree are not always equal to the
given.
5.2 Future Work
The further work includes three aspects: First, develop and design a sound
sanitization algorithm performed on the set of frequent itemsets with support counts. The
input of the algorithm is: a set of frequent itemsets with support
counts FS discovered from a real database, a set of association rules R derived from FS,
and a subset of sensitive rules Rh. The output of the algorithm is a set of frequent itemsets
with support counts from which we can just derive the set of rules R-Rh. The
algorithm itself should take into the following considerations: 1) ideally, the support and
confidence of the rules in R-Rh should remain unchanged as much as possible; 2) the
algorithm should be able to select appropriate hiding strategies according to different
kinds of correlations among the rules in R and Rh, which is just considered preliminarily
in recent work of; and 3) it should provide a security mechanism of preventing rule-based
reasoning, that is, deal with the case in which sensitive rules can be reasoned from nonsensitive rules.
Second, investigate how to restrict the number of transactions in the new released
database. Current work on the FP-tree-based inverse frequent set mining did not restrict
and control the number of transactions in the new generated database. As an important
characteristic of transaction database, the number of transactions
will directly affect the support of a rule. The number of transactions that the current
algorithms output is previously unknown, making the support of a rule is uncertain
although the support count of the corresponding itemset is certain. Third, develop an
integrated secure association rule mining tool which can conceal (protect) privacy data &
sensitive association rules contained in the data simultaneously. In privacy preserving
data sharing context, both the sensitive data and rules contained in the data need hiding
(we call sensitive data hiding in database DHD and sensitive rules hiding in database
KHD). However, currently, DHD and KHD techniques are always investigated
separately, and there is still lack of a tool integrating both DHD and KHD techniques.
Development of such a tool is significant and imperative under this situation.
5.3 Expected Contributions
The expected contributions of dissertation are:
1) It will provide an efficient inverse frequent set mining method which can be used in
different privacy preserving data sharing context.
2) It will provide an effective association rule hiding method.
3) It will provide database security administrators with a creditable association rule
mining tool protecting both the private data and confidential rules contained in the data.
6. EXPERIMENTAL EVALUATION PLAN
The proposed approach will be evaluated on the dataset of BMSPOS, BMS-WebView-1,
and BMS-WebView-2 according to: 1) hiding effects; 2) data utility; 3) time
performance. The hiding effects can be evaluated using the three metrics: ① Hiding
Failure Ratio; ②Lost Rules Ratio; ③Ghost Rules Ratio. Each metric evaluates one side
effect produced in hiding process (see Figure 3).
In Figure 3, R is the set of rules mined from original database D;R’ is the set of rules
mined from new database D’; Rh is the set of sensitive rules to be hided; ~Rh is the set of
non-sensitive rules; ① represents the sensitive rules not hidden successfully; ②represents
the non-sensitive rules that are falsely hidden (we say the rules are lost); ③ represents the
new generated spurious rules occurring in D’ but not occurring in D. According to this,
Hiding Failure Ratio, defined as Rh(D’)/Rh(D), refers to the percentage of the
sensitive rules in D’ to total sensitive rules in D, where Rh(X) is the number of sensitive
rules in database X.
Lost Rules Ratio, defined as (~Rh(D) − ~Rh(D’))/ ~Rh(D), refers to the percentage
of the non-sensitive rules falsely hidden to total 55 non-sensitive rules in D, where
~Rh(X) is the number of nonsensitive rules in database X.
Ghost Rules Ratio, defined as (| R’ | − | R ∩R’ |) / | R’ |, refers to the percentage of
the ghost rules in D’ to total rules in D’.
Finally, data utility can be defined as variations of support & confidence between
the old and new set of association rules.
7.CONCLUSION:
The proposed model will fetch up the new reconstruction-based research track
and work well according to the evaluation metrics including hiding effects, data utility,
and time performance.
REFERENCES:
[1] Atallah, M., Bertino, E., Elmagarmid, A., Ibrahim, M., and Verykios, V.S. Disclosure
limitation of sensitive rules. In:Scheuermann P, ed. Proc. of the IEEE Knowledge and
Data Exchange Workshop (KDEX'99). IEEE Computer Society,1999. 45-52.
[2] Calders, T. Computational complexity of itemset frequency satisfiability. In: Proc. of
the 23Prd P ACM PODS. ACM Press, 2004. 143-154.
[3] Chen, X. and Orlowska, M. A further study on inverse frequent set mining. In: Proc.
of the 1P stP Int’l Conf. on Advanced Data Mining and Applications (ADMA’05). LNCS
3584, Springer-Verlag. 2005. 753-760.
[4] Chen, X., Orlowska, M., and Li, X. A new framework for privacy preserving data
sharing. In: Proc. of the 4PthP IEEE ICDM Workshop: Privacy and Security Aspects of
Data Mining. IEEE Computer Society, 2004. 47-56.
.