Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Reconstruction-Based Association Rule Hiding Devineni Venkata Ramana & Dr.Hima Sekhar MIC COLLEGE OF TECHNOLOGY KANCHIKACHERALA Presented by M.Vamsi krishna Cse final year MIC COLLEGE OF TECHNOLOGY Kanchikacherla Email:[email protected] Phone : 9298408594 S.Pradeep Cse final year MIC COLLEGE OF TECHNOLOGY Kanchikacherla Email:[email protected] Phone : 9298655770 Reconstruction-Based Association Rule Hiding ABSTRACT As large repositories of data contain confidential rules that must be protected before published, association rule hiding becomes one of important privacy preserving data mining problems. Compared with traditional data modification methods, data reconstruction is a new promising, but not sufficiently investigated method, which is inspired by the inverse frequent set mining problem.Research focuses on further investigating reconstruction-based techniques for association rule hiding. FP-tree based method have been proposed for inverse frequent set mining, which can be used in the proposed reconstruction-based framework. 1. MOTIVATION Data mining extracts novel and useful knowledge from large repositories of data and has become an effective analysis and decision means in corporation. The sharing of data for data mining can bring a lot of advantages for research and business collaboration; however, large repositories of data contain private data and sensitive rules that must be protected before published. Motivated by the multiple conflicting requirements of data sharing, privacy preserving and knowledge discovery, privacy preserving data mining (PPDM) has become a research hotspot in data mining and database security fields. Two problems are addressed in PPDM: one is the protection of private data; another is the protection of sensitive rules (knowledge) contained in the data. The former settles how to get normal mining results when private data cannot be accessed accurately; the latter settles how to protect sensitive rules contained in the data from being discovered, while non-sensitive rules can still be mined normally. The latter problem is called knowledge hiding in database in (KHD) which is opposite to knowledge discovery in database (KDD). Concretely, the problem of KHD can be described as follows: Given a data set D to be released, a set of rules R mined from D, and a set of sensitive rules Rh R to be hided, how can we get a new data set D’, such that the rules in Rh cannot be mined from D’, while the rules in R-Rh can still be mined as many as possible. Typically, when D is a transaction database and R is specific to the set of association rules mined from D with minimum support threshold MST and minimum confidence threshold MCT, the problem of KHD becomes association rule hiding problem. Clifton in provided a well designed scenario which clearly shows the importance of the association rule hiding problem. In the scenario, by providing the original unaltered database to an external party, some strategic association rules that are crucial to the data owner are disclosed with serious adverse effects. The sensitive association rule hiding problem is very common in a collaborative association rule mining project, in which one company may decide to disclose only part of knowledge contained in its data and hide strategic knowledge represented by sensitive rules. These sensitive rules must be protected before its data is shared. Besides, by hiding some association rules, data owners can prevent the rule-based vicious inferences used for unwarrantable purposes, e.g. uncovering private data, as discussed in. To settle the association rule hiding problem the first proposed concept is “data sanitization”. Its main idea is to select some transactions to modify (delete or add items) from original database through some heuristics.It also proved that the optimal sanitization is an NP-hard problem. After that, many approaches have been proposed in data sanitization framework. Association rule hiding based on data sanitization framework operates simply. However, data sanitization techniques cannot control the hiding effects of confidential rules obviously. The hiding effects can only be validated after sanitization. In other words, they suffer from the weakness of providing a way of fine tuning the generation of the released database. Moreover, data sanitization can produce a lot of I/O operations, which greatly increase the time cost, especially when the original database includes a large number of transactions. Different from the data sanitization framework, the authors in proposed a novel framework that can be regarded as “knowledge sanitization” approach, which is inspired by the inverse frequent set mining problem. The new proposed framework first performs sanitization on an itemset lattice called a knowledge base from which association rules can be derived. The itemset lattice is defined as all partial ordered subset items generated from given transactions. Then a reconstruction procedure reconstructs a new released dataset from the sanitized itemset lattice. In one word, this approach conceals the sensitive rules by sanitizing itemset lattice rather than sanitizing original dataset. Compared with original dataset, itemset lattice is a medium production that is closer to association rules. In this way, one can easily control the availability of rules that can be mined from original dataset and control the hiding effects directly. However, as a rudimental work, the approach proposed in is still very incomplete and limited in the following two aspects: 1) It does not give concrete guidance on how to sanitize the itemset lattice according to the sensitive association rules. 2) The feasibility of the data reconstruction process is restricted to whether the knowledge sanitization process can produce an itemset lattice with consistent support value configuration relationship. But, in fact, the proposed knowledge sanitization process cannot guarantee that one can always find a consistent one within a polynomial time. Whereas, as a new promising, but not sufficiently investigated framework research focuses on further investigating effective knowledge sanitization, data reconstruction based techniques for association rule hiding. These techniques are refered as reconstruction-based association rule hiding, which is relative to data modification techniques. Particularly, FP-tree-based method was proposed for inverse frequent set mining, which can be used in reconstruction based framework. The aim is to provide an easily controllable and robust association rule hiding secure mechanism in privacy preserving data sharing context. 2. RELATED WORK The problem of association rule hiding was first probed in Knowledge and Data Workshop. After that, many approaches were proposed. Roughly, they can fall into two groups: data sanitization data modification approaches (data modification for short) and knowledge sanitization data reconstruction (data reconstruction) approaches. 1) Data Modification Approaches Data modification methods hide sensitive association rules by directly modifying original data. Most of the early methods belong to this track. According to different modification means, it can be further classified into the two subcategories: Data-Distortion techniques and Data-Blocking techniques. Data-Distortion is based on data perturbation or data transformation, and in particular, the procedure is to change a selected set of 1-values to 0-values (delete items) or 0-values to 1-values (add items) if we consider the transaction database as a two-dimensional matrix. Its aim is to reduce the support or confidence of the sensitive rules below the user predefined security threshold. Early data distortion techniques adopt simple heuristic-based sanitization strategies like Algo1a/Algo1b/Algo2a,Algo2b/Algo2c,Naive/MinFIA/MaxFIA/IGA, RRA/RA, and SWA. Different heuristics determine different selection strategies on which transactions are to be sanitized and which items are to be victims, which are two core issues affecting the hiding effects in the algorithms. Subsequent techniques like WSDA/PDA and Border-Based advanced the simple heuristics to heuristic greedy (local optimal) strategies trying to greedily select the modifications with minimal side effects on data utility. Further, Integer-Programming techniques based on global optimization were proposed. The target of is to minimize the number of sanitized transactions, while the goal of is to minimize the number of sanitized items. From the simple heuristics to the greedy strategies, then to the Integer-Programming techniques, the target solutions are gradually closer to the optimal with the loss of increasing computational complexity. Compared with the fore mentioned techniques, Sanitization-Matrix is a special distortion technique. This technique constructs the new released database through multiplying the original database by a sanitization matrix. Data-Blocking is another data modification approach for association rule hiding. Instead of making data distorted (part of data is altered to false), blocking approach is implemented by replacing certain data items with a question mark “?”. The introduction of this special unknown value brings uncertainty to the data, making the support and confidence of an association rule become two uncertain intervals respectively. At the beginning, the lower bounds of the intervals equal to the upper bounds. As the number of “?” in the data increases, the lower and upper bounds begin to separate gradually and the uncertainty of the rules grows accordingly. When either of the lower bounds of a rule’s support interval and confidence interval gets below the security threshold, the rule is deemed to be concealed. 2) Data Reconstruction Approaches Data reconstruction methods put the original data aside and start from sanitizing the so-called “knowledge base”. The new released data is then reconstructed from the sanitized knowledge base. This idea first depicted gave a coarse Constraint-based Inverse Itemset Lattice Mining procedure (CIILM) for hiding sensitive frequent itemsets.. The main differences are: 1)first method aims at hiding frequent itemsets, while second addresses hiding association rules; 2) data reconstruction is based on itemset lattice, while latter is based on FP-tree. Another dimension to classify existing algorithms is: hiding rules or hiding large (frequent) itemsets. Part of the existing work above choose to hide association rules, while others choose to hide large itemsets. Relatively, hiding rules is more complicated than hiding itemsets. For clarity (see Table 1), existing algorithms in the literature are classified into six different parts according to the three different hiding thoughts and whether the algorithm hides association rules or hides large itemsets. From the table, we can see the field of data reconstruction for hiding rules is blank.. In addition, the related inverse frequent set mining inferring the original data from given frequent itemsets is an emerging topic in privacy preserving data sharing. Mielikainen first proposed this problem in. He showed finding a dataset compatible with a given collection of frequent itemsets is NPcomplete. After that, several methods were proposed like. Settlement of the inverse frequent set mining problem will give strong support for reconstructionbased association rule hiding. 3. PROBLEM FORMULATION Let I = {I1, I2… Im} be a set of items. Any X⊆ I is called an itemset. Further, an itemset consisting of k elements is called a k-Itemset. Let D = {TB1B, TB2B, ..., TBnB} be a set of transactions, where each transaction TBi B(i[1..n]) is an itemset. The support count of an itemset X⊆ I in a transaction database D, denoted as supp_count(X) or |X|, is the number of transactions containing X. The support of an itemset XI in D, denoted support(X), is defined as the percentage of transactions containing X in D. support(X) = |X|/|D| (where |D| is the number of transactions in D). X is a frequent itemset if X’s support is no less than a predefined minimum support threshold MST. An association rule is an implication of the form X Y, where X I, Y I and X ∩ Y= Φ . We say the rule X Y holds in the database D with confidence c if |XY|/|X| ≥ c. We also say the rule X Y has support s if |XY|/|D|≥ s. Note while the support is a measure of the frequency of a rule, the confidence is a measure of the strength of the relation between sets of items. The well-known association rule mining problem aims to find all significant association rules. A rule is significant if its support and confidence is no less than the user specified minimum support threshold (MST) and minimum confidence threshold (MCT). To find the significant rules, an association rule mining algorithm first finds all the frequent itemsets and then derives the association rules from them. On the contrary, the association rule hiding problem aims to prevent some of these rules, which we refer to as “sensitive rules”, from being mined. The association rule hiding problem focused on this paper can be formulated as follows: Given a transaction database D, minimum support threshold “MST”, minimum confidence threshold “MCT”, a set of significant association rules R mined from D and a set of sensitive rules RhR to be hided, find a new database D’, such that iff the rules in R-Rh can be mind from D’ under the same “MST” and “MCT”. The iff clause means no normal rules in R-Rh are falsely hidden, and no extra artificial rules (also called “ghost" rules meaning that non-frequent rules become frequent) are falsely generated during the rule hiding process. The related inverse frequent set mining problem is defined as: Given a set of items I = {I1,I2, ..., Im}, minimum support threshold MST, and a set of all frequent itemsets F = {f1, f2, ...fn} with the support set S ={support(f1), support(f2), ..., support(fn)} discovered from a real database D, find a database D’ satisfies the following constraints: 1) D’ is over the same set of items I; 2) From D’, we can discover exactly the same set of frequent itemsets F with the same support set S under the same minimum support threshold MST. 4. PROPOSED SOLUTION 4.1 Framework The framework of this approach is depicted as Figure 1. The whole approach is divided into three phases: Frequent set mining, perform sanitation algorithm, and FP-treebased inverse frequent set mining. Framework of this approach The first phase is to use frequent itemset mining algorithm to generate all frequent itemsets with their supports and support counts (FS in short in the figure) from original database D. The second phase is to perform sanitation algorithm over FS, which involves selecting the hiding strategy and identifying sensitive frequent itemsets according to sensitive association rules. In best case, the sanitation algorithm ensures from the sanitized set of frequent itemsets with supports and support counts (FS’ in short in the figure) we can get exactly the set of non-sensitive rules with no normal rules lost and no ghost rules generated. The third phase is to generate released database D’ from FS’ by using inverse frequent set mining algorithm. In this framework, we plan to adopt an inverse frequent set mining algorithm based on FP-tree which comprises the following two steps: 1) The algorithm tries to “guess” a FP-tree that satisfies all the frequent itemsets and their support counts in FS’. We call such a FP-tree a compatible FP-tree meaning that from this FP-tree we can mine the same set of frequent itemsets with the same support counts as FS’. 2) Generate a corresponding database D’ directly from the compatible FP-tree by outspreading all the paths of the tree. The idea of the FP-tree-based inverse frequent set mining in the third phase comes from the fact that FP-tree is a highly compact structure which stores the complete information of a transaction database in relevance to frequent itemsets mining. Thus we can look upon FP-tree as a medium production between database and its corresponding frequent itemsets. Intuitively, FP-tree reduces the gap between a database and its frequent itemsets, which makes the transformation from given frequent itemsets to database more smoothly, more naturally and more easily. Roughly, the procedure of the FP-tree-based inverse frequent set mining in the third phase can be seen as the reverse process of the FP-tree-based frequent itemsets mining method proposed in. 4.2 Example To illustrate this proposed approach for the association rule hiding problem and validate its feasibility, let us consider an example (see Figure 2). In Figure 2, given I= {A, B, C, D, E}, an original database D= {T1, T2, T3, T4, T5, T6}, minimum support count threshold σ=4, minimum support threshold MST=66%, minimum confidence threshold MCT=75%. All frequent itemsets, their support counts and their supports obtained from D are listed in the FS (top middle of Figure 2). All significant association rules obtained from the frequent itemsets in FS are shown in table R (top right of Figure 2). Let us suppose B ⇒ A is a sensitive rule that needs hiding. First, instead of performing sanitation on the original database, we perform sanitation algorithm on FS by deleting the sensitive frequent itemsets B: 4 and AB: 4. Here, we select hiding a sensitive rule by reducing the support of its corresponding large itemsets. Furthermore, we adopt thorough hiding strategy meaning that the large itemset the sensitive rule corresponds to needs to be completely hided and its support is reduced to zero. So after the sanitation we get the frequent itemsets set FS’ from which we can obtain the set of association rules R-Rh exactly (with no normal rules lost and no ghost rules generated). Then, the FP-tree-based frequent set mining algorithm generates a released database D’ (middle left of Figure 2) from the frequent itemsets and their corresponding support counts FS’ via a FP-tree. Clearly, the FP-tree is compatible with the FS’, which means from the FP-tree we can obtain exactly the same FS’. The released database D’ is obtained by outspreading the paths of the FP-tree one by one. As FP-tree is composed of only frequent items, the initial D’ obtained directly from FP-tree is also made up of frequent items, with no infrequent items (in this example, “E” is an infrequent item). By scattering infrequent item “E” into at most 3 (σ=4) transactions of D’, we get a series of compatible and releasable databases shown as D1’, D2’, Dp’,…Dq’…in the bottom. By “compatible”, we mean from each database we can get exactly the set of non-sensitive rules R-Rh under the same MST, MCT. By “releasable”, we mean each database is secure containing no sensitive rules. Details about this procedure can be found in. 4.3 Discussion This subsection gives a discussion of how this suggested solution is different, new, and better as compared to existing approaches to the problem. The discussion surrounds the two major phases in this approach framework: sanitation algorithm and inverse frequent set mining algorithm. 1) Sanitation algorithm First, compared with the early popular data sanitation algorithms,sanitation algorithm is performed over the set of frequent itemsets with support counts, not on the original data. The set of frequent itemsets is much closer to the set of association rules than the data, which gives the database owner a more direct, visible and intuitive control towards the rules set. That is, by performing sanitation directly on knowledge level of data, one can control the discovered knowledge more handily. Second, compared with the recent emerging knowledge sanitation algorithm proposed in, sanitation algorithm aims at hiding sensitive association rules, while theirs aims at hiding sensitive itemsets for simplicity. Usually, hiding sensitive rules is a more general, familiar and intuitive requirement than hiding sensitive itemsets. Another difference is that their sanitation algorithm performs on the whole itemsets space, while this performs only on the small part of frequent itemsets, which can reduce much of sanitation cost. 2) Inverse frequent set mining algorithm FP-tree-based inverse frequent set mining algorithm can work more efficiently than the constraint-based inverse itemset lattice mining algorithm in and other inverse frequent set mining algorithms. The main reason is this algorithm deals with only frequent items at the beginning followed by the infrequent items (generates a database with only frequent items first, then scatters infrequent items), which significantly reduces the search space, while others consider frequent and infrequent items as a whole making the search space very large. More importantly, by dealing with frequent items and infrequent items separately, this inverse mining algorithm can easily output a large number of databases for release instead of finding only one database for release in. To be summary for the discussion, this solution can provide user with a knowledge level window to perform sanitation handily and can efficiently generate a number of secure, sharable databases. 5. CURRENT PROGRESS 5.1 Work to Date The FP-tree-based method for inverse frequent set mining in that can be used into the third phase in the framework of association rule hiding approach (Figure 1). A further work of inverse frequent set mining based on FP-tree has recently been accepted by the Journal of Software (Ruanjian Xuebao ISSN 1000-9825, JOS for short). The work describes shows effort towards the NP-complete problem of inverse frequent set mining. It gives a feasible and efficient algorithm to find a set of databases that agree with the given frequent itemsets discovered from a real database. Specifically, the algorithm provides a good heuristic search strategy to rapidly find a FP-tree satisfying the given frequent sets constraints, leading to rapidly finding a set of compatible databases. Compared with previous “generation-and-test” methods, the method in is a zero trace back algorithm, without rollback operations, which saves huge computational costs. Furthermore, the algorithm can find a set of compatible databases (usually a lot of databases) instead of finding only one compatible database in previous methods. And the number of databases the algorithm can find is carefully probed in. The work in the Journal of Software proposes a more mature and well-designed FP-tree-based method for inverse frequent set mining after expanding the definition of the inverse frequent set mining problem and exploring its three practical applications. First, the method divides target constraints into some sub constraints and each time it solves a sub linear constraint problem. After some iterations, it finds an FP-tree that exactly satisfies with the whole given constraints. Then, based on the FP-tree it generates a temporary database TempD that only involves frequent items. The target datasets are obtained by scattering infrequent items into TempD. Theoretic analysis and experiments show that the method is right and effect. This work differs from the work of mainly in that: the generation of FP-tree in JOS is based on solving constraint equations leading the support counts of frequent itemsets obtained from the FP-tree are exactly equal to the given, while the generation of FP-tree in is based on a heuristic strategy leading the support counts of frequent itemsets obtained from the FP-tree are not always equal to the given. 5.2 Future Work The further work includes three aspects: First, develop and design a sound sanitization algorithm performed on the set of frequent itemsets with support counts. The input of the algorithm is: a set of frequent itemsets with support counts FS discovered from a real database, a set of association rules R derived from FS, and a subset of sensitive rules Rh. The output of the algorithm is a set of frequent itemsets with support counts from which we can just derive the set of rules R-Rh. The algorithm itself should take into the following considerations: 1) ideally, the support and confidence of the rules in R-Rh should remain unchanged as much as possible; 2) the algorithm should be able to select appropriate hiding strategies according to different kinds of correlations among the rules in R and Rh, which is just considered preliminarily in recent work of; and 3) it should provide a security mechanism of preventing rule-based reasoning, that is, deal with the case in which sensitive rules can be reasoned from nonsensitive rules. Second, investigate how to restrict the number of transactions in the new released database. Current work on the FP-tree-based inverse frequent set mining did not restrict and control the number of transactions in the new generated database. As an important characteristic of transaction database, the number of transactions will directly affect the support of a rule. The number of transactions that the current algorithms output is previously unknown, making the support of a rule is uncertain although the support count of the corresponding itemset is certain. Third, develop an integrated secure association rule mining tool which can conceal (protect) privacy data & sensitive association rules contained in the data simultaneously. In privacy preserving data sharing context, both the sensitive data and rules contained in the data need hiding (we call sensitive data hiding in database DHD and sensitive rules hiding in database KHD). However, currently, DHD and KHD techniques are always investigated separately, and there is still lack of a tool integrating both DHD and KHD techniques. Development of such a tool is significant and imperative under this situation. 5.3 Expected Contributions The expected contributions of dissertation are: 1) It will provide an efficient inverse frequent set mining method which can be used in different privacy preserving data sharing context. 2) It will provide an effective association rule hiding method. 3) It will provide database security administrators with a creditable association rule mining tool protecting both the private data and confidential rules contained in the data. 6. EXPERIMENTAL EVALUATION PLAN The proposed approach will be evaluated on the dataset of BMSPOS, BMS-WebView-1, and BMS-WebView-2 according to: 1) hiding effects; 2) data utility; 3) time performance. The hiding effects can be evaluated using the three metrics: ① Hiding Failure Ratio; ②Lost Rules Ratio; ③Ghost Rules Ratio. Each metric evaluates one side effect produced in hiding process (see Figure 3). In Figure 3, R is the set of rules mined from original database D;R’ is the set of rules mined from new database D’; Rh is the set of sensitive rules to be hided; ~Rh is the set of non-sensitive rules; ① represents the sensitive rules not hidden successfully; ②represents the non-sensitive rules that are falsely hidden (we say the rules are lost); ③ represents the new generated spurious rules occurring in D’ but not occurring in D. According to this, Hiding Failure Ratio, defined as Rh(D’)/Rh(D), refers to the percentage of the sensitive rules in D’ to total sensitive rules in D, where Rh(X) is the number of sensitive rules in database X. Lost Rules Ratio, defined as (~Rh(D) − ~Rh(D’))/ ~Rh(D), refers to the percentage of the non-sensitive rules falsely hidden to total 55 non-sensitive rules in D, where ~Rh(X) is the number of nonsensitive rules in database X. Ghost Rules Ratio, defined as (| R’ | − | R ∩R’ |) / | R’ |, refers to the percentage of the ghost rules in D’ to total rules in D’. Finally, data utility can be defined as variations of support & confidence between the old and new set of association rules. 7.CONCLUSION: The proposed model will fetch up the new reconstruction-based research track and work well according to the evaluation metrics including hiding effects, data utility, and time performance. REFERENCES: [1] Atallah, M., Bertino, E., Elmagarmid, A., Ibrahim, M., and Verykios, V.S. Disclosure limitation of sensitive rules. In:Scheuermann P, ed. Proc. of the IEEE Knowledge and Data Exchange Workshop (KDEX'99). IEEE Computer Society,1999. 45-52. [2] Calders, T. Computational complexity of itemset frequency satisfiability. In: Proc. of the 23Prd P ACM PODS. ACM Press, 2004. 143-154. [3] Chen, X. and Orlowska, M. A further study on inverse frequent set mining. In: Proc. of the 1P stP Int’l Conf. on Advanced Data Mining and Applications (ADMA’05). LNCS 3584, Springer-Verlag. 2005. 753-760. [4] Chen, X., Orlowska, M., and Li, X. A new framework for privacy preserving data sharing. In: Proc. of the 4PthP IEEE ICDM Workshop: Privacy and Security Aspects of Data Mining. IEEE Computer Society, 2004. 47-56. .