Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Privacy-preserving rule mining Outline A brief introduction to association rule mining Privacy preserving rule mining Single party Perturbation - publishing Encryption - outsourcing Distributed multiparty Cryptographic protocols – collaborative mining Hiding sensitive rules Association Rule Mining Transactional datasets A transaction t = {a,b,c,…} a,b,c,… are called items The length of transaction = # of items Transaction length can vary Equivalent representation The set of all items : I A transaction t can be transformed to a boolean vector in length of |I|. Association Rule Mining Rule mining Goal: find the frequent itemset Some itemset, e.g.{a,b,c}, appears frequently, higher than certain support. Rules can be derived from the itemset: AB {a,b,c} is frequent, then abc, abc, … Metrics Support = # of occurrences of itemset/ total # of transactions i.e., prob (A,B) Confidence = # of occurrences of itemset/# of occurrences of left (of the rule) I.e. the conditional prob: Pr (B|A) Example E.g. a,b appears 100 times together, while abc appears 50 times together in total 5000 transactions Support of abc = 50/5000 = 0.01 Confidence of abc = 50/100 = 0.5 Algorithms Apriori observation: if itemset A is a part of B, then support(A) >= support (B) Steps in finding frequent itemsets: Starting from single-item set, pruning the itemsets that have support < threshold When we have a set of k-itemsets, we expand it to k+1-itemsets, and check their supports. Using data structures like hash tree to speed up the counting process Algorithm Generating rules with confidence threshold Confidence (AB) = P(B|A) = support(AB)/support(A) Centralized approaches Two types of methods (Categorical data) Perturbation Encryption Perturbation Papers 1."Privacy Preserving Mining of Association Rules," SIGKDD 2002. 2."Maintaining data privacy in association rule mining”, VLDB 02 3.A framework for high-accuracy privacypreserving mining, ICDE05 Basic ideas Consider a transaction as a boolean bit vector Perturb each bit with certain method Paper 1: randomly select j items from t, then for rest of all items, with prob p to be selected Paper 2: each bit has the prob p to be original, 1-p to be flipped Paper 3: unify the methods with perturbation matrix The key is… After you perturb the data, you should still be able to find the supported rules correctly. The accuracy is traded off by the intensity of perturbation (p) Methods discovering the original support Paper 1 using the correlation between “partial support” to find the original support Concept of partial support #{t T |# ( A t ) l} sup ( A) N T l Prob of the length change of matched parts m pk [l l ' ] p[# (t ' A) l ' |# (t A) l ] The size of t : m, the size of itemset A: k Some results Let si be supi(A) and si’ be supi’ (A) 1. 2. The matrix P and D are defined with only p[ll’] From 1, we can estimate the original support From 2, we can estimate the reliability (variance) of the support Estimation (which is related to perturbation rate p) Privacy Given an itemset A in perturbed transaction t’ What is the probability of an item a, really in the itemset A, i.e., p[a t | A t ' ] p[a t | A t ' ] Tradeoff between utility and privacy Lowest discoverable support: distinguishable from zero (consider the variance of support estimation) Encryption method Paper: Security in Outsourcing of Association Rule Mining, VLDB2007 Substitution encryption 1-1 substitution a1, b2, … 1-n substitution a{1,10}, b{2,11,12},… Problem: 1-1 substitution is weak Arbitrary 1-n substitution does not work Cannot recover original rules from the rules from the substituted items. The basic idea Fake items Original n items, additional m fake items Define “admissible 1-n mapping” Arbitrary 1-n mapping may result in irreversible results E.g., a{1,2}, b{2}, c{3} If we find frequent itemset {1,2,3} in the substituted set, {ac} or {abc}, which one is the right original itemset? Admissible 1-n mapping For each mapping, there should be at least one unique substitute item in the mapped result, which does not appear in other mapping E.g., a{1,2}, b{2}, c{3} breaks the definition while a{1,2}, b{2,4}, c{3} is admissible Recovering rules When we use admissible mappings We are able to reverse the discovered rules on substituted set. E.g., if we find {1,2,4} is a frequent set check all mappings: {1,2} a, {2,4} b {1,2,4} {ab} cost Additional cost Generating item mapping Generating transaction transformation significant Cost of rule mining Both the # of items and the average length of transaction is increased, thus the total cost will increase Features of encryption method Rules can be accurately recovered A tradeoff between cost and privacy Privacy is better preserved with more fake items More fake items will result in higher additional cost. The VLDB07 paper weaknesses No strong proof on security does not mention any attacks – there are possibly attacks on this scheme Distributed datasets Perturbation works, but existing encryption protocols do not work might need some protocols to make the encryption method applicable Horizontally or vertically partitioned Paper1: "Privacy-Preserving Distributed Mining of Association Rules on Horizontally Partitioned Data," Paper2: "Privacy preserving association rule mining in vertically partitioned data," Shared features: using the cryptographic methods to construct protocols Hiding sensitive rules When publishing data for rule mining, the rule itself can be sensitive too. Basic methods Decrease the support Support = # of occurrences of itemset/ total # of transactions Decrease the confident Confidence = # of occurrences of itemset/# of occurrences of left(rule) discussion on rule hiding Need sufficient amount of computational cost at the data owner side You should know what rules are sensitive in advance! So only necessary for the case you have to share the data When hiding sensitive rules, other rules might be damaged Summary Two methods for single party rule mining Perturbation and encryption Distributed multiparty can use protocols and perturbation method Hiding sensitive rules is also important in some cases