Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Security in Outsourced Association Rule Mining Agenda Introduction Approximate randomized technique Encryption Summary and future work Introduction Data mining in company know about the past activities of their customers make strategic decisions Types of data mining Association rules mining Clustering Classification Association rules “X => Y” If a transaction contains itemset X, the transaction will probably contain itemset Y Support: number of supporting transactions Confidence: proportion of transactions containing X which also contains Y Performing data mining Build application Buy software Development cost? Time? Fit requirements? Maintenance? Outsource Concerns in outsourcing Output Execution Assurance Correctness Security Company Data Miner DB Privacy of records Information of the company Approximate randomized technique Approximate solution Privacy Preserving Mining of Association Rules SIGKDD 2002 Authors: Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke Problem formulation Let the set of transactions be T = {t1, t2, … tN} Transform T to T’ = {t’1, t’2, … t’N} Mine in T’ Privacy breaches Itemset A cause a privacy breach of level p if for some item a in A P[a in ti|A in t’i] >= p Select-a-size randomization For each transaction ti in T m = length of ti Select (non-uniformly) randomly an integer j from [0, m] Copy uniformly at random j items in ti to t’i Consider every item a not in ti, add a to t’i with a given probability pm Run on real data Privacy breach of level <= 50% P[a in ti|A in t’i] <= 50% Accuracy = # true positive / (# found itemsets) Set 1 Itemset True True Size Itemset Positive 1 65 65 2 228 212 3 22 18 False False Accuracy Drops Positive 0 16 4 0 28 5 100% 88% 78% Accuracy Set 2: Itemset True True Size Itemset Positive 1 266 254 2 217 195 3 48 43 False False Accuracy Drops Positive 12 22 5 31 45 26 89% 81% 62% Problems Estimated counts of large itemsets varies "beer and diaper" story Lower accuracy of association rules customers who buy diapers tend also to buy beer hard to believe some strange rules Expensive to make wrong decision Supermarket: layout design Health center: identify new disease Security concerns Individual transaction is protected Private association rules can be estimated by other parties Adversary actions may be based on found association rules Encryption Problem formulation Let the set of transactions be T = {t1, t2, … tN} I is the entire set of items All ti is a subset of I Transform T to T’ = {t’1, t’2, … t’N} A third party mines in T’ and gets AR’ Transform AR’ to AR Architecture Association Rules Association Rules Mappings DB Transformer DB Encryption To protect a message, simple encryption can be applied Association rule encryption “GOOD DOG” can be encrypted as “PLLX XLP” 752 => 891? Milk => Bread Transaction encryption <8, 69, 153, 756>? <Cheese, Fork, Ice-cream, Clock> Simple scheme Encryption For every transaction ti For every item x in ti Add f(x) to t’i where f is a bi-jective function Decryption For every association rule ri For every item y in r Replace y by f-1(y) Problems with simple encryption They are easy to crack “PLLX XLP” 26P3 Association rules combinations, with at least one vowel # Bread > # Car # association rules, # large itemsets are disclosed Solution Use a more complex scheme Fake items Probability to make a correct guess of a single mapping = 1 / |I| Randomly add some fake items to each transaction Decrease the above probability to 1 / (|I| + |F|) One-to-n Mapping Originally, we are “one-to-one” mapping One item One item A1 B2 C3 We form “one-to-n” mapping A 1, 4, 5 B2 C 3, 5 Greatly increase the number of possible mapping of an item |I|+|F|C1 + |I|+|F|C2 + … |I|+|F|C|F| Example transformation T= {A} {B} {C} {A, B} {A, C} {B, C} {A, B, C} T’ = A 1, 4, 5 B2 C 3, 5 {1, 4, 5} {2} {3, 5} {1, 2, 4, 5} {1, 3, 5} {2, 3, 5} {1, 2, 3, 4, 5} Limitation on the mapping f For any item x, there does not exist items y1, y2, …, yk (x ≠ y1 ≠ … ≠ yk ) Such that f(x) subset in f(y1) U f(y2) U…f(yk) Consider an example A 1, 2 B 2, 3 C 3, 4 AC 1, 2, 3, 4 ABC 1, 2, 3, 4 Limitation on the mapping f For any item x f(x) – Ui != x, i in I f(i) != empty Every item must map to something unique Mapping generation – Item Extend Initialize every item to map to something unique I’ For every item x in IE Randomly pick some mappings Extend each mapping by x Example run A1 B2 C3 IE = {4, 5} Considering item 4 A1 B2 C3 Pick A A 1, 4 B2 C3 Considering item 5 A1 B2 C3 Pick A, C A 1, 4, 5 B2 C 3, 5 Item Extend Every item must map to something unique A 1, 4, 5 Say 1 is unique to f(A) B2 C 3, 5 suppT(A) = suppT’(1) For a transaction t without item A Add a subset of unique mapping set to t’ with some probability {1, 4} is unique mapping set in f(A) {}, {1}, {4}, {1, 4} may be added Fake items again Now, every item in t’i must be in some mappings Randomly add some fake items in |F| to each transaction Mapping f: I -> |I’| U |IE| U |F| |I’|: core “unique” items |IE|: expanding items |F|: fake items Basic transformation framework For each transaction t For each item x in t For item i in I - t Add f(x) to t’ Add randomly subset of unique mapping set of f(i) to t’ For item f in F Toss a biased coin for each item, add f to t’ if head (probability should be difference) Recovering association rules Given an encrypted rule in AR’ If there exists i1, i2, …, im in I Uk=1m f(ik) = X And there exists j1, j2, …, jn in I r’: X => Y Uk=1n f(jk) = XUY r: {i1, i2, … im} => {j1, j2, …, jn} – {i1, i2, … im} is a rule in AR Otherwise, the rule is not correct Example Given 1 => 4 (rejected) 2 => 1, 5 (rejected) 2 => 1, 3, 5 (rejected) 2 => 1, 3, 4, 5 (B => AC) 2, 3, 5 => 1, 4 (BC => A) 2, 3, 5 => BC 1, 2, 3, 4, 5 => ABC Mapping f A 1, 4, 5 B2 C 3, 5 Correctness Proposition For any item x, y, f is transformation mapping suppT(x) = suppT’(f(x)) suppT(xUy) = suppT’(f(x) U f(y)) For any itemset X, Y, F is the transformation mapping suppT(X) = suppT’(F(X)) suppT(XUY) = suppT’(F(X) U F(Y)) No false drops and false positives Summary Generation of mappings Transformation of transactions One-to-n mappings Item Extend Mapping f(x) Subsets of unique mapping set Fake items Recovering association rules Reverse mappings and filtering Test run # Items = 1k, |T| = 1k Without transformation One rule Time: 8s Item Extend 147 rules Total times: 26s Mappings generation and transformation: 219ms Future Work Define parameters to the problem Size of |IE| Size of |F| Give a clear measure of security Give a clear measure of overhead Correctness of association rules Query execution proof Result verification The End Choosing probability Uniform distribution or any fixed distribution give patterns which may be easily identified Random probability distribution {}: 70%, {1}: 5%, {4}: 15%, {1, 4}: 20% Storage: need additional storage Back Algorithm for transformation Transformation is the most costly process Execution time linear to database size |T| Should be as fast as possible Optimization Mapping Retrieval For an item x, use a hash table to retrieve the mapping, h(x) Adding fake items First randomly (according to the probability of adding items) determine the number of items to add Randomly pick in the set (non-uniform distribution) Gives a much shorter runtime in average Choice of mapped items Acceptable as long as it is not easy to identify I’, IE, F One way is to use random permutation of first |I| + |IE| + |F| natural numbers 1 2 … |I|+|IE|+|F| * (1+ δ) First |I| numbers are mapped to |I’| Next |IE| numbers are IE Cut and paste randomization One case of select-a-size randomization The way to perform selection of j Given an integer Km > 0 Randomly choose j in [0, Km] If (j > m) Set j = m Overall input parameters Km pm Effects on support Support of A in T’ A in t, without replaced A’ in t, randomly add A Support of AB in T’ AB in t, without replaced A and B AB’ in t, randomly add B A’B in t, randomly add A A’B’ in t, randomly add A and B Estimating original support Support of A in T, x Support of A in T’, y x * P(A remains in original transaction) + (|DB| - x) * pm = y Support of AB in T Support of AB in T’ Support of AB’, A’B in T’ Support of A’B’ in T’ Apriori property Suppose m = 2 for all t in T |T| = 10, |I| = {A, B} pm= 0, j = 1, Support of B in T’ suppT’ (B)= 0 E(suppT(B)) = 0 suppT’ (A)= 10 suppT’ (AB)= 0 E(suppT(AB)) = suppT’ (A) * 1 = 10 Apriori property An expected large itemset may have an expected small sub-set But generally the support of subsets are not too small Instead of using the support threshold to filter all small candidates, use a smaller value Apriori algorithm Generate candidate sets Scan database for counts Recover the predicted support Discard candidates with support smaller than <= candidate limit Save for output candidates with support >= support threshold Apriori_gen(remaining candidate) Candidate limit A high value A small value Increase numbers of false drops Poor correctness Increase number of candidate sets High running time Experiment Support threshold: smin estimated s.d.: δ smin – δ is found to be a good value Other applications Outsourced transaction database (secure) storage Outsourced association rule mining using data stream Secure distributed association rule mining with third party miner Outsourced database with association rule mining service Association Rules Association Rules Mappings Transformer Transactions Query DB