Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Maintaining Data Privacy in Association Rule Mining VLDB 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa Speaker: Minghua ZHANG Oct. 11, 2002 1 Content 2 Background Problem framework MASK -- distortion part MASK -- mining part Performance Conclusion Background In data mining, the accuracy of the input data is very important for obtaining valuable mining results. However, in real life, there are many reasons which lead to inaccurate data. One example is that, the users deliberately provide wrong information to protect their privacy. – 3 age, income, illness, etc. Problem: how to protect user privacy while getting accurate mining results at the same time? Background (cont’d) Privacy and accuracy are contradictory in nature. A compromise way is more feasible. – 4 satisfactory (not 100%) privacy and satisfactory (not 100%) accuracy This paper studied this problem in the context of mining association rules. Overview of the Paper The authors proposed a scheme --- MASK (Mining Associations with Secrecy Konstraints). Major idea of MASK – Apply a simple probabilistic distortion on original data – The miner tries to find accurate mining results, given the following inputs: 5 The distortion can be done at the user machine The distorted data A description of the distortion procedure Problem Framework Database model – – Each customer transaction is a record in the database. A record is a fixed-length sequence of 1’s and 0’s. E.g: for market-basket data – length of the record: the total number of items sold by the market. – 1: the corresponding item was bought in the transaction – 0: vice versa. – 6 The database can be regarded as a two-dimensional boolean matrix. Problem Framework (cont’d) The matrix is very sparse. Why not use itemlists? – – Mining objective: find frequent itemsets – 7 The data will be distorted. After the distortion, it will not as sparse as the original (true) data. Itemset whose appearance (support) in the database is larger than a threshold. 8 Background Problem framework MASK --- distortion part MASK --- mining part Performance Conclusion MASK --- Distortion Part Distortion Procedure – – – Represent a customer record by a random vector. Original record: X={Xi}, where Xi =0 or 1. Distorted record: Y={Yi}, where Yi =0 or 1. 9 Yi = Xi Yi = 1-Xi (with a probability of p) (with a probability of 1-p) Quantifying Privacy Privacy metric – – The probability of reconstructing the true data Consider each individual item Calculate reconstruction probability – – Let si = prob (a random customer C bought the ith item) = the true support of item i The probability of correctly reconstruction of a ‘1’ in a random item i is: 10 With what probability can a given 1 or 0 in the true matrix database be reconstructed? R1(p,si)= si x p2 / (si x p +(1-si) x (1-p) ) + si x (1-p) 2 / ( si x (1-p) + (1-si) x p) Reconstruction Probability Reconstruction probability of a ‘1’ across all items: R1(p) = ( i siR1(p,si) ) / (isi) Suppose – s0=the average support of an item Replace si by s0, we get – R1(p)= s0 x p2 / (s0 x p +(1-s0) x (1-p) ) + s0 x (1-p) 2 / ( s0 x (1-p) + (1-s0) x p) 11 Reconstruction Probability (cont’d) Relationship between R1(p) and p, s0 Observations: – – 12 R1(p) is high when p is near 0 and 1, and it is lowest when p=0.5. The curves become flatter as s0 decreases. Privacy Measure The reconstruction probability of a ‘0’ – The total reconstruction probability – – R(p)=a R1(p) +(1-a) R0(p) a is the weight parameter. Privacy – 13 R0(p)= func(p and s0). P(p) = ( 1- R(p) ) x 100 Privacy Measure (cont’d) Privacy vs. p P(p) for s0=0.01 Observations: – For a given value of s0, the curve shape is fixed. – The privacy is nearly constant for a large range of p. 14 The value of a determines the absolute value of privacy. provide flexibility in choosing p that can minimize the error in the later mining part. 15 Background Problem framework MASK --- distortion part MASK --- mining part Performance Conclusion MASK --- Mining Part How to estimate the accurate supports of itemsets from a distorted database? – 16 Remember that the miner knows the value of p. Estimating 1-itemset supports Estimating n-itemset supports The whole mining process Estimating 1-itemset Supports Symbols: – – – – From distortion method, we have – – 17 T: the original true matrix; D: the distorted matrix; i: a random item; C1T and C0T: the number of 1’s and 0’s in the i column of T; C1D and C0D : the number of 1’s and 0’s in the i column of D. C1D : roughly C1T p+ C0T(1-p) -> C1D = C1T p+ C0T(1-p) C0D : roughly C0T p+ C1T(1-p) -> C0D = C0T p+ C1T(1-p) Let , CT = M-1 CD. , , then CD = MCT. So Estimating n-itemset Supports Still use CT = M-1 CD to estimate support. Define – CKT is the number of records in T that have the binary form of k for the given itemset. E.g: for a 3-itemset that contains the first 3 items CT has 23=8 rows – C3T is the No. of records in T of form {0,1,1,…} Prob ( CjT -> CiD). – Mi,j = – 18 M7,3=p2(1-p) (C3T -> C7D or C011T -> C111D) Mining Process Similar to Apriori algorithm Difference: – E.g: when counting supports of 2-itemsets, Apriori only need to count the No. of records that have value ‘1’ for both items, or of form “11”. MASK has to keep track of all 4 combinations: 00,01,10 and 11 for the corresponding items. – MASK requires more time and space than Apriori. – 19 C2n-1T is estimated from C0D, C1D, … , C2n-1D. Some optimizations (omitted) 20 Background Problem framework MASK --- distortion part MASK --- mining part Performance Conclusion Performance Data sets – Synthetic database – Real dataset 21 1,000,000 records; 1000 items s0=0.01 Click-stream data of a retailer web site 600,000 records; about 500 items s0=0.005 Performance (cont’d) Error Metrics – Right class, wrong support Infrequent itemsets, error doesn’t matter Frequent itemsets – – Support Error (): Wrong class Identity Error () – false positives: – false negatives: 22 Performance (cont’d) Parameters – – – – sup = 0.25%, 0.5% p = 0.9, 0.7 a=1: only concern of privacy of 1’s r = 0%, 10% 23 Coverage may be more important than precision. Use a smaller support threshold to mine the distorted database. Support used to mine D = sup x (1-r) Performance (cont’d) Synthetic dataset – Experiment 1: p=0.9 (85%), sup=0.25% Level |F| - + Level |F| - + 1 689 3.31 1.16 1.16 1 689 3.37 0.73 3.19 2 2648 3.58 4.49 5.14 2 2648 3.73 0.19 19.68 3 1990 1.71 4.57 2.16 3 1990 1.76 0 28.09 4 1418 1.28 3.67 0.22 4 1418 1.29 0 25.81 5 730 1.27 5.89 0 5 730 1.32 0 16.44 6 212 1.36 4.25 5.19 6 212 1.37 0 25.47 7 35 1.40 0 0 7 35 1.40 0 51.43 8 3 0.99 0 0 8 3 0.99 0 66.67 r=0% 24 r=10% Performance (cont’d) Synthetic dataset – Experiment 2: p=0.9 (85%), sup=0.5% Level |F| - + Level |F| - + 1 560 2.60 1.25 0.89 1 560 2.66 0.18 4.29 2 470 2.13 5.53 4.89 2 470 2.21 0 44.89 3 326 1.22 3.07 0.31 3 326 1.26 0 42.64 4 208 1.34 1.44 0.48 4 208 1.35 0 51.44 5 125 1.81 0 0 5 125 1.81 0 22.4 6 43 2.62 0 0 6 43 2.62 0 18.60 7 10 3.44 10 0 7 10 3.47 0 10 8 1 4.50 0 0 8 1 4.50 0 0 r=0% 25 r=10% Performance (cont’d) Synthetic dataset – 26 Experiment 3: p=0.7 (96%), sup=0.25%, r=10% Level |F| - + 1 689 10.16 2.61 7.84 2 2648 25.23 19.52 630.93 3 1990 26.93 42.86 172.71 4 1418 29.14 65.94 0.35 5 730 28.47 79.32 0 6 212 36.25 84.91 0 7 35 51.37 85.71 0 8 3 - 100 0 Performance (cont’d) Real database – Experiment 1: p=0.9 (89%), sup=0.25% Level |F| - + Level |F| - + 1 249 5.89 4.02 2.81 1 249 6.12 1.2 0.40 2 239 3.87 6.69 7.11 2 239 4.04 1.26 23.43 3 73 2.60 10.96 9.59 3 73 2.93 0 45.21 4 4 1.41 0 25.0 4 4 1.41 0 75 r=0% 27 r=10% Performance (cont’d) Real database – Experiment 2: p=0.9 (89%), sup=0.5% Level |F| - + Level |F| - + 1 150 4.23 0.67 4.67 1 150 4.27 0 8 2 45 2.42 2.22 4.44 2 45 2.56 0 37.77 3 6 1.07 0 16.66 3 6 1.07 0 66.66 r=0% 28 r=10% Performance (cont’d) Real database – 29 Experiment 3: p=0.7 (97%), sup=0.25%, r=10% Level |F| - + 1 249 18.96 7.23 15.66 2 239 33.59 20.08 1907.53 3 73 32.87 30.14 2308.22 4 4 7.55 50 400 Performance (cont’d) Summary – – – – 30 Good privacy and good accuracy can be achieved at the same time by careful selection of p. In experiments, p around 0.9 is the best choice. A smaller p leads to much error in mining results. A larger p will reduces the privacy greatly. Conclusion 31 This paper studies the problem of achieving a satisfactory privacy and accuracy simultaneously for association rule mining. A probabilistic distortion of the true data is proposed. Privacy is measured by a formula, which is a function of p and s0. Conclusion (cont’d) 32 A mining process is put forward to estimate the real support from the distorted database. Experiment results show that there is a small window of p (near 0.9) that can achieve good accuracy (90%+) and privacy (80%+) at the same time. Related Works On preventing sensitive rules from being inferred by the miner (output privacy) – – 33 Y. Saygin, V. Verykios and C. Clifton, “Using Unknowns to Prevent Discovery of Association Rules”, ACM SIGMOD Record, vol.30 no. 4, 2001 M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim and V. Verykios, “Disclosure Limitation of Sensitive Rules”, Proc. Of IEEE Knowledge and Data Engineering Exchange Workshop, Nov.1999 Related Works On input data privacy in distributed databases – – 34 J. Vaidya and C. Clifton, “Privacy Preserving Association Rule Mining in Vertically Partitioned Data”, KDD2002 M. Kantarcioglu and C. Clifton, “Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data”, Proc. Of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2002 Related Works Privacy-preserving mining in the context of classification rules – A recent paper also appears in 2002 – 35 D. Agrawal and C. Aggarwal, “On the Design and Quantification of Privacy Preserving Data Mining Algorithms”, PODS, 2001 A. Evfimievski, R. Srikant, R. Agrawal and J. Gehrke, “Privacy Preserving Mining of Association Rules”, KDD2002 36 More information Distortion procedure – Yi = Xi XOR ri‘, where ri‘ is the complement of ri, ri is a random variable with density function f ( r ) = bernoulli(p) (0 <= p <= 1) 37 More Information Reconstruction error bounds (1-itemsets) – With probability PE(m,p,(2p-1)/2) X PE(n,p,(2p1)/2) , the error is less than . 38 n: the real support count of the item; m: dbsize-n; PE(n,p,) = (r=np-np+) nCrpr(1-p)n-r Reconstruction probability of a ‘1’ in a random item i – – Si = the true support of item i = pr (a random customer C bought the ith item), Xi = the original entry for item i Yi = the distorted entry for item I The probability of correct reconstruction of a ‘1’ in a random item i is: 39 R1(p,si)= Pr{Yi =1| Xi =1} x pr{Xi =1| Yi =1} + Pr{Yi =0| Xi =1} x Pr{Xi =1| Yi =0} = si x p2 / (si x p +(1-si) x (1-p) ) + si x (1-p) 2 / ( si x (1-p) + (1-si) x p)