Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Privacy-preserving Data Mining: current research and trends Stan Matwin School of Information Technology and Engineering University of Ottawa, Canada [email protected] Few words about our research Universit[é|y] [d’|of] Ottawa is the largest bilingual university in Canada • Text Analysis and Machine Learning group • Learning with previous knowledge • Applied research and tech transfer – Profiles of digital game players/ active learning, cotraining, instance selection – Classification of medical abstracts • Privacy-aware Data Mining 2 Plan • What is [data] privacy? • Privacy and Data Mining • Privacy-preserving Data mining: main approaches – Anonymization – Obfuscation – Cryptographic hiding • Challenges – Definition of privacy – Mining mobile data – Mining medical data 3 Privacy • Freedom to be left alone • …ability to control who knows what about us [Westin; Moor] [=database views] • Jeopardized with the Internet – greased data • Moral obligation for the community to find solutions 4 Privacy and data mining • Individual Privacy – Nobody should know more about any entity after the data mining than they did before – Approaches: Data Obfuscation, Value swapping • Organization Privacy – Protect knowledge about a collection of entities – Individual entity values are not disclosed to all parties • The results alone need not violate privacy 5 Basic ideas • Camouflage obfuscation • Hiding k-anonymity 6 Why naïve anonymization does not work • • The [Sweeney 01] experiment purchased the voter registration list for Cambridge, MA – 54,805 people • • • 69% unique on postal code and birth date; 87% US-wide with all three Also, we do not know with what other data sources we may do joins in the future Solution: k-anonymization 7 Randomization Approach Overview Alice’s age Age 30 becomes 65 (30+35) Salary 70K becomes 20K 30 | 70K | ... 50 | 40K | ... Randomizer Randomizer 65 | 20K | ... 25 | 60K | ... Reconstruct Distribution of Age Reconstruct Distribution of Salary Classification Algorithm ... ... ... Model 8 Reconstruction Problem • Original values x1, x2, ..., xn from probability distribution X (unknown) • To hide these values, we use y1, y2, ..., yn from known distribution Y • Given – x1+y1, x2+y2, ..., xn+yn – the probability distribution of Y Estimate the probability distribution of X. 9 Intuition (Reconstruct single point) • Use Bayes' rule for density functions 10 V Age 90 Original distribution for Age Probabilistic estimate of original value of V 10 Works well Number of People 1200 1000 800 Original Randomized Reconstructed 600 400 200 0 20 60 Age 11 Recap: Why is privacy preserved? • Cannot reconstruct individual values accurately. • Can only reconstruct distributions. 12 Classification • Naïve Bayes – Assumes independence between attributes. • Decision Tree – Correlations are weakened by randomization, not destroyed. 13 Decision Tree Example Age Salary Repeat Visitor? 23 17 43 68 32 20 50K 30K 40K 50K 70K 20K Repeat Repeat Repeat Single Single Repeat Age < 25 No Yes Salary < 50K Repeat Yes Repeat No Single 14 Decision Tree Experiments 100% Randomization Level 100 Accuracy 90 Original 80 Randomized 70 Reconstructed 60 50 Fn 1 Fn 2 Fn 3 Fn 4 Fn 5 100% privacy: attribute cannot be estimated (with 95% confidence) any closer than the entire range for the attribute 15 Issues For very high privacy, discretization will lead to a poor model • Gaussian provides more privacy at higher confidence levels • In fact, it can be de-randomized using advanced control theory approach [Kargupta 2003] • 16 Association Rule Mining Algorithm [Agrawal et al. 1993] 1. L1 = large 1-itemsets 2. for (k = 2; Lk −1 ≠ φ ; k + + ) do begin C k = apriori − gen( Lk −1 ) 3. 4. for all candidates c ∈ C k do begin 5. compute c.count 6. end 7. Lk = {c ∈ Ck | c.count ≥ min − sup} 8. end 9. Return L = U k Lk c.count is the frequency count for a given itemset. Key issue: to compute the frequency count, we needs to access attributes that belong to different parties. 17 An Example • c.count is the vector product. • Let’s use A to denote Alice’s attribute vector and B to denote Bob’s attribute vector. • AB is a candidate frequent itemset, then c.count = A • B = 3. • How to conduct this computation across parties without compromising each party’s data privacy? Alice Bob 1 0 1 1 1 1 1 1 1 0 A B 18 Homomorphic Encryption [Paillier 1999] • Privacy-preserving protocols are based on Homomorphic Encryption. • Specifically, we use the following additive homomorphism property: e(m1 ) × e(m2 ) × L× e(mn ) = e(m1 + m2 + L + mn ) • Where e is an encryption function and mi is the data to be encrypted and e(mi ) ≠ 0 . 19 Digital Envelope [Chaum85] • A digital envelope is a random number (a set of random numbers) only known by the owner of private data. V V+R VV 20 The Objective Privacy Correctness Efficiency Solution Homomorphic Encryption Digital Envelope 21 Frequency Count Protocol • Assume Alice’s attribute vector is A and Bob’s attribute vector is B. • Each vector contains N elements. • Ai : the ith element of A. • Bi : the ith element of B. • One of parties is randomly chosen as a key generator, e.g, Alice, who generates (e, d) and an integer X > N. e and X will be shared with Bob. • Let’s use e(.) to denote encryption and d(.) to denote decryption. 22 Alice A1 + R1 × X A2 + R2 × X …… AN + R N × X Digital envelopes R1 , R 2 L , R N A set of random integers generated by Alice 23 Alice A1 + R1 × X A2 + R2 × X …… AN + RN × X e( A1 + R1 × X ) e( A2 + R2 × X ) …… e( AN + RN × X ) 24 Alice e( A1 + R1 × X ) e( A2 + R2 × X ) …… e( AN + RN × X ) Bob 25 Bob W1 = e( A1 + R1 × X ) × B1 W 2 = e ( A2 + R 2 × X ) × B 2 … WN = e( AN + RN × X ) × BN Bi = 0 ⇒ Wi = 0 Bi = 1 ⇒ Wi = e ( Ai + Ri × X ) × Bi = e ( Ai + Ri × X ) 26 • Bob multiplies all the Wi s for those Bi s that are not equal to 0. In other words, Bob computes the multiplication of all non-zero Wi s, e.g.,W = ∏ Wi where Wi ≠ 0. W = W1 × W2 × L × W j 27 W = W1 × W2 × L × W j = [e( A1 + R1 × X ) × B1 ] × [e( A2 + R2 × X ) × B2 ] × L × [e( A j + R j × X ) × B j ] || || || 1 1 1 28 W = W1 × W2 × L × W j = [e( A1 + R1 × X ) × 1] × [e( A2 + R2 × X ) × 1] × L × [e( A j + R j × X ) × 1] 29 W = W1 × W 2 × L × W j = e( A1 + R1 × X ) × e( A2 + R2 × X ) × L × e( A j + R j × X ) According to the property of homomorphic encryption = e( A1 + A2 + L + A j + ( R1 + R2 + L + R j ) × X ) 30 • Bob generates an integer R' . • Bob then computes W ' = W × e( R'× X ) According to the property of homomorphic encryption = e( A1 + A2 + L + A j + ( R1 + R2 + L + R j + R ' ) × X ) Alice 31 The Final Step • Alice decrypts W ' and computes modulo X. c.count = d (e( A1 + A2 + L + A j + ( R1 + R2 + L + R j + R ' ) × X )) mod X ( A1 + A2 +L+ Aj ) ≤ N < X ((R1 + R2 +L+ Rj + R' ) × X ) modX = 0 • She then obtains A1 + A2 + L + Aj for those Aj for which corresponding Bj are 0, which is = c.count 32 Privacy Analysis Goal: Bob never sees Alice’s data values. • All the information that Bob obtains from Alice is e( A1 + R1 × X ), e( A2 + R2 × X ),L, e( AN + RN × X ) . • Since Bob doesn’t know the decryption key d, he cannot get Alice’s original data values. 33 Privacy Analysis Goal: Alice never sees Bob’s data values. The information that Alice obtains from Bob is W ' = e( A1 + A2 + L + A j + ( R1 + R2 + L + R j + R ' ) × X ) for those Bi = 1. Alice computes d (W ' ) mod X . She only obtains the frequency count and cannot know Bob’s original data values. 34 Complexity Analysis Linear in the number of transactions The total number elements in each attribute vector α ( N + 1) where N is the total number transactions and α is the number of bits for each encrypted element. 35 Complexity Analysis Linear in the number of transactions The computational cost is (10N + 20 + g) where N is the total number transactions and g is the computational cost for generating a key pair. 36 Other Privacy-Oriented Protocols Multi-Party Frequency Count Protocol [Zhan et al. 2005 (a)] Multi-Party Summation Protocol [Zhan et al. 2005 (f)] Multi-Party Comparison Protocol [Zhan et al. 2006 (a)] Multi-Party Sorting Protocol [Zhan et al. 2006 (a)] 37 What about the results of DM? • Can DM results reveal personal information? • In some cases, yes [Atzori et al. 05]: Suppose an association rule is found: a1 ∧ a2 ∧ a3 ⇒ a4 [sup = 80, conf = 98.7%] This means sup({a1 , a2 , a3 , a4 }) = 80 sup({a , a , a , a }) 0.8 , , }) sup({ a a a = = = 81.05 then 0.987 .0987 therefore a1 ∧ a2 ∧ a3 ∧ ¬a4 has support=1, and identifies one person!! 1 1 2 2 3 4 3 38 • They propose an approach called k-anonymous patterns and an algorithm (inference channels) which detects violations of k-anonymity • The algorithm is expensive computationally • We have a new approach which embeds kanonimity into the concept lattice association rule algorithm [Zaki, Ogihara 98] 39 Conclusion • Important problem, challenge for the field • Lots of creative work, but lack of systematic approach • Medical data particularly sensitive, but also makes de-identification easier: genotypephenotype inferences, location-visit patterns, family structures, etc. [Malin 2005] • Lack of an operational, agreed upon definition of privacy: inspiration in economics? 40