Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
April 13, 2010 Towards Publishing Recommendation Data With Predictive Anonymization Chih-Cheng Chang†, Brian Thompson†, Hui Wang‡, Danfeng Yao† † ‡ ACM Symposium on Information, Computer, and Communications Security (ASIACCS 2010) April 13, 2010 Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Motivation • Inevitable trend towards data sharing – Medical records – Social networks – Web search data – Online shopping, ads • Databases contain sensitive information • Growing need to protect privacy Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Privacy in Relational Databases identifiers Name sensitive information Age Gender Zip Code Disease Joe Smith 52 Male 08901 Cancer John Doe 24 Male 08904 --------- Mary Johnson 45 Female 08854 Asthma Janie McJonno 59 Female 08904 Cancer Johnny Walker 76 Male 08854 Diabetes Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Privacy in Relational Databases “Pseudo-identifiers” Name Age Gender Zip Code Disease Person 0001 52 Male 08901 Cancer Person 0002 24 Male 08904 --------- Person 0003 45 Female 08854 Asthma Person 0004 59 Female 08904 Cancer Person 0005 76 Male 08854 Diabetes 87% of the U.S. population can be uniquely identified by DOB, gender, and zip code! [S00] Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Approaches to Achieving Privacy 1. Statistical databases • Only aggregate queries: What is average salary? • Differential Privacy [Dinur-Nissim ‘03, Dwork ‘06] Adaptively add random noise to output so querier can not determine if a user is in the database • Quality decreases over multiple queries 2. Publishing of anonymized databases • No restriction on how data is utilized, good for complex data mining applications • How to address privacy concerns? Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Anonymization of Databases Techniques: • Perturbation Name Age Joe Smith 52 53 John Doe 24 26 Mary Johnson 45 42 Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Anonymization of Databases Techniques: • Perturbation • Swapping Name Age Joe Smith 52 John Doe 24 Mary Johnson 45 Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Anonymization of Databases Techniques: • Perturbation • Swapping • Generalization Name Age Joe Smith 52 50s John Doe 24 20s Mary Johnson 45 40s Def. A database entry is k-anonymous if ≥ k-1 other entries match identically on the insensitive attributes. [SS98] Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 The Generalization Approach Name Age Gender Zip Code Disease Person 0001 <50 32 Male 08901 089** Cancer Person 0002 <50 24 Male 08904 089** --------- Person 0003 >50 59 Female 08854 088** Asthma Person 0004 <50 45 Male 08904 089** Diabetes Person 0005 >50 76 Female 08854 088** Cancer Person 0006 >50 61 Female 08854 088** AIDS Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Recommender Systems • Users register for service • After buying a good, they submit a rating for it • Get recommendations based on yours and others’ ratings Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Recommender Systems NETFLIX Alien User Joe Smith 0001 Batman 4 User John 0002 Doe 3 3 Mary UserJohnson 0003 2 ? Janie User McDonno 0004 Closer Dogma Evita ? 2 X-rated Gladiator 5 5 3 3 ? 2 The Netflix Challenge: “Anonymized” Netflix data is released to the public. $1 million prize for best movie prediction algorithm. Question: Is privacy really protected? Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Privacy in Recommender Systems NETFLIX Alien User 0001 Closer Dogma 4 User 0002 3 User 0003 2 User 0004 Batman Evita X-rated Gladiator 2 3 5 5 3 3 2 Narayanan and Shmatikov [NS08] exploited external information to re-identify users in the released Netflix Challenge dataset. Privacy breach! Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 News Timeline Oct. 2006 Netflix Challenge announced May 2008 N&S publish attack Aug. 2009 Plans announced for Challenge 2 Dec. 2009 Netflix users file lawsuits Mar. 2010 NC2 plans canceled due to privacy concerns (and FTC investigation) How can we enable sharing of recommendation data without compromising users’ privacy? Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Challenges in Anonymization of Recommender Systems • All data may be considered “sensitive” by users. • All data could be used as quasi-identifiers. • Data sparsity helps re-identification attacks, and makes anonymization difficult. [NS08] • Scalability – Netflix matrix has 8.5 billion cells! Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Attack Models We represent the recommendation database as a labeled bipartite graph: 3 0001 5 4 0002 4 Ben “structure-based attack” 4 1 English Patient 5 Pretty in Pink 5 Star Wars 1 English Patient Tim 4 0004 English Patient Godfather 5 0003 Star Wars Godfather “label-based attack” Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Privacy Models • Node re-identification privacy: Should not be possible to re-identify individuals. • Link existence privacy: Should not be possible to infer whether a user has seen a particular movie. Our approach, Predictive Anonymization, provides these notions of privacy against both the structure-based and label-based attacks. Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Predictive Anonymization Our solution takes a 3-step approach: 1. Use predictive padding to reduce sparsity. 2. Cluster users into groups of size k. 3. Perform homogenization by assigning users in each group to have the same ratings. Achieves k-anonymity! Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Predictive Anonymization Alien Batman Closer Dogma Evita X-rated Gladiator User 0001 3 4 5 3 2 1 4 User 0002 3 3 5 2 3 1 5 User 0003 2 2 3 5 2 3 3 User 0004 3 2 3 4 1 3 2 • Want to cluster users, but there is not enough information due to data sparsity. • Solution: Fill empty cells with predicted values. • Cluster users based on similar tastes, not necessarily similar lists of movies rated. Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Predictive Anonymization 1. Use predictive padding to reduce sparsity. 2. Cluster users into groups of size k. 3. Perform homogenization by assigning users in each group to have the same ratings. The final step, homogenization, can be done in one of several ways. We describe two methods, “padded” and “pure” homogenization. Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Predictive Anonymization “Padded Homogenization” 0001 0002 3.5 3 4.5 5 3.5 4 4.5 4 4.5 5 2.5 0003 4.5 4 0004 3.5 43.5 11.5 2.5 1.5 4.5 5 4.5 Star Wars Godfather English Patient Pretty in Pink • All edges are added to the recommendation graph. • Each cluster is averaged using the padded data. Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Predictive Anonymization “Pure Homogenization” 0001 3.5 3 4.5 5 3.5 4 Star Wars 0002 4.5 4 Godfather 4 4 4.5 5 0003 1 4.5 4 0004 English Patient 1 5 5 Pretty in Pink • Only necessary edges are added to the graph. • Each cluster is averaged using the original data. Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Experiments • Performed on the Netflix Challenge dataset: – 480,189 users and 17,770 movies – more than 100 million ratings • Singular value decomposition (SVD) is used for padding and prediction. • We compute the root mean squared error (RMSE) for a test set of 1 million ratings on the original and anonymized data. RMSE = predicted actual Towards Publishing Recommendation Data With Predictive Anonymization 2 n April 13, 2010 Analysis: Prediction Accuracy Experiment Series RMSE Original Data 0.95185 Padded Anonymization (k=5) 0.95970 Padded Anonymization (k=50) 0.95871 Pure Anonymization (k=5) 2.36947 Pure Anonymization (k=50) 2.37710 • Padded Anonymization preserves prediction accuracy. • However, sparsity is eliminated, which affects the utility of the published dataset for data mining applications. Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Summary Utility Prediction Accuracy Supports Complex Data Mining Privacy Node Re-Ident. Privacy Naive Anonymization Padded Predictive Anonymization Pure Predictive Anonymization Towards Publishing Recommendation Data With Predictive Anonymization Link Existence Privacy April 13, 2010 Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Conclusions • We have formalized privacy and attack models for recommender systems. • Our solutions show that privacy-preserving publishing of anonymized recommendation data is feasible. • More work is required to find a practical solution that satisfies real-world privacy and utility goals. Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Future Work • Investigate the use of differential privacy-like guarantees for recommendation databases • Analyze how to protect against more complex attacks with greater background knowledge • Evaluate the utility of anonymized recommendation data for advanced data mining applications Towards Publishing Recommendation Data With Predictive Anonymization April 13, 2010 Thank you! Towards Publishing Recommendation Data With Predictive Anonymization