Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu GOAL ƒ Build information systems that ƒ protect the privacy and ownership of information ƒ do not impede the flow of information Drivers Policies and Legislations – U.S. and international regulations – Legal proceedings against businesses Consumer Concerns – Consumer privacy apprehensions continue to plague the Web … these fears will hold back roughly $15 billion in e-Commerce revenue.” Forrester Research, 2001 – Most consumers are “privacy pragmatists.” Westin Surveys Moral Imperative – The right to privacy: the most cherished of human freedom -- Warren & Brandeis, 1890 Outline Privacy Preserving Data Mining – How to have you cake and mine it too! – Assuming no trusted third party. Hippocratic Databases – What I may see or hear in the course of treatment … I will keep to myself. - Hippocratic Oath, 8 (circa 400 BC) Other related topics – Privacy Preserving Information Integration – Watermarking & fingerprinting Data Mining and Privacy The primary task in data mining: – development of models about aggregated data. Can we still develop accurate models, while protecting the privacy of individual records? Alice Randomization Overview 35 95,000 J.S. Bach painting nasa 35 95,000 J.S. Bach painting nasa Bob 45 60,000 B. Spears baseball cnn 45 60,000 B. Spears baseball cnn Chris 42 85,000 B. Marley, camping, microsoft 42 85,000 B. Marley camping microsoft Recommendation Service Alice Randomization Overview 35 95,000 J.S. Bach painting nasa 35 95,000 J.S. Bach painting nasa Bob 45 60,000 B. Spears baseball cnn 45 60,000 B. Spears baseball cnn Chris 42 85,000 B. Marley, camping, microsoft Recommendation Service Mining Algorithm 42 85,000 B. Marley camping microsoft Data Mining Model Alice Randomization Overview 50 65,000 Metallica painting nasa 35 95,000 J.S. Bach painting nasa Bob 45 60,000 B. Spears baseball cnn 38 90,000 B. Spears soccer fox Chris 42 85,000 B. Marley, camping, microsoft 32 55,000 B. Marley camping linuxware Recommendation Service Alice Randomization Overview 50 65,000 Metallica painting nasa 35 95,000 J.S. Bach painting nasa Bob 45 60,000 B. Spears baseball cnn 38 90,000 B. Spears soccer fox Chris 42 85,000 B. Marley, camping, microsoft Recommendation Service Recovery Mining Algorithm 32 55,000 B. Marley camping linuxware Data Mining Model Inducing Classifiers over Privacy Preserved Numeric Data Alice’s age Alice’s salary John’s age 30 | 25K | … 30 become s 65 (30+35) 50 | 40K | … Randomizer Randomizer 65 | 50K | … 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Decision Tree Algorithm Model Reconstruction Problem Original values x1, x2, ..., xn – from probability distribution X (unknown) To hide these values, we use y1, y2, ..., yn – from probability distribution Y Given – x1+y1, x2+y2, ..., xn+yn – the probability distribution of Y Estimate the probability distribution of X. Reconstruction Algorithm fX0 := Uniform distribution j := 0 repeat n 1 fY (( xi yi ) a ) f Xj (a ) fXj+1(a) := n Bayes’ Rule j i 1 fY (( xi yi ) a ) f X (a ) j := j+1 until (stopping criterion met) (R. Agrawal & R. Srikant, SIGMOD 2000) Converges to maximum likelihood estimate. – D. Agrawal & C.C. Aggarwal, PODS 2001. Works Well 1000 Original 800 600 Randomized 400 Reconstructed 0 60 200 20 Number of People 1200 Age Decision Tree Example Age 23 17 43 68 32 20 Salary 50K 30K 40K 50K 70K 20K Repeat Visitor? Repeat Repeat Repeat Single Single Repeat Age < 25 No Yes Salary < 50K Repeat Yes Repeat No Single Accuracy vs. Randomization Fn 3 100 Accuracy 90 80 Original 70 Randomized Reconstructed 60 50 40 10 20 40 60 80 100 Randomization Level 150 200 Hippocratic Databases Hippocratic Oath, 8 (circa 400 BC) – What I may see or hear in the course of treatment … I will keep to myself. What if the database systems were to embrace the Hippocratic Oath? R. Agrawal, J. Kiernan, R. Srikant, Y. Xu Hippocratic Databases. VLDB 2002.. The Ten Principles Driven by current privacy legislation. – US (FIPA, 1974), Europe (OECD , 1980), Canada (1995), Australia (2000), Japan (2003) Principles: – Collection Group: Purpose Specification, Consent, Limited Collection – Use Group: Limited Use, Limited Disclosure, Limited Retention, Accuracy – Security & Openness Group: Safety, Openness, Compliance Collection Group 1. Purpose Specification – For personal information stored in the database, the purposes for which the information has been collected shall be associated with that information. 2. Consent – The purposes associated with personal information shall have consent of the donor (person whose information is being stored). 3. Limited Collection – The information collected shall be limited to the minimum necessary for accomplishing the specified purposes. Use Group 4. Limited Use – The database shall run only those queries that are consistent with the purposes for which the information has been collected. 5. Limited Disclosure – Personal information shall not be communicated outside the database for purposes other than those for which there is consent from the donor of the information. Use Group (2) 6. Limited Retention – Personal information shall be retained only as long as necessary for the fulfillment of the purposes for which it has been collected. 7. Accuracy – Personal information stored in the database shall be accurate and up-to-date. Security & Openness Group 8. Safety – Personal information shall be protected by security safeguards against theft and other misappropriations. 9. Openness – A donor shall be able to access all information about the donor stored in the database. 10. Compliance – A donor shall be able to verify compliance with the above principles. Similarly, the database shall be able to address a challenge concerning compliance. Architecture Privacy Policy Privacy Metadata Creator Privacy Metadata Data Collection Queries Other Privacy Constraint Validator Attribute Access Control Data Collection Analyzer Data Accuracy Analyzer Query Intrusion Detector Data Retention Manager Audit Info Audit Info Audit Trail Store Record Access Control Encryption Support Decision-Making Across Sovereign Data Repositories • Separate databases due to statutory, competitive, or security reasons. Selective, minimal sharing on need-to-know basis. Minimal Necessary Sharing R a u Example: Among those who took a particular drug, how many had adverse reaction and their DNA contains a specific sequence? Researchers must not learn anything beyond counts. v Algorithms for computing joins and join counts while revealing minimal additional information. u x RS R must not know that S has b & y S must not know that R has a & x RS u v S b v y Count (R S) R & S do not learn anything except that the result is 2. Watermarking Relational Databases Goal: Deter data theft and assert ownership of pirated copies. – 1. Specify secret key 2. Specify table/attributes to be marked 2. Specify table/attributes which should contain marks Very unlikely to occur by chance. Hard to find => hard to destroy (robust against malicious attacks). Watermark Existing watermarking techniques developed for Insertion multimedia: images, sound, text, … are not 3. Pseudo applicable to database tables. – – – Choose secret key Examples: Life Sciences, Electronic Parts. Watermark – Intentionally introduced pattern in the data. – – 1. Rows in a table are unordered. Rows can be inserted, updated, deleted. Attributes can be added, dropped. randomly select a subset of the rows for marking Function of secret key and attribute Watermark can be detected using only a subset values New algorithm for watermarking database tables. – of the rows and attributes of a table. – Robust against updates,incrementally updatable. Database Watermark Detection 3. Identify marked rows/attributes, compare marks 4. Confirm with expected presence or absence of mark values the Requires watermark neither original unmarked data nor the Suspicious watermark Database Closing Thoughts Solutions to complex problems such as privacy require a mix of legislations, societal norms, market forces & technology By advancing technology, we can change the mix and improve the overall quality of the solution References M. Bawa, R. Bayardo, R. Agrawal. Privacy-preserving indexing of Documents on the Network. 29th Int'l Conf. on Very Large Databases (VLDB), Berlin, Sept. 2003. R. Agrawal, A. Evfimievski, R. Srikant. Information Sharing Across Private Databases. ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003. A. Evfimievski, J. Gehrke, R. Srikant. Liming Privacy Breaches in Privacy Preserving Data Mining. PODS, San Diego, California, June 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for P3P. 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database Technology. 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P. W3C Workshop on the Future of P3P, Dulles, Virginia, Nov. 2002. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002. R. Agrawal, J. Kiernan. Watermarking Relational Databases. 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002. Expanded version in VLDB Journal 2003. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD), Edmonton, Canada, July 2002. R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000.