* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Survey
Document related concepts
Transcript
Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center Thesis We need information systems that – respect the privacy of data they manage AND – do not impede the useful flow of information. It is feasible to reconcile the apparent contradiction Outline Why Privacy in Data Systems Some Technology Directions Some Challenging Problems Drivers for Privacy Privacy Surveys: – 17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users' attitude about online privacy, April 99) – 83% would stop doing business with a company if it misused customer information (Privacy on and off the Internet: What consumers want, Nov. 2001) Govt. legislations & guidelines: – – – – – – Fair Information Practices Act (US, 1974) OECD Guidelines (Europe, 1980) Canadian Standards Association’s Model Code (1995) Australian Privacy Amendment (2000) Japan: proposed legislation (2003) HIPAA, GLB, Recent U.S. Federal & State Initiatives Privacy Violations Accidents: – Kaiser, GlobalHealthrax Lax security: – Massachusetts govt. Ethically questionable behavior: – Lotus & Equifax, Lexis-Nexis, Medical Marketing Service, Boston University, CVS & Giant Food Illegal: – Toysmart Assertion Enterprises lack tools and technologies for managing private data and enforcing privacy policies. Founding Tenets of Current Database Systems Ullman, “Principles of Database and Knowledgebase Systems” Fundamental: – Manage persistent data. – Access a large amount of data efficiently. Desirable: – Support for data model, high-level languages, transaction management, access control, and resiliency. Similar list in other database textbooks. Statistical & Secure Databases Statistical Databases – Provide statistical information (sum, count, etc.) without compromising sensitive information about individuals, [AW89] Multilevel Secure Databases – Multilevel relations, e.g., records tagged “secret”, “confidential”, or “unclassified”, e.g. [JS91] Need to protect privacy in transactional databases that support daily operations. – Cannot restrict queries to statistical queries. – Cannot tag all the records “top secret”. Our Research Directions Privacy Preserving Data Mining Hippocratic Databases Data Mining and Privacy The primary task in data mining: development of models about aggregated data. Can we develop accurate models without access to precise information in individual data records? R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), May 2000. Privacy Preserving Data Mining 30 | 25K | … 50 | 40K | … Randomizer Randomizer 65 | 50K | … 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Data Mining Algorithm Model Reconstruction Problem Original values x1, x2, ..., xn – from probability distribution X To hide these values, we use y1, y2, ..., yn – from probability distribution Y Given – x1+y1, x2+y2, ..., xn+yn – the probability distribution of Y Estimate the probability distribution of X. Intuition (Reconstruct single point) Use Bayes' rule for density functions 1 0V A g e 9 0 O r i g i n a l d i s t r i b u t i o n f o r A g e P r o b a b i l i s t i c e s t i m a t e o f o r i g i n a l v a l u e o f V Intuition (Reconstruct single point) Use Bayes' rule for density functions 1 0V A g e 9 0 O r i g i n a l D i s t r i b u t i o n f o r A g e P r o b a b i l i s t i c e s t i m a t e o f o r i g i n a l v a l u e o f V Reconstruction: Intuition Combine estimates of where a point came from for all the points: – yields estimate of original distribution. 10 Age 90 Reconstruction Algorithm fX0 := Uniform distribution j := 0 repeat n 1 fY (( xi yi ) a ) f Xj (a ) fXj+1(a) := n Bayes’ Rule j i 1 fY (( xi yi ) a ) f X (a ) j := j+1 until (stopping criterion met) Converges to maximum likelihood estimate. – D. Agrawal & C.C. Aggarwal, PODS 2001. Works Well 1000 Original 800 600 Randomized 400 Reconstructed 0 60 200 20 Number of People 1200 Age Classification Naïve Bayes – Assumes independence between attributes. Decision Tree – Correlations are weakened by randomization. Experimental Methodology Compare accuracy against – Original: unperturbed data without randomization. – Randomized: perturbed data but without making any corrections for randomization. Test data not randomized. Synthetic data benchmark from [AGI+92]. Training set of 100,000 records, split equally between the two classes. Decision Tree Experiments 100% Randomization Lev el 100 Accuracy 90 Original 80 Randomized 70 Reconstructed 60 50 Fn 1 Fn 2 Fn 3 Fn 4 Fn 5 Accuracy vs. Randomization Fn 3 100 Accuracy 90 80 Original 70 Randomized Reconstructed 60 50 40 10 20 40 60 80 100 Randomization Level 150 200 So far… Question: Can we develop accurate models without access to precise information in individual data records? Answer: yes, by randomization. – for numerical attributes, classification How about Association Rules? Associations Recap A transaction t is a set of items (e.g. books) All transactions form a set T of transactions Any itemset A has support s in T if # t T | A t s supp A T Itemset A is frequent if s smin Task: Find all frequent itemsets The Problem How to randomize transactions so that – we can find frequent itemsets – while preserving privacy at transaction level? Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining, July 2002. Alice Randomization Overview J.S. Bach, painting, nasa.gov, … J.S. Bach, painting, nasa.gov, … Bob B. Spears, baseball, cnn.com, … B. Spears, baseball, cnn.com, … Chris B. Marley, camping, linux.org, … B. Marley, camping, linux.org, … Recommendation Service Alice Randomization Overview J.S. Bach, painting, nasa.gov, … J.S. Bach, painting, nasa.gov, … Bob B. Spears, baseball, cnn.com, … B. Spears, baseball, cnn.com, … Chris B. Marley, camping, linux.org, … Recommendation Service Associations B. Marley, camping, linux.org, … Recommendations Alice Randomization Overview Metallica, painting, nasa.gov, … J.S. Bach, painting, nasa.gov, … Recommendation Service Support Recovery Bob B. Spears, baseball, cnn.com, … B. Spears, soccer, bbc.co.uk, … Chris B. Marley, camping, linux.org, … Associations B. Marley, camping, ibm.com … Recommendations Uniform Randomization Given a transaction, – keep item with, say 20% probability, – replace with a new random item with 80% probability. Example: {x, y, z} 10 M transactions of size 10 with 10 K items: 1% 5% have have {x, y}, {x, z}, {x, y, z} or {y, z} only • 0.23 0.008% 800 ts. 97.8% 0.22 • • 8/10,000 0.00016% 16 trans. 1.9% 94% have one or zero items of {x, y, z} at most • 0.2 • (9/10,000)2 less than 0.00002% 2 transactions 0.3% Privacy Breach: Given {x, y, z} in the randomized transaction, we have about 98% certainty of {x, y, z} in the original one Privacy Breach Suppose: – t is an original transaction; – t’ is the corresponding randomized transaction; – A is a (frequent) itemset. Definition: Itemset A causes a privacy breach of level if, for some item z A, Prz t | A t Our Solution “Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?” “He grows a forest to hide it in.” G.K. Chesterton Insert many false items into each transaction Hide true itemsets among false ones Can we still find frequent itemsets while having sufficient privacy? Cut and Paste Randomization Given transaction t of size m, construct t’: t = t’ = a, b, c, u, v, w, x, y, z Cut and Paste Randomization Given transaction t of size m, construct t’: – Choose a number j between 0 and Km (cutoff); t = a, b, c, u, v, w, x, y, z t’ = j=4 Cut and Paste Randomization Given transaction t of size m, construct t’: – Choose a number j between 0 and Km (cutoff); – Include j items of t into t’; t = t’ = a, b, c, u, v, w, x, y, z b, v, x, z j=4 Cut and Paste Randomization Given transaction t of size m, construct t’: – Choose a number j between 0 and Km (cutoff); – Include j items of t into t’; – Each other item is included into t’ with probability pm . The choice of Km and pm is based on the desired level of privacy. t = t’ = a, b, c, u, v, w, x, y, z b, v, x, z j=4 œ, å, ß, ξ, ψ, €, א, ъ, ђ, … Partial Supports To recover original support of an itemset, we need randomized supports of its subsets. Given an itemset A of size k and transaction size m, A vector of partial supports of A is s s0 , s1 ,..., sk , where sl 1 # t T | # t A l T – Here sk is the same as the support of A. – Randomized partial supports are denoted by s . Transition Matrix Let k = |A|, m = |t|. Transition matrix P = P (k, m) connects randomized partial supports with original ones: E s P s , where Pl , l Pr # t A l | # t A l Randomized supports are distributed as a sum of multinomial distributions. The Unbiased Estimators Given randomized partial supports, we can estimate original partial supports: sest Q s , where Q P 1 Covariance matrix for this estimator: 1 k T Cov sest s Q D [ l ] Q , l T l 0 where D[l ]i , j Pi , l i j Pi , l Pj , l To estimate it, substitute sl with (sest)l . – Special case: estimators for support and its variance Privacy Breach Analysis How many added items are enough to protect privacy? – Have to satisfy Pr [z t | A t’] < ( no privacy breaches) – Select parameters so that it holds for all itemsets. – Use formula ( s Pr # t A l , z t , s 0 l 0 k Prz t | A t s Pk , l l 0 l ): k s P l 0 l k ,l Parameters are to be selected in advance! – Construct a privacy-challenging test: an itemset whose all subsets have maximum possible support. – Enough to know maximal support of an itemset for each size. Lowest Discoverable Support LDS is s.t., when predicted, is 4 away from zero. Roughly, LDS is proportional to 1 T |t| = 5, = 50% LDS vs. number of transactions 1.2 1-itemsets 2-itemsets 3-itemsets 1 LDS, % 0.8 0.6 0.4 0.2 0 1 10 Number of transactions, millions 100 LDS vs. Breach Level |t| = 5, |T| = 5 M 2.5 1-itemsets 2-itemsets 2 LDS, % 3-itemsets 1.5 1 0.5 0 30 40 50 60 70 80 Privacy Breach Level, % Reminder: breach level is the limit on Pr [z t | A t’] 90 Real Datasets: soccer, mailorder Soccer is the clickstream log of WorldCup’98 web site, split into sessions of HTML requests. – 11 K items (HTMLs), 6.5 M transactions Mailorder is a purchase dataset from a certain online store – Products are replaced with their categories – 96 items (categories), 2.9 M transactions Results Breach level = 50%. Itemset Size True Itemsets True Positives False Drops False Positives smin = 0.2% 1 266 254 12 31 0.07% for 2 217 195 22 45 3-itemsets 3 48 43 5 26 Itemset Size True Itemsets True Positives False Drops False Positives smin = 0.2% 1 65 65 0 0 0.05% for 2 228 212 16 28 3-itemsets 3 22 18 4 5 Soccer: Mailorder: Summary Can have our cake and mine it too! Randomization is an interesting approach for building data mining models while preserving user privacy!!! Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000. S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002 J. Vaidya, C.W. Clifton. Privacy Preserving Association Rule Mining in Vertically Partitioned Data. KDD 2002. The Hippocratic Oath “What I may see or hear in the course of treatment or even outside of the treatment in regard to the life of men, which on no account [ought to be] spread abroad, I will keep to myself, holding such things shameful to be spoken about.” – Hippocratic Oath, 8 (circa 400 BC) Hippocratic Databases Founding tenet: Responsibility for the privacy of data they manage. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu Hippocratic Databases 28th Int'l Conf. on Very Large Databases (VLDB), August 2002.. Approach Derive founding principles from current privacy legislation. Strawman Design Ten Principles of Hippocratic Databases Collection Group – Purpose Specification, Consent, Limited Collection Use Group – Limited Use, Limited Disclosure, Limited Retention, Accuracy Security & Openness Group – Safety, Openness, Compliance Collection Group 1. Purpose Specification – For personal information stored in the database, the purposes for which the information has been collected shall be associated with that information. 2. Consent – The purposes associated with personal information shall have consent of the donor (person whose information is being stored). 3. Limited Collection – The information collected shall be limited to the minimum necessary for accomplishing the specified purposes. Use Group 4. Limited Use – The database shall run only those queries that are consistent with the purposes for which the information has been collected. 5. Limited Disclosure – Personal information shall not be communicated outside the database for purposes other than those for which there is consent from the donor of the information. Use Group (2) 6. Limited Retention – Personal information shall be retained only as long as necessary for the fulfillment of the purposes for which it has been collected. 7. Accuracy – Personal information stored in the database shall be accurate and up-to-date. Security & Openness Group 8. Safety – Personal information shall be protected by security safeguards against theft and other misappropriations. 9. Openness – A donor shall be able to access all information about the donor stored in the database. 10. Compliance – A donor shall be able to verify compliance with the above principles. Similarly, the database shall be able to address a challenge concerning compliance. Strawman Architecture Privacy Policy Data Collection Queries Store Other Architecture: Policy Privacy Policy Privacy Metadata Creator Converts privacy policy into privacy metadata tables. For each purpose & piece of information (attribute): • External recipients • Retention period • Authorized users Different designs possible. Privacy Metadata Store Limited Disclosure Limited Retention Privacy Policies Table Purpose Table Attribute External- Authorizedrecipients users Retention purchase customer name {delivery, credit-card} {shipping, charge} 1 month purchase customer email empty {shipping} 1 month register customer name empty {registration} 3 years register customer email empty {registration} 3 years book empty {mining} 10 years recommen order dations Architecture: Data Collection Data Collection Privacy policy compatible with user’s privacy preference? Privacy Constraint Validator Audit trail for compliance. Audit Info Privacy Metadata Audit Trail Store Consent Compliance Architecture: Data Collection Data Collection Privacy Constraint Validator Data Accuracy Analyzer Audit Info Privacy Metadata Audit Trail Data cleansing, e.g., errors in address. Accuracy Associate set of Purpose purposes with Specification each record. Store Record Access Control Architecture: Queries Queries Safety Limited Use 2. Query tagged “telemarketing” cannot see credit card info. Attribute Access Control 3. Telemarketing query only sees records that include “telemarketing” in set of purposes. Privacy Metadata Store Record Access Control Safety 1. Telemarketing cannot issue query tagged “charge”. Architecture: Queries Queries Safety Compliance Attribute Access Control Telemarketing query that asks for all phone numbers. Query Intrusion Detector • Compliance • Training data for query intrusion detector Privacy Metadata Audit Trail Audit Info Store Record Access Control Architecture: Other Other Analyze queries to identify unnecessary collection, retention & authorizations. Limited Collection Limited Retention Delete items in accordance with privacy policy. Safety Privacy Metadata Data Collection Analyzer Data Retention Manager Additional security for sensitive data. Store Encryption Support Strawman Architecture Privacy Policy Privacy Metadata Creator Privacy Metadata Data Collection Queries Other Privacy Constraint Validator Attribute Access Control Data Collection Analyzer Data Accuracy Analyzer Query Intrusion Detector Data Retention Manager Audit Info Audit Info Audit Trail Store Record Access Control Encryption Support Status Prototyping core functionality of the design Nibbling at some of the open problems (see VLDB-2002 paper) Privacy-Preserving Synthetic Datasets for Data Mining Research How to randomize to be able to build multiple types of models How to handle combination of data types How to handle rare events Synthetic Data Transactions Comunications Randomize Demographic Govt Records State Birth Marriage Local Credit Agencies Network is the Database What if private data never leaves a person’s data store? Jane’s Data Credit Application Decision – Computations travel to data Jane’s Data Approval Function Result Decision-Making Across Private Data Repositories Separate databases due to statutory, competitive, or security reasons. Selective, minimal sharing on need-to-know basis. Example: Among those who took a particular drug, how many had adverse reaction and their DNA contains a specific sequence? Researchers must not learn anything beyond counts. Minimal Necessary Sharing R a u v x RS R must not know that S has b & y S must not know that R has a & x RS u v S b u v y Count (R S) R & S do not learn anything except that the result is 2. Closing Thoughts The right to privacy: the most cherished of human freedoms -- Warren & Brandeis, 1890 Code is law … it is all a matter of code: the software and hardware that now rule -- L. Lessig We can architect computing systems to protect values we believe are fundamental, or we can architect them to allow those values to disappear. What do we want to do as computer scientists? References R. Agrawal, R. Srikant. Information Integration Across Autonomous Enterprises. ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for P3P. 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database Technology. 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P. W3C Workshop on the Future of P3P, Dulles, Virginia, Nov. 2002. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002. R. Agrawal, J. Kiernan. Watermarking Relational Databases. 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD), Edmonton, Canada, July 2002. R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000. New Challenges General – Language – Efficiency Use – Limited Collection – Limited Disclosure – Limited Retention Security and Openness – Safety – Openness – Compliance Language Need a language for privacy policies & user preferences. P3P can be used as starting point. – Developed primarily for web shopping. – What about richer domains? How do we balance expressibility and usability? – Arrange concepts in hierarchy or subsumption relationship. P3P recipients: Purpose: contact email phone home work Ours Same Delivery Unrelated Public Language (2) How do we accommodate user negotiation models? – User willing to disclose information only if fairly compensated. – Value of privacy as coalitional game [KPR2001] Efficiency How do we minimize the cost of privacy checking? How do we incorporate purpose into database design and query optimization? Tradeoffs between space & running time. Only tag records in customer table with purpose, not all records. But now need to do a join when scanning records in order table. How does the secure databases work on decomposition of multilevel relations into singlelevel relations [JS91] apply here? Limited Collection How do we identify attributes that are collected but not used? – Assets are only needed for mortgage when salary is below some threshold. What’s the needed granularity for numeric attributes? – Queries only ask “Salary > threshold” for rent application. How do we generate minimal queries? – Redundancy may be hidden in application code. Limited Disclosure Can the user dynamically determine the set of recipients? Example: Alice wants to add EasyCredit to set of recipients in EquiRate’s database. Digital signatures. Limited Retention Completely forgetting some information is non-trivial. How do we delete a record from the logs and checkpoints, without affecting recovery? How do we continue to support historical analysis and statistical queries without incurring privacy breaches? Safety Encryption provides additional layer of security. How do we index encrypted data? How do we run queries against encrypted data? [SWP00], [HILM02] Openness A donor shall be able to access all information about the donor stored in the database. How does the database check Alice is really Alice and not somebody else? – Princeton admissions office broke into Yale’s admissions using applicant’s social security number and birth date. How does Alice find out what databases have information about her? – Symmetrically private information retrieval [GIKM98]. Compliance Universal Logging – Can we provide each user whose data is accessed with a log of that access, along with the query reading the data? – Use intermediaries who aggregate and analyze logs for many users. Tracking Privacy Breaches – Insert “fingerprint” records with emails, telephone numbers, and credit card numbers. – Some data may be more valuable for spammers or credit card theft. How do we identify categories to do stratified fingerprinting rather than randomly inserting records?