Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Entity–attribute–value model wikipedia , lookup
Mass surveillance wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Functional Database Model wikipedia , lookup
Clusterpoint wikipedia , lookup
Relational model wikipedia , lookup
Privacy in Data Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu Theme ƒ Increasing need for information systems that ƒ protect the privacy and ownership of information ƒ do not impede the flow of information ƒ Resolving the apparent contradiction in the above statement is a major research challenge and opportunity Drivers Policies and Legislations – U.S. and international regulations – Legal proceedings against businesses Consumer Concerns – Consumer privacy apprehensions continue to plague the Web … these fears will hold back roughly $15 billion in e-Commerce revenue.” Forrester Research, 2001 – Most consumers are “privacy pragmatists.” Westin Surveys Moral Imperative – The right to privacy: the most cherished of human freedom -- Warren & Brandeis, 1890 Privacy Is Headline News “Privacy #1 issue in the 21Century” -Wall Street Journal, January 24, 2000 “Anyone today who thinks the privacy issue has peaked is greatly mistaken…we are in the early stages of a sweeping change in attitudes that will put once-routine business practices under the microscope.” Forrester Research, March 5, 2001 Outline Privacy Preserving Data Mining – How to have you cake and mine it too! – Assuming no trusted third party. Hippocratic Databases – What I may see or hear in the course of treatment … I will keep to myself. - Hippocratic Oath, 8 (circa 400 BC) Other related topics – Privacy Cognizant Information Integration – Database support for P3P – Watermarking & fingerprinting Data Mining and Privacy The primary task in data mining: – development of models about aggregated data. Can we still develop accurate models, while protecting the privacy of individual records? Approaches Randomization Cryptographic Statistical disclosure control Alice Randomization Overview 35 80,000 J.S. Bach painting nasa 35 95,000 J.S. Bach painting nasa Bob 45 60,000 B. Spears baseball cnn 45 60,000 B. Spears baseball cnn Chris 42 85,000 B. Marley, camping, microsoft 42 85,000 B. Marley camping microsoft Recommendation Service Alice Randomization Overview 35 80,000 J.S. Bach painting nasa 35 95,000 J.S. Bach painting nasa Bob 45 60,000 B. Spears baseball cnn 45 60,000 B. Spears baseball cnn Chris 42 85,000 B. Marley, camping, microsoft Recommendation Service Mining Algorithm 42 85,000 B. Marley camping microsoft Data Mining Model Alice Randomization Overview 50 65,000 Metallica painting nasa 35 95,000 J.S. Bach painting nasa Bob 45 60,000 B. Spears baseball cnn 38 90,000 B. Spears soccer fox Chris 42 85,000 B. Marley, camping, microsoft 32 55,000 B. Marley camping linuxware Recommendation Service Alice Randomization Overview 50 65,000 Metallica painting nasa 35 95,000 J.S. Bach painting nasa Bob 45 60,000 B. Spears baseball cnn 38 90,000 B. Spears soccer fox Chris 42 85,000 B. Marley, camping, microsoft Recommendation Service Recovery Mining Algorithm 32 55,000 B. Marley camping linuxware Data Mining Model Inducing Classifiers over Privacy Preserved Numeric Data Alice’s age Alice’s salary John’s age 30 | 25K | … 30 become s 65 (30+35) 50 | 40K | … Randomizer Randomizer 65 | 50K | … 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Decision Tree Algorithm Model Reconstruction Problem Original values x1, x2, ..., xn – from probability distribution X (unknown) To hide these values, we use y1, y2, ..., yn – from probability distribution Y Given – x1+y1, x2+y2, ..., xn+yn – the probability distribution of Y Estimate the probability distribution of X. Reconstruction Algorithm fX0 := Uniform distribution j := 0 repeat n 1 fY (( xi yi ) a ) f Xj (a ) fXj+1(a) := n Bayes’ Rule j i 1 fY (( xi yi ) a ) f X (a ) j := j+1 until (stopping criterion met) (R. Agrawal & R. Srikant, SIGMOD 2000) Converges to maximum likelihood estimate. – D. Agrawal & C.C. Aggarwal, PODS 2001. Estimate of a single point Use Bayes' rule for density functions 1 0V A g e 9 0 O r i g i n a l d i s t r i b u t i o n f o r A g e P r o b a b i l i s t i c e s t i m a t e o f o r i g i n a l v a l u e o f V Estimate of a single point Use Bayes' rule for density functions 1 0V A g e 9 0 O r i g i n a l D i s t r i b u t i o n f o r A g e P r o b a b i l i s t i c e s t i m a t e o f o r i g i n a l v a l u e o f V Estimate of the distribution Combine estimates of where a point came from for all the points: – yields estimate of original distribution. 10 Age 90 Works Well 1000 Original 800 600 Randomized 400 Reconstructed 0 60 200 20 Number of People 1200 Age Decision Tree Example Age 23 17 43 68 32 20 Salary 50K 30K 40K 50K 70K 20K Repeat Visitor? Repeat Repeat Repeat Single Single Repeat Age < 25 No Yes Salary < 50K Repeat Yes Repeat No Single Algorithms Global – Reconstruct for each attribute once at the beginning By Class – For each attribute, first split by class, then reconstruct separately for each class. Local – Reconstruct at each node See SIGMOD 2000 paper for details. Experimental Methodology Compare accuracy against – Original: unperturbed data without randomization. – Randomized: perturbed data but without making any corrections for randomization. Test data not randomized. Synthetic benchmark from [AGI+92]. Training set of 100,000 records, split equally between the two classes. Decision Tree Experiments 100% Randomization Lev el 100 Accuracy 90 Original 80 Randomized 70 Reconstructed 60 50 Fn 1 Fn 2 Fn 3 Fn 4 Fn 5 Accuracy vs. Randomization Fn 3 100 Accuracy 90 80 Original 70 Randomized Reconstructed 60 50 40 10 20 40 60 80 100 Randomization Level 150 200 Discovering Associations Over Privacy Preserved Categorical Data A transaction t is a set of items Support s for an itemset A is the number of transactions in which A appears Itemset A is frequent if s smin Task: Find all frequent itemsets, while preserving the privacy of individual transaction. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy Preserving Data. KDD-2002. S. Rizvi, J. Haritsa. Privacy-Preserving Association Rule Mining. VLDB 2002 Uniform Randomization Given a transaction, – Keep item with, say 20% probability – Replace with a new random item with 80% probability. Privacy Breach If one has access to the result (frequent itemsets) as well as the randomized transactions, one may make inferences about the original transactions. Itemset A causes a privacy breach of level if, for some item z A Prz t | A t See also: A. Evfimievski, J. Gehrke , R. Srikant. PODS 2003. Example: {x, y, z} 10 M transactions of size 10 with 10 K items: 1% 5% have have {x, y}, {x, z}, {x, y, z} or {y, z} only • 0.23 0.008% 800 ts. 97.8% 0.22 • • 8/10,000 0.00016% 16 trans. 1.9% Original 94% have one or zero items of {x, y, z} at most • 0.2 • (9/10,000)2 less than 0.00002% 2 transactions Randomize 0.3% d Privacy Breach: Given {x, y, z} in a randomized transaction, we have 98% certainty of {x, y, z} being in the original one Solution “Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?” “He grows a forest to hide it in.” G.K. Chesterton Insert many false items into each transaction Hide true itemsets among false ones Cut and Paste Randomization Given transaction t of size m, construct t’: t = t’ = a, b, c, u, v, w, x, y, z Cut and Paste Randomization Given transaction t of size m, construct t’: – Choose a number j between 0 and Km (cutoff); t = a, b, c, u, v, w, x, y, z t’ = j=4 Cut and Paste Randomization Given transaction t of size m, construct t’: – Choose a number j between 0 and Km (cutoff); – Include j items of t into t’; t = t’ = a, b, c, u, v, w, x, y, z b, v, x, z j=4 Cut and Paste Randomization Given transaction t of size m, construct t’: – Choose a number j between 0 and Km (cutoff); – Include j items of t into t’; – Each other item is included into t’ with probability pm . The choice of Km and pm is based on the desired level of privacy. t = t’ = a, b, c, u, v, w, x, y, z b, v, x, z j=4 œ, å, ß, ξ, ψ, €, א, ъ, ђ, … Partial Supports To recover original support of an itemset, we need randomized supports of its subsets. Given an itemset A of size k and transaction size m, A vector of partial supports of A is s s0 , s1 ,..., sk , where sl 1 # t T | # t A l T – Here sk is the same as the support of A. – Randomized partial supports are denoted by s . The Unbiased Estimators Given randomized partial supports, we can estimate original partial supports: sest Q s , where Q P 1 Covariance matrix for this estimator: 1 k T Cov sest s Q D [ l ] Q , l T l 0 where D[l ]i , j Pi , l i j Pi , l Pj , l To estimate it, substitute sl with (sest)l . – Special case: estimators for support and its variance Can we still find frequent itemsets? Breach level = 50%. Soccer: smin = 0.2% Mailorder: smin = 0.2% Itemset Size True Itemsets True Positives False Drops False Positives 1 266 254 12 31 2 217 195 22 45 3 48 43 5 26 Itemset Size True Itemsets True Positives False Drops False Positives 1 65 65 0 0 2 228 212 16 28 3 22 18 4 5 Cryptographic Approach + – ? Accuracy Performance Security – Semi-honest (or passive) adversary: Correctly follows the protocol specification, yet attempts to learn additional information by analyzing the messages. Cryptographic Primitives Oblivious transfer [CACM85, MIT81] – Sender’s input: (x0, x1), Receiver’s input {0,1} – Receiver learns x , sender learns nothing – Sufficient for secure computation [STOC 88] Oblivious polynomial evaluation [STOC99] – Sender’s input: polynomial Q over F – Receiver’s input: z F – Receiver obtains Q(z), sender learns nothing Yao’s two-party protocol [FOCS 84] – Party 1 with input x, Party 2 with input y – Compute f(x,y) without revealing x,y – Any polynomial-time function can be expressed as a combinatorial circuit of polynomial size [JACM72] Private Distributed ID3 Problem: Two parties owning confidential databases wish to build a decision-tree classifier on the union of their databases, without revealing any unnecessary information. Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000. Basic Idea Find attribute with highest information gain privately We can then split on this attribute and recurse. Information Gain Let – T = set of records (dataset), – T(ci) = set of records in class i, – T(ci,aj) = set of records in class i with value(A) = aj. – Entropy(T) = i | T (ci ) | | T ( ci ) | log |T | |T | – Gain(T,A) = Entropy(T) - | T (aj ) | j | T | Entropy(T( aj)) Need to compute – Sj Si |T(aj, ci)| log |T(aj, ci)| – Sj |T(aj)| log |T(aj)|. Selecting the Split Attribute Given v1 known to party 1 and v2 known to party 2, compute (v1 + v2) log (v1 + v2) and output random shares. – Party 1 gets Answer - – Party 2 gets , where is a random number Given random shares for each attribute, use Yao's protocol to compute information gain. Purdue Toolkit Partitioned databases Secure Building Blocks for computing – Sum – Set Union – Size of Set Intersection – Scalar Product C. Clifton et al. Tools for Privacy Preserving Data Mining. SIGKDD Explorations 2003. Sum Sum = 24 – R = 17 5 R=7 24 9 12 15 3 Semi-honest Assumption Algorithms Association rules – horizontally partitioned data – vertically partitioned data EM Clustering Work in Statistical Databases Provide statistical information without compromising sensitive information about individuals (AW89, Sho82) Techniques – Query Restriction – Data Perturbation Negative Results: cannot give high quality statistics and simultaneously prevent partial disclosure of individual information [AW89] Techniques Query Restriction – Restrict the size of query result (e.g. FEL72, DDS79) – Control overlap among successive queries (e.g. DJL79) – Suppress small data cells (e.g. CO82) Output Perturbation – Sample result of query (e.g. Den80) – Add noise to query result (e.g. Bec80) Data Perturbation – Replace db with sample (e.g. LST83, LCL85, Rei84) – Swap values between records (e.g. Den82) – Add noise to values (e.g. TYW84, War65) Summary Promising technical direction & results Much more needs to be done, e.g. – Trade off between the amount of privacy breach and performance – Examination of other approaches (e.g. randomization based on swapping) Hippocratic Databases Hippocratic Oath, 8 (circa 400 BC) – What I may see or hear in the course of treatment … I will keep to myself. What if the database systems were to embrace the Hippocratic Oath? R. Agrawal, J. Kiernan, R. Srikant, Y. Xu Hippocratic Databases. VLDB 2002.. The Ten Principles Driven by current privacy legislation. – US (FIPA, 1974), Europe (OECD , 1980), Canada (1995), Australia (2000), Japan (2003) Principles: – Collection Group: Purpose Specification, Consent, Limited Collection – Use Group: Limited Use, Limited Disclosure, Limited Retention, Accuracy – Security & Openness Group: Safety, Openness, Compliance Collection Group 1. Purpose Specification – For personal information stored in the database, the purposes for which the information has been collected shall be associated with that information. 2. Consent – The purposes associated with personal information shall have consent of the donor (person whose information is being stored). 3. Limited Collection – The information collected shall be limited to the minimum necessary for accomplishing the specified purposes. Use Group 4. Limited Use – The database shall run only those queries that are consistent with the purposes for which the information has been collected. 5. Limited Disclosure – Personal information shall not be communicated outside the database for purposes other than those for which there is consent from the donor of the information. Use Group (2) 6. Limited Retention – Personal information shall be retained only as long as necessary for the fulfillment of the purposes for which it has been collected. 7. Accuracy – Personal information stored in the database shall be accurate and up-to-date. Security & Openness Group 8. Safety – Personal information shall be protected by security safeguards against theft and other misappropriations. 9. Openness – A donor shall be able to access all information about the donor stored in the database. 10. Compliance – A donor shall be able to verify compliance with the above principles. Similarly, the database shall be able to address a challenge concerning compliance. Architecture Privacy Policy Privacy Metadata Creator Privacy Metadata Data Collection Queries Other Privacy Constraint Validator Attribute Access Control Data Collection Analyzer Data Accuracy Analyzer Query Intrusion Detector Data Retention Manager Audit Info Audit Info Audit Trail Store Record Access Control Encryption Support New Challenges General – Language – Efficiency Use – Limited Collection – Limited Disclosure – Limited Retention Security and Openness – Safety – Openness – Compliance Language Need a declarative language for specifying privacy policies & user preferences. P3P is very limited – Developed primarily for web shopping – No enforcement Some features: – User negotiation models Least invasive site Coalitional game [KPR2001]) – Balance expressibility and usability Efficiency How do we minimize the cost of privacy checking? – Need cell-level access control How do we incorporate purpose into database design and query optimization? How does the secure databases work on decomposition of multilevel relations into singlelevel relations [JS91] apply here? – Difference in granularity and scale from work in secure databases Limited Collection How do we identify attributes that are collected but not used? – Assets are only needed for mortgage when salary is below some threshold. What’s the needed granularity for numeric attributes? – Queries only ask “Salary > threshold” for rent application. How do we generate minimal query for a given purpose? Limited Disclosure Need dynamically determined set of recipients? Example: Alice wants to add EasyCredit to set of recipients in EquiRate’s database. Digital signatures. Limited Retention Completely forgetting some information is non-trivial. How do we delete a record from the logs and checkpoints, without affecting recovery? How do we continue to support historical analysis and statistical queries without incurring privacy breaches? Safety Encryption do avoid inadvertent disclosure of data – How do we index encrypted data? – How do we run queries against encrypted data? – [SWP00], [HILM02] Openness A donor shall be able to access all information about the donor stored in the database. How does the database check Alice is really Alice and not somebody else? – Princeton admissions office broke into Yale’s admissions using applicant’s social security number and birth date. How does Alice find out what databases have information about her? – Information discovery literature – Symmetrically private information retrieval [GIKM98]. Compliance Universal Logging – Can we provide each user whose data is accessed with a log of that access, along with the query reading the data? Tracking Privacy Breaches – Insert “fingerprint” records with emails, telephone numbers, and credit card numbers. – Some data may be more valuable for spammers or credit card theft. How do we identify categories to do stratified fingerprinting rather than randomly inserting records? Summary Gold mine of challenging research problems (besides being useful)! Related Topics Privacy Cognizant Information Integration Database support for P3P Watermarking & fingerprinting Decision-Making Across Private Data Repositories • Separate databases due to statutory, competitive, or security reasons. Selective, minimal sharing on need-to-know basis. Minimal Necessary Sharing R a u Example: Among those who took a particular drug, how many had adverse reaction and their DNA contains a specific sequence? Researchers must not learn anything beyond counts. v Algorithms for computing joins and join counts while revealing minimal additional information. u x RS R must not know that S has b & y S must not know that R has a & x RS u v S b v y Count (R S) R & S do not learn anything except that the result is 2. Database Support for P3P APPEL P3P: New W3C standard to encode company privacy policies and user Privacy privacy preferences in XML. Preference • Can programmatically match preferences against policies. • Solves the problem that current policies are written by lawyers, for lawyers. Matching • Current implementations do the matching in the client (browser). result Proposal: Server-centric preference matching using relational databases. Policy-Preference Matching Advantages: • Server-side matching necessary for thin clients, e.g. mobile devices. APPEL to SQL • Sets up necessary infrastructure for policy enforcement. Converter • Provides companies with extra information for refining privacy policies. Query SQL Prototype enables DB2 with P3P support. results query • Stores P3P policies in relational tables. • Reuses database query technology for policy-preference matching. Database • Algorithm for converting APPEL preferences into SQL queries. P3P Privacy Policy Policy Storing Shredder Policy Metadata Watermarking Relational Databases Goal: Deter data theft and assert ownership of pirated copies. – 1. Specify secret key 2. Specify table/attributes to be marked 2. Specify table/attributes which should contain marks Very unlikely to occur by chance. Hard to find => hard to destroy (robust against malicious attacks). Watermark Existing watermarking techniques developed for Insertion multimedia: images, sound, text, … are not 3. Pseudo applicable to database tables. – – – Choose secret key Examples: Life Sciences, Electronic Parts. Watermark – Intentionally introduced pattern in the data. – – 1. Rows in a table are unordered. Rows can be inserted, updated, deleted. Attributes can be added, dropped. randomly select a subset of the rows for marking Function of secret key and attribute Watermark can be detected using only a subset values New algorithm for watermarking database tables. – of the rows and attributes of a table. – Robust against updates,incrementally updatable. Database Watermark Detection 3. Identify marked rows/attributes, compare marks 4. Confirm with expected presence or absence of mark values the Requires watermark neither original unmarked data nor the Suspicious watermark Database Closing Thoughts The right to privacy: the most cherished of human freedoms. -- Warren & Brandeis, 1890 Code is law … it is all a matter of code: the software and hardware that now rule. -- L. Lessig We can architect computing systems to protect values we believe are fundamental, or we can architect them to allow those values to disappear. What do we want to do as database researchers? References R. Agrawal, A. Evfimievski, R. Srikant. Information Sharing Across Private Databases. ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for P3P. 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database Technology. 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P. W3C Workshop on the Future of P3P, Dulles, Virginia, Nov. 2002. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002. R. Agrawal, J. Kiernan. Watermarking Relational Databases. 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD), Edmonton, Canada, July 2002. R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000.