Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy (Based on Slides of Kobbi Nissim) Benny Pinkas HP Labs, Israel March 3, 2005 10th Estonian Winter School in Computer Science page 1 Why not use cryptographic methods? d … • Many users contribute data. Cannot require them to participate in a cryptographic protocol. – • In particular, cannot require p2p communication between users. Cryptographic protocols incur considerable overhead. March 3, 2005 10th Estonian Winter School in Computer Science page 2 Data Privacy d Data access mechanism March 3, 2005 10th Estonian Winter School in Computer Science users breach privacy page 3 Easy Tempting Solution A Bad Solution Idea: a. Remove identifying information (name, SSN, …) b. Publish data Mr. Brown d Ms. John Mr. Doe • • • But, ‘harmless’ attributes uniquely identify many patients (gender, age, approx weight, ethnicity, marital status…) Recall, DOB+gender+zip code identify people whp. Worse:`rare’ attributes (e.g. disease with prob. 1/3000) March 3, 2005 10th Estonian Winter School in Computer Science page 4 What is Privacy? • Something should not be computable from query answers – – • E.g. Joe={Joe’s private data} The definition should take into account the adversary’s power (computational, #of queries, prior knowledge, …) Quite often it is much easier to say what is surely nonprivate – March 3, 2005 Intuition: privacy breached E.g. Strong breaking of privacy: adversary is ifable to it is possible private to compute retrieve (almost) everybody’s data someone’s private information from his identity 10th Estonian Winter School in Computer Science page 5 The Data Privacy Game: an Information-Privacy Tradeoff • Private functions: x – want to hide x(DB)=dx • Information functions: – want to reveal f (q, DB) for queries q f f f • Here: explicit definition of private functions. – The question: which information functions may be allowed? • Different from Crypto (secure function evaluation): – There, want to reveal f() (explicit definition of information function) – want to hide all functions () not computable from f() – Implicit definition of private functions – The question whether f() should be revealed is not asked March 3, 2005 10th Estonian Winter School in Computer Science page 6 A simplistic model: Statistical Database (SDB) query Mr. Fox 0/1 d {0,1}n q [n] Ms. John 0/1 answer aq=iq di Mr. Doe 0/1 bits March 3, 2005 10th Estonian Winter School in Computer Science page 7 Approaches to SDB Privacy • • Studied extensively since the 70s Perturbation – – Add randomness. Give `noisy’ or `approximate’ answers Techniques: • Data perturbation (perturb data and then answer queries as usual) [Reiss 84, Liew Choi Liew 85, Traub Yemini Wozniakowski 84] … • Output perturbation (perturb answers to queries) [Denning 80, Beck 80, Achugbue Chin 79, Fellegi Phillips 74] … – • Recent interest: [Agrawal, Srikant 00] [Agrawal, Aggarwal 01],… Query Restriction – – Answer queries accurately but sometimes disallow queries Require queries to obey some structure [Dobkin Jones Lipton 79] • Restricts number of queries – March 3, 2005 Auditing [Chin Ozsoyoglu 82, Kleinberg Papadimitriou Raghavan 01] 10th Estonian Winter School in Computer Science page 8 Some Recent Privacy Definitions X – data, Y – (noisy) observation of X [Agrawal, Srikant ‘00] Interval of confidence – Let Y = X+noise (e.g. uniform noise in [-100,100]. – Perturb input data. Can still estimate underlying distribution. – Tradeoff: more noise less accuracy but more privacy. – Intuition: large possible interval privacy preserved • Given Y, we know that within c% confidence X is in [a1,a2]. For example, for Y=200, with 50% X is in [150,250]. • a2-a1 defines the amount of privacy at c% confidence – Problem: there might be some a-priori information about X • X = someone’s age & Y= -97 March 3, 2005 10th Estonian Winter School in Computer Science page 9 The [AS] scheme can be turned against itself • Assume that N is large – Even if the data-miner doesn’t have a-priori information about X, it can estimate it given the randomized data Y. • The perturbation is uniform in [-1,1] • [AS]: privacy interval 2 with confidence 100% • Let fx(X)=50% for x[0,1], and 50% for x[4,5]. • But, after learning fx(X) the value of X can be easily localized within an interval of size at most 1. – Problem: aggregate information provides information that can be used to attack individual data March 3, 2005 10th Estonian Winter School in Computer Science page 10 Some Recent Privacy Definitions X – data, Y – (noisy) observation of X • [Agrawal, Aggarwal ‘01] Mutual information – Intuition: • High entropy is good. I(X;Y) = H(X)-H(X|Y) (mutual information) • small I(X;Y) (mutual information) privacy preserved (Y provides little information about X). • Problem [EGS] : – – Average notion. Privacy loss can happen with low but significant probability, but without affecting I(X;Y). Sometimes I(X;Y) seems good but privacy is breached March 3, 2005 10th Estonian Winter School in Computer Science page 11 Output Perturbation (Randomization Approach) • Exact answer to query q: – aq =iq di • Actual SDB answer: âq • Perturbation : – • For all q: | âq – aq | ≤ Questions: – – – Does perturbation give any privacy? How much perturbation is needed for privacy? Usability March 3, 2005 10th Estonian Winter School in Computer Science page 12 Privacy Preserved by Perturbation n Database: dR{0,1}n (uniform input distribution!) âq Algorithm: on query q, 1. Let aq=iq di q/2 2. If | aq - |q|/2 | < return âq = |q| / 2 3. Otherwise return âq = aq n (lgn)2 Privacy is preserved – Assume poly(n) queries – If n (lgn)2, whp always use rule 2 • No information about d is given! • • March 3, 2005 q/2 aq (but database is completely useless…) Shows that sometimes perturbation n is enough for privacy. Can we do better? 10th Estonian Winter School in Computer Science page 13 Perturbation << n Implies no Privacy • The previous useless database achieves the best possible perturbation. • Theorem [Dinur-Nissim]: Given any DB and any DB response algorithm with perturbation = o(n), there is a poly-time reconstruction algorithm that outputs a database d’, s.t. dist(d,d’) < o(n). strong breaking of privacy March 3, 2005 10th Estonian Winter School in Computer Science page 14 ’ The Adversary as a Decoding Algorithm d encode March 3, 2005 aq1 âq1 aq2 âq2 aq3 pert âq3 d decode aqt âqt partial sums perturbed sums 10th Estonian Winter School in Computer Science page 15 Proof of Theorem [DN03] The Adversary Reconstruction Algorithm • Query phase: Get âqj for t random subsets q1,…,qt • Weeding phase: Solve the Linear Program (over ): 0 xi 1 |iqj xi - âqj | • Rounding: Let ci = round(xi), output c Observation: A solution always exists, e.g. x=d. March 3, 2005 10th Estonian Winter School in Computer Science page 16 Why does the Reconstruction Algorithm Work? • Consider x{0,1}n s.t. dist(x,d)=c·n = (n) • Observation: • – A random q contains c’·n coordinates in which x≠d – The differences in the sum of these coordinates is, with constant probability, at least (n) (> = o(n) ). – Such a q disqualifies x as a solution for the LP Since the total number of queries q is polynomial, then all such vectors x are disqualified with overwhelming probability. March 3, 2005 10th Estonian Winter School in Computer Science page 17 Summary of Results (statistical database) small DB – Unlimited adversary: • Perturbation of magnitude (n) required medium DB [Dinur, Nissim 03] : – Polynomial-time adversary: • Perturbation of magnitude (sqrt(n)) is required (shown above) – In both cases, adversary may reconstruct a good approximation for the database • Disallows even very weak notions of privacy large DB • – Bounded adversary, restricted to T << n queries (SuLQ): • There is a privacy preserving access mechanism with perturbation << sqrt(T) • Chance for usability • Reasonable model as database grows larger and larger March 3, 2005 10th Estonian Winter School in Computer Science page 18 SuLQ for Multi-Attribute Statistical Database (SDB) Database {di,j} Row distribution D (D1,D2,…,Dn) k attributes n persons 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 Query (q, f) q [n] Answer aq,f=iq f(di) f : {0,1}k {0,1} f f f aq,f 1 0 1 0 0 1 1 0 1 1 f 0 0 1 0 1 March 3, 2005 10th Estonian Winter School in Computer Science page 19 Privacy and Usability Concerns for the Multi-Attribute Model [DN] • • • • March 3, 2005 Rich set of queries: subset sums over any property of the k attributes – Obviously increases usability, but how is privacy affected? More to protect: functions of the k attributes Relevant factors: – What is the adversary’s goal? – Row dependency Vertically split data (between k or less databases): – Can privacy still be maintained with independently operating databases? 10th Estonian Winter School in Computer Science page 20 Privacy Definition - Intuition • 3-phase adversary – Phase 0: defines a target set G of poly(n) functions g: {0,1}k {0,1} • Will try to learn some of this information about someone – – Phase 1: adaptively queries the database T=o(n) times Phase 2: chooses an index i of a row it intends to attack and a function gG use all • Attack: – given d-i –try to guess g(di,1…di,k) March 3, 2005 10th Estonian Winter School in Computer Science gained info to choose i, g page 21 The Privacy Definition • • P 0i,g – a-priori probability that g(di,1…di,k)=1 p Ti,g – a-posteriori probability that g(di,1…di,k)=1 – • Given answers to the T queries, and d-i Define conf(p) = log (p/(1-p)) – – 1-1 relationship between p and conf(p) conf(1/2)=0; conf(2/3)=1; conf(1)= • conf i,g = conf(pTi,g) – conf(p0i,g) • (,T) – privacy: (“relative privacy”) – March 3, 2005 For all distributions D1…Dn , row i, function g and any adversary making at most T queries: Pr[conf i,g > ] = neg(n) 10th Estonian Winter School in Computer Science page 22 The SuLQ* Database • • Adversary restricted to T << n queries On query (q, f): • q [n] • f : {0,1}k {0,1} (binary function) Let aq,f = iq f(di,1…di,k) – Let N Binomial(0, T ) – Return aq,f+N – *SuLQ – Sub Linear Queries March 3, 2005 10th Estonian Winter School in Computer Science page 23 Privacy Analysis of the SuLQ Database • Pmi,g - a-posteriori probability that g(di,1…di,k)=1 – • conf(pmi,g) Describes a random walk on the line with: – – • Given d-i and answers to the first m queries Starting point: conf(p0i,g) Compromise: conf(pmi,g) – conf(p0i,g) > W.h.p. more than T steps needed to reach compromise conf(p0i,g) March 3, 2005 10th Estonian Winter School in Computer Science conf(p0i,g) + page 24 Usability: One multi-attribute SuLQ DB • Statistics of any property f of the k attributes – I.e. for what fraction of the (sub)population does f(d1…dk) hold? – Easy: just put f in the query – Other applications: • k independent multi-attribute SuLQ DBs • Vertically partitioned SulQ DBs • Testing whether Pr[|] ≥ Pr[]+ – Caveat: we hide g() about a specific row (not about multiple rows) March 3, 2005 10th Estonian Winter School in Computer Science 0 1 0 1 0 1 0 1 1 1 0 1 1 0 0 1 1 1 0 1 1 1 0 1 1 0 0 1 0 1 page 25 Overview of Methods • Input Perturbation Query SDB Data Perturbation SDB’ User Response • Output Perturbation (Restricted) Query SDB • Perturbed Response User Query Restriction (Restricted) Query SDB March 3, 2005 User Exact Response Or Denial 10th Estonian Winter School in Computer Science page 26 Query restriction (Restricted) Query User SDB • Exact Response Or Denial The decision whether to answer or deny the query – – Can be based on the content of the query and on answers to previous queries Or, can be based on the above and on the content of the database March 3, 2005 10th Estonian Winter School in Computer Science page 27 Auditing • [AW89] classify auditing as a query restriction method: – “Auditing of an SDB involves keeping up-to-date logs of all queries made by each user (not the data involved) and constantly checking for possible compromise whenever a new query is issued” • Partial motivation: May allow for more queries to be posed, if no privacy threat occurs. • Early work: Hofmann 1977, Schlorer 1976, Chin, Ozsoyoglu 1981, 1986 Recent interest: Kleinberg, Papadimitriou, Raghavan 2000, Li, • Wang, Wang, Jajodia 2002, Jonsson, Krokhin 2003 March 3, 2005 10th Estonian Winter School in Computer Science page 28 How Auditors may Inadvertently Compromise Privacy March 3, 2005 10th Estonian Winter School in Computer Science page 29 The Setting q = (f ,i1,…,ik) f (di1,…,dik) • Dataset: d={d1,…,dn} – • • March 3, 2005 Entries di: Real, Integer, Boolean Query: q = (f ,i1,…,ik) – • Statistical database f : Min, Max, Median, Sum, Average, Count… Bad users will try to breach the privacy of individuals Compromise uniquely determine di (very weak def) 10th Estonian Winter School in Computer Science page 30 Auditing Here’s the answer OR Here’s a new query: qi+1 Query denied (as the answer would cause privacy loss) Auditor Query log Statistical database q1,…,qi March 3, 2005 10th Estonian Winter School in Computer Science page 31 Example 1: Sum/Max auditing di real, sum/max queries, privacy breached if some di learned q1 = sum(d1,d2,d3) sum(d1,d2,d3) = 15 q2 = max(d1,d2,d3) There must beiffa q2 is denied Ohreason well… for the d1=d2=d3 =5 denial… I win! Denied (the answer would cause privacy loss) Auditor March 3, 2005 10th Estonian Winter School in Computer Science page 32 Sounds Familiar? David Duncan, Former auditor for Enron and partner in Andersen: Mr. Chairman, I would like to answer the committee's questions, but on the advice of my counsel I respectfully decline to answer the question based on the protection afforded me under the Constitution of the United States. March 3, 2005 10th Estonian Winter School in Computer Science page 33 Max Auditing d1 d2 d3 d4 d5 d6 d7 d8 … dn-1 dn di real q1 = max(d1,d2,d3,d4) M1234 q2 = max(d1,d2,d3) M123 / denied If denied: d4=M1234 q2 = max(d1,d2) If denied: d3=M123 Learn an item with prob ½ M12 / denied Auditor March 3, 2005 10th Estonian Winter School in Computer Science page 34 Boolean Auditing? d1 d2 d3 d4 d5 d6 d7 d8 … dn-1 dn q1 = sum(d1,d2) di Boolean 1 / denied q2=sum(d2,d3) … 1 / denied qi denied iff di = di+1 learn database/complement Auditor March 3, 2005 10th Estonian Winter School in Computer Science page 35 The Problem • The problem: – Query denials leak (potentially sensitive) information • Users cannot decide denials by themselves Possible assignments to {d1,…,dn} Assignments consistent with (q1,…qi, a1,…,ai) qi+1 denied March 3, 2005 10th Estonian Winter School in Computer Science page 36 Solution to the problem: simulatable Auditing An auditor is simulatable if a simulator exists s.t.: q1,…,qi Statistical database qi+1 Auditor Deny/answer qi+1 q1,…,qi a1,…,ai Simulator Deny/answer Simulation denials do not leak information March 3, 2005 10th Estonian Winter School in Computer Science page 37 Why Simulatable Auditors do not Leak Information? Possible assignments to {d1,…,dn} Assignments consistent with (q1,…qi, a1,…,ai ) qi+1 denied/allowed March 3, 2005 10th Estonian Winter School in Computer Science page 38 Simulatable auditing March 3, 2005 10th Estonian Winter School in Computer Science page 39 Query Restriction for Sum Queries • Given: – D={x1,..,xn} dataset, xi – S is a subset of X. Query: xiS xi • Is it possible to compromise D? – Here compromise means: uniquely determine xi from the queries • Can compromise if subsets arbitrarily small: – sum(x9)= x9 March 3, 2005 10th Estonian Winter School in Computer Science page 40 Query Set Size Control • • Do not permit queries that involve a small subset of the database. Compromise still possible – • • Want to discover x: sum(x,y1,..,yk) - sum(y1,..,yk) = x Issue: Overlap In general, overlap is not enough. – – March 3, 2005 Need to restrict also the number of queries Note that overlap itself sometimes restricts number of queries (e.g. size of queries = cn, overlap = const, only about 1/c possible queries) 10th Estonian Winter School in Computer Science page 41 Restricting Set-Sum Queries • Restricting the sum queries based on – – – • Note that the criteria are known to the user – • Number of database elements in the sum Overlap with previous sum queries Total number of queries They do not depend on the contents of the database Therefore, the user can simulate the denial/no-denial answer given by the DB – Simulatable auditing March 3, 2005 10th Estonian Winter School in Computer Science page 42 Restricting Overlap and Number of Queries • Assume: – |Query Qi| ≥ k – |Qi Qj| ≤ r – Adversary knows a-priori at most L values, L+1 < k • Claim: Data cannot be compromised with fewer than 1+(2k-L)/r Sum Queries. 1 0 0 0 1 1 1 1 1 0 0 1 0 0 1 0 x1 x2 x3 .. = Q1 Q2 Q3 ... ≥k ≥k xl ≤r xn March 3, 2005 ≤r ≤r Qt ≥k 10th Estonian Winter School in Computer Science ≥k page 43 Overlap + Number of Queries Claim: Data cannot be compromised with fewer than 1+(2k-L)/r Sum Queries [Dobkin,Jones,Lipton][Reiss] – • Suppose xc compromised after t queries where each query represented by: – • k < query size, r > overlap, L a-priori known items Qi = xi1 + xi2 + … + xik for i =1, …, t Implies that: – – – xc = i=1,t i Qi = i=1,t i j=1,k xij Let i = 1 if x in query i, 0 otherwise xc= i=1,t i =1,n i x = =1,n (i=1,t i i)x March 3, 2005 10th Estonian Winter School in Computer Science page 44 Overlap + Number of Queries We have: xc= =1,n (i=1,ti i)x • In the above sum, (i=1,ti i) must be 0 for all x except for xc (in order for xc to be compromised) • This happens iff i=0 for all i, or if i =j =1 and i j have opposite signs – or i =0, in which case the ith query didn’t matter March 3, 2005 10th Estonian Winter School in Computer Science page 45 Overlap + Number of Queries • • • • • • Wlog, first query contains xc, second query is of opposite sign. In the first query, k elements are probed The second query adds at least k-r elements Elements from first and second queries cannot be canceled within the same (additional) query (opposite signs requires) Therefore each new query cancels items from first or from second query, but not from both. Need to cancel 2k-r-L elements. – Need 2+(2k-r-L)/r queries, i.e. 1+(2k-L)/r. March 3, 2005 10th Estonian Winter School in Computer Science page 46 Notes • The number of queries satisfying |Qi|≥ k and |Qi Qj| ≤r is small – – – March 3, 2005 If k=n/c for some constant c and r=const, then there are only ~c queries where no two queries overlap by more than 1. Hence , query sequence length may be uncomfortably short. Or, if r=k/c (overlap is a constant fraction of query size) then number of queries, 1+(2k-L)/r, is O( c). 10th Estonian Winter School in Computer Science page 47 Conclusions • Privacy should be defined and analyzed rigorously – In particular, assuming randomization privacy is dangerous • High perturbation is needed for privacy against polynomial adversaries – Threshold phenomenon – above n: total privacy, below n: no privacy (for poly-time adversary) – Main tool: a reconstruction algorithm • Careless auditing might leak private information • Self Auditing (simulatable auditors) is safe – Decision whether to allow a query based on previous `good’ queries and their answers • Without access to DB contents • Users may apply the decision procedure by themselves March 3, 2005 10th Estonian Winter School in Computer Science page 48 ToDo • Come up with good model and requirements for database privacy – Learn from crypto – Protect against more general loss of privacy • Simulatable auditors are a starting point for designing more reasonable audit mechanisms March 3, 2005 10th Estonian Winter School in Computer Science page 49 References • Course web page: – – A Study of Perturbation Techniques for Data Privacy, Cynthia Dwork and Nina Mishra and Kobbi Nissim, http://theory.stanford.edu/~nmishra/cs369-2004.html Privacy and Databases http://theory.stanford.edu/~rajeev/privacy.html March 3, 2005 10th Estonian Winter School in Computer Science page 50 Foundations of CS at the Weizmann Institute • • • • • Uri Feige Oded Goldreich Shafi Goldwasser David Harel Moni Naor • • • • • David Peleg Amir Pnueli Ran Raz Omer Reingold Adi Shamir Yellow crypto • All students receive a fellowship • Language of instruction English March 3, 2005 10th Estonian Winter School in Computer Science page 51