Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Foundations of Privacy Lecture 5 Lecturer: Moni Naor Desirable Properties from a sanitization mechanism • Composability – Applying the sanitization several time yields a graceful degradation – Will see: t releases , each -DP, are t¢ -DP – Next class: (√t+t 2,)-DP (roughly) • Robustness to side information – No need to specify exactly what the adversary knows: – knows everything except one row Differential Privacy: satisfies both… Differential Privacy Protect individual participants: Curator/ Sanitizer M D1 + D2 Curator/ Sanitizer M Dwork, McSherry Nissim & Smith 2006 Differential Privacy Protect individual participants: Probability of every bad event - or any event - increases only by small multiplicative factor when I enter the DB. May as well participate in DB… Adjacency: D+I and D-I ε-differentially private sanitizer M Handles aux For all DBs D, all individuals I and all events T input e-ε ≤ PrA[M(D+I) 2 T] PrA[M(D-I) 2 T] ≤ eε ≈ 1+ε Differing in one user Differential Privacy Sanitizer M gives -differential privacy if: for all adjacent D1 and D2, and all A µ range(M): Pr[M(D1) 2 A] ≤ e Pr[M(D2) 2 A] ratio bounded Pr [response] (Bad) Responses: Z Z Z Participation in the data set poses no additional risk Example of Differential Privacy X is a set of (name,tag 2 {0,1}) tuples One query: #of participants with tag=1 -4 -3 -2 -1 Sanitizer : output #of 1’s + noise 0 1 2 3 4 • noise from Laplace distribution with parameter 1/ε • Pr[noise = k-1] ≈ eε Pr[noise=k] -4 -3 -2 -1 0 1 2 3 4 5 5 (, ) - Differential Privacy Sanitizer M gives (, ) -differential privacy if: for all adjacent D1 and D2, and all A µ range(M): Pr[M(D1) 2 A] ≤ e Pr[M(D2) 2 A] + ratio bounded Pr [response] Bad Responses: Typical setting 𝜖 = Z 1 10 Z and δ negligible Z This course: negligible Example: NO Differential Privacy U set of (name,tag 2{0,1}) tuples One counting query: #of participants with tag=1 Sanitizer A: choose and release a few random tags Bad event T: Only my tag is 1, my tag released PrA[A(D+Me) 2 T] ≥ 1/n • Not ε diff private for any ε! PrA[A(D-Me) 2 T] = 0 • It is (0,1/n) Differential Private PrA[A(D+Me) 2 T] ≤ eε ≈ 1+ε e-ε ≤ PrA[A(D-Me) 2 T] Counting Queries Database x of size n Query q n individuals, each contributing a single point in U Counting-queries U Q is a set of predicates q: U {0,1} Query: how many x participants satisfy q? Sometimes talk about fraction Relaxed accuracy: answer query within α additive error w.h.p Not so bad: some error anyway inherent in statistical analysis Bound on Achievable Privacy Want to get bounds on the • Accuracy – The responses from the mechanism to all queries are assured to be within α except with probability • Number of queries t for which we can receive accurate answers • The privacy parameter ε for which ε differential privacy is achievable – Or (ε,) differential privacy is achievable Blatant Non Privacy Mechanism M is Blatantly Non-Private if there is an adversary A that • On any database D of size n can select queries and use the responses M(D) to reconstruct D’ such that ||D-D’||1 2 o(n) D’ agrees with D in all but o(n) of the entries. Claim: Blatant non privacy implies that M is not (, ) -DP for any constant Sanitization Can’t be Too Accurate Usual counting queries – Query: q µ [n] – i 2 q di Response = Answer + noise Blatant Non-Privacy: Adversary Guesses 99% bits Theorem: If all responses are within o(n) of the true answer, then the algorithm is blatantly non-private. But: require exponential # of queries . 12 Proof: Exponential Adversary • Focus on Column Containing Super Private Bit 1 0 0 1 0 1 1 “The database” Vector d 2 {0,1}n • Assume all answers are within error bound . Will show that cannot be o(n) 13 Proof: Exponential Adversary for Blatant Non Privacy • Estimate #1’s in all possible sets – 8 S µ [n]: |M(S) – i 2 S di | ≤ M(S): answer on S • Weed Out “Distant” DBs – For each possible candidate database c 2 {0,1}n: If for any S µ [n]: |i 2 S ci – M(S)| > , then rule out c. – If c not ruled out, halt and output c Claim: Real database d won’t be ruled out 14 Proof: Exponential Adversary • Assume: 8 S µ [n]: |M(S) – i 2S di | ≤ Claim: For c that has not been ruled out Hamming distance (c,d) ≤ 2 S0 0 S1 1 d 0 1 1 1 0 0 1 1 c ≤ 2 ≤ 2 |M(S0) - i 2S0 ci | ≤ (c not ruled out) ≤ 4 |M(S1) - i 2S1 ci | ≤ (c not ruled out) Impossibility of Exponential Queries The result means that we cannot sanitize the data and publish a data structure so that • for all queries the answer can be deduced correctly to within 2 o(n) answer 1 answer 3 On the other hand: we will see that we Sanitizer answer 2 can get accuracy up to log |Q| Database query 1, query 2, ... ? What can we do efficiently? Allowed “too” much power to the adversary • Number of queries: exponential • Computation: exponential • On the other hand: lack of wild errors in the responses Theorem: For any sanitization algorithm: If all responses are within o(√n) of the true answer, then it is blatantly non-private even against a polynomial time adversary making O(n log2 n) random queries. The Model • As before: database d is a bit string of length n. • Counting queries: – A query is a subset q µ {1, …, n} – The (exact) answer is aq = i 2q di • -perturbation – for an answer: aq ± Slide 18 What If We Had Exact Answers? • Consider a mechanism 0-perturbations – Receive the exact answer aq = i 2q di Then with n linearly independent queries – over the reals we could reconstruct d precisely: A solution must exist: d itself • Obtain n linearly equations aq = i 2q ci and solve uniquely When we have -perturbations : get an inequality • aj - ≤ i 2q ci ≤ aj + Idea: use linear programming Privacy requires Ω(√n) perturbation For every query qwith answerperturbation according to c is j: itso(√n) Consider a database at most 2 far from (real) in queries d. 2 nanswer • Adversary makes t = its n log random qj , getting noisy answers aj A solution must exist: d itself • Privacy violating Algorithm: Construct database c = {ci}1 ≤ i ≤ n by solving Linear Program: 0 ≤ ci ≤ 1 for 1 ≤ i ≤ n aj - ≤ i 2q ci ≤ aj + for 1 ≤ j ≤ t • Round the solution: – if ci > 1/2 set to 1 and to 0 otherwise Bad solutions to LP do not survive A query q disqualifies a potential database c 2 [0,1]n if its answer on q is more than 2 far answer in d: |i 2q ci -i 2q di| > 2 • Idea: show that for a database c that is far away from d a random query disqualifies c with some constant probability • Want to use the Union Bound: all far away solutions are disqualified w.p. at least 1 – nn(1 - )t = 1–neg(n) How do we limit the solution space? Round each value to closest 1/n Privacy requires Ω(√n) perturbation A query q disqualifies a potential database c 2 [0,1]n if its answer on q is more than 2 far answer in d: Lemma: if c is far away from d, then a random query disqualifies c with some constant probability • If Probi 2 [n] [|di-ci| ¸ 1/3] > , then there is a >0 such that Probq 2 {0,1}[n] [|i 2q (ci – di)|¸ 2+1] > Proof uses Azuma’s inequality Privacy requires Ω(√n) perturbation Can discretize all potential databases c 2 [0,1]n : Suppose we round each entry ci to closest fraction with denominator n: |ci – wi/n| · 1/n The response on q change by at most 1. • If we disqualify all `discrete’ databases then we also effectively eliminate all c 2 [0,1]n • There are nn `discrete’ databases Privacy requires Ω(√n) perturbation A query q disqualifies a potential database c 2 [0,1]n if its answer on q is more than 2 far answer in d: Claim:if c is far away from d, then a random query disqualifies c with some constant probability • Therefore: t = n log2 n queries leave a negligible probability for each far away reconstruction. Count number of entries far from d • Union bound: all far away suggestions are disqualified w.p. at least 1 – nn(1 - )t = 1 – neg(n) Can apply union bound by discretization Review and Conclusion • When the perturbation is o(√n), choosing Õ(n) random queries gives enough information to efficiently reconstruct an o(n)-close db. • Database reconstructed using Linear programming – polynomial time. o(√n) databases are Blatantly Non-Private. poly(n) time reconstructable Slide 25 Composition Suppose we are going to apply a DP mechanism t times. – Perhaps on different databases Want to argue that result is differentially private • A value b 2 {0,1} is chosen • In each of the t rounds adversary A picks two adjacent databases D0i and D1i and receives result zi of an DP mechanism Mi on Dbi • Want to argue A‘s view is within for both values of b • A‘s view: (z1, z2, …, zt) plus randomness used. Differential Privacy: Composition P[z1] = Pr z~A1(D)[z=z1] Handles P’[z1] = auxiliary Pr z~A1(D’)information [z=z1] Composes naturally • A1(D) is ε1-diffP P[z2] = Pr z~A2(D,z1)[z=z2] • for all z1, A2(D,z1) is εP’[z 2-diffP, 2] = Pr z~A2(D’,z1)[z=z2] Then A2(D,A1(D)) is (ε1+ε2)-diffP Proof: for all adjacent D,D’ and (z1,z2): e-ε1 ≤ P[z1] / P’[z1] ≤ eε1 e-ε2 ≤ P[z2] / P’[z2] ≤ eε2 e-(ε1+ε2) ≤ P[(z1,z2)]/P’[(z1,z2)] ≤ eε1+ε2 Differential Privacy: Composition • If all mechanisms Mi are -DP, then for any view the probability that A gets the view when b=0 and when b=1 are with et Therefore results for a single query translate to results on several queries Answering a single counting query U set of (name,tag2 {0,1}) tuples One counting query: #of participants with tag=1 Sanitizer A: output #of 1’s + noise Differentially private! If choose noise properly Choose noise from Laplace distribution Laplacian Noise Laplace distribution Y=Lap(b) has density function Pr[Y=y] =1/2b e-|y|/b Standard deviation: O(b) Take b=1/ε, get that Pr[Y=y] Ç e-|y| -4 -3 -2 -1 0 1 2 3 4 5 Laplacian Noise: ε-Privacy Take b=1/ε, get that Pr[Y=y] Ç e-|y| Release: q(D) + Lap(1/ε) For adjacent D,D’: |q(D) – q(D’)| ≤ 1 For output a: e- ≤ Prby D[a]/Prby D’[a] ≤ e -4 -3 -2 -1 0 1 2 3 4 5 Laplacian Noise: ε-Privacy Theorem: the Laplace mechanism with parameter b=1/ is -differential private -4 -3 -2 -1 0 1 2 3 4 5 Laplacian Noise: Õ(1/ε)-Error Take b=1/ε, get that Pr[Y=y] Ç e-|y| Concentration of the Laplace distribution: Pry~Y[|y| > k·1/ε] = O(e-k) Setting k=O(log n) Expected error is 1/ε, w.h.p error is Õ(1/ε) -4 -3 -2 -1 0 1 2 3 4 5