Download Slides of Differential privacy

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Foundations of Privacy
Lecture 5
Lecturer: Moni Naor
Desirable Properties from a sanitization
mechanism
• Composability
– Applying the sanitization several time yields a graceful
degradation
– Will see: t releases , each -DP, are t¢ -DP
– Next class: (√t+t 2,)-DP (roughly)
• Robustness to side information
– No need to specify exactly what the adversary knows:
– knows everything except one row
Differential Privacy: satisfies both…
Differential Privacy
Protect individual participants:
Curator/
Sanitizer M
D1
+
D2
Curator/
Sanitizer M
Dwork, McSherry
Nissim & Smith
2006
Differential Privacy
Protect individual participants:
Probability of every bad event - or any event - increases
only by small multiplicative factor when I enter the DB.
May as well participate in DB…
Adjacency: D+I
and D-I
ε-differentially private sanitizer M
Handles aux
For all DBs D, all individuals I and all events T
input
e-ε ≤
PrA[M(D+I) 2 T]
PrA[M(D-I) 2 T]
≤ eε ≈ 1+ε
Differing in
one user
Differential Privacy
Sanitizer M gives  -differential privacy if:
for all adjacent D1 and D2, and all A µ range(M):

Pr[M(D1) 2 A] ≤ e Pr[M(D2) 2 A]
ratio bounded
Pr [response]
(Bad) Responses: Z
Z
Z
Participation in the data set poses no additional risk
Example of Differential Privacy
X is a set of (name,tag 2 {0,1}) tuples
One query: #of participants with tag=1
-4
-3
-2
-1
Sanitizer : output #of 1’s + noise
0
1
2
3
4
• noise from Laplace distribution with parameter 1/ε
• Pr[noise = k-1] ≈ eε Pr[noise=k]
-4
-3
-2
-1
0
1
2
3
4
5
5
(, ) - Differential Privacy
Sanitizer M gives (, ) -differential privacy if:
for all adjacent D1 and D2, and all A µ range(M):

Pr[M(D1) 2 A] ≤ e Pr[M(D2) 2 A] + 
ratio bounded
Pr [response]
Bad Responses:
Typical setting 𝜖 =
Z
1
10
Z
and δ negligible
Z
This course:  negligible
Example: NO Differential Privacy
U set of (name,tag 2{0,1}) tuples
One counting query: #of participants with tag=1
Sanitizer A: choose and release a few random tags
Bad event T: Only my tag is 1, my tag released
PrA[A(D+Me) 2 T] ≥ 1/n
• Not ε diff private for any ε!
PrA[A(D-Me) 2 T] = 0
• It is (0,1/n) Differential
Private
PrA[A(D+Me) 2 T]
≤ eε ≈ 1+ε
e-ε ≤
PrA[A(D-Me) 2 T]
Counting Queries
Database x of
size n
Query q
n individuals, each contributing
a single point in U
Counting-queries
U
Q is a set of predicates q: U  {0,1}
Query: how many x participants satisfy q?
Sometimes talk
about fraction
Relaxed accuracy:
answer query within α additive error w.h.p
Not so bad: some error anyway inherent in statistical analysis
Bound on Achievable Privacy
Want to get bounds on the
• Accuracy
– The responses from the mechanism to all queries are
assured to be within α except with probability 
• Number of queries t for which we can receive
accurate answers
• The privacy parameter ε for which ε differential
privacy is achievable
– Or (ε,) differential privacy is achievable
Blatant Non Privacy
Mechanism M is Blatantly Non-Private if there is an
adversary A that
• On any database D of size n can select queries and
use the responses M(D) to reconstruct D’ such that
||D-D’||1 2 o(n)
D’ agrees with D in all but o(n) of the entries.
Claim: Blatant non privacy implies that M is not (, )
-DP for any constant 
Sanitization Can’t be Too Accurate
Usual counting queries
– Query: q µ [n]
– i 2 q di Response = Answer + noise
Blatant Non-Privacy: Adversary Guesses 99% bits
Theorem: If all responses are within o(n) of the true
answer, then the algorithm is blatantly non-private.
But: require exponential # of queries .
12
Proof: Exponential Adversary
• Focus on Column Containing Super Private Bit
1
0
0
1
0
1
1
“The database”
Vector d 2 {0,1}n
• Assume all answers are within error bound .
Will show that  cannot be o(n)
13
Proof: Exponential Adversary for Blatant
Non Privacy
• Estimate #1’s in all possible sets
– 8 S µ [n]: |M(S) – i 2 S di | ≤ 
M(S): answer on S
• Weed Out “Distant” DBs
– For each possible candidate database c 2 {0,1}n:
If for any S µ [n]:
|i 2 S ci – M(S)| > ,
then rule out c.
– If c not ruled out, halt and output c
Claim: Real database d won’t be ruled out
14
Proof: Exponential Adversary
• Assume: 8 S µ [n]: |M(S) – i 2S di | ≤ 
Claim: For c that has not been ruled out
Hamming distance (c,d) ≤ 2
S0 0
S1 1
d
0
1
1
1
0
0
1
1
c
≤ 2
≤ 2
|M(S0) - i 2S0 ci | ≤  (c not ruled out)
≤ 4
|M(S1) - i 2S1 ci | ≤  (c not ruled out)
Impossibility of Exponential Queries
The result means that we cannot sanitize the data and
publish a data structure so that
• for all queries the answer can be deduced correctly
to within  2 o(n)
answer 1
answer
3
On the other hand:
we
will
see
that
we
Sanitizer
answer 2
can get accuracy up to log |Q|
Database
query 1,
query 2,
...
?
What can we do efficiently?
Allowed “too” much power to the adversary
• Number of queries: exponential
• Computation: exponential
• On the other hand: lack of wild errors in the responses
Theorem: For any sanitization algorithm:
If all responses are within o(√n) of the true answer, then it
is blatantly non-private even against a polynomial time
adversary making O(n log2 n) random queries.
The Model
• As before: database d is a bit string of length n.
• Counting queries:
– A query is a subset q µ {1, …, n}
– The (exact) answer is aq = i 2q di
• -perturbation
– for an answer: aq ± 
Slide 18
What If We Had Exact Answers?
• Consider a mechanism 0-perturbations
– Receive the exact answer aq = i 2q di
Then with n linearly independent queries
– over the reals
we could reconstruct d precisely:
A solution must
exist: d itself
• Obtain n linearly equations aq = i 2q ci and solve uniquely
When we have -perturbations : get an inequality
• aj -  ≤ i 2q ci ≤ aj + 
Idea: use linear programming
Privacy requires Ω(√n) perturbation
For every
query qwith
answerperturbation
according to c is
j: itso(√n)
Consider
a database
at most 2
far from
(real)
in queries
d.
2 nanswer
• Adversary
makes
t = its
n log
random
qj ,
getting noisy answers aj
A solution must
exist: d itself
• Privacy violating Algorithm:
Construct database c = {ci}1 ≤ i ≤ n by solving Linear Program:
0 ≤ ci ≤ 1
for 1 ≤ i ≤ n
aj -  ≤ i 2q ci ≤ aj +  for 1 ≤ j ≤ t
• Round the solution:
– if ci > 1/2 set to 1 and to 0 otherwise
Bad solutions to LP do not survive
A query q disqualifies a potential database c 2 [0,1]n if
its answer on q is more than 2 far answer in d:
|i 2q ci -i 2q di| > 2
• Idea: show that for a database c that is far away from d
a random query disqualifies c with some constant
probability 
• Want to use the Union Bound: all far away solutions are
disqualified w.p. at least 1 – nn(1 - )t = 1–neg(n)
How do we limit the solution space?
Round each value to closest 1/n
Privacy requires Ω(√n) perturbation
A query q disqualifies a potential database c 2 [0,1]n
if its answer on q is more than 2 far answer in d:
Lemma: if c is far away from d, then a random query
disqualifies c with some constant probability 
• If Probi 2 [n] [|di-ci| ¸ 1/3] >  ,
then there is a  >0 such that
Probq 2 {0,1}[n] [|i 2q (ci – di)|¸ 2+1] > 
Proof uses Azuma’s inequality
Privacy requires Ω(√n) perturbation
Can discretize all potential databases c 2 [0,1]n :
Suppose we round each entry ci to closest fraction with
denominator n:
|ci – wi/n| · 1/n
The response on q change by at most 1.
• If we disqualify all `discrete’ databases then we also
effectively eliminate all c 2 [0,1]n
• There are nn `discrete’ databases
Privacy requires Ω(√n) perturbation
A query q disqualifies a potential database c 2 [0,1]n
if its answer on q is more than 2 far answer in d:
Claim:if c is far away from d, then a random query
disqualifies c with some constant probability 
• Therefore: t = n log2 n queries leave a negligible
probability for each far away reconstruction.
Count number of entries far from d
• Union bound: all far away suggestions are disqualified
w.p. at least 1 – nn(1 - )t = 1 – neg(n)
Can apply union bound by discretization
Review and Conclusion
• When the perturbation is o(√n), choosing Õ(n)
random queries gives enough information to
efficiently reconstruct an o(n)-close db.
• Database reconstructed using Linear programming
– polynomial time.
o(√n) databases are Blatantly Non-Private.
poly(n) time reconstructable
Slide 25
Composition
Suppose we are going to apply a DP mechanism t times.
– Perhaps on different databases
Want to argue that result is differentially private
• A value b 2 {0,1} is chosen
• In each of the t rounds adversary A picks two adjacent
databases D0i and D1i and receives result zi of an DP mechanism Mi on Dbi
• Want to argue A‘s view is within  for both values of b
• A‘s view: (z1, z2, …, zt) plus randomness used.
Differential Privacy: Composition
P[z1] = Pr z~A1(D)[z=z1]
Handles
P’[z1] = auxiliary
Pr z~A1(D’)information
[z=z1]
Composes naturally
• A1(D) is ε1-diffP
P[z2] = Pr z~A2(D,z1)[z=z2]
• for all z1, A2(D,z1) is εP’[z
2-diffP,
2] = Pr z~A2(D’,z1)[z=z2]
Then A2(D,A1(D)) is (ε1+ε2)-diffP
Proof:
for all adjacent D,D’ and (z1,z2):
e-ε1 ≤ P[z1] / P’[z1] ≤ eε1
e-ε2 ≤ P[z2] / P’[z2] ≤ eε2
e-(ε1+ε2) ≤ P[(z1,z2)]/P’[(z1,z2)] ≤ eε1+ε2
Differential Privacy: Composition
• If all mechanisms Mi are -DP, then for any view
the probability that A gets the view when b=0 and
when b=1 are with et
Therefore results for a single query
translate to results on several queries
Answering a single counting query
U set of (name,tag2 {0,1}) tuples
One counting query: #of participants with tag=1
Sanitizer A: output #of 1’s + noise
Differentially private! If choose noise properly
Choose noise from Laplace distribution
Laplacian Noise
Laplace distribution Y=Lap(b) has density function
Pr[Y=y] =1/2b e-|y|/b
Standard deviation: O(b)
Take b=1/ε, get that Pr[Y=y] Ç e-|y|
-4
-3
-2
-1
0
1
2
3
4
5
Laplacian Noise: ε-Privacy
Take b=1/ε, get that Pr[Y=y] Ç e-|y|
Release: q(D) + Lap(1/ε)
For adjacent D,D’: |q(D) – q(D’)| ≤ 1
For output a:
e- ≤ Prby D[a]/Prby D’[a] ≤ e
-4
-3
-2
-1
0
1
2
3
4
5
Laplacian Noise: ε-Privacy
Theorem: the Laplace mechanism with parameter
b=1/ is -differential private
-4
-3
-2
-1
0
1
2
3
4
5
Laplacian Noise: Õ(1/ε)-Error
Take b=1/ε, get that Pr[Y=y] Ç e-|y|
Concentration of the Laplace distribution:
Pry~Y[|y| > k·1/ε] = O(e-k)
Setting k=O(log n)
Expected error is 1/ε, w.h.p error is Õ(1/ε)
-4
-3
-2
-1
0
1
2
3
4
5