Download Slides of Differential privacy

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Foundations of Privacy
Lecture 5
Lecturer: Moni Naor
Desirable Properties from a sanitization
mechanism
• Composability
– Applying the sanitization several time yields a graceful
degradation
– Will see: t releases , each -DP, are t¢ -DP
– Next class: (√t+t 2,)-DP (roughly)
• Robustness to side information
– No need to specify exactly what the adversary knows:
– knows everything except one row
Differential Privacy: satisfies both…
Differential Privacy
Protect individual participants:
Curator/
Sanitizer M
D1
+
D2
Curator/
Sanitizer M
Dwork, McSherry
Nissim & Smith
2006
Differential Privacy
Protect individual participants:
Probability of every bad event - or any event - increases
only by small multiplicative factor when I enter the DB.
May as well participate in DB…
Adjacency: D+I
and D-I
ε-differentially private sanitizer M
Handles aux
For all DBs D, all individuals I and all events T
input
e-ε ≤
PrA[M(D+I) 2 T]
PrA[M(D-I) 2 T]
≤ eε ≈ 1+ε
Differing in
one user
Differential Privacy
Sanitizer M gives  -differential privacy if:
for all adjacent D1 and D2, and all A µ range(M):

Pr[M(D1) 2 A] ≤ e Pr[M(D2) 2 A]
ratio bounded
Pr [response]
(Bad) Responses: Z
Z
Z
Participation in the data set poses no additional risk
Example of Differential Privacy
X is a set of (name,tag 2 {0,1}) tuples
One query: #of participants with tag=1
-4
-3
-2
-1
Sanitizer : output #of 1’s + noise
0
1
2
3
4
• noise from Laplace distribution with parameter 1/ε
• Pr[noise = k-1] ≈ eε Pr[noise=k]
-4
-3
-2
-1
0
1
2
3
4
5
5
(, ) - Differential Privacy
Sanitizer M gives (, ) -differential privacy if:
for all adjacent D1 and D2, and all A µ range(M):

Pr[M(D1) 2 A] ≤ e Pr[M(D2) 2 A] + 
ratio bounded
Pr [response]
Bad Responses:
Typical setting 𝜖 =
Z
1
10
Z
and δ negligible
Z
This course:  negligible
Example: NO Differential Privacy
U set of (name,tag 2{0,1}) tuples
One counting query: #of participants with tag=1
Sanitizer A: choose and release a few random tags
Bad event T: Only my tag is 1, my tag released
PrA[A(D+Me) 2 T] ≥ 1/n
• Not ε diff private for any ε!
PrA[A(D-Me) 2 T] = 0
• It is (0,1/n) Differential
Private
PrA[A(D+Me) 2 T]
≤ eε ≈ 1+ε
e-ε ≤
PrA[A(D-Me) 2 T]
Counting Queries
Database x of
size n
Query q
n individuals, each contributing
a single point in U
Counting-queries
U
Q is a set of predicates q: U  {0,1}
Query: how many x participants satisfy q?
Sometimes talk
about fraction
Relaxed accuracy:
answer query within α additive error w.h.p
Not so bad: some error anyway inherent in statistical analysis
Bound on Achievable Privacy
Want to get bounds on the
• Accuracy
– The responses from the mechanism to all queries are
assured to be within α except with probability 
• Number of queries t for which we can receive
accurate answers
• The privacy parameter ε for which ε differential
privacy is achievable
– Or (ε,) differential privacy is achievable
Blatant Non Privacy
Mechanism M is Blatantly Non-Private if there is an
adversary A that
• On any database D of size n can select queries and
use the responses M(D) to reconstruct D’ such that
||D-D’||1 2 o(n)
D’ agrees with D in all but o(n) of the entries.
Claim: Blatant non privacy implies that M is not (, )
-DP for any constant 
Sanitization Can’t be Too Accurate
Usual counting queries
– Query: q µ [n]
– i 2 q di Response = Answer + noise
Blatant Non-Privacy: Adversary Guesses 99% bits
Theorem: If all responses are within o(n) of the true
answer, then the algorithm is blatantly non-private.
But: require exponential # of queries .
12
Proof: Exponential Adversary
• Focus on Column Containing Super Private Bit
1
0
0
1
0
1
1
“The database”
Vector d 2 {0,1}n
• Assume all answers are within error bound .
Will show that  cannot be o(n)
13
Proof: Exponential Adversary for Blatant
Non Privacy
• Estimate #1’s in all possible sets
– 8 S µ [n]: |M(S) – i 2 S di | ≤ 
M(S): answer on S
• Weed Out “Distant” DBs
– For each possible candidate database c 2 {0,1}n:
If for any S µ [n]:
|i 2 S ci – M(S)| > ,
then rule out c.
– If c not ruled out, halt and output c
Claim: Real database d won’t be ruled out
14
Proof: Exponential Adversary
• Assume: 8 S µ [n]: |M(S) – i 2S di | ≤ 
Claim: For c that has not been ruled out
Hamming distance (c,d) ≤ 2
S0 0
S1 1
d
0
1
1
1
0
0
1
1
c
≤ 2
≤ 2
|M(S0) - i 2S0 ci | ≤  (c not ruled out)
≤ 4
|M(S1) - i 2S1 ci | ≤  (c not ruled out)
Impossibility of Exponential Queries
The result means that we cannot sanitize the data and
publish a data structure so that
• for all queries the answer can be deduced correctly
to within  2 o(n)
answer 1
answer
3
On the other hand:
we
will
see
that
we
Sanitizer
answer 2
can get accuracy up to log |Q|
Database
query 1,
query 2,
...
?
What can we do efficiently?
Allowed “too” much power to the adversary
• Number of queries: exponential
• Computation: exponential
• On the other hand: lack of wild errors in the responses
Theorem: For any sanitization algorithm:
If all responses are within o(√n) of the true answer, then it
is blatantly non-private even against a polynomial time
adversary making O(n log2 n) random queries.
The Model
• As before: database d is a bit string of length n.
• Counting queries:
– A query is a subset q µ {1, …, n}
– The (exact) answer is aq = i 2q di
• -perturbation
– for an answer: aq ± 
Slide 18
What If We Had Exact Answers?
• Consider a mechanism 0-perturbations
– Receive the exact answer aq = i 2q di
Then with n linearly independent queries
– over the reals
we could reconstruct d precisely:
A solution must
exist: d itself
• Obtain n linearly equations aq = i 2q ci and solve uniquely
When we have -perturbations : get an inequality
• aj -  ≤ i 2q ci ≤ aj + 
Idea: use linear programming
Privacy requires Ω(√n) perturbation
For every
query qwith
answerperturbation
according to c is
j: itso(√n)
Consider
a database
at most 2
far from
(real)
in queries
d.
2 nanswer
• Adversary
makes
t = its
n log
random
qj ,
getting noisy answers aj
A solution must
exist: d itself
• Privacy violating Algorithm:
Construct database c = {ci}1 ≤ i ≤ n by solving Linear Program:
0 ≤ ci ≤ 1
for 1 ≤ i ≤ n
aj -  ≤ i 2q ci ≤ aj +  for 1 ≤ j ≤ t
• Round the solution:
– if ci > 1/2 set to 1 and to 0 otherwise
Bad solutions to LP do not survive
A query q disqualifies a potential database c 2 [0,1]n if
its answer on q is more than 2 far answer in d:
|i 2q ci -i 2q di| > 2
• Idea: show that for a database c that is far away from d
a random query disqualifies c with some constant
probability 
• Want to use the Union Bound: all far away solutions are
disqualified w.p. at least 1 – nn(1 - )t = 1–neg(n)
How do we limit the solution space?
Round each value to closest 1/n
Privacy requires Ω(√n) perturbation
A query q disqualifies a potential database c 2 [0,1]n
if its answer on q is more than 2 far answer in d:
Lemma: if c is far away from d, then a random query
disqualifies c with some constant probability 
• If Probi 2 [n] [|di-ci| ¸ 1/3] >  ,
then there is a  >0 such that
Probq 2 {0,1}[n] [|i 2q (ci – di)|¸ 2+1] > 
Proof uses Azuma’s inequality
Privacy requires Ω(√n) perturbation
Can discretize all potential databases c 2 [0,1]n :
Suppose we round each entry ci to closest fraction with
denominator n:
|ci – wi/n| · 1/n
The response on q change by at most 1.
• If we disqualify all `discrete’ databases then we also
effectively eliminate all c 2 [0,1]n
• There are nn `discrete’ databases
Privacy requires Ω(√n) perturbation
A query q disqualifies a potential database c 2 [0,1]n
if its answer on q is more than 2 far answer in d:
Claim:if c is far away from d, then a random query
disqualifies c with some constant probability 
• Therefore: t = n log2 n queries leave a negligible
probability for each far away reconstruction.
Count number of entries far from d
• Union bound: all far away suggestions are disqualified
w.p. at least 1 – nn(1 - )t = 1 – neg(n)
Can apply union bound by discretization
Review and Conclusion
• When the perturbation is o(√n), choosing Õ(n)
random queries gives enough information to
efficiently reconstruct an o(n)-close db.
• Database reconstructed using Linear programming
– polynomial time.
o(√n) databases are Blatantly Non-Private.
poly(n) time reconstructable
Slide 25
Composition
Suppose we are going to apply a DP mechanism t times.
– Perhaps on different databases
Want to argue that result is differentially private
• A value b 2 {0,1} is chosen
• In each of the t rounds adversary A picks two adjacent
databases D0i and D1i and receives result zi of an DP mechanism Mi on Dbi
• Want to argue A‘s view is within  for both values of b
• A‘s view: (z1, z2, …, zt) plus randomness used.
Differential Privacy: Composition
P[z1] = Pr z~A1(D)[z=z1]
Handles
P’[z1] = auxiliary
Pr z~A1(D’)information
[z=z1]
Composes naturally
• A1(D) is ε1-diffP
P[z2] = Pr z~A2(D,z1)[z=z2]
• for all z1, A2(D,z1) is εP’[z
2-diffP,
2] = Pr z~A2(D’,z1)[z=z2]
Then A2(D,A1(D)) is (ε1+ε2)-diffP
Proof:
for all adjacent D,D’ and (z1,z2):
e-ε1 ≤ P[z1] / P’[z1] ≤ eε1
e-ε2 ≤ P[z2] / P’[z2] ≤ eε2
e-(ε1+ε2) ≤ P[(z1,z2)]/P’[(z1,z2)] ≤ eε1+ε2
Differential Privacy: Composition
• If all mechanisms Mi are -DP, then for any view
the probability that A gets the view when b=0 and
when b=1 are with et
Therefore results for a single query
translate to results on several queries
Answering a single counting query
U set of (name,tag2 {0,1}) tuples
One counting query: #of participants with tag=1
Sanitizer A: output #of 1’s + noise
Differentially private! If choose noise properly
Choose noise from Laplace distribution
Laplacian Noise
Laplace distribution Y=Lap(b) has density function
Pr[Y=y] =1/2b e-|y|/b
Standard deviation: O(b)
Take b=1/ε, get that Pr[Y=y] Ç e-|y|
-4
-3
-2
-1
0
1
2
3
4
5
Laplacian Noise: ε-Privacy
Take b=1/ε, get that Pr[Y=y] Ç e-|y|
Release: q(D) + Lap(1/ε)
For adjacent D,D’: |q(D) – q(D’)| ≤ 1
For output a:
e- ≤ Prby D[a]/Prby D’[a] ≤ e
-4
-3
-2
-1
0
1
2
3
4
5
Laplacian Noise: ε-Privacy
Theorem: the Laplace mechanism with parameter
b=1/ is -differential private
-4
-3
-2
-1
0
1
2
3
4
5
Laplacian Noise: Õ(1/ε)-Error
Take b=1/ε, get that Pr[Y=y] Ç e-|y|
Concentration of the Laplace distribution:
Pry~Y[|y| > k·1/ε] = O(e-k)
Setting k=O(log n)
Expected error is 1/ε, w.h.p error is Õ(1/ε)
-4
-3
-2
-1
0
1
2
3
4
5