Download ppsx - Harvard University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Computational Complexity &
Differential Privacy
Salil Vadhan
Harvard University
Joint works with Cynthia Dwork, Kunal Talwar, Andrew McGregor, Ilya Mironov,
Moni Naor, Omkant Pandey, Toni Pitassi, Omer Reingold, Guy Rothblum, Jon Ullman
Data Privacy: The Problem
Given a dataset with sensitive information, such as:
• Health records
• Census data
• Search engine logs
How can we:
• Compute and release useful functions of the dataset (utility)
• Without compromising info about individuals (privacy)?
Data Privacy: The Challenge
• Traditional approach (e.g. in HIPAA): “anonymize” by
removing “personally identifying information (PII)”
• Many supposedly anonymized datasets have been subject to
reidentification:
– Gov. Weld’s medical record re-identified by linkage with voter records
[Swe97].
– Netflix Challenge database reidentified by linkage with IMDb [NS08]
– AOL search users reidentified by contents of their queries [BZ06]
– …
Differential Privacy
[DN03,DN04,BDMN05,DMNS06]
A strong, new notion of privacy that:
• Is robust to auxiliary information possessed by an adversary
• Degrades gracefully under repetition/composition
• Allows for many useful computations
Differential Privacy: Definition
C
Database DXn
curator
C(D)
Def: A randomized algorithm C : Xn  Y is (,d) differentially private
iff  databases D1, D2 that differ on one row,
and  set T Y,
Pr[C(D1) T]  e  Pr[C(D2)T] + d
 (1+)  Pr[C(D2)T]
Think of  as a small constant, e.g.  = .01,
d as cryptographically small, e.g. d = 2-60
Differential Privacy: Interpretations
Def: A randomized algorithm C : Xn  Y is (,d) differentially private
iff  databases D1, D2 that differ on one row,
and  T Y,
Pr[C(D1) T] . (1+) Pr[C(D2)T]
• My data affects what an adversary sees by at most .
• Whatever an adversary learns about me, it could have learned
from everyone else’s data.
• Above interpretations hold regardless of adversary’s auxiliary
information.
• Composes gracefully (k repetitions ) k differentially private)
• But no protection for information that is not localized to a few
rows.
Differential Privacy: Example
• D = (x1,…,xn) Xn
• Goal: given ¼ : X! {0,1} estimate counting query
(D):=i (xi)
within error  
• Example:
X = {0,1}d
 = conjunction on  k variables
Counting query = k-way marginal
e.g. How many people in D
smoke and have cancer?
>35
Smoker?
Cancer?
0
1
1
1
1
0
1
0
1
1
1
1
0
1
0
1
1
1
Differential Privacy: Example
• D = (x1,…,xn) Xn
• Goal: estimate counting query
(D):=i (xi)
within error  
• Solution: C(D) = (D) + Lap(1/²),
where Lap(¾) has density / exp(-¾|x|)
• Can answer k queries with error k/² or even  k1/2/²®
Other Differentially Private Algorithms
•
•
•
•
•
•
•
•
histograms [DMNS06]
contingency tables [BCDKMT07, GHRU11],
machine learning [BDMN05,KLNRS08],
logistic regression [CM08]
clustering [BDMN05]
social network analysis [HLMJ09]
approximation algorithms [GLMRT10]
…
This talk: Computational Complexity
in Differential Privacy
Q: Do computational resource constraints change what is possible?
Computationally bounded curator
– Makes differential privacy harder
– Differentially private & accurate synthetic data infeasible to construct,
even preserving 2-way marginals to within o(n) error
(proof uses PCPs & Digital Signatures)
– Open: release other types of summaries/models?
Computationally bounded adversary
– Makes differential privacy easier
– Provable gain in accuracy for multi-party protocols
– Connections to randomness extractors, communication complexity, secure
multiparty computation
Computational Differential Privacy:
Definition
Def [MPRV09]: A randomized algorithm Ck : Xn  Y is
 computationally differentially private iff
 databases D1, D2 that differ on one row, and
 nonuniform poly(k)-time algorithm T,
Pr[T(Ck(D1))=1]  e  Pr[T(Ck(D2))=1] + negl(k)
–
–
–
–
k=security parameter.
Allow Ck running time poly(n,log|X|,k).
negl(k): function vanishing faster than 1/poly(k).
Implicit in works on distributed implementations of differential privacy
[DKMMN06,BKO08].
Computational Differential Privacy:
Example
Def [MPRV09]: A randomized algorithm Ck : Xn  Y is
 computationally differentially private iff
 databases D1, D2 that differ on one row, and
 nonuniform poly(k)-time algorithm T,
Pr[T(Ck(D1))=1]  e  Pr[T(Ck(D2))=1] + negl(k)
Example:
• Ck = [differentially private C + a PRG to generate randomness]
• Ck(D) computationally indistinguishable from C(D)
 Ck computationally differentially private
Computational Differential Privacy:
Definitional Questions
• Stronger Def: Ck is  computationally differentially private’ if
 (,negl(k))-differentially private C : Xn Y s.t.
 databases D, C(D) & Ck(D) computationally indistinguishable.
• Implies accuracy of Ck(D)  accuracy of C(D)
 not much gain over information-theoretic d.p.
• Q [MPRV09]: are the two definitions equivalent?
– Related to “Dense Model Theorems” in additive combinatorics [GT04]
and pseudorandomness [RTTV08]
– Still open!
Computational Differential Privacy:
When can it Help?
• [GKY11]: in many cases, can convert a computationally
differentially private Ck into an information-theoretically
differentially private C with negligible loss in accuracy.
– Does not cover all functionalities & utility measures of interest
– Is restricted to case of single curator computing on entire dataset
• Today [BNO08,MPRV09,MMPRTV10]: Computational
differential privacy can help when database distributed
among two or more parties.
2-Party Computational Differential Privacy
• each party has a sensitive dataset, want to do a joint
computation f(DA,DB)
DA
x1
m1
m2
m3
x2
DB
y1
y2

mk-1
xn
mk
out(m1,…,mk) f(DA,DB)

ym
2-Party Computational Differential Privacy
m1
m2
m3
DB
y1
y2
mk-1
mk

ym
0/1
•  computational differential privacy (for B):
 nonuniform poly(k)-time A*,
 databases DB, D’B that differ on one row,
Pr[outA*(A*,B(DB))=1]  e  Pr[outA*(A*,B(D’B))=1] + negl(k)
• and similarly for A
Multiparty Computational Differential Privacy
P2(D2)
P3(D3)
P-5*`
P1(D1)
0/1
P4(D4)
P5(D5)
Require:  nonuniform poly(k)-time P-i*,
 databases Di, D’i that differ on one row,
Pr[outP-i*(P-i*,Pi(Di))=1]  e  Pr[outP-i*(P-i*,Pi(D’i))=1] + negl(k)
Constructing CDP Protocols
Given a function f(D1,…,Dn) we wish to compute
– Example: each Di {0,1} and f(D1,…,Dn)=i Di
• Step 1: Design a centralized mechanism C
– Example: C(D1,…,Dn) = i Di + Lap(1/)
• Step 2: Use secure multiparty computation [Yao86,GMW86]
to implement C with a distributed protocol (P1,…,Pn).
– Adversary’s view can be simulated (up to computational
indistinguishability) given only access to “ideal functionality” C
 computational differential privacy
– Can be done more efficiently in specific cases [DKMMN06,BKO08].
• Q: Did we really need computational differential privacy?
Multiparty Computational Differential Privacy
P2(D2)
P3(D3)
P-5*`
P1(D1)
0/1
P4(D4)
P5(D5)
Require:  nonuniform poly(k)-time P-i*,
 databases Di, D’i that differ on one row,
Pr[outP-i*(P-i*,Pi(Di))=1]  exp()  Pr[outP-i*(P-i*,Pi(D’i))=1] + negl(k)
Differentially Private Protocol for SUM:
“Randomized Response” [W65]
P1,…,Pn have bits D1,… ,Dn {0,1}, want to estimate i Di
1. Each Pi broadcasts Ni =
Di w.p. (1+)/2
Di w.p. (1-)/2
2. Everyone computes Z = (1/)i (Ni-(1-)/2)
• Differential Privacy: ((1+)/2)/((1-)/2) = 1+O().
• Accuracy: E[Ni] = (1-)/2+Di  whp error is O(n1/2/)
– Nontrivial but worse than O(1/) with computational d.p.
Lower Bound for Computing SUM
Thm: Every n-party differentially private protocol for SUM incurs
error (n1/2) whp
– Assuming =O(1), d=o(1/n)
– Improves lower bound of (n1/2/(# rounds)) of [BNO08].
Proof:
• Assume d=0 for simplicity.
• Let (D1,…,Dn) be uniform, independent bits,
T=transcript(P1(D1) ,…,Pn(Dn))
• Claim: conditioned T=t, D1,…,Dn are still independent bits,
each with bias O()
Lower Bound for Computing SUM
Thm: Every n-party differentially private protocol for SUM incurs
error (n1/2) whp
Proof:
• (D1,…,Dn) = uniformly random, T= trans(P1(D1) ,…,Pn(Dn))
• Claim: conditioned T=t, D1,…,Dn are still independent bits,
each with bias O()
– Independence true for any communication protocol
– Bias by Bayes Rule & Differential Privacy:
𝐷𝑖 = 1 𝑇 = 𝑡 Pr 𝑇 = 𝑡 𝐷𝑖 = 1
=
 1+
Pr 𝐷𝑖 = 0 𝑇 = 𝑡
Pr 𝑇 = 𝑡 𝐷𝑖 = 0
Pr
Lower Bound for Computing SUM
Thm: Every n-party differentially private protocol for SUM incurs
error (n1/2) whp
Proof:
• (D1,…,Dn) = uniformly random, T= trans(P1(D1) ,…,Pn(Dn))
• Claim: conditioned T=t, D1,…,Dn are still independent bits,
each with bias O()
• Claim: sum of n independent bits, each with constant bias,
falls outside any interval of size o(n1/2) whp.
 Whp i Di  [output(T) – o(n1/2), output(T)+o(n1/2)]
Separation for Two-Party Protocols
A’s input: x=(x1,…,xn)  {0,1}n, B’s input: y=(y1,…,yn)  {0,1}n
Goal [MPRV09]: estimate <x,y> (set intersection)
• Can be computed by a computational differentially private
protocol with error O(1/)
• Can be computed by an differentially
private protocol with
error O(n1/2/)
• Thm [MMPTRV10]: Every 2-party differentially private
1/2
protocol for <x,y> incurs error (n /log n) whp.
Lower Bound for Inner Product
A’s input: x=(x1,…,xn)  {0,1}n, B’s input: y=(y1,…,yn)  {0,1}n
Thm: Every 2-party differentially private protocol for <x,y> has
error (n1/2/log n) whp
Proof:
• X, Y = uniformly random, T= trans(A(X),B(Y))
• Claim: conditioned T=t, X,Y are independent unpredictable
(Santha-Vazirani) sources
𝑋𝑖 = 1 𝑇 = 𝑡, 𝑋−𝑖 = 𝑥−𝑖 Pr 𝑇 = 𝑡 𝑋𝑖 = 1, 𝑋−𝑖 = 𝑥−𝑖
=

Pr 𝑋𝑖 = 0 𝑇 = 𝑡, 𝑋−𝑖 = 𝑥−𝑖
Pr 𝑇 = 𝑡 𝑋𝑖 = 0, 𝑋−𝑖 = 𝑥−𝑖
Pr
1+
Lower Bound for Inner Product
Thm: Every 2-party differentially private protocol for <x,y> has
error (n1/2/log n) whp
Proof:
• X, Y = uniformly random, T= trans(A(X),B(Y))
• Claim: conditioned T=t, X,Y are independent unpredictable
(Santha-Vazirani) sources.
• Claim: If X,Y independent, unpredictable sources on {0,1}n, then
<X,Y> mod m almost-uniform in Zm for some m = (n1/2/log n)
– Randomness extractor!
– Generalizes [Vaz86] result for m=2.
– Fourier analysis over Zm
Lower Bound for Inner Product
Thm: Every 2-party differentially private protocol for <x,y> has
error (n1/2/log n) whp
Proof:
• X, Y = uniformly random, T= trans(A(X),B(Y))
• Claim: conditioned T=t, X,Y are independent unpredictable
(Santha-Vazirani) sources.
• Claim: If X,Y independent, unpredictable sources on {0,1}n, then
<X,Y> mod m almost-uniform in Zm for some m = (n1/2/log n)
 Whp <X,Y>  [output(T)–o(m), output(T)+o(m)]
Connections with Communication Complexity
[MMPRTV10]
• Let (A(x),B(y)) be a 2-party protocol, x Xn,yYn,
• CC(A,B) = maximum number bits communicated
• (X,Y) distributed in XnYn according to 
• T = transcript(A(X),B(Y))
Thm: (A,B) -differentially private  information cost I(XY;T) = O(n)
 can simulate w/CC(A’,B’) = O(n  polylog(CC(A,B))) [BBCR10]
– Also holds for (,o(1/n))-differential privacy if symbols of X are independent
given Y and vice-versa.
– Can preserve differential privacy if #rounds bounded.
Thm: (A,B) deterministic protocol for approximating f(x,y) to within  
 -differentially private protocol for approximating f(x,y) to within
 +O(CC(A,B) Sensitivity(f)  (#rounds)/)
Conclusions
• Computational complexity is relevant to differential privacy.
• Bad news: producing synthetic data is intractable
• Good news: better protocols against bounded adversaries
Interaction with differential privacy likely to benefit complexity
theory too.
A More Ambitious Goal:
Noninteractive Data Release
C
Original Database D
Sanitization C(D)
Goal: From C(D), can answer many questions about D,
e.g. all counting queries associated with a large family
of predicates P = { ¼ : X ! {0,1}}
Noninteractive Data Release: Desidarata
• (,d)-differential privacy:
for every D1, D2 that differ in one row and every set T,
Pr[C(D1) T]  exp() Pr[C(D2) T]+d,
with d negligible
• Utility: C(D) allows answering many questions about D
• Computational efficiency: C is polynomial-time computable.
Utility: Counting Queries
• D = (x1,…,xn) Xn
• P = {  : X {0,1}}
• For any  P, want to estimate (from C(D)) counting query
(D):=(i (xi))/n
within accuracy error  
>35
• Example:
X = {0,1}d
P = {conjunctions on  k variables}
Counting query = k-way marginal
e.g. What fraction of people in D
smoke and have cancer?
Smoker?
Cancer?
0
1
1
1
1
0
1
0
1
1
1
1
0
1
0
1
1
1
Form of Output
• Ideal: C(D) is a synthetic dataset
–   P |(C(D))-(D)|  
– Values consistent
– Use existing software
• Alternatives?
>35
Smoker?
Cancer?
1
0
0
0
1
1
0
1
0
0
1
1
0
1
0
– Explicit list of |P| answers (e.g. contingency table)
– Median of several synthetic datasets [RR10]
– Program M s.t.   P |M()-(D)| 
Positive Results
minimum database size
computational complexity
reference
general P
k-way marginals
synthetic
[DN03,DN04,
BDMN05]
O(|P|1/2/)
O(dk/2/)
N
general P
k-way marginals
• D = (x1,…,xn)({0,1}d)n
• P = {  : {0,1}d {0,1}}
• (D):=(1/n) i (xi)
•  = accuracy error
•  = privacy
Positive Results
minimum database size
computational complexity
reference
general P
k-way marginals
synthetic
general P
k-way marginals
[DN03,DN04,
BDMN05]
O(|P|1/2/)
O(dk/2/)
N
poly(n,|P|)
poly(n,dk)
• D = (x1,…,xn)({0,1}d)n
• P = {  : {0,1}d {0,1}}
• (D):=(1/n) i (xi)
•  = accuracy error
•  = privacy
Positive Results
minimum database size
computational complexity
reference
general P
k-way marginals
synthetic
general P
k-way marginals
[DN03,DN04,
BDMN05]
O(|P|1/2/)
O(dk/2/)
N
poly(n,|P|)
poly(n,dk)
Õ((2d)k/)
Y
[BDCKMT07]
poly(n,2d)
• D = (x1,…,xn)({0,1}d)n
• P = {  : {0,1}d {0,1}}
• (D):=(1/n) i (xi)
•  = accuracy error
•  = privacy
Positive Results
minimum database size
computational complexity
reference
general P
k-way marginals
synthetic
general P
k-way marginals
[DN03,DN04,
BDMN05]
O(|P|1/2/)
O(dk/2/)
N
poly(n,|P|)
poly(n,dk)
Õ((2d)k/)
Y
Õ(dk/3)
Y
[BDCKMT07]
[BLR08]
O(dlog|P|/3)
poly(n,2d)
• D = (x1,…,xn)({0,1}d)n
• P = {  : {0,1}d {0,1}}
• (D):=(1/n) i (xi)
•  = accuracy error
•  = privacy
Positive Results
minimum database size
computational complexity
reference
general P
k-way marginals
synthetic
general P
k-way marginals
[DN03,DN04,
BDMN05]
O(|P|1/2/)
O(dk/2/)
N
poly(n,|P|)
poly(n,dk)
Õ((2d)k/)
Y
Õ(dk/3)
Y
[BDCKMT07]
[BLR08]
O(dlog|P|/3)
poly(n,2d)
qpoly(n,|P|,2d)
qpoly(n,2d)
• D = (x1,…,xn)({0,1}d)n
• P = {  : {0,1}d {0,1}}
• (D):=(1/n) i (xi)
•  = accuracy error
•  = privacy
Positive Results
minimum database size
computational complexity
reference
general P
k-way marginals
synthetic
general P
k-way marginals
[DN03,DN04,
BDMN05]
O(|P|1/2/)
O(dk/2/)
N
poly(n,|P|)
poly(n,dk)
Õ((2d)k/)
Y
[BDCKMT07]
poly(n,2d)
[BLR08]
O(dlog|P|/3)
Õ(dk/3)
Y
qpoly(n,|P|,2d)
qpoly(n,2d)
[DNRRV09,
DRV10, HR10]
O(d log2|P|/22)
Õ(dk2/22)
Y
poly(n,|P|,2d)
poly(n,|P|,2d)
Summary: Can construct synthetic databases
accurate on huge families of counting queries,
but complexity may be exponential in dimensions
of data and query set P.
Question: is the complexity inherent?
• D = (x1,…,xn)({0,1}d)n
• P = {  : {0,1}d {0,1}}
• (D):=(1/n) i (xi)
•  = error in accuracy
•  = privacy
Our Result: Intractability of Synthetic Data
Informally:
• Producing accurate & differentially private synthetic data is as
hard as breaking cryptography (e.g. factoring large integers).
• Inherently exponential in dimensionality of data (and in
dimensionality of queries).
Our Result: Intractability of Synthetic Data
Thm [UV11, building on DNRRV10]:
Under standard crypto assumptions (OWF),
there is no n=poly(d) and curator that:
–
–
–
–
Produces synthetic databases.
Is differentially private.
Runs in time poly(n,d).
Achieves error =.01 for 2-way marginals.
Proof overview:
1. Use digital signatures to show hardness for a complicated
(but efficiently computable) family of counting queries
[DNRRV10]
2. Use PCPs to reduce complicated queries to simple ones
[UV11].
Tool 1: Digital Signature Schemes
A digital signature scheme consists of 3 poly-time algorithms
(Gen,Sign,Ver):
• On security parameter d, Gen(d) = (SK,PK)  {0,1}d {0,1}d
• On m {0,1}d, can compute =SignSK(m){0,1}d s.t. VerPK(m,)=1
• Given many (m,) pairs, infeasible to generate new (m’,’)
satisfying VerPK
Thm [NY89,Rom90]:
Digital signature schemes exist iff one-way functions exist.
Hard-to-Sanitize Databases I [DNRRV10]
Generate random (PK,SK) Gen(d), m1, m2,…, mn {0,1}d
D
C(D)
m1
SignSK(m1)
m2
SignSK(m2)
m3
SignSK(m3)


mn
SignSK(mn)
VerPK(D) = 1
curator
m’1
1
m’2
2


m’k
k
If error · ® wrt VerPK:
• VerPK(C(D))  1- > 0
•  i VerPK(m’i,i)=1
Case 1: m’iD  Forgery!
Case 2: m’iD  Reidentification!
Tool 2: Probabilistically Checkable Proofs
The PCP Theorem:  efficient algorithms (Red,Enc,Dec) s.t.
w s.t. V(w)=1
Enc
Circuit V of size poly(d)
w’ s.t. V(w’)=1
z {0,1}d’ satisfying
all of V
Dec
Red
Set of 3-clauses on
d’=poly(d) vars
V={x1 x5  x7,
x1  v5 xd’,…}
z’ {0,1}d’ satisfying
.99 fraction of V
Hard-to-Sanitize Databases II [UV11]
VerPK
D
m1
SignSK(m1)
m2
SignSK(m2)
m3
SignSK(m3)
z3



mn
SignSK(mn)
zn
Enc
• Let PK = Red(VerPK)
• Each clause in PK is
satisfied by all zi
C(D)
z1
z2
z’1
curator
z’2

z’k
If error·.01 wrt 3-way marginals:
• Each clause in PK is satisfied by
 .99 fraction of z’i
•  i s.t. z’i satisfies  1- fraction
of clauses
• Dec(z’i) = valid (m’i,i)
Case 1: m’iD  Forgery!
Case 2: m’iD  Reidentification!
Conclusions
• Producing private, synthetic databases that preserve simple
statistics requires computation exponential in the dimension of the
data.
How to bypass?
• Average-case accuracy: Heuristics that don’t give good accuracy on
all databases, only those from some class of models.
• Non-synthetic data:
– Hardness extends to some generalizations of synthetic data (e.g. medians
of synthetic data).
– Thm [DNRRV09]: For unnatural but efficiently computable P
 efficient curators “iff”  efficient “traitor-tracing” schemes
– But for natural P (e.g. P={all marginals}), wide open!