Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Towards Privacy in Public Databases
Shuchi Chawla, Cynthia Dwork,
Frank McSherry, Adam Smith,
Larry Stockmeyer, Hoeteck Wee
Work Done at Microsoft Research
Database Privacy
Think “Census”
Individuals provide information
Census Bureau publishes sanitized records
Privacy is legally mandated; what utility can we achieve?
Inherent Privacy vs Utility tension
One extreme – complete privacy; no information
Other extreme – complete information; no privacy
Goals:
Find a middle path
Change the nature of discourse
2
preserve macroscopic properties
“disguise” individual identifying information
Establish framework for meaningful comparison of techniques
Outline
Definitions
privacy, defined in the breach
sanitization requirements
utility goals
Example: Recursive Histogram Sanitizations
description of technique
a robust proof of privacy
Example: “Round” Sanitizations
nice learning properties
privacy via cross-training
Setting the Real World Context
3
dealing with auxiliary information
Outline
Definitions
privacy, defined in the breach
sanitization requirements
utility goals
Example: Recursive Histogram Sanitizations
description of technique
a robust proof of privacy
Example: “Round” Sanitizations
nice learning properties
privacy via cross-training
Setting the Real World Context
4
dealing with auxiliary information
What do WE mean by privacy?
[Ruth Gavison] Protection from being brought to
the attention of others
inherently valuable
attention invites further privacy loss
Privacy is assured to the extent that one blends
in with the crowd
Appealing definition; can be converted into a
precise mathematical statement…
5
A geometric view
Abstraction:
Database consists of points in high dimensional space Rd
Points are unlabeled
you are your collection of attributes
Distance is everything
points are more similar if and only if they are closer
Real Database (RDB), private
n unlabeled points in d-dimensional space
think of d as number of sensitive attributes
Sanitized Database (SDB), public
n’ new points, possibly in a different space
6
The adversary or Isolator - Intuition
On input SDB and auxiliary information,
adversary outputs a point q Rd
q “isolates” a real DB point x, if it is much
closer to x than to x’s near neighbors
q fails to isolate x if q looks roughly as much like
everyone in x’s neighborhood as it looks like x
itself
Tightly clustered points have a smaller radius of
isolation
RDB
7
(c,T)-Isolation – the definition
I(SDB,aux) = q
x is (c,T)-isolated if B(q,cd) contains fewer than T
other points from RDB
cd
p
x
8
d
q
c – privacy parameter; eg, 4
Requirements for the sanitizer
No way of obtaining privacy if AUX already reveals too much!
Sanitization procedure compromises privacy if
giving the adversary access to the SDB
considerably increases its probability of success
Definition of “considerably” can be forgiving
Formally, quantify over distributions, adversaries, choice of
database, auxiliary information:
9
D I I’ w.h.p. D aux
x |Pr[I(SDB,aux) isolates x] – Pr[I’(aux) isolates x]| is small
probabilities over choices made by sanitizer and I, I’
Provides a framework for describing the power of a sanitization
method, and hence for comparisons
Aux is going to cause trouble. Ignore it for now.
Utility Goals
Pointwise proofs of specific utilities
averages, medians, clusters, regressions,…
Prove there is a large class of interesting
utilities for which there are good
approximation procedures using sanitized
data
10
Outline
Definitions
privacy, defined in the breach
sanitization requirements
utility goals
Example: Recursive Histogram Sanitizations
description of technique
a robust proof of privacy
Example: “Round” Sanitizations
nice learning properties
privacy via cross-training
Setting the Real World Context
11
dealing with auxiliary information
Recursive Histogram Sanitization
U = d-dim cube, side = 2
Cut into 2d subcubes
split along each axis
subcube has side = 1
For each subcube
if number of RDB points > 2T
then recurse
Output: list of cells and counts
12
Recursive Histogram Sanitization
Theorem: 9c s.t. if n points are drawn
uniformly from U, then recursive histogram
sanitizations are safe with respect to
c-isolation: Pr[I(SDB) succeeds] · exp(-d).
13
Safety of Recursive Histogram Sanitization
Rough Intuition
14
Expected distance ||q-x|| is ≈ diameter of cell.
Distances tightly concentrated around mean.
Multiplying radius by c captures almost all the
parent cell - contains at least 2T points.
For Very Large Values of n
Wlog can switch to ball adversaries: (q,r)
I wins if B(q,r) contains at least one RDB point and
B(q,cr) contains fewer than T RDB points
Define a probability density f(x) that
captures adversary’s view of the RDB
To win with probability , I needs:
Prf[B(q,r)] ¸ /n
Prf[B(q,cr)] · (2T + O(log -1))/n
Prf[B(q,r)]/Prf[B(q,cr)] ¸ /(2T + O(log -1))
Bound by bounding ratio, · 2-d, < 1
15
Prf[B(q,r)]/Prf[B(q,cr)]
f(x) = (nC/n) (1 / Vol(C))
fraction of RDB points landing in cell C, spread
uniformly within C
If r is sufficiently small, the bigger ball
captures exp(d) more mass in each subcube
than does the smaller ball
yields
< 2-(d)
16
Prf[B(q,r)]/Prf[B(q,cr)]
f(x) = (nC/n) (1 / Vol(C))
fraction of RDB points landing in cell C, spread
uniformly within C
If r is sufficiently small, the bigger ball
captures exp(d) more mass in each subcube
than does the smaller ball
If r is large, the small ball captures nothing
or the bigger ball captures parent cube
Either way isolation cannot occur (c = 16)
17
Proof is Very Robust
Extends to many interesting cases
non-uniform but bounded-ratio density fns
isolator knows constant fraction of attribute vals
isolator knows lots of RDB points
isolation in few attributes
very weak bounds
Can be adapted to “round” distributions
balls, spheres, mixtures of Gaussians,
with effort; [work in progress w/ K. Talwar]
More General Distributions
18
“good” islands in a sea of zero probability
Outline
Definitions
privacy, defined in the breach
sanitization requirements
utility goals
Example: Recursive Histogram Sanitizations
description of technique
a robust proof of privacy
Example: “Round” Sanitizations
nice learning properties
privacy via cross-training
Setting the Real World Context
19
dealing with auxiliary information
Round Sanitizations
The privacy of x is linked to its T-radius
Randomly perturb it in proportion to its T-radius
x’ = San(x) R B(x,T-rad(x))
alternatively: S(x, T-rad(x)) or d-dim Gaussian
Intuition:
20
We are blending x in with its crowd
We are adding to x random noise with mean zero,
so several macroscopic properties should be
preserved.
Nice Learning Properties
Known algorithm for learning mixtures of
Gaussians works for clustering sanitized
Gaussian data
Original distribution (mixture of Gaussians) is recovered
Technical issue: added noise is a function of the data
Subject of another talk
Diameter increases by at most x3 when finding
k clusters minimizing the largest diameter
21
Privacy for n Sanitized Points?
Given n-1 points in the clear, the probability
of isolating the nth is O(exp(-d))
Intuition for extension to n points is wrong!
22
Privacy of xn given xn’ and all the other points in
the clear does not imply privacy of xn given xn’ and
sanitizations of others!
Sanitization of other points reveals information
about xn
Worry is for safety of the reference point (the
neighbor defining the T-radius), not the principal
Combining the Two Sanitizations
Partition RDB into two sets A and B
Cross-training
23
Compute histogram sanitization for B
v 2 A: v = f(side length of C containing v)
Output GSan(v, v)
Cross-Training Privacy
Privacy for B: only histogram information
about B is used
Privacy for A: enough variance for enough
coordinates of v, even given C containing v
and sanitization v’ of v.
24
current proof works only for |A| = 2o(d)
Additional Results*
Impossibility Results
9 interesting utilities that have no sanitization protecting
against isolation (cf. SFE)
Impossibility of all-purpose sanitizers
There is always a choice of aux that defeats a certain natural
version of privacy
Contrived, but places a limit on what can be proved
Poly-time bounded adversary? Connection to obfuscation.
Utility
*
25
Exploit literature on power of randomized histograms for
algorithms for data streams (eg, Indyk)
with assorted collaborators, eg, N, N, S, T
Outline
Definitions
privacy, defined in the breach
sanitization requirements
utility goals
Example: Recursive Histogram Sanitizations
description of technique
a robust proof of privacy
Example: “Round” Sanitizations
nice learning properties
privacy via cross-training
Setting the Real World Context
26
dealing with auxiliary information
A Standard Technique: Cell Suppression
Gestalt: Tabular Data (many, possibly linked, tables)
entries are cells
frequency (count) data
magnitude data (income, sales, etc.)
16
8
5
2
31
1
5
20
3
29
17
13
25
5
60
Disclosure = small counts
Provides key for population unique, or almost-unique
Can be used as a key into a different database
Enormous literature on suppressing “safely”
27
Connection to Our Definitions
Protection against isolation yields protection
against learning a key for a population unique
isolation on a subspace does not imply isolation in
the full-dimensional space …
… but aux may contain other DBs that can be
queried to learn remaining attributes
28
definition mandates protection against all possible aux
satisfy def ) can’t learn key
Connection to Our Definitions
Seems very hard to provide good
sanitization in the presence of arbitrary aux
Provably impossible in general
Anyway, can probably already isolate people
based solely on aux
Suggests we need to control aux
How should we redesign the world?
29
Two Tools
Secure Function Evaluation [Yao, GMW]
Technique permitting Alice, Bob, Carol, and their
friends to collaboratively compute a function f of
their private inputs =f(a,b,c,…).
eg, = sum(a,b,c, …)
Each player learns only what can be deduced from
and her own input to f
SuLQ databases [Dwork, Nissim]
30
Provably preserves privacy of attributes when the
rows of the database are mutually independent
Powerful [DwNi; Blum, Dwork, McSherry, Nissim]
Statistical Database
Database DB
Row distribution
D (D1,D2,…,Dn)
d attributes
n persons
0 0 0 1 0
1
1 0 0 1
0 0 1
1 0
1 0 1 0 0
1
1 0 1
1
0 0 1 0 1
31
Query (S, f)
S [n]
Exact Answer
rS f(row r)
f : {0,1}d {0,1}
f
f
f
f
Sub-Linear Query (SuLQ) Databases
If the number of queries is << n, then privacy can be
protected with little noise (per query):
E(noise) = 0; standard dev << √n
Much less than sampling error!
d attributes
n persons
0 0 0 1 0
1
1 0 0 1
0 0 1
1 0
1 0 1 0 0
1
1 0 1
1
0 0 1 0 1
32
f
f
f
f
+ noise
Our Data, Ourselves
Individuals maintain their own data records
join a DB by setting an appropriate attribute
0 4 6 3 … 1 0 …
Statistical queries via a SFE(SuLQ)
privacy of SuLQ query ) this SFE is “safe”
Individuals ensure
34
data take part in sufficiently few queries
sufficient random noise is added
Summary
Definitions
defined isolation and sanitization
Recursive Histogram Sanitizations
described approach and sketched a robust proof of privacy
for a special distribution
proof exploits high dimensionality (# columns)
Sanitization via perturbations
utility and privacy via cross-training
Setting the Real World Context
35
discussed a radical view of how data might be organized to
prevent a powerful class of attacks based on auxiliary data
SuLQ tool exploits large membership (# rows)
Larry Joseph Stockmeyer
November 13, 1948 - July 31, 2004
36
Larry Stockmeyer Commemoration
May 21-22, 2005
Baltimore, Maryland
(in conjunction with STOC 2005)
May 21:,
Tutorial by Nick Pippenger (Princeton) on some of
Stockmeyer's fundamental results in complexity theory
Lectures by Miki Ajtai (IBM), Anne Condon (UBC),
Cynthia Dwork (Microsoft), Richard Karp (UC Berkeley),
Albert Meyer (MIT), and Chris Umans (CalTech).
Some time will be reserved for personal remarks. Contact
Cynthia Dwork if you want to participate in this part of the
commemoration.
May 22: Lance Fortnow gives first keynote address to STOC.
37