Download Toward Privacy in Public Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Economics of digitization wikipedia , lookup

Transcript
Toward Privacy in Public Databases
Shuchi Chawla, Cynthia Dwork,
Frank McSherry, Adam Smith,
Larry Stockmeyer, Hoeteck Wee
Work Done at Microsoft Research
Database Privacy
 Think “Census”


Individuals provide information
Census Bureau publishes sanitized records

Privacy is legally mandated; what utility can we achieve?
 Inherent Privacy vs Utility trade-off


One extreme – complete privacy; no information
Other extreme – complete information; no privacy
 Goals:

Find a middle path



Change the nature of discourse

2
preserve macroscopic properties
“disguise” individual identifying information
Establish framework for meaningful comparison of techniques
Current solutions
 Statistical approaches


Alter the frequency (PRAN/DS/PERT) of particular features,
while preserving means.
Additionally, erase values that reveal too much
 Query-based approaches


Disallow queries that reveal too much
Output perturbation (add noise to true answer)
 Unsatisfying




3
Ad-hoc definitions of the privacy/breach
Erasure can disclose information
Noise can cancel (although, see work of Nissim+.)
Combinations of several seemingly innocuous queries could
reveal information; refusal to answer can be revelatory
Everybody’s First Suggestion
 Learn the distribution, then output


A description of the distribution, or
Samples from the learned distribution
 Want to reflect facts on the ground

4
Statistically insignificant clusters can be
important for allocating resources
Our Approach
 Crypto-flavored definitions


Mathematical characterization of Adversary’s goal
Precise definition of when sanitization procedure fails

Intuition: seeing sanitized DB gives Adversary an “advantage”
 Statistical Techniques

Perturbation of attribute values

Differs from previous work: perturbation amounts depend on
local densities of points
 Highly abstracted version of problem

If we can’t understand this, we can’t understand real life
(and we can’t…)

5
If we get negative results here, the world is in trouble.
What do WE mean by privacy?
 [Ruth Gavison] Protection from being brought to
the attention of others


inherently valuable
attention invites further privacy loss
 Privacy is assured to the extent that one blends
in with the crowd
 Appealing definition; can be converted into a
precise mathematical statement…
6
A geometric view
 Abstraction:

Database consists of points in high dimensional space Rd
independent samples from some underlying distribution

Points are unlabeled
you are your collection of attributes

Distance is everything
points are similar if and only if they are close (L2 norm)
 Real Database (RDB), private
n unlabeled points in d-dimensional space
 Sanitized Database (SDB), public
n’ new points, possibly in a different space
7
The adversary or Isolator - Intuition
 On input SDB and auxiliary information,
adversary outputs a point q  Rd
 q “isolates” a real DB point x, if it is much
closer to x than to x’s near neighbors


q fails to isolate x if q looks roughly as much like
everyone in x’s neighborhood as it looks like x
itself
Tightly clustered points have a smaller radius of
isolation
RDB
8
Isolation – the definition
 I(SDB,aux) = q
 x is isolated if B(q,cd) contains fewer than T other
points from RDB
 T-radius of x – distance to its Tth-nearest neighbor
 x is “safe” if dx > (T-radius of x)/(c-1)
B(q,cdx) contains x’s entire T-neighborhood
cd
p
x
d
q
(c-1) d
9
c – privacy parameter; eg, 4
If |x-p| < T-radx < (c-1)dx then
|q-p| · |q-x| + |x-p| < dx + T-radx < cdx
Requirements for the sanitizer
 No way of obtaining privacy if AUX already reveals too much!
 Sanitization procedure compromises privacy if
giving the adversary access to the SDB
considerably increases its probability of success
 Definition of “considerably” can be forgiving, say, n-2.
 Made rigorous by quantification over adversaries,
distributions, auxiliary information, sanitizations, samples:

 I  I’ w.o.p. D  aux z  x 2 D
|Pr[I(SDB,z) isolates x] – Pr[I’(z) isolates x]| is small/n

10
Provides a framework for describing the power of a sanitization
method, and hence for comparisons
The Sanitizer
 The privacy of x is linked to its T-radius
 Randomly perturb it in proportion to its T-radius
 x’ = San(x) R B(x,T-rad(x))
 Intuition:


11
We are blending x in with its crowd
We are adding to x random noise with mean zero,
so several macroscopic properties should be
preserved.
Flavor of Results (Preliminary)
 Assumptions
Data arises from a mixture of Gaussians
Dimension d, number of points n are large
d = w(log n)
 Results
Privacy: An adversary who knows the Gaussians
and some auxiliary information cannot isolate
any point with probability more than 2-W(d)


several special cases; general result not yet proved;
Very different proof techniques from anything in the
statistics or crypto literatures!
Utility: A user who does not know the Gaussians
can compute the means with a high probability.
12
The “simplest” interesting case
 Two points – x and y – generated uniformly from
surface of a ball B(o,r)
 The adversary knows x’, y’, r and d = |x-y|
 We prove there are 2W(d) “decoy” pairs (xi,yi) such that
|xi-yi|= d and Pr[ xi,yi | x’,y’ ] = Pr[ x,y | x’,y’ ]
 Furthermore, the adversary can only isolate one point
xi or yi at a time: they are “far apart” wrt d
Proof based on symmetry arguments and coding theory.
High dimensionality crucial.
13
Finding Decoy Pairs
 Consider a hyperplane H through x’, y’ and o
 xH, yH – mirror reflections of x, y through H
Note: reflections preserve distances!
 The world of xH, yH looks identical to the world of x, y
x’
xH
x
yH
y
y’
H
14
Pr[ xH,yH | x’,y’ ] = Pr[ x,y | x’,y’ ]
Lots of choices for H
 xH, yH – reflections of x, y through H(x’,y’,o)
Note: reflections preserve distances!
 The world of xH, yH looks identical to the world of x, y
 How many different H such that the corresponding xH
are pairwise distant (and distant from x)?
x1
2q
2r sinq
> 2/3 d
x2
15
Sufficient to pick r > 2/3 d and q = 30°
r
x
r
Fact: There are 2W(d) vectors in d-dim, at
angle 60° from each other.
 Probability that adversary wins ≤ 2-W(d)
Towards the general case… n points
 The adversary is given n-1 real points x2,…,xn
and one sanitized point x’1
 Symmetry does not work – too many
constraints
 A more direct argument –


Let Z = { pRd | p is a legal pre-image for x’1 }
Q = {p | if x1 = p then x1 is isolated by q }
Show that Pr[x1 in Q∩Z | x’1 ] ≤ 2-W(d)
Pr[x1 in Q∩Z | x’1 ] =
prob mass contribution from Q∩Z / contribution from Z
= 21-d /(1/4)
16
Why does Q∩Z contribute so little mass?
Z = { p| p is a legal pre-image for x’1 }
Q = { p | if x1 = p then x1 is isolated by q }
x3
Z
x5
Q∩Z
q
x1’
T=1; perturb to 1-radius
|x1’ – x1| = 1-rad(x1)
Q
x2
x4
Key observation:
As |q-x1’| increases, Q becomes larger.
x6
But, larger distance from x1’ implies
smaller probability mass, as x1 is
randomized over a larger area
17
The general case… n sanitized points
 Initial intuition is wrong:


18
Privacy of x1 given x1’ and all the other points in the clear
does not imply privacy of x1 given x1’ and sanitizations of
others!
Sanitization of other points reveals information about x
Digression: Histogram Sanitization
 U = d-dim cube, side = 2
 Cut into 2d subcubes


split along each axis
subcube has side = 1
 For each subcube
if number of RDB points > 2T
then recurse
 Output: list of cells and counts
19
Digression: Histogram Sanitization
 Theorem: If n = 2o(d) and points are drawn
uniformly from U, then histogram
sanitizations are safe with respect to 8isolation: Pr[I(SDB) succeeds] · 2-W(d).
 Rough Intuition:
For q 2 C: expected distance to any x 2 C is
relatively large (and even larger for x 2 C’);
distances tightly concentrated. Increasing
radius by 8 captures almost all the parent
cell, which contains at least 2T points.
20
Combining the Two Sanitizations
 Partition RDB into two sets A and B
 Cross-training



Compute histogram sanitization for B
v 2 A: rv = side length of C containing v
Output GSan(v, rv)
A
21
B
Cross-Training Privacy
 Privacy for B: only histogram information
about B is used
 Privacy for A: enough variance for enough
coordinates of v, even given C containing v
and sanitization v’ of v.
22
Results on privacy.. The special Cases
Distribution
Num. of Revealed to
points
adversary
Auxiliary
information
Uniform on surface
of sphere
2
Both sanitized points Distribution,
1-radius
Uniform over a
bounding box or
surface of sphere
n
One sanitized point,
all other real points
Distribution
Uniform over a
hypercube
2W(d)
n/2 sanitized points
Distribution
Gaussian
2o(d)
n sanitized points
Distribution
23
Learning mixtures of Gaussians
- Spectral techniques
 Observation: Optimal low-rank approx to a
matrix of complex data yields the underlying
structure, eg, means [M01,VW02].
 We show that McSherry’s algorithm works for
clustering sanitized Gaussian data
original distribution (mixture of Gaussians) is recovered
24
Spectral techniques for perturbed data
 A sanitized point is the sum of two
Gaussian variables – sample + noise
 w.h.p. the T-radius of a point is less than
the “radius” of its Gaussian
 Variance of the noise is small
 Previous techniques work
25
Results on utility… An overview
Distributional/ Objective
Worst-case
Worst-case
Find K clusters
minimizing largest
diameter
Distributional
Find k maximum
likelihood clusters
26
Assumptions
-
Result
Diameter increases by a
factor of 3
Mixture of k Correct clustering with
Gaussians
high probability as long
as means are pairwise
sufficiently far
What about the real world?
 Lessons from the abstract model



High dimensionality is our friend
Gaussian perturbations seem to be the right thing to do
Need to scale different attributes appropriately, so that
data is well rounded
 Moving towards real data


27
Outliers
– Our notion of c-isolation deals with them
- Existence of outlier may be disclosed
Discrete attributes
– Convert them into real-valued attributes
- e.g. Convert a binary variable into a probability