Download Data Publishing against Realistic Adversaries

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Publishing
against Realistic
Adversaries
Ashwin Machanavajjhala
Yahoo! Reasearch
Santa Clara, CA
Johannes Gerhrke
Cornell University
Ithaca, NY
Michaela Götz
Cornell University
Ithaca, NY
Amedeo D’Ascanio, University Of Bologna, Italy
Data Publishing against Realistic Adversaries
Outline
Introduction
 Є-privacy
 Adversary knowledge
 Adversary Classes
 Apply Є-privacy to Generalization
 Experimental evaluation
 Conclusion

Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Introduction

Many reasons to Publish Data:
requirements
 Preserve
aggregate information about population
 Preserve privacy of sensitive information

Privacy
 How
much information can an adversary deduce from
released data?
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Example
Alice knows that
Rachel is 35 and
she lives in 13058
 Alice knows that
Rachel is 20 and
she has very low
probability of Hart
Disease

Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Previous Definitions
L-diversity


The adversary knows l-2 information about sensitive attribute
 The informations are equally like
T-closeness



Alice knows the distribution of sensitive values
Rachel’s chances of having a disease follow the same odds
Differential privacy


Alice knows exact disease about every patient but Rachel’s one
“It’s flu season, a lot of elderly people will be in the hospital with flu
symptoms”

How do we model such background knowledge with l-diversty or tcloseness?
 Does Alice knows everything about 1Billion patients?
Unrealistic assumptions!
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Є-privacy
Flexible language to define information
about each individual
 Privacy as difference of adversary’s belief
between published table with and without
the “victim”
 Different class of adversary (either realistic
or unrealistic) modeled based on their
knowledge

Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Modeling sensitive information

Positive disclosure


Alice knows that Rachel has flu
Negative disclosure

Alice knows that Rachel has not flu
Sensitive information using positive discloser on a set of sensitive
predicates Φ
Rachel[ Disease]  Flu
Rachel[ Disease] {Ulcer , Dyspepsia}
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Modeling sensitive information
Example
each  (u )   takes the form tu .S  S , S  dom( S )
where dom(S) is the domain of sensitive attribute
 1 ( Rachel ) : tRachel [ S ] {Flu}
False
 2 ( Rachel ) : tRachel [ S ] {Cancer}
 3 ( Rachel ) : tRachel [ S ] {Ulcer , Dyspepsia}
True
Rachel can protect against any kind of disclosures for flu, cancer and
any stomach disease if   { 1 ,  2 ,  3} for each subset S
Negative
Positive discloser
discloser
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversaries Knowledge

Knowledge from other sources
Usually modeled as the joint distribution P over
N and S attributes.
P  p  (..., pi ,...), i  N  S such that iN S pi  1
If the adversary has no preference for any value of i
i  N  S , pi  1
N S
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversary Knowledge
 Two

problems
Where does the adversary learn their knowledge?
If population with cancer is 10% (si = s/10)
 For each i, pi=si/s=0.1

What if Tpub has only 10 enties?

Can the adversary change his prior?
The probability that a woman has cancer is pi=0.5 based
on a sample of 100 women
 An adversary read another table with 20k tuples where si
is 2k (so that pi=0.1)

If her prior is not strong pi will change accordingly
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversary Knowledge
To model adversaries we assume that

The adversary knows more priors
 The tuples are not independent each other
Exchangeability: a sequence of random variable X1,X2,..,Xn is
exchangeable if every finite permutation of these random variables
has the same joint probability distribution

If H is healty an S is Sick, the probability of seeing the sequence
SSHSH is the same as the probability of HHSSS
Accordin to deFinetti’s representation Theorem, an exchangeable
sequence of random variables is mathematically equivalent to


Choosing a data-generating distribution θ at random
Creating the data by independently sampling from this chosen
distribution θ
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversary Knowledge Example
Assume two populations of equal size, Ω1 with only healty people and
Ω2 with only sick people. Table T is drawn only form Ω1 or Ω2
If the adversary doesn’t know which population has been chosen:
If the adversary knows that just one t is healthy then:
If tuples are independent from each other? Still Pr[t=H] =0.5
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Dirichlet Distribution
More generally:

T (of size n) is generate in two steps:


One of probability vector p is drawn from a distribution D
Then n elements are drawn i.i.d. from the probability vector
p
D encode the adversary knowledge


If the adversary has no prior p is drawn from D equally like
If an adversary know that 999 people over 1k have cancer, he
should model D in order to draw pno(cancer) = 0.001 and
pyes(cancer) =0.999
Dirichlet Distribution to model prior over
p
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Dirichlet Distribution

where  =  i  i and (t )   x t 1e  x dx
0
 is the stubbornness and  /  is the shape

belief that the probabilities of k rival events are xi given
that each event has been observed σi − 1 times.


Adversary without knowledge: D(σ1,…, σk) = D(1,…,1);
After reading dataset whit counts (σ1-1,…, σk-1) the adversary
may update his prior to D(σ1,…, σk).
In this case not all p are equally like
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Dirichlet Distribution

The vector with the maximum likelihood is
p*   i / 
*
As we increase σ the p becomes more
likely
*
 If   , p is the only possible probability
distribution

Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Other Adversary Knowledge

Knowledge from individuals inside the
published table
 Full
knowledge about a subset B of tuples in T
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Definition

After Tpub is published the adversary
belief in a sensitive predicate  (u )about an
individual u in T is

If the individual u is remove from T, the
belief becomes
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Definition

The pin should not be much greater than pout


The greater it is, the more information about an individual’s
sensitive predicate the adversary learns
A Table does not respect epsilon-privacy if
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversary Classes

Defined based on their prior built over the
distribution of sensitive values
I: fixed  and shape  i / 
 Class II: fixed  arbitrary D( ), such that  sS  ( s)  
 Class III: fixed shape  i /  and arbitrarily large 
 Class IV: arbitrary  and  i / 
 Class
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversary classes - Examples





Suppose to have
another dataset with
30000 tuples: 12000
with flu and 18000
cancer
Class I: σ= 30k, D(12k,18k)
Class II: σ= 30k arbitrary shape
Class III: arbitrary σ, distribution (.4,.6)
Class IV: arbitrary prior
Rachel is in the table. pin(flu) = .9 for all adversaries (depends only
from published table)
pout(flu) changes for each adversary
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversary classes - Examples

Class I : pout(flu) = (18k+12k)/(20k+30k) =.6
Class II : pout(flu) = (18k+1)/(20k+30k)=.36002
Class III: pout(flu) = .4

Class IV = every value


 So
that Rachel is granted .4, 6.4, 6 and no
privacy against respectively class I,II,III,IV
adversary
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Generalization and epsilon privacy
We can define a set of constraint that have to be checked during the
generalization process
Set of sensitive predicates for each individual u is
(u)  {{s} s  S}
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Check for Class I

R1 and R2 has to be respected

Combination of


Anonymity
closeness
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Check for Class II

R1 and R2 has to be respected

Combination of


anonymity
diversity
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Check for Class III

R1 and R2 has to be respected

Only closeness
doesn’t guarantee privacy
against class IV adversary
 epsilon-privacy
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Montonicity

T1 and T2 generalization of T such that
T2
T1
if T1 satisfies epsilon-privacy, then T2 also
satisfies epsilon-privacy
 Useful
for algorithms such as Incognito, Mondrian,
PET algorithm
 All checks shown before can has a time complexity
O(N)
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Choosing Parameter

The choice is application dependent: US
Census
 Stubbornness:
number of individuals
 Shape: distribution of sensitive values
 Epsilon: between 10 and 100
WHY?
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Experimental results
Data from Minnesota Population Center with nearly 3M tuples
The more stubbornness we have, the grater epsilon we need to
achieve privacy
With small values of σ the cost function is better
The average group size increases according to σ
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Embedding prior work

Epsilon-privacy can cover some
instantiation of
 Recursive
diversity (c,2)-diversity
 Differential privacy
 T-closeness
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Conclusions

Definition of epsilon-privacy
Definition of Realistic Adversaries
How to cover scenarios not taken in account in previous
works
Epsilon-privacy in generalization process

Future work:





Considering correlation between sensitive and non sensitive
values
apply epsilon privacy to other algorithm
Amedeo D’Ascanio
Related documents