Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Publishing
against Realistic
Adversaries
Ashwin Machanavajjhala
Yahoo! Reasearch
Santa Clara, CA
Johannes Gerhrke
Cornell University
Ithaca, NY
Michaela Götz
Cornell University
Ithaca, NY
Amedeo D’Ascanio, University Of Bologna, Italy
Data Publishing against Realistic Adversaries
Outline
Introduction
Є-privacy
Adversary knowledge
Adversary Classes
Apply Є-privacy to Generalization
Experimental evaluation
Conclusion
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Introduction
Many reasons to Publish Data:
requirements
Preserve
aggregate information about population
Preserve privacy of sensitive information
Privacy
How
much information can an adversary deduce from
released data?
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Example
Alice knows that
Rachel is 35 and
she lives in 13058
Alice knows that
Rachel is 20 and
she has very low
probability of Hart
Disease
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Previous Definitions
L-diversity
The adversary knows l-2 information about sensitive attribute
The informations are equally like
T-closeness
Alice knows the distribution of sensitive values
Rachel’s chances of having a disease follow the same odds
Differential privacy
Alice knows exact disease about every patient but Rachel’s one
“It’s flu season, a lot of elderly people will be in the hospital with flu
symptoms”
How do we model such background knowledge with l-diversty or tcloseness?
Does Alice knows everything about 1Billion patients?
Unrealistic assumptions!
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Є-privacy
Flexible language to define information
about each individual
Privacy as difference of adversary’s belief
between published table with and without
the “victim”
Different class of adversary (either realistic
or unrealistic) modeled based on their
knowledge
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Modeling sensitive information
Positive disclosure
Alice knows that Rachel has flu
Negative disclosure
Alice knows that Rachel has not flu
Sensitive information using positive discloser on a set of sensitive
predicates Φ
Rachel[ Disease] Flu
Rachel[ Disease] {Ulcer , Dyspepsia}
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Modeling sensitive information
Example
each (u ) takes the form tu .S S , S dom( S )
where dom(S) is the domain of sensitive attribute
1 ( Rachel ) : tRachel [ S ] {Flu}
False
2 ( Rachel ) : tRachel [ S ] {Cancer}
3 ( Rachel ) : tRachel [ S ] {Ulcer , Dyspepsia}
True
Rachel can protect against any kind of disclosures for flu, cancer and
any stomach disease if { 1 , 2 , 3} for each subset S
Negative
Positive discloser
discloser
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversaries Knowledge
Knowledge from other sources
Usually modeled as the joint distribution P over
N and S attributes.
P p (..., pi ,...), i N S such that iN S pi 1
If the adversary has no preference for any value of i
i N S , pi 1
N S
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversary Knowledge
Two
problems
Where does the adversary learn their knowledge?
If population with cancer is 10% (si = s/10)
For each i, pi=si/s=0.1
What if Tpub has only 10 enties?
Can the adversary change his prior?
The probability that a woman has cancer is pi=0.5 based
on a sample of 100 women
An adversary read another table with 20k tuples where si
is 2k (so that pi=0.1)
If her prior is not strong pi will change accordingly
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversary Knowledge
To model adversaries we assume that
The adversary knows more priors
The tuples are not independent each other
Exchangeability: a sequence of random variable X1,X2,..,Xn is
exchangeable if every finite permutation of these random variables
has the same joint probability distribution
If H is healty an S is Sick, the probability of seeing the sequence
SSHSH is the same as the probability of HHSSS
Accordin to deFinetti’s representation Theorem, an exchangeable
sequence of random variables is mathematically equivalent to
Choosing a data-generating distribution θ at random
Creating the data by independently sampling from this chosen
distribution θ
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversary Knowledge Example
Assume two populations of equal size, Ω1 with only healty people and
Ω2 with only sick people. Table T is drawn only form Ω1 or Ω2
If the adversary doesn’t know which population has been chosen:
If the adversary knows that just one t is healthy then:
If tuples are independent from each other? Still Pr[t=H] =0.5
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Dirichlet Distribution
More generally:
T (of size n) is generate in two steps:
One of probability vector p is drawn from a distribution D
Then n elements are drawn i.i.d. from the probability vector
p
D encode the adversary knowledge
If the adversary has no prior p is drawn from D equally like
If an adversary know that 999 people over 1k have cancer, he
should model D in order to draw pno(cancer) = 0.001 and
pyes(cancer) =0.999
Dirichlet Distribution to model prior over
p
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Dirichlet Distribution
where = i i and (t ) x t 1e x dx
0
is the stubbornness and / is the shape
belief that the probabilities of k rival events are xi given
that each event has been observed σi − 1 times.
Adversary without knowledge: D(σ1,…, σk) = D(1,…,1);
After reading dataset whit counts (σ1-1,…, σk-1) the adversary
may update his prior to D(σ1,…, σk).
In this case not all p are equally like
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Dirichlet Distribution
The vector with the maximum likelihood is
p* i /
*
As we increase σ the p becomes more
likely
*
If , p is the only possible probability
distribution
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Other Adversary Knowledge
Knowledge from individuals inside the
published table
Full
knowledge about a subset B of tuples in T
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Definition
After Tpub is published the adversary
belief in a sensitive predicate (u )about an
individual u in T is
If the individual u is remove from T, the
belief becomes
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Definition
The pin should not be much greater than pout
The greater it is, the more information about an individual’s
sensitive predicate the adversary learns
A Table does not respect epsilon-privacy if
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversary Classes
Defined based on their prior built over the
distribution of sensitive values
I: fixed and shape i /
Class II: fixed arbitrary D( ), such that sS ( s)
Class III: fixed shape i / and arbitrarily large
Class IV: arbitrary and i /
Class
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversary classes - Examples
Suppose to have
another dataset with
30000 tuples: 12000
with flu and 18000
cancer
Class I: σ= 30k, D(12k,18k)
Class II: σ= 30k arbitrary shape
Class III: arbitrary σ, distribution (.4,.6)
Class IV: arbitrary prior
Rachel is in the table. pin(flu) = .9 for all adversaries (depends only
from published table)
pout(flu) changes for each adversary
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Adversary classes - Examples
Class I : pout(flu) = (18k+12k)/(20k+30k) =.6
Class II : pout(flu) = (18k+1)/(20k+30k)=.36002
Class III: pout(flu) = .4
Class IV = every value
So
that Rachel is granted .4, 6.4, 6 and no
privacy against respectively class I,II,III,IV
adversary
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Generalization and epsilon privacy
We can define a set of constraint that have to be checked during the
generalization process
Set of sensitive predicates for each individual u is
(u) {{s} s S}
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Check for Class I
R1 and R2 has to be respected
Combination of
Anonymity
closeness
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Check for Class II
R1 and R2 has to be respected
Combination of
anonymity
diversity
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Check for Class III
R1 and R2 has to be respected
Only closeness
doesn’t guarantee privacy
against class IV adversary
epsilon-privacy
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Montonicity
T1 and T2 generalization of T such that
T2
T1
if T1 satisfies epsilon-privacy, then T2 also
satisfies epsilon-privacy
Useful
for algorithms such as Incognito, Mondrian,
PET algorithm
All checks shown before can has a time complexity
O(N)
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Choosing Parameter
The choice is application dependent: US
Census
Stubbornness:
number of individuals
Shape: distribution of sensitive values
Epsilon: between 10 and 100
WHY?
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Experimental results
Data from Minnesota Population Center with nearly 3M tuples
The more stubbornness we have, the grater epsilon we need to
achieve privacy
With small values of σ the cost function is better
The average group size increases according to σ
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Embedding prior work
Epsilon-privacy can cover some
instantiation of
Recursive
diversity (c,2)-diversity
Differential privacy
T-closeness
Amedeo D’Ascanio
Data Publishing against Realistic Adversaries
Conclusions
Definition of epsilon-privacy
Definition of Realistic Adversaries
How to cover scenarios not taken in account in previous
works
Epsilon-privacy in generalization process
Future work:
Considering correlation between sensitive and non sensitive
values
apply epsilon privacy to other algorithm
Amedeo D’Ascanio