Download Data Publishing against Realistic Adversaries

Data Publishing against Realistic Adversaries Ashwin Machanavajjhala Yahoo! Reasearch Santa Clara, CA Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Amedeo D’Ascanio, University Of Bologna, Italy Data Publishing against Realistic Adversaries Outline Introduction  Є-privacy  Adversary knowledge  Adversary Classes  Apply Є-privacy to Generalization  Experimental evaluation  Conclusion  Amedeo D’Ascanio Data Publishing against Realistic Adversaries Introduction  Many reasons to Publish Data: requirements  Preserve aggregate information about population  Preserve privacy of sensitive information  Privacy  How much information can an adversary deduce from released data? Amedeo D’Ascanio Data Publishing against Realistic Adversaries Example Alice knows that Rachel is 35 and she lives in 13058  Alice knows that Rachel is 20 and she has very low probability of Hart Disease  Amedeo D’Ascanio Data Publishing against Realistic Adversaries Previous Definitions L-diversity   The adversary knows l-2 information about sensitive attribute  The informations are equally like T-closeness    Alice knows the distribution of sensitive values Rachel’s chances of having a disease follow the same odds Differential privacy   Alice knows exact disease about every patient but Rachel’s one “It’s flu season, a lot of elderly people will be in the hospital with flu symptoms”  How do we model such background knowledge with l-diversty or tcloseness?  Does Alice knows everything about 1Billion patients? Unrealistic assumptions! Amedeo D’Ascanio Data Publishing against Realistic Adversaries Є-privacy Flexible language to define information about each individual  Privacy as difference of adversary’s belief between published table with and without the “victim”  Different class of adversary (either realistic or unrealistic) modeled based on their knowledge  Amedeo D’Ascanio Data Publishing against Realistic Adversaries Modeling sensitive information  Positive disclosure   Alice knows that Rachel has flu Negative disclosure  Alice knows that Rachel has not flu Sensitive information using positive discloser on a set of sensitive predicates Φ Rachel[ Disease]  Flu Rachel[ Disease] {Ulcer , Dyspepsia} Amedeo D’Ascanio Data Publishing against Realistic Adversaries Modeling sensitive information Example each  (u )   takes the form tu .S  S , S  dom( S ) where dom(S) is the domain of sensitive attribute  1 ( Rachel ) : tRachel [ S ] {Flu} False  2 ( Rachel ) : tRachel [ S ] {Cancer}  3 ( Rachel ) : tRachel [ S ] {Ulcer , Dyspepsia} True Rachel can protect against any kind of disclosures for flu, cancer and any stomach disease if   { 1 ,  2 ,  3} for each subset S Negative Positive discloser discloser Amedeo D’Ascanio Data Publishing against Realistic Adversaries Adversaries Knowledge  Knowledge from other sources Usually modeled as the joint distribution P over N and S attributes. P  p  (..., pi ,...), i  N  S such that iN S pi  1 If the adversary has no preference for any value of i i  N  S , pi  1 N S Amedeo D’Ascanio Data Publishing against Realistic Adversaries Adversary Knowledge  Two  problems Where does the adversary learn their knowledge? If population with cancer is 10% (si = s/10)  For each i, pi=si/s=0.1  What if Tpub has only 10 enties?  Can the adversary change his prior? The probability that a woman has cancer is pi=0.5 based on a sample of 100 women  An adversary read another table with 20k tuples where si is 2k (so that pi=0.1)  If her prior is not strong pi will change accordingly Amedeo D’Ascanio Data Publishing against Realistic Adversaries Adversary Knowledge To model adversaries we assume that  The adversary knows more priors  The tuples are not independent each other Exchangeability: a sequence of random variable X1,X2,..,Xn is exchangeable if every finite permutation of these random variables has the same joint probability distribution  If H is healty an S is Sick, the probability of seeing the sequence SSHSH is the same as the probability of HHSSS Accordin to deFinetti’s representation Theorem, an exchangeable sequence of random variables is mathematically equivalent to   Choosing a data-generating distribution θ at random Creating the data by independently sampling from this chosen distribution θ Amedeo D’Ascanio Data Publishing against Realistic Adversaries Adversary Knowledge Example Assume two populations of equal size, Ω1 with only healty people and Ω2 with only sick people. Table T is drawn only form Ω1 or Ω2 If the adversary doesn’t know which population has been chosen: If the adversary knows that just one t is healthy then: If tuples are independent from each other? Still Pr[t=H] =0.5 Amedeo D’Ascanio Data Publishing against Realistic Adversaries Dirichlet Distribution More generally:  T (of size n) is generate in two steps:   One of probability vector p is drawn from a distribution D Then n elements are drawn i.i.d. from the probability vector p D encode the adversary knowledge   If the adversary has no prior p is drawn from D equally like If an adversary know that 999 people over 1k have cancer, he should model D in order to draw pno(cancer) = 0.001 and pyes(cancer) =0.999 Dirichlet Distribution to model prior over p Amedeo D’Ascanio Data Publishing against Realistic Adversaries Dirichlet Distribution  where  =  i  i and (t )   x t 1e  x dx 0  is the stubbornness and  /  is the shape  belief that the probabilities of k rival events are xi given that each event has been observed σi − 1 times.   Adversary without knowledge: D(σ1,…, σk) = D(1,…,1); After reading dataset whit counts (σ1-1,…, σk-1) the adversary may update his prior to D(σ1,…, σk). In this case not all p are equally like Amedeo D’Ascanio Data Publishing against Realistic Adversaries Dirichlet Distribution  The vector with the maximum likelihood is p*   i /  * As we increase σ the p becomes more likely *  If   , p is the only possible probability distribution  Amedeo D’Ascanio Data Publishing against Realistic Adversaries Other Adversary Knowledge  Knowledge from individuals inside the published table  Full knowledge about a subset B of tuples in T Amedeo D’Ascanio Data Publishing against Realistic Adversaries Definition  After Tpub is published the adversary belief in a sensitive predicate  (u )about an individual u in T is  If the individual u is remove from T, the belief becomes Amedeo D’Ascanio Data Publishing against Realistic Adversaries Definition  The pin should not be much greater than pout   The greater it is, the more information about an individual’s sensitive predicate the adversary learns A Table does not respect epsilon-privacy if Amedeo D’Ascanio Data Publishing against Realistic Adversaries Adversary Classes  Defined based on their prior built over the distribution of sensitive values I: fixed  and shape  i /   Class II: fixed  arbitrary D( ), such that  sS  ( s)    Class III: fixed shape  i /  and arbitrarily large   Class IV: arbitrary  and  i /   Class Amedeo D’Ascanio Data Publishing against Realistic Adversaries Adversary classes - Examples      Suppose to have another dataset with 30000 tuples: 12000 with flu and 18000 cancer Class I: σ= 30k, D(12k,18k) Class II: σ= 30k arbitrary shape Class III: arbitrary σ, distribution (.4,.6) Class IV: arbitrary prior Rachel is in the table. pin(flu) = .9 for all adversaries (depends only from published table) pout(flu) changes for each adversary Amedeo D’Ascanio Data Publishing against Realistic Adversaries Adversary classes - Examples  Class I : pout(flu) = (18k+12k)/(20k+30k) =.6 Class II : pout(flu) = (18k+1)/(20k+30k)=.36002 Class III: pout(flu) = .4  Class IV = every value    So that Rachel is granted .4, 6.4, 6 and no privacy against respectively class I,II,III,IV adversary Amedeo D’Ascanio Data Publishing against Realistic Adversaries Generalization and epsilon privacy We can define a set of constraint that have to be checked during the generalization process Set of sensitive predicates for each individual u is (u)  {{s} s  S} Amedeo D’Ascanio Data Publishing against Realistic Adversaries Check for Class I  R1 and R2 has to be respected  Combination of   Anonymity closeness Amedeo D’Ascanio Data Publishing against Realistic Adversaries Check for Class II  R1 and R2 has to be respected  Combination of   anonymity diversity Amedeo D’Ascanio Data Publishing against Realistic Adversaries Check for Class III  R1 and R2 has to be respected  Only closeness doesn’t guarantee privacy against class IV adversary  epsilon-privacy Amedeo D’Ascanio Data Publishing against Realistic Adversaries Montonicity  T1 and T2 generalization of T such that T2 T1 if T1 satisfies epsilon-privacy, then T2 also satisfies epsilon-privacy  Useful for algorithms such as Incognito, Mondrian, PET algorithm  All checks shown before can has a time complexity O(N) Amedeo D’Ascanio Data Publishing against Realistic Adversaries Choosing Parameter  The choice is application dependent: US Census  Stubbornness: number of individuals  Shape: distribution of sensitive values  Epsilon: between 10 and 100 WHY? Amedeo D’Ascanio Data Publishing against Realistic Adversaries Experimental results Data from Minnesota Population Center with nearly 3M tuples The more stubbornness we have, the grater epsilon we need to achieve privacy With small values of σ the cost function is better The average group size increases according to σ Amedeo D’Ascanio Data Publishing against Realistic Adversaries Embedding prior work  Epsilon-privacy can cover some instantiation of  Recursive diversity (c,2)-diversity  Differential privacy  T-closeness Amedeo D’Ascanio Data Publishing against Realistic Adversaries Conclusions  Definition of epsilon-privacy Definition of Realistic Adversaries How to cover scenarios not taken in account in previous works Epsilon-privacy in generalization process  Future work:      Considering correlation between sensitive and non sensitive values apply epsilon privacy to other algorithm Amedeo D’Ascanio

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Publishing against Realistic Adversaries