Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
To Do or Not To Do: The Dilemma of Disclosing Anonymized Data Lakshmanan L, Ng R, Ramesh G Univ. of British Columbia Oren Fine Nov. 2008 CS Seminar in Databases (236826) Once Upon a Time… • The police is after Edgar, a drug lord suspect. – Intel. has gathered calls & meetings data records as a transactional database – In order to positively frame Edgar, the police must find hard evidence, and wishes to outsource data mining tasks to “We Mind your Data Ltd.” – But, the police is subject to the law, and is obligated to keep the privacy of the people in the database – including Edgar, which is innocent until proven otherwise. – Furthermore, Edgar is seeking for the smallest hint to disappear… I have the pleasure to introduce Edgar vs. The Police VS. Motivation • The Classic Dilemma: – Keep your data close to your chest and never risk privacy or confidentiality or… – Disclose the data and gain potential valuable knowledge and benefits • In order to decide, we need to answer a major question – “Just how safe is the anonymized data?” – Safe = protecting the identities of the of the objects. Agenda • Anonymization • Model the Attacker’s Knowledge • Determine the risk to our data Anonymization or De-Identification • Transform sensitive data into generated unique content (strings, numbers) • Example TID Names TID Transaction 1 {Hussein, Hassan, Dimitri} 1 {1,2,3} 2 {Hussein, Edgar, Anglea} 2 {1,4,5} 3 {Angela, Edgar} 3 {5,4} 4 {Raz, Adi, Yishai} 4 {6,7,8} 5 {Hassan, Yishai, Dimitri, Raz} 5 {2,8,3,6} 6 {Raz, Anglea, Nithai} 6 {6, 5, 9} Anonymization or De-Identification • Advantages – Very simple – Does not affect final outcome or perturb data characteristics • We do not suggest that anonymization is the “right” way, but it is probably the most common Frequent Set Mining Crash Course • Transactional database • Each transaction has TID and a set of items • An association rule of the form XY has – Support s if s% of the transactions include (X,Y) – Confidence c if c% of the transactions that include X also include Y • Support = frequent sets • Confidence = association rules • A k-itemset is a set of k items Example TID Names 1 Angela, Ariel, Edgar, Steve, Benny 2 Edgar, Hassan, Steve, Tommy 3 Joe, Sara, Israel 4 Steve, Angela, Edgar 5 Benny, Mahhmud, Tommy 6 Angela, Sara, Edgar 7 Hassan, Angela, Joe, Edgar, Noa 8 Edgar, Benny, Steve, Tommy Example (Cont.) • First, we look for frequent sets, according to a support threshold • 2-itemsets: {Angela, Edgar}, {Edgar, Steve} have 50% support (4 out of 8 transactions). • 3-itemsets: {Angela, Edgar, Steve}, {Benny, Edgar, Steve} and {Tommy, Edgar, Steve} have only 25% support (2 out of 8 transactions) • The rule {Edgar, Steve} {Angela} has 50% confidence (2 out 4 transactions) and the rule {Tommy} {Edgar, Steve} has 66.6% confidence. Frequent Set Mining Crash Course (You’re Qualified!) • Widely used in market basket analysis, intrusion detection, Web usage mining and bioinformatics • Aimed at discovering non trivial or not necessarily intuitive relation between items/variables of large databases “Extracting wisdom out of data” • Who knows what is the most famous frequent set? Big Mart’s Database Modeling the Attacker’s Knowledge • We believe that the attacker has prior knowledge about the items in the original domain • The prior information regards the frequencies of items in the original domain • We capture the attacker’s knowledge with “Belief Functions” Examples of Belief Functions Consistent Mapping • Mapping anonymized entities to original entities only according to the belief function Ignorant Belief Function (Q) • How does the graph look like? • What is the expected number of cracks? • Suppose n items. Further suppose that we are only interested in a partial group, of size n1 • What is the expected number of cracks now? • Don’t you underestimate Edgar… Ignorant Belief Function (A) Compliant Point-Valued Belief Function (Q) • How does the graph look like? • What is the expected number of cracks? • Suppose n items. Further suppose that we are only interested in a partial group, of size n1 • What is the expected number of cracks now? • Unless he has inner source, we shouldn’t overestimate Edgar either… Compliant Point-Valued Belief Function (A) Compliant Interval Belief Functions • Direct Computation Method – Build a graph G and adjacency matrix AG – The probability of cracking k out of n items: • Computing the permanent is know to be #P-complete problem, state of the art approximation running time O(n22) !! • What the !#$!% is a permanent or #P-complete? Permanent • A permanent of an n*n matrix is perm( A) n a S n i 1 i, (i ) • The sum is over all permutations of 1,2,… • Calculating the permanent is #P-complete • Which brings us to… #P-Complete • Unlike well known complexity classes which are of decision problems, this is a class of function problems • "compute f(x)," where f is the number of accepting paths of an NP machine • Example – NP: Are there any subsets of a list of integers that add up to zero? – #P: How many subsets of a list of integers add up to zero? Chain Belief Functions Chain Belief Functions Unfortunately… • General Belief Function does not always produce a chain… • We seek for way to estimate the number of cracks. The O-estimate Heuristic • Suppose Graph G, interval belief function β. • For each x, let Ox denote the outdegree of x in G. • The probability of cracking x is simply 1 Ox • The expected number of cracks is Properties of O-estimate • Inexact (hence “estimate”) • Monotonic -Compliant Belief Function • Suppose we “somehow” know which items are guessed wrong • We sum the O-estimates only over the compliant frequency groups Risk Assessment • Worst case \ Best case – unrealistic • Determine the intervals width – Twice the median gap of all successive frequency groups – Why? • Determine the degree of compliancy – Perform binary search on , subject to a “degree of tolerance” – . End to End Example • These Intel. Calls & Meeting DR are classified “Top Secret” TID Names 1 Angela, Ariel, Edgar, Steve, Benny 2 Edgar, Hassan, Steve, Tommy 3 Joe, Sara, Israel 4 Steve, Angela, Edgar 5 Benny, Mahhmud, Tommy 6 Angela, Sara, Edgar 7 Hassan, Angela, Joe, Edgar, Noa 8 Edgar, Benny, Steve, Tommy We Anonymize the Database I J freq TID Items 1 1, 2, 3, 4, 5 2 3, 6, 4, 7 Angela 1 4/8 Ariel 2 1/8 Edgar 3 6/8 Steve 4 4/8 Benny 5 3/8 3 8, 9, 10 Hassan 6 2/8 4 4, 1, 3 Tommy 7 3/8 Joe 8 2/8 5 5, 7, 12 Sara 9 2/8 6 1, 9, 3 Israel 10 1/8 7 6, 1, 8, 3, 11 Noa 11 1/8 8 3, 5, 4, 7 Mahhmud 12 1/8 Frequency Groups Frequency Items 1/8 2, 10, 11, 12 2/8 6, 8, 9 3/8 5, 7 4/8 1, 4 6/8 3 • The gaps between the frequency groups: 1/8, 1/8, 1/8, 1/8, 2/8 • The median gap = 1/8 The Attacker’s Prior Knowledge I Frequency Group Angela 3/8 – 5/8 Ariel 0 – 2/8 Edgar 5/8 – 7/8 Steve 3/8 – 5/8 Benny 2/8 – 4/8 Hassan 1/8 – 3/8 Tommy 2/8 – 4/8 Joe 1/8 – 3/8 Sara 1/8 – 3/8 Israel 0 – 2/8 Noa 0 – 2/8 Mahhmud 0 – 2/8 The Graph, By the Way… Angela 1 Ariel 2 Edgar Steve 3 4 Benny 5 Hassan 6 Tommy 7 Joe 8 Sara 9 Israel 10 Noa 11 Mahhmud 12 Calculating the Risk • Oest=1/4+1/7+1/3+1/4+1/7+1/9+1/7+ 1/9+1/9+1/7+1/7+1/7 = 2.023 • Now, it’s a question of how much would you tolerate... • Note, that this is the expected number of cracks. However, if we are interested in Edgar, as we’ve seen in previous lemmas, the probability of crack – 1/3. Experiments Open Problems • The attacker’s prior knowledge remains a largely unsolved issue • This article does not really deal with frequent sets but rather frequent items – Frequent sets can add more information and differentiate objects from one frequency group Modeling the Attacker’s Knowledge in the Real World • In a report for the Canadian Privacy Commissioner appears a broad mapping of adversary knowledge – Mapping phone directories – CV’s – Inferring gender, year of birth and postal code from different details – Data remnants on 2nd hand hard disks – Etc. סוף טוב ,הכל טוב Bibliography • Lakshmanan L., Ng R., Ramesh G. To Do or Not To Do: The Dilemma of Disclosing Anonymized Data. ACM SIGMOD Conference, 2005. • Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), Santiago, Chile, pp. 487–499. • Pan-Canadian De-Identification Guidelines for Personal Health Information, Khaled El-Emam et al., April 2007. • Wikipedia – Association rule – #P – Permanent Questions ?