Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database Laboratory Data Transformation for Privacy-Preserving Data Mining Stanley R. M. Oliveira Database Systems Laboratory Computing Science Department University of Alberta, Canada Graduate Seminar November 26th, 2004 Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Motivation • Changes in technology are making privacy harder. • The new challenge of Statistical Offices. • Data Mining plays an outstanding role in business collaboration. • The traditional solution “all or nothing” has been too rigid. • The need for techniques to enforce privacy concerns when data are shared for mining. 2 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions PPDM: Increasing Number of Papers 35 30 Number of papers 30 25 20 17 14 15 10 5 7 4 1 1 1 1996 1998 3 4 0 1991 1995 Conceptive Landmark 1999 2000 2001 2002 2003 2004 Deployment Landmark Prospective Landmark 3 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions PPDM: Privacy Violation • Privacy violation in data mining: misuse of data. • Defining privacy preservation in data mining: Individual privacy preservation: protection of personally identifiable information. Collective privacy preservation: protection of users’ collective activity. 4 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions A few Examples of Scenarios in PPDM • Scenario 1: A hospital shares some data for research purposes. • Scenario 2: Outsourcing the data mining process. • Scenario 3: A collaboration between an Internet marketing company and an on-line retail company. 5 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions A Taxonomy of the Existing Solutions Data Partitioning Data Modification Data Restriction Data Ownership Contributions Fig.1: A Taxonomy of PPDM Techniques 6 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Problem Definition • To transform a database into a new one that conceals sensitive information while preserving general patterns and trends from the original database. Data mining Original Database The transformation process Transformed Database Non-sensitive patterns and trends 7 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Problem Definition (cont.) • Problem 1: Privacy-Preserving Association Rule Mining I do not address privacy of individuals but the problem of protecting sensitive knowledge. • Assumptions: The data owners have to know in advance some knowledge (rules) that they want to protect. The individual data values (e.g., a specific item) are not restricted but the relationships between items. 8 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Problem Definition (cont.) • Problem 2: Privacy-Preserving Clustering I protect the underlying attribute values of objects subjected to clustering analysis. • Assumptions: Given a data matrix Dmn, the goal is to transform D into D' so that the following restrictions hold: A transformation T:D D’ must preserve the privacy of individual records. The similarity between objects in D and D' must be the same or slightly altered by the transformation process. 9 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions A Framework for Privacy PPDM Server Client Original data Collective Transformation Transformed Database Individual Transformation PPDT Methods Library of Algorithms Sanitization Retrieval Facilities Metrics Privacy Preserving Framework A schematic view of the framework for privacy preservation 10 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Privacy-Preserving Association Rule Mining D Identify Discovered Patterns Classify Discovered Patterns Transactional Database Step 1 Step 2 Select the Sensitive Transactions Modify some Sensitive Transactions Step 3 Step 4 D’ Sanitized Database The sanitization process Inverted File TID List of Items T1 A B C D A,B D T1, T3 T2 A B C A,C D T1, T4 T3 A B D T4 A C D T5 A B C T6 B D Sensitive Rules A sample transactional database Sensitive Transaction IDs The corresponding inverted file 11 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Privacy-Preserving Association Rule Mining Heuristic 1 Heuristic 2 Heuristic 3 A taxonomy of sanitizing algorithms 12 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Heuristic 1: Degree of Sensitive Transactions • Definition: Let D be a transactional database and ST a set of all sensitive transactions in D. The degree of a sensitive transactions t, such that t ST, is defined as the number of sensitive association rules that can be found in t. Inverted File TID List of Items T1 A B C D A,B D T1, T3 T2 A B C A,C D T1, T4 T3 A B D T4 A C D T5 A B C T6 B D A sample transactional database Degree(T1) = 2 Sensitive Rules Sensitive Transaction IDs Degree(T3) = 1 Degree(T4) = 1 The corresponding inverted file 13 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Data Sharing-Based Algorithms 1. Scan a database and identify the sensitive transactions for each restrictive patterns; 2. Based on the disclosure threshold , compute the number of sensitive transactions to be sanitized; 3. For each restrictive pattern, identify a candidate item that should be eliminated (victim item); 4. Based on the number found in step 3, remove the victim items from the sensitive transactions. 14 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Data Sharing-Based Algorithms Sensitive Rules (SR): Rule 1: A,BD Rule 2: A,CD TID List of Items TID List of Items TID List of Items T1 A B C D T1 B C D T1 B C D T2 A B C T2 A B C T2 A B C T3 A B D T3 A B D T3 A D T4 A C D T4 A C D T4 A D T5 A B C T5 A B C T5 A B C T6 B D T6 B D T6 B D Transactional Database Partial Sanitization Full Sanitization The Round Robin Algorithm (RRA) • Step 1: Sensitive transactions: A,BD = {T1, T3}; A,CD = {T1, T4} • Step 2: Select the number of sensitive transactions: (a) = 50%; (b) = 0% • Step 3: Identify the victim items (taking turns): = 50% Victim(T1) = A; Victim(T1) = A (Partial Sanitization) = 0% Victim(T3) = B; Victim(T4) = C (Full Sanitization) • Step 4: Sanitize the marked sensitive transactions. 15 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Data Sharing-Based Algorithms (cont.) Sensitive Rules (SR): Rule 1: A,BD Rule 2: A,CD TID List of Items T1 A B C D T2 A B C T3 A B D T4 Group Sensitive Rules TID List of Items T1 B C D T2 A B C T3 A B A C D T4 A C T5 A B C T5 A B C T6 B D T6 B D B A D C Victim Item Transactional Database Sup(D) Sup(A) Sanitized Database The Item Grouping Algorithm (IGA) • Step 1: Sensitive transactions: A,BD = {T1, T3}; A,CD = {T1, T4} • Step 2: Select the number of sensitive transactions: = 0% • Step 3: Identify the victim items (grouping sensitive rules): Victim(T1) = D; Victim(T3) = D; Victim(T4) = D (Full Sanitization) • Step 4: Sanitize the marked sensitive transactions. 17 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Heuristic 2: Size of Sensitive Transactions • For every group of K transactions: Step1: Distinguishing the sensitive transactions from the non-sensitive ones; Step 2: Selecting the victim item for each sensitive rule; Step 3: Computing the number of sensitive transactions to be sanitized; Step 4: Sorting the sensitive transactions by size; Step 5: Sanitizing the sensitive transactions. 18 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Novelties of this Approach • The notion of disclosure threshold for every single pattern Mining Permissions (MP). • Each mining permission mp = <sri, i>, where – i sri set of sensitive rules SR; and – i [0 … 1]. • Mining Permissions allow an DBA to put different weights for different rules to hide. • All the thresholds i can also be set to the same value, if needed. 19 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Data Sharing-Based Algorithms (cont.) Sensitive Rules (SR): Rule 1: A,BD Rule 2: A,CD TID List of Items TID List of Items T1 A B C D T1 A B C T2 A B C T2 A B C T3 A B D T3 A D T4 A C D T4 C D T5 A B C T5 A B C T6 B D T6 B D Transactional Database Sanitized Database The Sliding Window Algorithm (SWA) • Step 1: Sensitive transactions: A,BD = {T1, T3}; A,CD = {T1, T4} • Step 2: Identify the victim items (based on frequencies of the items in SR): Victim(T3) = B; Victim(T4) = A; Victim(T1) = D; • • • Step 3: Select the number of sensitive transactions: = 0% Step 4: Sort sensitive transactions by size: A,BD = {T3, T1}; A,CD = {T4, T1} Step 5: Sanitize the marked sensitive transactions. 20 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Data Sharing-Based Metrics Database D Database D’ 1. Hiding Failure: HF # S R ( D' ) # S R ( D) 3. Artifactual Patterns: AP 2. Misses Cost: MC # S R ( D ) # S R ( D ' ) # S R ( D ) | R' | | R R' | | R' | 4. Difference between D and D’: Dif ( D, D' ) n 1 [ f D (i) f D ' (i)] n f i 1 D (i) i 1 21 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Pattern Sharing-Based Algorithm Data Sharing-Based Approach D Sanitization Share D’ D’ AR generation Association Rules( D’) Pattern Sharing-Based Approach D AR generation Association Rules( D) Sanitization Discovered Association Patterns ( Rules( D’) Share D’) 22 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Possible Inference Channels • Inferences: based on non-restrictive rules, someone tries to deduce one or more restrictive rules that are not supposed to be discovered. TID List of Items T1 A B C D T2 A B C T3 T4 A C D A B C T5 A B Transactional Database a) An example of forward-inference; b) An example of backward-inference. 24 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Pattern Sharing-Based Metrics R: all rules R’: rules to share Non-Sensitive rules ~SR Rules to hide Problem 1: RSE (Side effect) Sensitive rules SR Problem 2: Inference (recovery factor) Rules hidden • 1. Side Effect Factor (SEF) (| R | (| R ' | | S R |)) SEF = (| R | | S R |) • 2. Recovery Factor (RF) RF = [0, 1] 25 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Heuristic 3: Rule Sanitization • Step1: Identifying the sensitive itemsets. • Step 2: Selecting the subsets to sanitize. • Step 3: Sanitizing the set of supersets of marked pairs in level1. The Downright Sanitizing Algorithm (DSA) Frequent pattern lattice A Sensitive Rules: ABC AC AB B AC* BC ABC* Step2 C AD D Level 0 CD Level 1 Level 2 ACD Step3 Step1 26 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering PP-Clustering Results Conclusions Privacy-Preserving Clustering (PPC) • PPC over Centralized Data: The attribute values subjected to clustering are available in a central repository. • PPC over Vertically Partitioned Data: There are k parties sharing data for clustering, where k 2; The attribute values of the objects are split across the k parties. Objects IDs are revealed for join purposes only. The values of the associated attributes are private. 27 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering PP-Clustering Results Conclusions Object Similarity-Based Representation (OSBR) Example 1: Sharing data for research purposes (OSBR). Original Data ID age weight heart rate 123 75 80 342 56 254 Transformed Data Int_def QRS PR_int 63 32 91 193 64 53 24 81 174 40 52 70 24 77 129 446 28 58 76 40 83 251 286 44 90 68 44 109 128 A sample of the cardiac arrhythmia database (UCI Machine Learning Repository) 0 2.243 0 DM 3.348 2.477 0 3 . 690 3 . 884 3 . 176 0 3.020 4.082 4.130 3.995 0 The corresponding dissimilarity matrix 28 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering PP-Clustering Results Conclusions Object Similarity-Based Representation (OSBR) • The Security of the OSBR: Lemma 1: Let DMmm be a dissimilarity matrix, where m is the number of objects. It is impossible to determine the coordinates of the two objects by knowing only the distance between them. • The Complexity of the OSBR: Communication cost is of the order O(m2), where m is the number of objects under analysis. 29 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering PP-Clustering Results Conclusions Object Similarity-Based Representation (OSBR) • Limitations of the OSBR: Lemma 2: Knowing the coordinates of a particular object i and the distance r between i and any other object j, it is possible to estimate the attribute values of j. Vulnerable to attacks in the case of vertically partitioned data (Lemma 2). Conclusion The OSBR is effective for PPC centralized data only, but expensive. 30 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering PP-Clustering Results Conclusions Dimensionality Reduction Transformation (DRBT) • General Assumptions: The attribute values subjected to clustering are numerical only. In PPC over centralized data, object IDs should be replaced by fictitious identifiers; In PPC over vertically partitioned data, object IDs are used for the join purposes between the parties involved in the solution. The transformation (random projection) applied to the data might slightly modify the distances between data points. 31 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering PP-Clustering Results Conclusions Dimensionality Reduction Transformation (DRBT) • Random projection from d to k dimensions: D’ n k = Dn d • Rd k (linear transformation), where D is the original data, D’ is the reduced data, and R is a random matrix. • R is generated by first setting each entry, as follows: (R1): rij is drawn from an i.i.d. N(0,1) and then normalizing the columns to unit length; 1 with probabilit y 1 / 6 (R2): rij = 3 0 with probabilit y 2 / 3 1 with probabilit y 1 / 6 32 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering PP-Clustering Results Conclusions Dimensionality Reduction Transformation (DRBT) • PPC over Centralized Data (General Approach): Step 1 - Suppressing identifiers. Step 2 - Normalizing attribute values subjected to clustering. Step 3 - Reducing the dimension of the original dataset by using random projection. Step 4 – Computing the error that the distances in k-d space suffer from: Error 2 ( (dˆ d ) 2 ) /( d 2 ) i, j ij ij ij i, j • PPC over Vertically Partitioned Data: It is a generalization of the solution for PPC over centralized data. 33 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering PP-Clustering Results Conclusions Dimensionality Reduction Transformation (DRBT) ID age weight heart rate Int_def QRS PR_int Original Data 123 75 80 63 32 91 193 342 56 64 53 24 81 174 254 40 52 70 24 77 129 446 28 58 76 40 83 251 286 44 90 68 44 109 128 RP1 A sample of the cardiac arrhythmia database (UCI Machine Learning Repository) RP2 ID Att1 Att2 Att3 Att1 Att2 Att3 123 -50.40 17.33 12.31 -55.50 -95.26 -107.93 342 -37.08 6.27 12.22 -51.00 -84.29 -83.13 254 -55.86 20.69 -0.66 -65.50 -70.43 -66.97 446 -37.61 -31.66 -17.58 -85.50 -140.87 -72.74 286 -62.72 37.64 18.16 -88.50 -50.22 -102.76 Transformed Data RP1: The random matrix is based on the Normal distribution. RP2: The random matrix is based on a much simpler distribution. 34 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering PP-Clustering Results Conclusions Dimensionality Reduction Transformation (DRBT) • The Security of the DRBT: Lemma 3: A random projection from d to k dimensions, where k d, is a non-invertible linear transformation. • The Complexity of the DRBT: The complexity of space requirements is of order O(m), where m is the number of objects. The communication cost is of order O(mlk), where l represents the size (in bits) required to transmit a dataset from one party to a central or third party. 35 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering PP-Clustering Results Conclusions Dimensionality Reduction Transformation (DRBT) • The Accuracy of the DRBT: C1’ C2’ … Ck’ C1 freq1,1 freq1,2 … freq1,k C2 freq2,1 freq2,2 … freq2,k … … … … … Ck freqk,1 freqk,2 … freqk,k Precision (P) = Recall (R) = freqi , j | ci' | freqi , j | ci | 2 P R F-measure (F) = PR k | c | F (c ) Overall F-measure (F) = i 1 i i k | c | i 1 i 36 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Results Conclusions Results and Evaluation Dataset Association Rules BMS 59,602 497 2.51 1 145 Retail 88,162 16,470 10.30 1 76 Kosarak 990,573 41,270 8.10 1 1065 Reuters 7,774 26,639 46.81 1 427 340,183 468 33.81 18 51 Mushroom 8,124 119 23.00 23 23 Chess 3,196 75 37.00 37 37 Connect 67,557 129 43.00 43 43 Pumbs 49,046 2,113 74.00 74 74 Accidents Clustering # records # items Avg.Length Shortest Longest Record Record Datasets used in our performance evaluation 37 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Results Conclusions Data Sharing-Based Algorithms • Item Group Algorithm (IGA) [Oliveira & Zaiane, PSDM 2002]. • Sliding Window Algorithm (SWA) [Oliveira & Zaïane, ICDM 2003]. • Round Robin Algorithm (RRA) [Oliveira & Zaïane, IDEAS 2003]. • Random Algorithm (RA) [Oliveira & Zaïane, IDEAS 2003]. • Algo2a [E. Dasseni et al., IHW 2001]. 38 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Results Conclusions Methodology • The Sensitive rules selected based on four scenarios: S1: Rules with items mutually exclusive. S2: Rules selected randomly. S3: Rules with very high support. S4: Rules with low support. • The effectiveness of the algorithms was measured based on: C1: = 0%, fixed the minimum support () and minimum confidence (). C2: the same as C1 but varied the number of sensitive rules. C3: = 0%, fixed the minimum confidence () and the number of sensitive rules, and varied the minimum support (). 39 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Results Conclusions Measuring Effectiveness Algorithm = 0% 6 sensitive rules S1 S2 S3 S4 Kosarak IGA IGA IGA IGA Retail IGA SWA RA Reuters IGA IGA BMS-1 IGA IGA S1 S2 S3 S4 Kosarak IGA IGA IGA IGA RA Retail IGA IGA / SWA RA RA IGA IGA Reuters IGA IGA IGA IGA IGA IGA BMS-1 IGA Algo2a/IGA IGA IGA Misses Cost under condition C1 Algorithm = 0% varying the # of rules Algorithm Misses Cost under condition C2 = 0% varying values of S1 S2 S3 S4 Kosarak IGA IGA IGA IGA Retail IGA SWA RA Reuters IGA IGA BMS-1 IGA IGA Algorithm = 0% 6 sensitive rules S1 S2 S3 S4 Kosarak SWA SWA SWA SWA RRA Retail SWA SWA SWA SWA IGA IGA Reuters SWA SWA SWA SWA IGA IGA BMS-1 SWA SWA SWA SWA Misses Cost under condition C3 Dif (D, D’ ) under conditions C1 and C3 41 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Results Conclusions Special Cases of Data Sanitization SWA: { [rule1, 30%], [rule2, 25%], [rule3, 15%], [rule4, 45%], [rule5, 15%], [rule6, 20%] } Algorithm K = 100,000 Kosarak Retail Reuters BMS-1 MC 37.22 31.07 46.48 8.68 HF 5.57 7.45 0.01 21.84 Dif (D, D’) 1.68 1.24 0.63 0.70 An example of different thresholds for the sensitive rules in scenario S3. = 0% MC HF = 5% MC HF = 10% = 15% MC HF MC HF = 25% MC HF IGA 66.31 0.00 64.77 0.66 63.23 0.83 60.94 1.32 56.26 1.99 RRA 64.02 0.00 61.18 7.28 58.15 6.46 55.12 7.62 46.46 15.73 RA 63.86 0.00 60.12 7.12 56.72 7.62 54.29 7.95 46.48 16.39 SWA 65.29 0.00 55.58 1.16 48.31 1.82 42.67 3.31 27.74 15.89 Effect of on misses cost and hiding failure in the dataset Retail 42 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Results Conclusions CPU Time Results of CPU time for the sanitization process 43 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Results Conclusions Pattern Sharing-based Algorithm • Downright Sanitizing Algorithm (DSA) [Oliveira & Zaiane, PAKDD 2004]. • We used the data-sharing algorithm IGA for our comparison study. • Methodology: IGA DSA 1) We used IGA to sanitize the datasets. 2) We used Apriori to extract the rules to share (all the datasets). 1) We used Apriori to extract the rules from the datasets. 2) We used DSA to sanitize the rules mined in the previous step. 44 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Results Conclusions Measuring Effectiveness Dataset = 0% 6 sensitive rules S1 S2 S3 S4 Kosarak IGA DSA DSA DSA Retail DSA DSA DSA IGA Reuters DSA DSA DSA IGA BMS-1 DSA DSA DSA DSA Dataset S1 S2 S3 S4 Kosarak IGA DSA DSA IGA / DSA Retail DSA DSA IGA / DSA IGA Reuters IGA / DSA DSA DSA IGA BMS-1 DSA DSA DSA DSA The best algorithm in terms of misses cost Dataset = 0% 6 sensitive rules S1 S2 S3 S4 Kosarak IGA DSA DSA DSA Retail DSA DSA DSA IGA Reuters DSA DSA DSA IGA BMS-1 DSA DSA DSA DSA = 0% varying # of rules The best algorithm in terms of misses cost varying the number of rules to sanitize The best algorithm in terms of side effect factor 45 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Results Conclusions Lessons Learned • Large dataset are our friends. • The benefit of index: at most two scans to sanitize a dataset. • The data sanitization paradox. • The outstanding performance of IGA and DSA. • Rule sanitization reduces inference channels, and does not change the support and confidence of the shared rules. • DSA reduces the flexibility of information sharing. 47 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Results Conclusions Evaluation: DRBT • Methodology: Step 1: Attribute normalization. Step 2: Dimensionality reduction (two approaches). Step 3: Computation of the error produced on reduced datasets. Step 4: Run K-means to find the clusters in the original and reduced datasets. Step 5: Computation of F-measure (experiments repeated 10 times). Step 6: Comparison of the clusters generated from the original and the reduced datasets. 48 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Results Conclusions DRBT: PPC over centralized Data Transformation dr = 37 dr = 34 dr = 31 dr = 28 dr = 25 dr = 22 dr = 16 RP1 0.00 0.015 0.024 0.033 0.045 0.072 0.141 RP2 0.00 0.014 0.019 0.032 0.041 0.067 0.131 The error produced on the dataset Chess (do = 37) Data K=2 K=3 K=4 K=5 Transformation Avg Std Avg Std Avg Std Avg Std RP2 0.941 0.014 0.912 0.009 0.881 0.010 0.885 0.006 Average of F-measure (10 trials) for the dataset Accidents (do = 18, dr = 12) Data K=2 K=3 K=4 K=5 Transformation Avg Std Avg Std Avg Std Avg Std RP2 1.000 0.000 0.948 0.010 0.858 0.089 0.833 0.072 Average of F-measure (10 trials) for the dataset Iris (do = 5, dr = 3) 49 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Results Conclusions DRBT: PPC over Vertically Centralized Data No. Parties RP1 RP2 1 0.0762 0.0558 2 0.0798 0.0591 3 0.0870 0.0720 4 0.0923 0.0733 The error produced on the dataset Pumsb (do = 74) Data K=2 K=3 K=4 K=5 Transformation Avg Std Avg Std Avg Std Avg Std 1 0.909 0.140 0.965 0.081 0.891 0.028 0.838 0.041 2 0.904 0.117 0.931 0.101 0.894 0.059 0.840 0.047 3 0.874 0.168 0.887 0.095 0.873 0.081 0.801 0.073 4 0.802 0.155 0.812 0.117 0.866 0.088 0.831 0.078 Average of F-measure (10 trials) for the dataset Pumsb (do = 74, dr = 38) 50 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Results Contributions of this Research • Foundations for further research in PPDM. • A taxonomy of PPDM techniques. • A family of privacy-preserving methods. • A library of sanitizing algorithms. • Retrieval facilities. • A set of metrics. 51 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Future Research • Privacy definition in data mining. • Combining sanitization and randomization. • New method for PPC (k-anonymity + isometries + data distortion) • Sanitization of documents repositories. 52 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Results Conclusions Thank You! 53 Graduate Seminar – Nov. 26, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira