Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database Laboratory Data Transformation for Privacy-Preserving Data Mining Stanley R. M. Oliveira Database Systems Laboratory Computing Science Department University of Alberta, Canada PhD Thesis - Final Examination November 29th, 2004 Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Motivation • Scenario 1: A collaboration between an Internet marketing company and an on-line retail company. Objective: find optimal customer targets. • Scenario 2: Companies sharing their transactions to build a recommender system. Objective: provide recommendations to their customers. 2 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions A Taxonomy of the Existing Solutions Data Partitioning Data Modification Data Restriction Data Ownership Contributions Fig.1: A Taxonomy of PPDM Techniques 3 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework Framework PP-Assoc. Rules PP-Clustering Conclusions Problem Definition • To transform a database into a new one that conceals sensitive information while preserving general patterns and trends from the original database. Data mining Original Database The transformation process Transformed Database • Data sanitization Non-sensitive patterns and trends • Dimensionality reduction 4 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework Framework PP-Assoc. Rules PP-Clustering Conclusions Problem Definition (cont.) • Sub-Problem 1: Privacy-Preserving Association Rule Mining I do not address privacy of individuals but the problem of protecting sensitive knowledge. • Sub-Problem 2: Privacy-Preserving Clustering I protect the underlying attribute values of objects subjected to clustering analysis. 5 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Framework Framework Introduction PP-Assoc. Rules PP-Clustering Conclusions A Framework for Privacy PPDM Server Client Original data Collective Transformation Transformed Database Individual Transformation PPDT Methods Library of Algorithms Sanitization Retrieval Facilities Metrics Privacy Preserving Framework A schematic view of the framework for privacy preservation 6 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Privacy-Preserving Association Rule Mining Heuristic 1 Heuristic 2 Heuristic 3 A taxonomy of sanitizing algorithms 7 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Data Sharing-Based Algorithms: Problems Database D Database D’ 1. Hiding Failure: HF # S R ( D' ) # S R ( D) 3. Artifactual Patterns: AP 2. Misses Cost: MC # S R ( D ) # S R ( D ' ) # S R ( D ) | R' | | R R' | | R' | 4. Difference between D and D’: Dif ( D, D' ) n 1 [ f D (i) f D ' (i)] n f i 1 D (i) i 1 8 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Data Sharing-Based Algorithms 1. Scan a database and identify the sensitive transactions for each sensitive rule; 2. Based on the disclosure threshold , compute the number of sensitive transactions to be sanitized; 3. For each sensitive rule, identify a candidate item that should be eliminated (victim item); 4. Based on the number found in step 3, remove the victim items from the sensitive transactions. 9 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Measuring Effectiveness Algorithm = 0% 6 sensitive rules S1 S2 S3 S4 Kosarak IGA IGA IGA IGA Retail IGA SWA RA Reuters IGA IGA BMS-1 IGA IGA S1 S2 S3 S4 Kosarak IGA IGA IGA IGA RA Retail IGA IGA / SWA RA RA IGA IGA Reuters IGA IGA IGA IGA IGA IGA BMS-1 IGA Algo2a/IGA IGA IGA Misses Cost under condition C1 Algorithm = 0% varying the # of rules Algorithm Misses Cost under condition C2 = 0% varying values of S1 S2 S3 S4 Kosarak IGA IGA IGA IGA Retail IGA SWA RA Reuters IGA IGA BMS-1 IGA IGA Algorithm = 0% 6 sensitive rules S1 S2 S3 S4 Kosarak SWA SWA SWA SWA RRA Retail SWA SWA SWA SWA IGA IGA Reuters SWA SWA SWA SWA IGA IGA BMS-1 SWA SWA SWA SWA Misses Cost under condition C3 Dif (D, D’ ) under conditions C1 and C3 10 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Pattern Sharing-Based Algorithm Data Sharing-Based Approach D Sanitization Share D’ (IGA) D’ Pattern Sharing-Based Approach D AR generation Association Rules( D) Sanitization AR generation Association Rules( D’) (DSA) Discovered Association Patterns ( Rules( D’) Share D’) 11 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Pattern Sharing-Based Algorithms: Problems R: all rules R’: rules to share Non-Sensitive rules ~SR Rules to hide Problem 1: RSE (Side effect) Sensitive rules SR Problem 2: Inference (recovery factor) Rules hidden • 1. Side Effect Factor (SEF) (| R | (| R ' | | S R |)) SEF = (| R | | S R |) • 2. Recovery Factor (RF) RF = [0, 1] 12 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Measuring Effectiveness Dataset = 0% 6 sensitive rules S1 S2 S3 S4 Kosarak IGA DSA DSA DSA Retail DSA DSA DSA IGA Reuters DSA DSA DSA IGA BMS-1 DSA DSA DSA DSA Dataset S1 S2 S3 S4 Kosarak IGA DSA DSA IGA / DSA Retail DSA DSA IGA / DSA IGA Reuters IGA / DSA DSA DSA IGA BMS-1 DSA DSA DSA DSA The best algorithm in terms of misses cost Dataset = 0% 6 sensitive rules S1 S2 S3 S4 Kosarak IGA DSA DSA DSA Retail DSA DSA DSA IGA Reuters DSA DSA DSA IGA BMS-1 DSA DSA DSA DSA = 0% varying # of rules The best algorithm in terms of misses cost varying the number of rules to sanitize The best algorithm in terms of side effect factor 13 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Lessons Learned • Large dataset are our friends. • The benefit of index: at most two scans to sanitize a dataset. • The data sanitization paradox. • The outstanding performance of IGA and DSA. • Rule sanitization does not change the support and confidence of the shared rules. • DSA reduces the flexibility of information sharing. 14 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Privacy-Preserving Clustering (PPC) • PPC over Centralized Data: The attribute values subjected to clustering are available in a central repository. • PPC over Vertically Partitioned Data: There are k parties sharing data for clustering, where k 2; The attribute values of the objects are split across the k parties. Objects IDs are revealed for join purposes only. The values of the associated attributes are private. 15 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Object Similarity-Based Representation (OSBR) Example 1: Sharing data for research purposes (OSBR). Original Data ID age weight heart rate 123 75 80 342 56 254 Transformed Data Int_def QRS PR_int 63 32 91 193 64 53 24 81 174 40 52 70 24 77 129 446 28 58 76 40 83 251 286 44 90 68 44 109 128 A sample of the cardiac arrhythmia database (UCI Machine Learning Repository) 0 2.243 0 DM 3.348 2.477 0 3 . 690 3 . 884 3 . 176 0 3.020 4.082 4.130 3.995 0 The corresponding dissimilarity matrix 16 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Object Similarity-Based Representation (OSBR) • Limitations of the OSBR: Expensive in terms of communication cost - O(m2), where m is the number of objects under analysis. Vulnerable to attacks in the case of vertically partitioned data. Conclusion The OSBR is effective for PPC over centralized data only. 17 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Dimensionality Reduction Transformation (DRBT) • Random projection from d to k dimensions: D’ n k = Dn d • Rd k (linear transformation), where D is the original data, D’ is the reduced data, and R is a random matrix. • R is generated by first setting each entry, as follows: (R1): rij is drawn from an i.i.d. N(0,1) and then normalizing the columns to unit length; 1 with probabilit y 1 / 6 (R2): rij = 3 0 with probabilit y 2 / 3 1 with probabilit y 1 / 6 18 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Dimensionality Reduction Transformation (DRBT) ID age weight heart rate Int_def QRS PR_int Original Data 123 75 80 63 32 91 193 342 56 64 53 24 81 174 254 40 52 70 24 77 129 446 28 58 76 40 83 251 286 44 90 68 44 109 128 RP1 A sample of the cardiac arrhythmia database (UCI Machine Learning Repository) RP2 ID Att1 Att2 Att3 Att1 Att2 Att3 123 -50.40 17.33 12.31 -55.50 -95.26 -107.93 342 -37.08 6.27 12.22 -51.00 -84.29 -83.13 254 -55.86 20.69 -0.66 -65.50 -70.43 -66.97 446 -37.61 -31.66 -17.58 -85.50 -140.87 -72.74 286 -62.72 37.64 18.16 -88.50 -50.22 -102.76 Transformed Data RP1: The random matrix is based on the Normal distribution. RP2: The random matrix is based on a much simpler distribution. 19 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Dimensionality Reduction Transformation (DRBT) • Security: A random projection from d to k dimensions, where k d, is a non-invertible linear transformation. • Space requirement is of the order O(m), where m is the number of objects. • Communication cost is of the order O(mkl), where l represents the size (in bits) required to transmit a dataset from one party to a central or third party. • Conclusion The DRBT is effective for PPC over centralized data and vertically partitioned data. 20 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions DRBT: PPC over Centralized Data Transformation dr = 37 dr = 34 dr = 31 dr = 28 dr = 25 dr = 22 dr = 16 RP1 0.00 0.015 0.024 0.033 0.045 0.072 0.141 RP2 0.00 0.014 0.019 0.032 0.041 0.067 0.131 The error produced on the dataset Chess (do = 37) Data K=2 K=3 K=4 K=5 Transformation Avg Std Avg Std Avg Std Avg Std RP2 0.941 0.014 0.912 0.009 0.881 0.010 0.885 0.006 Average of F-measure (10 trials) for the dataset Accidents (do = 18, dr = 12) Data K=2 K=3 K=4 K=5 Transformation Avg Std Avg Std Avg Std Avg Std RP2 1.000 0.000 0.948 0.010 0.858 0.089 0.833 0.072 Average of F-measure (10 trials) for the dataset Iris (do = 5, dr = 3) 21 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Clustering PP-Assoc. Rules Conclusions DRBT: PPC over Vertically Partitioned Data No. Parties RP1 RP2 1 0.0762 0.0558 2 0.0798 0.0591 3 0.0870 0.0720 4 0.0923 0.0733 The error produced on the dataset Pumsb (do = 74) Data K=2 K=3 K=4 K=5 Transformation Avg Std Avg Std Avg Std Avg Std 1 0.909 0.140 0.965 0.081 0.891 0.028 0.838 0.041 2 0.904 0.117 0.931 0.101 0.894 0.059 0.840 0.047 3 0.874 0.168 0.887 0.095 0.873 0.081 0.801 0.073 4 0.802 0.155 0.812 0.117 0.866 0.088 0.831 0.078 Average of F-measure (10 trials) for the dataset Pumsb (do = 74, dr = 38) 22 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Conclusions Contributions of this Research • Foundations for further research in PPDM. • A taxonomy of PPDM techniques. • A family of privacy-preserving methods. • A library of sanitizing algorithms. • Retrieval facilities. • A set of metrics. 23 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira Introduction Framework PP-Assoc. Rules PP-Clustering Conclusions Conclusions Future Research • Privacy definition in data mining. • Combining sanitization and randomization for PPARM. • Transforming data using one-way functions and learning from the distorted data. • Privacy preservation in spoken language databases. • Sanitization of documents repositories on the Web. 24 PhD – Final Examination – Nov. 29, 2004 Data Transformation for Privacy-Preserving Data Mining Stanley Oliveira