Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 PRIVACY AND SECURITY ISSUES IN DATA MINING P.h.D. Candidate: Anna Monreale Supervisors Prof. Dino Pedreschi Dott.ssa Fosca Giannotti University of Pisa Department of Computer Science Privacy-Preserving Data Mining 2 New privacy-preserving data mining techniques: Goal: to develop algorithms for modifying the original data, so that For individual privacy: Personal data are private For corporate privacy: Knowledge extracted is private private data are protected private knowledge remain private even after the mining tasks Analysis results are still useful Natural trade-off between privacy quantification and data utility Secure Outsourcing of Data Mining 3 The server has access to data of the owner Data owner has the property of Data Knowledge extracted from data all encrypted transactions in D* and items contained in it are secure given any mining query the server can compute the encrypted result encrypted mining and analysis results are secure the owner can decrypt the results and so, reconstruct the exact result the space and time incurred by the owner in the process has to be minimum A Solution for Pattern Mining: K-anonymity 4 Attack Model: the attacker knows the set of plain items and their true supports in D exactly and has access to the encrypted database D∗ Item-based attack: guessing the plain item corresponding to the cipher item e with probability prob(e) Itemset-based attack: guessing the plain itemset corresponding to the cipher itemset E with probability prob(E) Encryption: Replacing each plain item in D by a 1-1 substitution cipher Adding fake transactions K-Anonymity: for each item e there are at least others k-1 cipher items + Decryption: A Synopsis allows computing the actual support of every pattern Privacy-Preserving DT Framework 5 GOAL: publishing and sharing various forms of data without disclosing sensitive personal information while preserving mining results Sequence data Query-Log data ….… Problem: Anonymizing sequence data while preserving sequential pattern mining results Attack Model: Sequence Linking Attack The attacker knows part of a sequence and want to guess the whole correct sequence Idea: Combining k-anonymity and sequence hiding methods and reformulating the problem as that of hiding k-infrequent sequences Running example: k = 2 6 Dataset D BC ABCD ABCD BCE BCD Tree Reconstruction Root Prefix Tree Construction C:3 B:2 E:1 D:1 C:2 Root D:2 Tree Pruning Lcut BCE:1 BCD:1 B:2 A:2 A:3 Generation of D’ C:2 LCS: 1. B C 2. B C D B:3 A:2 Root B:3 B:2 C:3 C:2 D:3 D:2 B:3 A:2 B:1 C:3 C:1 B:2 E:1 D:1 C:2 D:2 Dataset D’ BC ABCD ABCD BC ABCD