Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Tools for Privacy Preserving Distributed Data Mining By Michael Holmes Why Private Data Mining ❖ The CDC may want to use data mining techniques to identify trends in disease outbreaks. ❖ Insurance companies have useful data but can’t disclose it because of privacy concerns. ❖ Is there a way to obtain this data without revealing the identity of the patients? Private Data Mining Techniques ❖ Secure Sum ❖ Secure Set Union ❖ Secure Size of Set Intersection ❖ Scalar Product Private Data Mining Toolkit ❖ Association Rules in horizontally partitioned data ❖ Association Rules in vertically partitioned data ❖ EM Clustering Secure Sum ❖ Securely compute the sum from individual databases. ❖ Have a site randomly generate a number R ❖ Add this number to every value and send it to site 2. ❖ Site 2 can then add each of it’s values to that values sent from site 1 and return a single number back to Site 1. ❖ Site 1 can then remove the random number N times and find the correct sum. Secure Sum Secure Set Union Secure Size of Set Intersection ❖ Only possible with Commutative Encryption. ❖ very party encrypts their data and then sends it to another party. ❖ The next party also encrypts the encrypted data. ❖ After all parties have encrypted all the data from every other party only that has been duplicated by the encryption is shared. ❖ Count the duplicates and you know the size of the intersection. Scalar Product ❖ Want to compute the sum of x1 * y1 between two databases ❖ Use linear combinations of random numbers to disguise elements and then computationally remove these once you get the result. Association Rules in Horizontally Partitioned Data ❖ Candidate Set Generation ❖ Local Pruning ❖ Itemset Exchange (Secure Union Step here) ❖ Support Count Exchange Association Rules in Vertically Partitioned Data ❖ Uses scalar product to determine if the count of an item set is greater than a threshold ❖ If the count is above the threshold you’ve determined that the database is worth querying ❖ Can also user Secure Size Set Intersection to see how much is in common. ❖ Useful when using algorithm such as apriori algorithm EM Clustering ❖ Uses secure sum to get a global number associated with all sites involved. ❖ Once global sum is computed, it can be used in the Expectation-maximization method to generate staistical models. EM Clustering ❖ Uses secure sum to get a global number associated with all sites involved. ❖ Once global sum is computed, it can be used in the Expectation-maximization method to generate staistical models. Things to Note ❖ These algorithms are not fully private, some information is learned in the process. ❖ For example in the set intersection, sites can potentially learn the sizes of each database. ❖ Make sure to pick the appropriate algorithms for what you need to accomplish ❖ Watch out for intermediate information being leaked! Thank you