Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004 Data, Data, Everywhere… Lot’s of data Let’s mine it and find interesting things! But also maintain people’s privacy… Privacy Preserving Data Mining Randomization Cryptographic protocols K-anonymization Randomization Techniques Insert random ‘noise’ into the data to mask actual values Compute a function to recover the original data distribution from the randomized values Not always secure – random noise can be filtered in certain circumstances to accurately estimate original data values Cryptographic Techniques Utilize cryptographic techniques, such as oblivious transfer, to execute data mining algorithms between two parties without revealing data to each other E.g. Amazon.com and Barnes & Noble K-anonymity For a given set of attribute values in a record, these same values must also occur in at least k-1 other records for the data to be considered k-anonymous Idea developed by Latanya Sweeney, who has also developed a tool Datafly, which will perform generalizations and suppressions to achieve the k-anonymity Generalization and Suppression Generalization Usually done for an entire attribute Basically make the data less precise Taxonomy tree for categorical data Discretization of continuous, numerical values Suppression – remove data (either record or a single value) What Should We Generalize? Most approaches assume that you know what columns to generalize from domain knowledge What if you choose too little? Re-identification What if you choose too much? Data loss Is there a way to get some more insight into the data and what makes it not k-anonymous? A Priori Algorithm General data mining definition: For sets X and Y, if X Y, then Support(X) >= Support(Y) For k-anonymity: For a set of attributes X and Y, if X Y, and X is not k-anonymous, then Y is not kanonymous Use this approach to see which rows cause kanonymity to fail for which combinations of columns This info can be used to make more informed decisions when using generalization and suppression techniques Experiment Implemented the method and tested it on the “Adult Database” from the UCI Machine Learning Repository Contains 32,561 records with 16 attributes: age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capitalloss, hours-per-week, native-country K value was 3 Results Discovered some not-immediately obvious single attributes: age (a few old people), hours-per-week (a few work-aholics), native country (one person from the Netherlands) Many combinations of 2 attributes: workclass & gender identified 2 records with values (“Never-worked”, “Female”) Occupation & race identified 1 record with value (“Armed-Forces”, “Black”) Due to A Priori, did not find any combinations greater than 2 that failed ({Sex, Education, Education-Num} was the largest combination that remained k-anonymous) Conclusion and Future Work Conclusions Applying A Prioiri can provide some insight into the data to be anonymized Future work Further testing on various data sets Deeper analysis of results of algorithm and how to best apply them in anonymization Development of a data anonymization tool that makes use of this algorithm