Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center Thesis Organizational boundaries are blurring in the emerging networked economy – Compete and co-operate simultaneously – Int’l value chain Need to rethink information sharing, searching, and mining in the new brave world of virtual organizations Sovereign Information Sharing Separate databases due to statutory, competitive, or security reasons. Selective, minimal sharing on need-to-know basis. Minimal Necessary Sharing R a u Example: Among those who took a particular drug, how many had adverse reaction and their DNA contains a specific sequence? Researchers must not learn anything beyond counts. v Commutative Encryption: u E1(E2(T)) = E2(E1(T)) x RS R must not know that S has b & y S must not know that R has a & x RS u v S b v Count (R S) R & S do not learn anything except that the result is 2. y SIGMOD 00 Privacy Preserving Data Mining Alice’s age Alice’s salary Bob’s age 30 | 70K | ... 50 | 40K | ... Randomizer Randomizer 30+35 65 | 20K | ... Reconstruct distribution of Age 25 | 60K | ... Insight: Preserve privacy at the individual level, while still building accurate data mining models at the aggregate level. Add random noise to individual values to protect privacy. EM algorithm to estimate original distribution of values given randomized values + randomization function. Algorithms for building classification models and discovering association rules on top of privacypreserved data with only small loss of accuracy. Reconstruct distribution of Salary Data Mining Algorithms Data Mining Model 1200 120 1000 100 800 80 600 60 400 40 20 200 0 0 Original Randomized 20 40 82 74 66 58 50 42 34 26 18 2 SIGMOD 00 10 10 Reconstructed 60 80 100 150 Randomization Level Original Randomized Reconstructed 200 Finessing Schema Chaos 50 50 Use a simple regular expression extractor to get numbers 40 40 30 30 20 20 Do simple data extraction to get hints 10 10 0 0 Hint for unit: the word following the number. Hint for attribute name: k following numbers. 256 MB SDRAM memory 10 20 30 40 0 50 0 10 20 30 40 100 80 60 40 20 Unit Hint: MB Attribute Hint: SDRAM, Memory Use only numbers in the queries Treat any attribute name in the query also as hint W W W 03 0 1 2 3 4 5 7 Query Size Accuracy Non-Reflectivity Randomized NonReflectivity Reflectivity estimates accuracy 50 Privacy Preserving Indexing A public mapping function that maps a query to a set of providers P that may contain the desired document P contains false negatives Providers return a document only if the searcher is authorized to access the document VLDB 03 Some Interesting Topics Current integration approaches do not scale – Information integration per se is not interesting – Static vs. dynamic plumbing Incentive compatibility Auditing interactions