Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Li Xiong (Emory University) Subramanyam Chitti (GA Tech) Ling Liu (GA Tech) Presented by: Cesar Gutierrez About Me 2 ISYE Senior and CS minor Graduating December, 2008 Humanitarian Logistics and/or Supply Chain Originally from Lima, Peru Travel, paintball and politics Outline 3 Intro. & Motivation Problem Definition Important Concepts & Examples Private Algorithm Conclusion Introduction ↓ of information-sharing restrictions due to technology ↑ need for distributed data-mining tools that preserve privacy Accuracy Trade-off Efficiency 4 Privacy Motivating Scenarios 5 CDC needs to study insurance data to detect disease outbreaks Disease incidents Disease seriousness Patient Background Legal/Commercial Problems prevent release of policy holder's information Motivating Scenarios (cont'd) 6 Industrial trade group collaboration Useful pattern: "manufacturing using chemical supplies from supplier X have high failure rates" Trade secret: "manufacturing process Y gives low failure rate" Problem & Assumptions Model: n nodes, horizontal partitioning ... 7 Assume Semi-honesty: Nodes follow specified protocol Nodes attempt to learn additional information about other nodes Challenges 8 Why not use a Trusted Third Party (TTP)? Difficult to find one that is trusted Increased danger from single point of compromise Why not use secure multi-party computation techniques? High communication overhead Feasible for small inputs only Recall Our 3-D Goal Accuracy Efficiency 9 Privacy Private Max Actual Data sent on first pass start Static Starting Point Known 30 2 1 30 10 40 30 40 20 4 3 40 10 Multi-Round Max Randomly perturbed data passed to successor during multiple passes Start 18 0 32 35 D2 D2 11 No successor can determine actual data from it's predecessor Randomized Starting Point 30 32 35 10 40 18 20 40 D4 D3 32 35 40 32 35 Evaluation Parameters Parameter Description n # of nodes in the system k KNN parameter Po Initial randomization probability in neighbor selection d Dampening factor in neighbor selection r # of rounds in neighbor selection 12 Large k = "avoid information leaks" Large d = more randomization = more privacy Small d = more accurate (deterministic) Large r = "as accurate as ordinary classifier" Accuracy Results 13 Varying Rounds 14 Privacy Results 15 Conclusion 16 Problems Tackled Preserving efficiency and accuracy while introducing provable privacy to the system Improving a naive protocol Reducing privacy risk in an efficient manner Critique 17 Dependency on other research papers in order to obtain a full understanding Few/No Illustrations A real life example would have created a better understanding of the charts