Download Mining Multiple Private Databases

Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Li Xiong (Emory University) Subramanyam Chitti (GA Tech) Ling Liu (GA Tech) Presented by: Cesar Gutierrez About Me 2  ISYE Senior and CS minor  Graduating December, 2008  Humanitarian Logistics and/or Supply Chain  Originally from Lima, Peru  Travel, paintball and politics Outline 3  Intro. & Motivation  Problem Definition  Important Concepts & Examples  Private Algorithm  Conclusion Introduction    ↓ of information-sharing restrictions due to technology ↑ need for distributed data-mining tools that preserve privacy Accuracy Trade-off Efficiency 4 Privacy Motivating Scenarios   5 CDC needs to study insurance data to detect disease outbreaks  Disease incidents  Disease seriousness  Patient Background Legal/Commercial Problems prevent release of policy holder's information Motivating Scenarios (cont'd)  6 Industrial trade group collaboration  Useful pattern: "manufacturing using chemical supplies from supplier X have high failure rates"  Trade secret: "manufacturing process Y gives low failure rate" Problem & Assumptions  Model: n nodes, horizontal partitioning ...  7 Assume Semi-honesty:  Nodes follow specified protocol  Nodes attempt to learn additional information about other nodes Challenges   8 Why not use a Trusted Third Party (TTP)?  Difficult to find one that is trusted  Increased danger from single point of compromise Why not use secure multi-party computation techniques?  High communication overhead  Feasible for small inputs only Recall Our 3-D Goal Accuracy Efficiency 9 Privacy Private Max   Actual Data sent on first pass start Static Starting Point Known 30 2 1 30 10 40 30 40 20 4 3 40 10 Multi-Round Max  Randomly perturbed data passed to successor during multiple passes Start 18 0 32 35 D2 D2   11 No successor can determine actual data from it's predecessor Randomized Starting Point 30 32 35 10 40 18 20 40 D4 D3 32 35 40 32 35 Evaluation Parameters Parameter Description n # of nodes in the system k KNN parameter Po Initial randomization probability in neighbor selection d Dampening factor in neighbor selection r # of rounds in neighbor selection 12  Large k = "avoid information leaks"  Large d = more randomization = more privacy  Small d = more accurate (deterministic)  Large r = "as accurate as ordinary classifier" Accuracy Results 13 Varying Rounds 14 Privacy Results 15 Conclusion  16 Problems Tackled  Preserving efficiency and accuracy while introducing provable privacy to the system  Improving a naive protocol  Reducing privacy risk in an efficient manner Critique    17 Dependency on other research papers in order to obtain a full understanding Few/No Illustrations A real life example would have created a better understanding of the charts

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Mining Multiple Private Databases