Download Mining Multiple Private Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Mining Multiple Private
Databases
Topk Queries Across Multiple Private Databases (2005)
Li Xiong (Emory University)
Subramanyam Chitti (GA Tech)
Ling Liu (GA Tech)
Presented by: Cesar Gutierrez
About Me
2

ISYE Senior and CS minor

Graduating December, 2008

Humanitarian Logistics and/or Supply Chain

Originally from Lima, Peru

Travel, paintball and politics
Outline
3

Intro. & Motivation

Problem Definition

Important Concepts & Examples

Private Algorithm

Conclusion
Introduction



↓ of information-sharing restrictions due
to technology
↑ need for distributed data-mining tools
that preserve privacy
Accuracy
Trade-off
Efficiency
4
Privacy
Motivating Scenarios


5
CDC needs to study insurance data to detect
disease outbreaks

Disease incidents

Disease seriousness

Patient Background
Legal/Commercial Problems prevent release
of policy holder's information
Motivating Scenarios (cont'd)

6
Industrial trade group collaboration

Useful pattern: "manufacturing using chemical
supplies from supplier X have high failure rates"

Trade secret: "manufacturing process Y gives low
failure rate"
Problem & Assumptions

Model: n nodes, horizontal partitioning
...

7
Assume Semi-honesty:

Nodes follow specified protocol

Nodes attempt to learn additional information
about other nodes
Challenges


8
Why not use a Trusted Third Party (TTP)?

Difficult to find one that is trusted

Increased danger from single point of
compromise
Why not use secure multi-party computation
techniques?

High communication overhead

Feasible for small inputs only
Recall Our 3-D Goal
Accuracy
Efficiency
9
Privacy
Private Max


Actual Data sent on
first pass
start
Static Starting Point
Known
30
2
1
30
10
40
30
40
20
4
3
40
10
Multi-Round Max

Randomly perturbed
data passed to
successor during
multiple passes
Start
18
0
32
35
D2
D2


11
No successor can
determine actual data
from it's predecessor
Randomized Starting
Point
30
32
35
10
40
18
20
40
D4
D3
32
35
40
32
35
Evaluation Parameters
Parameter Description
n
# of nodes in the system
k
KNN parameter
Po
Initial randomization probability in neighbor selection
d
Dampening factor in neighbor selection
r
# of rounds in neighbor selection
12

Large k = "avoid information leaks"

Large d = more randomization = more privacy

Small d = more accurate (deterministic)

Large r = "as accurate as ordinary classifier"
Accuracy Results
13
Varying Rounds
14
Privacy Results
15
Conclusion

16
Problems Tackled

Preserving efficiency and accuracy while
introducing provable privacy to the system

Improving a naive protocol

Reducing privacy risk in an efficient manner
Critique



17
Dependency on other research papers in
order to obtain a full understanding
Few/No Illustrations
A real life example would have created a
better understanding of the charts