Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Privacy-preserving Data Mining:
current research and trends
Stan Matwin
School of Information Technology and Engineering
University of Ottawa, Canada
[email protected]
Few words about our research
Universit[é|y] [d’|of] Ottawa is the
largest bilingual university in Canada
• Text Analysis and Machine
Learning group
• Learning with previous
knowledge
• Applied research and tech transfer
– Profiles of digital game players/ active learning, cotraining, instance selection
– Classification of medical abstracts
• Privacy-aware Data Mining
2
Plan
• What is [data] privacy?
• Privacy and Data Mining
• Privacy-preserving Data mining: main
approaches
– Anonymization
– Obfuscation
– Cryptographic hiding
• Challenges
– Definition of privacy
– Mining mobile data
– Mining medical data
3
Privacy
• Freedom to be left alone
• …ability to control who knows what about
us [Westin; Moor] [=database views]
• Jeopardized with the Internet – greased
data
• Moral obligation for the community to find
solutions
4
Privacy and data mining
• Individual Privacy
– Nobody should know more about any entity after the
data mining than they did before
– Approaches: Data Obfuscation, Value swapping
• Organization Privacy
– Protect knowledge about a collection of entities
– Individual entity values are not disclosed to all parties
• The results alone need not violate privacy
5
Basic ideas
• Camouflage
obfuscation
•
Hiding
k-anonymity
6
Why naïve anonymization does not
work
•
•
The [Sweeney 01] experiment
purchased the voter registration list for Cambridge, MA
– 54,805 people
•
•
•
69% unique on postal code and birth date; 87% US-wide with all
three
Also, we do not know with what other data sources we may do joins
in the future
Solution: k-anonymization
7
Randomization Approach
Overview
Alice’s
age
Age 30
becomes
65
(30+35)
Salary
70K
becomes
20K
30 | 70K | ...
50 | 40K | ...
Randomizer
Randomizer
65 | 20K | ...
25 | 60K | ...
Reconstruct
Distribution
of Age
Reconstruct
Distribution
of Salary
Classification
Algorithm
...
...
...
Model
8
Reconstruction Problem
• Original values x1, x2, ..., xn from
probability distribution X (unknown)
• To hide these values, we use y1, y2, ..., yn
from known distribution Y
• Given
– x1+y1, x2+y2, ..., xn+yn
– the probability distribution of Y
Estimate the probability distribution of X.
9
Intuition (Reconstruct single
point)
• Use Bayes' rule for density functions
10
V
Age
90
Original distribution for Age
Probabilistic estimate of original value of V
10
Works well
Number of People
1200
1000
800
Original
Randomized
Reconstructed
600
400
200
0
20
60
Age
11
Recap: Why is privacy
preserved?
• Cannot reconstruct individual values
accurately.
• Can only reconstruct distributions.
12
Classification
• Naïve Bayes
– Assumes independence between attributes.
• Decision Tree
– Correlations are weakened by randomization,
not destroyed.
13
Decision Tree Example
Age
Salary
Repeat
Visitor?
23
17
43
68
32
20
50K
30K
40K
50K
70K
20K
Repeat
Repeat
Repeat
Single
Single
Repeat
Age <
25
No
Yes
Salary <
50K
Repeat
Yes
Repeat
No
Single
14
Decision Tree Experiments
100% Randomization Level
100
Accuracy
90
Original
80
Randomized
70
Reconstructed
60
50
Fn 1
Fn 2
Fn 3
Fn 4
Fn 5
100% privacy: attribute cannot
be estimated
(with 95% confidence)
any closer
than the entire
range for the attribute
15
Issues
For very high privacy, discretization will
lead to a poor model
• Gaussian provides more privacy at
higher confidence levels
• In fact, it can be de-randomized using
advanced control theory approach
[Kargupta 2003]
•
16
Association Rule Mining Algorithm
[Agrawal et al. 1993]
1. L1 = large 1-itemsets
2. for (k = 2; Lk −1 ≠ φ ; k + + ) do begin
C k = apriori − gen( Lk −1 )
3.
4.
for all candidates c ∈ C k do begin
5.
compute c.count
6.
end
7. Lk = {c ∈ Ck | c.count ≥ min − sup}
8. end
9. Return L = U k Lk
c.count is the frequency count for a given itemset.
Key issue: to compute the frequency count, we needs to
access attributes that belong to different parties.
17
An Example
• c.count is the vector product.
• Let’s use A to denote Alice’s
attribute vector and B to denote
Bob’s attribute vector.
• AB is a candidate frequent itemset,
then c.count = A • B = 3.
• How to conduct this computation
across parties without
compromising each party’s data
privacy?
Alice
Bob
1
0
1
1
1
1
1
1
1
0
A
B
18
Homomorphic Encryption
[Paillier 1999]
• Privacy-preserving protocols are based
on Homomorphic Encryption.
• Specifically, we use the following additive
homomorphism property:
e(m1 ) × e(m2 ) × L× e(mn ) = e(m1 + m2 + L + mn )
• Where e is an encryption function and mi
is the data to be encrypted and e(mi ) ≠ 0 .
19
Digital Envelope
[Chaum85]
• A digital envelope is a random number (a
set of random numbers) only known by the
owner of private data.
V
V+R
VV
20
The Objective
Privacy
Correctness
Efficiency
Solution
Homomorphic Encryption
Digital Envelope
21
Frequency Count Protocol
• Assume Alice’s attribute vector is A and Bob’s
attribute vector is B.
• Each vector contains N elements.
• Ai : the ith element of A.
• Bi : the ith element of B.
• One of parties is randomly chosen as a key
generator, e.g, Alice, who generates (e, d) and an
integer X > N. e and X will be shared with Bob.
• Let’s use e(.) to denote encryption and d(.) to
denote decryption.
22
Alice
A1 + R1 × X
A2 + R2 × X
……
AN + R N × X
Digital envelopes
R1 , R 2 L , R N
A set of random integers generated by Alice
23
Alice
A1 + R1 × X
A2 + R2 × X
……
AN + RN × X
e( A1 + R1 × X )
e( A2 + R2 × X )
……
e( AN + RN × X )
24
Alice
e( A1 + R1 × X )
e( A2 + R2 × X )
……
e( AN + RN × X )
Bob
25
Bob
W1 = e( A1 + R1 × X ) × B1
W 2 = e ( A2 + R 2 × X ) × B 2 …
WN = e( AN + RN × X ) × BN
Bi = 0 ⇒ Wi = 0
Bi = 1 ⇒ Wi = e ( Ai + Ri × X ) × Bi = e ( Ai + Ri × X )
26
• Bob multiplies all the Wi s for those Bi s that are
not equal to 0. In other words, Bob computes
the multiplication of all non-zero Wi s, e.g.,W = ∏ Wi
where Wi ≠ 0.
W = W1 × W2 × L × W j
27
W = W1 × W2 × L × W j
= [e( A1 + R1 × X ) × B1 ] × [e( A2 + R2 × X ) × B2 ] × L × [e( A j + R j × X ) × B j ]
||
||
||
1
1
1
28
W = W1 × W2 × L × W j
= [e( A1 + R1 × X ) × 1] × [e( A2 + R2 × X ) × 1] × L × [e( A j + R j × X ) × 1]
29
W = W1 × W 2 × L × W j
= e( A1 + R1 × X ) × e( A2 + R2 × X ) × L × e( A j + R j × X )
According to the property of homomorphic encryption
= e( A1 + A2 + L + A j + ( R1 + R2 + L + R j ) × X )
30
• Bob generates an integer R' .
• Bob then computes
W ' = W × e( R'× X )
According to the property of homomorphic encryption
= e( A1 + A2 + L + A j + ( R1 + R2 + L + R j + R ' ) × X )
Alice
31
The Final Step
• Alice decrypts W ' and computes modulo X.
c.count
= d (e( A1 + A2 + L + A j + ( R1 + R2 + L + R j + R ' ) × X )) mod X
( A1 + A2 +L+ Aj ) ≤ N < X
((R1 + R2 +L+ Rj + R' ) × X ) modX = 0
• She then obtains A1 + A2 + L + Aj for those Aj for which
corresponding Bj are 0, which is = c.count
32
Privacy Analysis
Goal: Bob never sees Alice’s data values.
• All the information that Bob obtains from Alice is
e( A1 + R1 × X ), e( A2 + R2 × X ),L, e( AN + RN × X ) .
• Since Bob doesn’t know the decryption key d,
he cannot get Alice’s original data values.
33
Privacy Analysis
Goal: Alice never sees Bob’s data values.
The information that Alice obtains from Bob is
W ' = e( A1 + A2 + L + A j + ( R1 + R2 + L + R j + R ' ) × X ) for those Bi = 1.
Alice computes d (W ' ) mod X . She only obtains the frequency
count and cannot know Bob’s original data values.
34
Complexity Analysis
Linear in the number of transactions
The total number elements in each attribute vector
α ( N + 1) where N is the total number transactions and α is
the number of bits for each encrypted element.
35
Complexity Analysis
Linear in the number of transactions
The computational cost is (10N + 20 + g) where N is the total number
transactions and g is the computational cost for generating a key pair.
36
Other Privacy-Oriented Protocols
Multi-Party Frequency Count Protocol
[Zhan et al. 2005 (a)]
Multi-Party Summation Protocol
[Zhan et al. 2005 (f)]
Multi-Party Comparison Protocol
[Zhan et al. 2006 (a)]
Multi-Party Sorting Protocol
[Zhan et al. 2006 (a)]
37
What about the results of DM?
• Can DM results reveal personal information?
• In some cases, yes [Atzori et al. 05]:
Suppose an association rule is found:
a1 ∧ a2 ∧ a3 ⇒ a4 [sup = 80, conf = 98.7%]
This means sup({a1 , a2 , a3 , a4 }) = 80
sup({a , a , a , a })
0.8
,
,
})
sup({
a
a
a
=
=
= 81.05
then
0.987
.0987
therefore a1 ∧ a2 ∧ a3 ∧ ¬a4 has support=1, and
identifies one person!!
1
1
2
2
3
4
3
38
• They propose an approach called k-anonymous
patterns and an algorithm (inference channels)
which detects violations of k-anonymity
• The algorithm is expensive computationally
• We have a new approach which embeds kanonimity into the concept lattice association
rule algorithm [Zaki, Ogihara 98]
39
Conclusion
• Important problem, challenge for the field
• Lots of creative work, but lack of systematic
approach
• Medical data particularly sensitive, but also
makes de-identification easier: genotypephenotype inferences, location-visit patterns,
family structures, etc. [Malin 2005]
• Lack of an operational, agreed upon definition of
privacy: inspiration in economics?
40