Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
On the Use of Spectral Filtering for
Privacy Preserving Data Mining
Songtao Guo
Xintao Wu
UNC Charlotte
UNC Charlotte
SAC’06 April 23-27, 2006, Dijon, France
Source:
http://www.privacyinternational.org/issues/foia/foiaSAC, Dijon, France
April 23-27, 2006
laws.jpg
2
PIPEDA 2000
European Union (Directive 94/46/EC)
HIPAA for health care
California State Bill 1386
Grann-Leach-Bliley Act for
financial
COPPA for childern’s online
privacy
Source: http://www.privacyinternational.org/survey/dpmap.jpg
SAC, Dijon, France
April 23-27, 2006
3
Mining vs. Privacy
Data mining
Individual Privacy
The goal of data mining is summary results (e.g.,
classification, cluster, association rules etc.) from the
data (distribution)
Individual values in database must not be disclosed, or
at least no close estimation can be derived by attackers
Privacy Preserving Data Mining (PPDM)
How to “perturb” data such that
we can build a good data mining model (data utility)
while preserving individual’s privacy at the record
level (privacy)?
SAC, Dijon, France
April 23-27, 2006
4
Outline
Additive Randomization
Distribution Reconstruction
Individual Value Reconstruction
Spectral Filtering H. Kargupta ICDM03
PCA Technique Du et al. SIGMOD05
Error Bound Analysis for Spectral
Filtering
Bayesian Method Agrawal & Srikant SIGMOD00
EM Method
Agrawal & Aggawal PODS01
Upper Bound
Conclusion and Future Work
SAC, Dijon, France
April 23-27, 2006
5
Additive Randomization
To hide the sensitive data by randomly modifying
the data values using some additive noise
~
U U V
Privacy preserving aims at
~
|| U U ||
and || Uˆ U ||
Utility preserving aims at
The aggregate characteristics remain unchanged or can
be recovered
SAC, Dijon, France
April 23-27, 2006
6
Distribution Reconstruction
The original density distribution can be reconstructed
effectively given the perturbed data and the noise's
distribution --– Agrawal & Srikant SIGMOD 2000
Independent random noises with any distribution
fX0 := Uniform distribution
j := 0 // Iteration number
repeat
1200
j
n
1
f
((
x
y
)
a
)
f
(a )
Y
i
i
X
fXj+1(a) :=
j
n i 1
f
((
x
y
)
a
)
f
(a )
Y
i
i
X
j := j+1
Number of People
1000
800
Original
Randomized
Reconstructed
600
400
200
0
until (stopping criterion met)
20
60
Age
It can not reconstruct individual value
SAC, Dijon, France
April 23-27, 2006
7
Individual Value Reconstruction
Spectral Filtering, Kargupta et al. ICDM 2003
~ ~ ~T
~
Q
Q
1.
Apply EVD :
U
Using some published information about V, extract the first k
components of ~ as the principal components.
2.
–
–
3.
4.
U
~ ~
~
~ ~
~
1~ 2 k e and e1 , e2 ,, ek are the corresponding eigenvectors.
~
~e
~ e
~ ] forms an orthonormal basis of a subspace X
.
X [e
1 2
k
~
Find the orthogonal projection on to X:
~
Get estimate data set: Uˆ U
P~
~ ~
P~ X X T
PCA Technique, Huang, Du and Chen, SIGMOD 05
SAC, Dijon, France
April 23-27, 2006
8
Motivation
Previous work on individual reconstruction are
only empirical
The relationship between the estimation accuracy and the
noise was not clear
Two questions
Attacker question: How close the estimated data using
SF is from the original one?
Data owner question: How much noise should be added
to preserve privacy at a given tolerated level?
SAC, Dijon, France
April 23-27, 2006
9
Our Work
Investigate the explicit relationship between the estimation
accuracy and the noise
Derive one upper bound of
|| Uˆ U || F
in terms of V
The upper bound determines how close the estimated data
achieved by attackers is from the original one
It imposes a serious threat of privacy breaches
SAC, Dijon, France
April 23-27, 2006
10
Preliminary
F-norm and 2-norm
|| A || F
m
n
a
i 1 j 1
2
ij
|| Ax ||2
|| A ||2 max
|| x ||2
x 0
Some properties
|| AB || F || A || F || B || F and || AB ||2 || A ||2 || B ||2
|| A ||2 || A || F n || A ||2
|| A ||2 max ( AT A) ,the square root of the largest
eigenvalue of ATA
If A is symmetric, then || A ||2 max ( A) ,the largest
eigenvalue of A
SAC, Dijon, France
April 23-27, 2006
11
Matrix Perturbation
Traditional Matrix perturbation theory
How the derived perturbation E affects the covariance matrix A
Our scenario
How the primary perturbation V affects the
data matrix U
~ ~T ~
A U U (U V )T (U V ) U TU V TU U TV V TV
A
SAC, Dijon, France
April 23-27, 2006
+
E
12
Error Bound Analysis
~~
~ ~
Uˆ U UP~ UP U ( P~ P ) VP
~
~
|| Uˆ U || || U || || P~ P || || VP ||
F
F
F
F
Prop 1. Let covariance matrix of the perturbed
~
data be A A E . Given || E ||F and k k 1(eigengap)
~
|| P~ P || F
2
2
Prop 2. 1 2 n (eigenvalue of E)
~
~
i [i 1 , i n ]
SAC, Dijon, France
April 23-27, 2006
13
Theorem
Given a date set U R mnand a noise set V R mn we have
~
the perturbed data set U U V . Let Û be the estimation
obtained from the Spectral Filtering, then
2 || E || F
~
ˆ
|| U U || F || U || F ~
|| VP || F
(k || E ||2 ) 2 || E || F
where E V T U U T V V TV is the derived perturbation on
the original covariance matrix A = UUT
Proof is skipped
SAC, Dijon, France
April 23-27, 2006
14
Special Cases
When the noise matrix is generated by i.i.d.
Gaussian distribution with zero mean and known
variance
2
2
||
V
||
F
|| Uˆ U || F || U P || F ~
k / n || V || F
2
(k || E ||2 ) 2 || V || F
When the noise is completely correlated with data
2
2
||
V
||
F
|| Uˆ U || F || U P || F ~
|| V || F
2
(k || E ||2 ) 2 || V || F
SAC, Dijon, France
April 23-27, 2006
15
Experimental Results
Artificial Dataset
35 correlated variables
30,000 tuples
F
F
F
F
F
F
F
1
2
3
4
5
6
7
F
F
F
F
F
F
F
1
1
1
1
1
8
9
0
1
2
3
4
F
F
F
F
F
F
F
1
1
1
1
1
2
2
5
6
7
8
9
0
1
F
F
F
F
F
F
F
2
2
2
2
2
2
2
2
3
4
5
6
7
8
F
F
F
F
F
F
F
2
3
3
3
3
3
3
9
0
1
2
3
4
5
SAC, Dijon, France
April 23-27, 2006
16
Experimental Results
Scenarios of noise addition
Case 1: i.i.d. Gaussian noise
Case 2: Independent Gaussian noise
N(0,COV), where COV = c * diag(σ12, …, σn2)
Case 3: Correlated Gaussian noise
N(0,COV), where COV = diag(σ2,…, σ2)
N(0,COV), where COV = c * ΣU (or c * A……)
Measure
Absolute error
ae(U , Uˆ ) || U Uˆ || F
Relative error
ˆ ||
||
U
U
F
re(U , Uˆ )
|| U || F
SAC, Dijon, France
April 23-27, 2006
17
Determining k
Determine k in Spectral Filtering
According to Matrix Perturbation Theory
~
max{| i i |} || E ||2
Our heuristic approach:
check
K=
SAC, Dijon, France
~
i || E ||2
~
min( i | i || E ||2 )
April 23-27, 2006
18
Effect of varying k (case 1)
N(0,COV), where COV = diag(σ2,…, σ2)
relative error
||V||F
σ2
K=1
K=2
K=3
K=4
K=5
SAC, Dijon, France
229
323
561
725
0.05
0.10
0.3
0.5
0.43
0.44
0.45
0.46
0.22
0.23
0.26
0.29
0.16
0.18
0.24
0.29
*0.09 *0.12 *0.22 *0.28
0.10
0.14
0.25
0.32
April 23-27, 2006
1025
1.0
0.48
0.36
*0.31
0.40
0.45
19
Effect of varying k (case 2)
N(0,COV), where COV = c * diag(σ12, σ22 …, σn2)
relative error
||V||F
229
323
561
725
1025
c
0.07
0.15
0.44
0.74
1.45
K=1
0.44
0.44
0.45
0.46
0.49
K=2
0.22
0.23
0.27 *0.30 *0.36
K=3
0.16
0.18
0.24
0.33
0.44
K=4 *0.07 *0.11 *0.23 0.37
0.50
K=5
0.09
0.13
0.26
0.40
0.56
SAC, Dijon, France
April 23-27, 2006
20
Effect of varying k (case 3)
N(0,COV), where COV = c * ΣU
||V||F
c
K=1
229
0.07
0.50
323
0.15
0.55
561
0.44
0.73
725
0.74
0.88
1025
1.45
*1.17
K=2
K=3
K=4
0.34
0.30
*0.27
0.43
0.41
*0.38
0.68
0.67
*0.65
0.86
0.86
*0.85
1.19
1.20
1.20
K=5
0.27
0.38
0.65
0.85
1.20
SAC, Dijon, France
April 23-27, 2006
21
Effect of varying noise
σ2=0.1
σ2=1.0
σ2=0.5
||V||F/||U||F = 87.8%
SAC, Dijon, France
April 23-27, 2006
22
Effect of covariance matrix
||V||F/||U||F = 39.1%
Case 1
Case 3
Case 2
SAC, Dijon, France
April 23-27, 2006
23
Conclusion
Spectral filtering based technique has been
investigated as a major means of point-wise data
reconstruction.
We present the upper bound
which enables attackers to determines how close the
estimated data achieved by attackers is from the original
one
SAC, Dijon, France
April 23-27, 2006
24
Future Work
We are working on the lower bound
which represents the best estimate the attacker can
achieve using SF
which can be used by data owners to determine how
much noise should be added to preserve privacy
Bound analysis at point-wise level
SAC, Dijon, France
April 23-27, 2006
25
Acknowledgement
NSF Grant
Personnel
CCR-0310974
IIS-0546027
Xintao Wu
Songtao Guo
Ling Guo
More Info
http://www.cs.uncc.edu/~xwu/
[email protected],
SAC, Dijon, France
April 23-27, 2006
26
Questions?
Thank you!
SAC, Dijon, France
April 23-27, 2006
27