Download Slides - Personal Web Pages

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
On the Use of Spectral Filtering for
Privacy Preserving Data Mining
Songtao Guo
Xintao Wu
UNC Charlotte
UNC Charlotte
SAC’06 April 23-27, 2006, Dijon, France
Source:
http://www.privacyinternational.org/issues/foia/foiaSAC, Dijon, France
April 23-27, 2006
laws.jpg
2
PIPEDA 2000
European Union (Directive 94/46/EC)
HIPAA for health care
 California State Bill 1386
 Grann-Leach-Bliley Act for
financial
 COPPA for childern’s online
privacy
Source: http://www.privacyinternational.org/survey/dpmap.jpg

SAC, Dijon, France
April 23-27, 2006
3
Mining vs. Privacy

Data mining


Individual Privacy


The goal of data mining is summary results (e.g.,
classification, cluster, association rules etc.) from the
data (distribution)
Individual values in database must not be disclosed, or
at least no close estimation can be derived by attackers
Privacy Preserving Data Mining (PPDM)

How to “perturb” data such that
 we can build a good data mining model (data utility)
 while preserving individual’s privacy at the record
level (privacy)?
SAC, Dijon, France
April 23-27, 2006
4
Outline

Additive Randomization

Distribution Reconstruction



Individual Value Reconstruction



Spectral Filtering H. Kargupta ICDM03
PCA Technique Du et al. SIGMOD05
Error Bound Analysis for Spectral
Filtering


Bayesian Method Agrawal & Srikant SIGMOD00
EM Method
Agrawal & Aggawal PODS01
Upper Bound
Conclusion and Future Work
SAC, Dijon, France
April 23-27, 2006
5
Additive Randomization

To hide the sensitive data by randomly modifying
the data values using some additive noise
~
U  U V

Privacy preserving aims at
~
|| U  U || 

and || Uˆ  U || 
Utility preserving aims at

The aggregate characteristics remain unchanged or can
be recovered
SAC, Dijon, France
April 23-27, 2006
6
Distribution Reconstruction
The original density distribution can be reconstructed
effectively given the perturbed data and the noise's
distribution --– Agrawal & Srikant SIGMOD 2000

Independent random noises with any distribution
fX0 := Uniform distribution
j := 0 // Iteration number
repeat
1200
j
n
1
f
((
x

y
)

a
)
f
(a )
Y
i
i
X
fXj+1(a) :=  
j
n i 1
f
((
x

y
)

a
)
f
(a )
Y
i
i
X

j := j+1

Number of People

1000
800
Original
Randomized
Reconstructed
600
400
200
0
until (stopping criterion met)

20
60
Age
It can not reconstruct individual value
SAC, Dijon, France
April 23-27, 2006
7
Individual Value Reconstruction

Spectral Filtering, Kargupta et al. ICDM 2003
~ ~ ~T
~


Q
Q
1.
Apply EVD :
U
Using some published information about V, extract the first k
components of  ~ as the principal components.
2.
–
–
3.
4.
U
~ ~
~
~ ~
~
1~ 2   k  e and e1 , e2 ,, ek are the corresponding eigenvectors.
~
~e
~ e
~ ] forms an orthonormal basis of a subspace X
.
X  [e
1 2
k
~
Find the orthogonal projection on to X:
~
Get estimate data set: Uˆ  U
P~
~ ~
P~  X X T

PCA Technique, Huang, Du and Chen, SIGMOD 05
SAC, Dijon, France
April 23-27, 2006
8
Motivation

Previous work on individual reconstruction are
only empirical


The relationship between the estimation accuracy and the
noise was not clear
Two questions

Attacker question: How close the estimated data using
SF is from the original one?

Data owner question: How much noise should be added
to preserve privacy at a given tolerated level?
SAC, Dijon, France
April 23-27, 2006
9
Our Work

Investigate the explicit relationship between the estimation
accuracy and the noise

Derive one upper bound of
|| Uˆ  U || F
in terms of V

The upper bound determines how close the estimated data
achieved by attackers is from the original one

It imposes a serious threat of privacy breaches
SAC, Dijon, France
April 23-27, 2006
10
Preliminary

F-norm and 2-norm
|| A || F 

m
n
 a
i 1 j 1
2
ij
|| Ax ||2
|| A ||2  max

|| x ||2
x 0
Some properties




|| AB || F || A || F || B || F and || AB ||2 || A ||2 || B ||2
|| A ||2 || A || F  n || A ||2
|| A ||2  max ( AT A) ,the square root of the largest
eigenvalue of ATA
If A is symmetric, then || A ||2  max ( A) ,the largest
eigenvalue of A
SAC, Dijon, France
April 23-27, 2006
11
Matrix Perturbation

Traditional Matrix perturbation theory


How the derived perturbation E affects the covariance matrix A
Our scenario

How the primary perturbation V affects the
data matrix U
~ ~T ~
A  U U  (U  V )T (U  V )  U TU  V TU  U TV  V TV
A
SAC, Dijon, France
April 23-27, 2006
+
E
12
Error Bound Analysis



~~
~ ~
Uˆ  U  UP~  UP  U ( P~  P )  VP
~
~
|| Uˆ  U || || U || || P~  P ||  || VP ||
F
F

F

F
Prop 1. Let covariance matrix of the perturbed
~
data be A  A  E . Given  || E ||F and   k  k 1(eigengap)
~
|| P~  P || F 


2
  2
Prop 2. 1   2     n (eigenvalue of E)
~
~
i [i  1 , i   n ]
SAC, Dijon, France
April 23-27, 2006
13
Theorem

Given a date set U  R mnand a noise set V  R mn we have
~
the perturbed data set U  U  V . Let Û be the estimation
obtained from the Spectral Filtering, then
2 || E || F
~
ˆ
|| U  U || F || U || F ~
 || VP || F
(k  || E ||2 )  2 || E || F
where E  V T U  U T V  V TV is the derived perturbation on
the original covariance matrix A = UUT

Proof is skipped

SAC, Dijon, France
April 23-27, 2006
14
Special Cases

When the noise matrix is generated by i.i.d.
Gaussian distribution with zero mean and known
variance
2
2
||
V
||
F
|| Uˆ  U || F || U P || F ~
 k / n || V || F
2
(k  || E ||2 )  2 || V || F

When the noise is completely correlated with data
2
2
||
V
||
F
|| Uˆ  U || F || U P || F ~
 || V || F
2
(k  || E ||2 )  2 || V || F
SAC, Dijon, France
April 23-27, 2006
15
Experimental Results



Artificial Dataset
35 correlated variables
30,000 tuples
F
F
F
F
F
F
F
1
2
3
4
5
6
7
F
F
F
F
F
F
F
1
1
1
1
1
8
9
0
1
2
3
4
F
F
F
F
F
F
F
1
1
1
1
1
2
2
5
6
7
8
9
0
1
F
F
F
F
F
F
F
2
2
2
2
2
2
2
2
3
4
5
6
7
8
F
F
F
F
F
F
F
2
3
3
3
3
3
3
9
0
1
2
3
4
5
SAC, Dijon, France
April 23-27, 2006
16
Experimental Results

Scenarios of noise addition

Case 1: i.i.d. Gaussian noise


Case 2: Independent Gaussian noise


N(0,COV), where COV = c * diag(σ12, …, σn2)
Case 3: Correlated Gaussian noise


N(0,COV), where COV = diag(σ2,…, σ2)
N(0,COV), where COV = c * ΣU (or c * A……)
Measure


Absolute error
ae(U , Uˆ ) || U  Uˆ || F
Relative error
ˆ ||
||
U

U
F
re(U , Uˆ ) 
|| U || F
SAC, Dijon, France
April 23-27, 2006
17
Determining k

Determine k in Spectral Filtering

According to Matrix Perturbation Theory
~
max{| i  i |} || E ||2

Our heuristic approach:

check

K=
SAC, Dijon, France
~
i || E ||2
~
min( i | i || E ||2 )
April 23-27, 2006
18
Effect of varying k (case 1)

N(0,COV), where COV = diag(σ2,…, σ2)
relative error
||V||F
σ2
K=1
K=2
K=3
K=4
K=5
SAC, Dijon, France
229
323
561
725
0.05
0.10
0.3
0.5
0.43
0.44
0.45
0.46
0.22
0.23
0.26
0.29
0.16
0.18
0.24
0.29
*0.09 *0.12 *0.22 *0.28
0.10
0.14
0.25
0.32
April 23-27, 2006
1025
1.0
0.48
0.36
*0.31
0.40
0.45
19
Effect of varying k (case 2)

N(0,COV), where COV = c * diag(σ12, σ22 …, σn2)
relative error
||V||F
229
323
561
725
1025
c
0.07
0.15
0.44
0.74
1.45
K=1
0.44
0.44
0.45
0.46
0.49
K=2
0.22
0.23
0.27 *0.30 *0.36
K=3
0.16
0.18
0.24
0.33
0.44
K=4 *0.07 *0.11 *0.23 0.37
0.50
K=5
0.09
0.13
0.26
0.40
0.56
SAC, Dijon, France
April 23-27, 2006
20
Effect of varying k (case 3)

N(0,COV), where COV = c * ΣU
||V||F
c
K=1
229
0.07
0.50
323
0.15
0.55
561
0.44
0.73
725
0.74
0.88
1025
1.45
*1.17
K=2
K=3
K=4
0.34
0.30
*0.27
0.43
0.41
*0.38
0.68
0.67
*0.65
0.86
0.86
*0.85
1.19
1.20
1.20
K=5
0.27
0.38
0.65
0.85
1.20
SAC, Dijon, France
April 23-27, 2006
21
Effect of varying noise
σ2=0.1
σ2=1.0
σ2=0.5
||V||F/||U||F = 87.8%
SAC, Dijon, France
April 23-27, 2006
22
Effect of covariance matrix
||V||F/||U||F = 39.1%
Case 1
Case 3
Case 2
SAC, Dijon, France
April 23-27, 2006
23
Conclusion

Spectral filtering based technique has been
investigated as a major means of point-wise data
reconstruction.

We present the upper bound

which enables attackers to determines how close the
estimated data achieved by attackers is from the original
one
SAC, Dijon, France
April 23-27, 2006
24
Future Work

We are working on the lower bound



which represents the best estimate the attacker can
achieve using SF
which can be used by data owners to determine how
much noise should be added to preserve privacy
Bound analysis at point-wise level
SAC, Dijon, France
April 23-27, 2006
25
Acknowledgement

NSF Grant



Personnel




CCR-0310974
IIS-0546027
Xintao Wu
Songtao Guo
Ling Guo
More Info


http://www.cs.uncc.edu/~xwu/
[email protected],
SAC, Dijon, France
April 23-27, 2006
26
Questions?
Thank you!
SAC, Dijon, France
April 23-27, 2006
27
Related documents