Download DefenseStanley - Department of Computing Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Database Laboratory
Data Transformation for
Privacy-Preserving Data Mining
Stanley R. M. Oliveira
Database Systems Laboratory
Computing Science Department
University of Alberta, Canada
PhD Thesis - Final Examination
November 29th, 2004
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Motivation
• Scenario 1: A collaboration between an Internet
marketing company and an on-line retail company.
 Objective: find optimal customer targets.
• Scenario 2: Companies sharing their transactions to
build a recommender system.
 Objective: provide recommendations to their customers.
2
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
A Taxonomy of the Existing Solutions
Data Partitioning
Data Modification
Data Restriction
Data Ownership
Contributions
Fig.1: A Taxonomy of PPDM Techniques
3
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Problem Definition
• To transform a database into a new one that conceals
sensitive information while preserving general patterns
and trends from the original database.
Data mining
Original
Database
The transformation
process
Transformed
Database
• Data sanitization
Non-sensitive
patterns and trends
• Dimensionality reduction
4
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Problem Definition (cont.)
• Sub-Problem 1: Privacy-Preserving Association
Rule Mining
I do not address privacy of individuals but the problem
of protecting sensitive knowledge.
• Sub-Problem 2: Privacy-Preserving Clustering
I protect the underlying attribute values of objects
subjected to clustering analysis.
5
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Framework
Framework
Introduction
PP-Assoc. Rules
PP-Clustering
Conclusions
A Framework for Privacy PPDM
Server
Client
Original data
Collective Transformation
Transformed
Database
Individual Transformation
PPDT
Methods
Library of
Algorithms
Sanitization
Retrieval
Facilities
Metrics
Privacy Preserving Framework
A schematic view of the framework for privacy preservation
6
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Privacy-Preserving Association Rule Mining
Heuristic 1
Heuristic 2
Heuristic 3
A taxonomy of sanitizing algorithms
7
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Data Sharing-Based Algorithms: Problems
Database D
Database D’
1. Hiding Failure: HF 
# S R ( D' )
# S R ( D)
3. Artifactual Patterns: AP 
2. Misses Cost: MC 
#  S R ( D )  # S R ( D ' )
# S R ( D )
| R' |  | R  R' |
| R' |
4. Difference between D and D’: Dif ( D, D' ) 
n
1
 [ f D (i)  f D ' (i)]
n
f
i 1
D
(i)
i 1
8
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Data Sharing-Based Algorithms
1. Scan a database and identify the sensitive transactions
for each sensitive rule;
2. Based on the disclosure threshold , compute the
number of sensitive transactions to be sanitized;
3. For each sensitive rule, identify a candidate item that
should be eliminated (victim item);
4. Based on the number found in step 3, remove the
victim items from the sensitive transactions.
9
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Measuring Effectiveness
Algorithm
 = 0%  6 sensitive rules
S1
S2
S3
S4
Kosarak
IGA
IGA
IGA
IGA
Retail
IGA
SWA
RA
Reuters
IGA
IGA
BMS-1
IGA
IGA
S1
S2
S3
S4
Kosarak
IGA
IGA
IGA
IGA
RA
Retail
IGA
IGA / SWA
RA
RA
IGA
IGA
Reuters
IGA
IGA
IGA
IGA
IGA
IGA
BMS-1
IGA
Algo2a/IGA
IGA
IGA
Misses Cost under condition C1
Algorithm
 = 0%  varying the # of rules
Algorithm
Misses Cost under condition C2
 = 0%  varying values of 
S1
S2
S3
S4
Kosarak
IGA
IGA
IGA
IGA
Retail
IGA
SWA
RA
Reuters
IGA
IGA
BMS-1
IGA
IGA
Algorithm
 = 0%  6 sensitive rules
S1
S2
S3
S4
Kosarak
SWA
SWA
SWA
SWA
RRA
Retail
SWA
SWA
SWA
SWA
IGA
IGA
Reuters
SWA
SWA
SWA
SWA
IGA
IGA
BMS-1
SWA
SWA
SWA
SWA
Misses Cost under condition C3
Dif (D, D’ ) under conditions C1 and C3
10
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Pattern Sharing-Based Algorithm
Data Sharing-Based Approach
D
Sanitization
Share
D’
(IGA)
D’
Pattern Sharing-Based Approach
D
AR
generation
Association
Rules(
D)
Sanitization
AR
generation
Association
Rules(
D’)
(DSA)
Discovered
Association
Patterns (
Rules(
D’) Share
D’)
11
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Pattern Sharing-Based Algorithms: Problems
R: all rules
R’: rules to share
Non-Sensitive rules
~SR
Rules to hide
Problem 1: RSE (Side effect)
Sensitive rules
SR
Problem 2: Inference
(recovery factor)
Rules hidden
• 1. Side Effect Factor (SEF)
(| R | (| R ' |  | S R |))
SEF =
(| R |  | S R |)
• 2. Recovery Factor (RF)
RF = [0, 1]
12
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Measuring Effectiveness
Dataset
 = 0%  6 sensitive rules
S1
S2
S3
S4
Kosarak
IGA
DSA
DSA
DSA
Retail
DSA
DSA
DSA
IGA
Reuters
DSA
DSA
DSA
IGA
BMS-1
DSA
DSA
DSA
DSA
Dataset
S1
S2
S3
S4
Kosarak
IGA
DSA
DSA
IGA /
DSA
Retail
DSA
DSA
IGA /
DSA
IGA
Reuters
IGA /
DSA
DSA
DSA
IGA
BMS-1
DSA
DSA
DSA
DSA
The best algorithm in terms of misses cost
Dataset
 = 0%  6 sensitive rules
S1
S2
S3
S4
Kosarak
IGA
DSA
DSA
DSA
Retail
DSA
DSA
DSA
IGA
Reuters
DSA
DSA
DSA
IGA
BMS-1
DSA
DSA
DSA
DSA
 = 0%  varying # of rules
The best algorithm in terms of misses cost
varying the number of rules to sanitize
The best algorithm in terms of side effect factor
13
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Lessons Learned
• Large dataset are our friends.
• The benefit of index: at most two scans to sanitize a
dataset.
• The data sanitization paradox.
• The outstanding performance of IGA and DSA.
• Rule sanitization does not change the support and
confidence of the shared rules.
• DSA reduces the flexibility of information sharing.
14
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Privacy-Preserving Clustering (PPC)
• PPC over Centralized Data:
 The attribute values subjected to clustering are available in a
central repository.
• PPC over Vertically Partitioned Data:
 There are k parties sharing data for clustering, where k  2;
 The attribute values of the objects are split across the k parties.
 Objects IDs are revealed for join purposes only. The values of
the associated attributes are private.
15
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Object Similarity-Based Representation (OSBR)
Example 1: Sharing data for research purposes (OSBR).
Original Data
ID
age weight heart
rate
123
75
80
342
56
254
Transformed Data
Int_def
QRS
PR_int
63
32
91
193
64
53
24
81
174
40
52
70
24
77
129
446
28
58
76
40
83
251
286
44
90
68
44
109
128
A sample of the cardiac arrhythmia database
(UCI Machine Learning Repository)
 0

2.243

0



DM  3.348 2.477
0


3
.
690
3
.
884
3
.
176
0


3.020 4.082 4.130 3.995 0
The corresponding dissimilarity matrix
16
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Object Similarity-Based Representation (OSBR)
• Limitations of the OSBR:
Expensive in terms of communication cost - O(m2),
where m is the number of objects under analysis.
Vulnerable to attacks in the case of vertically
partitioned data.
Conclusion  The OSBR is effective for PPC over
centralized data only.
17
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Dimensionality Reduction Transformation (DRBT)
• Random projection from d to k dimensions:
 D’ n k = Dn d • Rd k (linear transformation), where
D is the original data, D’ is the reduced data, and R is a random
matrix.
• R is generated by first setting each entry, as follows:
 (R1): rij is drawn from an i.i.d. N(0,1) and then normalizing the
columns to unit length;
 1 with probabilit y 1 / 6

 (R2): rij = 3   0 with probabilit y 2 / 3
 1 with probabilit y 1 / 6

18
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Dimensionality Reduction Transformation (DRBT)
ID
age
weight
heart
rate
Int_def
QRS
PR_int
Original Data
123
75
80
63
32
91
193
342
56
64
53
24
81
174
254
40
52
70
24
77
129
446
28
58
76
40
83
251
286
44
90
68
44
109
128
RP1
A sample of the cardiac arrhythmia database
(UCI Machine Learning Repository)
RP2
ID
Att1
Att2
Att3
Att1
Att2
Att3
123
-50.40
17.33
12.31
-55.50
-95.26
-107.93
342
-37.08
6.27
12.22
-51.00
-84.29
-83.13
254
-55.86
20.69
-0.66
-65.50
-70.43
-66.97
446
-37.61
-31.66
-17.58
-85.50
-140.87
-72.74
286
-62.72
37.64
18.16
-88.50
-50.22
-102.76
Transformed Data
RP1: The random matrix is based on the
Normal distribution.
RP2: The random matrix is based on a
much simpler distribution.
19
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Dimensionality Reduction Transformation (DRBT)
• Security: A random projection from d to k dimensions,
where k  d, is a non-invertible linear transformation.
• Space requirement is of the order O(m), where m is the
number of objects.
• Communication cost is of the order O(mkl), where l
represents the size (in bits) required to transmit a
dataset from one party to a central or third party.
• Conclusion  The DRBT is effective for PPC over
centralized data and vertically partitioned data.
20
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
DRBT: PPC over Centralized Data
Transformation
dr = 37
dr = 34
dr = 31
dr = 28
dr = 25
dr = 22
dr = 16
RP1
0.00
0.015
0.024
0.033
0.045
0.072
0.141
RP2
0.00
0.014
0.019
0.032
0.041
0.067
0.131
The error produced on the dataset Chess (do = 37)
Data
K=2
K=3
K=4
K=5
Transformation
Avg
Std
Avg
Std
Avg
Std
Avg
Std
RP2
0.941
0.014
0.912
0.009
0.881
0.010
0.885
0.006
Average of F-measure (10 trials) for the dataset Accidents (do = 18, dr = 12)
Data
K=2
K=3
K=4
K=5
Transformation
Avg
Std
Avg
Std
Avg
Std
Avg
Std
RP2
1.000
0.000
0.948
0.010
0.858
0.089
0.833
0.072
Average of F-measure (10 trials) for the dataset Iris (do = 5, dr = 3)
21
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Clustering
PP-Assoc. Rules
Conclusions
DRBT: PPC over Vertically Partitioned Data
No. Parties
RP1
RP2
1
0.0762
0.0558
2
0.0798
0.0591
3
0.0870
0.0720
4
0.0923
0.0733
The error produced on the dataset Pumsb (do = 74)
Data
K=2
K=3
K=4
K=5
Transformation
Avg
Std
Avg
Std
Avg
Std
Avg
Std
1
0.909
0.140
0.965
0.081
0.891
0.028
0.838
0.041
2
0.904
0.117
0.931
0.101
0.894
0.059
0.840
0.047
3
0.874
0.168
0.887
0.095
0.873
0.081
0.801
0.073
4
0.802
0.155
0.812
0.117
0.866
0.088
0.831
0.078
Average of F-measure (10 trials) for the dataset Pumsb (do = 74, dr = 38)
22
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Conclusions
Contributions of this Research
• Foundations for further research in PPDM.
• A taxonomy of PPDM techniques.
• A family of privacy-preserving methods.
• A library of sanitizing algorithms.
• Retrieval facilities.
• A set of metrics.
23
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira
Introduction
Framework
PP-Assoc. Rules
PP-Clustering
Conclusions
Conclusions
Future Research
• Privacy definition in data mining.
• Combining sanitization and randomization for PPARM.
• Transforming data using one-way functions and learning
from the distorted data.
• Privacy preservation in spoken language databases.
• Sanitization of documents repositories on the Web.
24
PhD – Final Examination – Nov. 29, 2004
Data Transformation for Privacy-Preserving Data Mining
Stanley Oliveira