Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Overview of Statistical Disclosure
Control and Privacy-Preserving
Data Mining
Traian Marius Truta
http://www.nku.edu/~trutat1/
Content of the Talk
Introduction
Statistical Disclosure Control (SDC)
Privacy-Preserving Data Mining (PPDM)
De-identification Techniques
Disclosure Risk & Information Loss
Conclusions
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
2
SDC/PPDM Problem
Individuals
Submit
Collect
Data
Data Owner
Masking
Process
Release
Receive
Masked Data
Researcher
Intruder
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
3
SDC/PPDM Problem
Individuals
Submit
Collect
Data Owner
Release
Receive
Data
Confidentiality
of Individuals
Disclosure Risk /
Anonymity Properties
Masking
Process
Masked Data
Preserve
Data Utility
Information Loss
Researcher
Intruder
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
4
SDC/PPDM Problem
Individuals
Submit
Collect
Data Owner
Release
Receive
Data
Disclosure Risk /
Anonymity Properties
Masking
Process
Masked Data
Researcher
Intruder
External Data
April 30, 2009
Confidentiality
of Individuals
Preserve
Data Utility
Information Loss
Use Masked Data for
Statistical Analysis or Data Mining
Use Masked Data and External Data
to disclose confidential information
Traian Marius Truta – DIMACS Tutorial
5
Types of Disclosure
Initial Microdata
Masked Microdata
Name
SSN
Age
Zip
Diagnosis
Income
Age
Zip
Diagnosis
Income
Alice
123456789
44
48202
AIDS
17,000
44
48202
AIDS
17,000
Bob
323232323
44
48202
AIDS
68,000
44
48202
AIDS
68,000
Charley
232345656
44
48201
Asthma
80,000
44
48201
Asthma
80,000
Dave
333333333
55
48310
Asthma
55,000
55
48310
Asthma
55,000
Eva
666666666
55
48310
Diabetes
23,000
55
48310
Diabetes
23,000
Data Owner
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
6
Types of Disclosure
Initial Microdata
Masked Microdata
Name
SSN
Age
Zip
Diagnosis
Income
Age
Zip
Diagnosis
Income
Alice
123456789
44
48202
AIDS
17,000
44
48202
AIDS
17,000
Bob
323232323
44
48202
AIDS
68,000
44
48202
AIDS
68,000
Charley
232345656
44
48201
Asthma
80,000
44
48201
Asthma
80,000
Dave
333333333
55
48310
Asthma
55,000
55
48310
Asthma
55,000
Eva
666666666
55
48310
Diabetes
23,000
55
48310
Diabetes
23,000
Data Owner
External Information
Name
SSN
Age
Zip
Alice
123456789
44
48202
Charley
232345656
44
48201
Dave
333333333
55
48310
Intruder
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
7
Types of Disclosure
Initial Microdata
Masked Microdata
Name
SSN
Age
Zip
Diagnosis
Income
Age
Zip
Diagnosis
Income
Alice
123456789
44
48202
AIDS
17,000
44
48202
AIDS
17,000
Bob
323232323
44
48202
AIDS
68,000
44
48202
AIDS
68,000
Charley
232345656
44
48201
Asthma
80,000
44
48201
Asthma
80,000
Dave
333333333
55
48310
Asthma
55,000
55
48310
Asthma
55,000
Eva
666666666
55
48310
Diabetes
23,000
55
48310
Diabetes
23,000
Data Owner
External Information
Identity Disclosure:
Name
SSN
Age
Zip
Alice
123456789
44
48202
Attribute Disclosure:
Charley
232345656
44
48201
Alice has AIDS
Dave
333333333
55
48310
Charlie is the third record
Intruder
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
8
Types of Disclosure
Initial Microdata
Masked Microdata
Name
SSN
Age
Zip
Diagnosis
Income
Age
Zip
Diagnosis
Income
Alice
123456789
44
48202
AIDS
17,000
44
482
AIDS
17,000
Bob
323232323
44
48202
AIDS
68,000
44
482
AIDS
68,000
Charley
232345656
44
48201
Asthma
80,000
44
482
Asthma
80,000
Dave
333333333
55
48310
Asthma
55,000
55
483
Asthma
55,000
Eva
666666666
55
48310
Diabetes
23,000
55
483
Diabetes
23,000
Data Owner
External Information
Identity Disclosure:
Name
SSN
Age
Zip
Alice
123456789
44
48202
Attribute Disclosure:
Charley
232345656
44
48201
Alice has AIDS
Dave
333333333
55
48310
Charlie is the third record
Intruder
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
9
Types of disclosure
Identity disclosure - identification of an entity
(person, institution).
Attribute disclosure - the intruder finds
something new about the target person.
[Lambert 1993]
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
10
Content of the Talk
Introduction
Statistical Disclosure Control (SDC)
Privacy-Preserving Data Mining (PPDM)
De-identification Techniques
Disclosure Risk & Information Loss
Conclusions
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
11
Statistical Disclosure Control
Statistical Disclosure Control is the discipline
concerned with the modification of data, containing
confidential information about individual entities such as
persons, households, businesses, etc. in order to
prevent third parties working with these data to
recognize individuals in the data [Willemborg 1996,
Willemborg 2001].
Also called:
Computational Disclosure Control [Sweeney 2001]
Disclosure Control [Truta 2004]
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
12
Microdata and External Information
Microdata represents a series of records, each record
containing information on an individual unit such as a
person, a firm, an institution, etc.
Masked Microdata names and other identifying
information are removed from microdata.
External Information any known information by an
presumptive intruder related to some individuals from
initial microdata.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
13
Disclosure Risk and Information Loss
Disclosure risk - the risk that a given form of
disclosure will arise if a masked microdata is
released [Chen 1998].
Information loss - the quantity of information
which exist in the initial microdata and because
of disclosure control methods does not occur in
masked microdata [Willemborg 2001].
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
14
Name
Wayne
Gore
Banks
Casey
Stone
Kopi
Simms
Wood
Aaron
Pall
Age
44
44
55
44
55
45
25
35
55
45
Diagnosis
AIDS
Asthma
AIDS
Asthma
Asthma
Diabetes
Diabetes
AIDS
AIDS
Tuberculosis
Initial Microdata
Income
45,500
37,900
67,000
21,000
90,000
48,000
49,000
66,000
69,000
34,000
Age
44
44
55
44
55
45
55
45
Diagnosis
AIDS
Asthma
AIDS
Asthma
Asthma
Diabetes
Diabetes
AIDS
AIDS
-
Income
50,000
40,000
70,000
20,000
90,000
50,000
50,000
70,000
70,000
30,000
Masked Microdata
Disclosure Control For Microdata
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
15
Name
Wayne
Gore
Banks
Casey
Stone
Kopi
Simms
Wood
Aaron
Pall
Age
44
44
55
44
55
45
25
35
55
45
Diagnosis
AIDS
Asthma
AIDS
Asthma
Asthma
Diabetes
Diabetes
AIDS
AIDS
Tuberculosis
Income
45,500
37,900
67,000
21,000
90,000
48,000
49,000
66,000
69,000
34,000
Disclosure Control for Tables
Initial Microdata
Count
4
3
2
1
Diagnosis
AIDS
Asthma
Diabetes
Tuberculosis
Count
4
3
Diagnosis
AIDS
Asthma
Masked Table 1
Table 1 - Count Diagnosis
Count
1
1
5
3
0
Age
<= 30
31- 40
41 - 50
51-60
> 60
Income
49,000
66,000
188,200
226,000
0
Table 2 - Total Incoming
Tables
April 30, 2009
Count
5
3
Age
31 - 40
41 - 50
Income
188,200
226,000
Masked Table 2
Masked Tables
from Tables
Traian Marius Truta – DIMACS Tutorial
16
Name
Wayne
Gore
Banks
Casey
Stone
Kopi
Simms
Wood
Aaron
Pall
Age
44
44
55
44
55
45
25
35
55
45
Diagnosis
AIDS
Asthma
AIDS
Asthma
Asthma
Diabetes
Diabetes
AIDS
AIDS
Tuberculosis
Income
45,500
37,900
67,000
21,000
90,000
48,000
49,000
66,000
69,000
34,000
Age
44
44
55
44
55
45
55
45
Diagnosis
AIDS
Asthma
Diabetes
Tuberculosis
Count
4
3
Diagnosis
AIDS
Asthma
Masked Table 1
Table 1 - Count Diagnosis
Count
1
1
5
3
0
Age
<= 30
31- 40
41 - 50
51-60
> 60
Income
49,000
66,000
188,200
226,000
0
Table 2 - Total Incoming
Tables
April 30, 2009
Income
50,000
40,000
70,000
20,000
90,000
50,000
50,000
70,000
70,000
30,000
Masked Microdata
Initial Microdata
Count
4
3
2
1
Diagnosis
AIDS
Asthma
AIDS
Asthma
Asthma
Diabetes
Diabetes
AIDS
AIDS
-
Count
5
3
Age
31 - 40
41 - 50
Income
188,200
226,000
Masked Table 2
Masked Tables
from Tables
Traian Marius Truta – DIMACS Tutorial
Count Diagnosis Income
4
AIDS
260,000
3
Asthma 150,000
2
Diabetes 100,000
Masked Table 3
Count
Age
Income
3
44
110,000
2
45
80,000
3
55
230,000
Masked Table 4
Count Diagnosis
4
AIDS
3
Asthma
2
Diabetes
Masked Table 5
Masked Tables
from Masked
Microdata
17
Content of the Talk
Introduction
Statistical Disclosure Control (SDC)
Privacy-Preserving Data Mining (PPDM)
De-identification Techniques
Disclosure Risk & Information Loss
Conclusions
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
18
What is Data Mining?
Many Definitions
Non-trivial extraction of implicit, previously unknown and
potentially useful information from data.
Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
[Tan 2006].
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
19
Origins of Data Mining
Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems.
Traditional Techniques
may be unsuitable due to
Statistics/
Machine Learning/
Enormity of data
High dimensionality
of data
Heterogeneous,
distributed nature
of data
April 30, 2009
AI
Traian Marius Truta – DIMACS Tutorial
Pattern
Recognition
Data Mining
Database
systems
20
Data Mining Tasks
Prediction Methods
Use some variables to predict unknown or future
values of other variables.
Classification, regression.
Description Methods
Find human-interpretable patterns that describe the
data.
Clustering, association rule discovery.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
21
Privacy-Preserving Data Mining
Privacy preserving data mining is a research
direction in data mining and statistical databases,
where data mining algorithms are analyzed for the
side-effects they incur in data privacy [Verykios 2004].
Two objectives:
Quasi-identifiers like names, addresses and the like, should be
modified or trimmed out from the original database, in order for the
recipient of the data not to be able to compromise another person’s
privacy.
Sensitive knowledge which can be mined from a database by using
data mining algorithms, should also be excluded, because such a
knowledge can equally well compromise data privacy, as we will
indicate.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
22
Other Names for the Same Problem
The field of data privacy emerged in recent years at the
confluence of well established research areas: data
mining, databases, computer security, health informatics,
statistics, etc. As a result, there are different terminologies
that define the same or very similar concepts.
Statistical Disclosure Control [Willemborg 1996]
Privacy Preserving Data Mining [Clifton 1996]
Data Anonymity [Sweeney 2002]
Privacy Preserving Data Publishing [Fung 2007]
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
23
PPDM Current Directions
Transform data (usually microdata) to satisfy a
privacy guarantee
k-anonymity [Samarati 2001, Sweeney 2002]
p-sensitive k-anonymity [Truta 2006]
l-diversity [Machanavajjhala 2006]
t-closeness [Li 2007]
randomization [Evfimievski 2003]
… and many more
Algorithms, theoretical results, data mining under the
privacy contraints, information loss measures
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
24
PPDM Current Directions
Cryptographic methods for data sharing and
privacy
Horizontal Partitioning [Karantacioglu 2004]
Vertical partitioning [Vaidya 2002]
Privacy-preservation for other data models
Data Streams [Xu 2008]
Location Based Systems [Kalinis 2007]
Social Networks [Hay 2007, Campan 2008]
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
25
Attribute Classification
I1, I2,..., Im – identifier attributes
K1, K2,.…, Kp – key or quasi-identifier attributes
Ex: Name and SSN
Found in IM only
Information that leads to a specific entity
Ex: Zip Code and Age
Found in IM and MM
May be known by an intruder
S1, S2,.…, Sq – confidential or sensitive attributes
Ex: Principal Diagnosis and Annual Income
Found in IM and MM
Assumed to be unknown to an intruder
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
26
K-Anonymity Definitions
QI-cluster – all the tuples with identical combination
of quasi-identifier attribute values in that microdata.
K-anonymity property for a masked microdata (MM)
is satisfied if every QI-cluster in MM contains k or
more tuples.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
27
K-Anonymity Example
RecID
Age
Zip
Sex
Illness
1
50
41076
Female
AIDS
2
30
41099
Male
Diabetes
3
30
41099
Male
AIDS
4
20
41076
Male
Asthma
5
20
41076
Male
Asthma
6
50
41076
Female
Diabetes
7
60
41076
Female
Tuberculosis
KA = { Age, Zip, Sex }
cl1 = {1, 6, 7}; cl2 = {2, 3}; cl3 = {4, 5}
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
28
Domain and Value Generalization Hierarchies
Z2 = {*****}
*****
482**
410**
Z1 = {482**, 410**}
Z0 = {48201, 41075, 41076, 41088, 41099}
48201
41075
S1 = {*}
41076
41088
41099
*
male
female
S0 = {male, female}
[Samarati 2001, Sweeney 2002]
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
29
Generalization Types
All Attributes:
Full domain generalization [Samarati 2001, LeFevre 2006]
Iyengar generalization [Iyengar 2002]
Cell-level generalization [Lunacek 2006]
Numerical Attributes
Predefined hierarchy [Iyengar 2002]
Computed hierarchy [LeFevre 2006]
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
30
Generalization Types
Tuple
Age
ZipCode
Sex
r1
50
41076
Male
r2
30
41075
Female
r3
30
41099
Female
r4
20
48201
Male
r5
20
41075
Male
Age
ZipCode
Sex
Tuple
Age
ZipCode
Sex
r1
20-30
*****
Male
r1
20-30
410**
Male
r2
20-30
*****
Male
r2
20-30
410**
Male
r3
30-40
*****
Female
r3
30-40
*****
Female
r4
30-40
*****
Female
r4
30-40
*****
Female
r5
30-40
*****
Female
r5
30-40
*****
Female
Full domain generalization
(Iyengar g. is identical in this case)
April 30, 2009
2-Anonymity
Tuple
Cell-level generalization
Traian Marius Truta – DIMACS Tutorial
31
Attacks Against K-Anonymity
Unsorted Matching Attack
This attack is based on the order in which tuples
appear in the released table.
Solution:
Randomly sort the tuples before releasing.
[Sweeney 2002]
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
32
Attacks Against K-Anonymity
Complementary Release Attack
Different releases can be linked together to
compromise k-anonymity [Sweeney 2002].
Solution:
Consider all of the released tables before release the new
one, and try to avoid linking.
Other data holders may release some data that can be used
in this kind of attack. Generally, this kind of attack is hard to
be prohibited completely.
Temporal Attack
Adding or removing tuples may compromise kanonymity protection [Sweeney 2002].
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
33
Attacks Against K-Anonymity
k-Anonymity does not provide privacy if
Sensitive values in an equivalence class lack diversity [ Truta
2006, Machanavajjhala 2006].
The attacker has background knowledge [Machanavajjhala 2006].
Homogeneity Attack
Bob
Zipcode
Age
47678
27
Background Knowledge Attack
Carl
April 30, 2009
A 3-anonymous patient table
Zipcode
Age
Disease
476**
2*
Heart Disease
476**
2*
Heart Disease
476**
2*
Heart Disease
4790*
≥40
Flu
4790*
≥40
Heart Disease
4790*
≥40
Cancer
476**
3*
Heart Disease
Zipcode
Age
476**
3*
Cancer
47673
36
476**
3*
Cancer
Traian Marius Truta – DIMACS Tutorial
34
P-Sensitive K-Anonymity Definition
P-sensitive K-anonymity property – A MM
satisfies p-sensitive k-anonymity property if it
satisfies k-anonymity and the number of
distinct attribute values for each confidential
attribute is at least p within the same QIcluster from the MM [Truta 2006].
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
35
P-Sensitive K-Anonymity Example
RecID
Age
Zip
Sex
Illness
1
50
41076
Female
AIDS
2
30
41076
Male
Diabetes
3
30
41076
Male
AIDS
4
20
41076
Male
Asthma
5
20
41076
Male
Asthma
6
50
41076
Female
Diabetes
KA = { Age, Zip, Sex }
cl1 = {1, 6}; cl2 = {2, 3}; cl3 = {4, 5}
This microdata is NOT 2-sensitive 2-anonymous
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
36
P-Sensitive K-Anonymity Example
RecID
Age
Zip
Sex
Illness
1
50
41076
Female
AIDS
2
20-30
41076
Male
Diabetes
3
20-30
41076
Male
AIDS
4
20-30
41076
Male
Asthma
5
20-30
41076
Male
Asthma
6
50
41076
Female
Diabetes
KA = { Age, Zip, Sex }
cl1 = {1, 6}; cl2 = {2, 3, 4, 5}
This microdata is 2-sensitive 2-anonymous
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
37
l-Diversity
Distinct l-diversity
Each equivalence class has at least l well-represented
sensitive values.
Limitation:
Doesn’t prevent the probabilistic inference attacks
Ex: In one equivalent class, there are ten tuples. In the
“Disease” area, one of them is “Cancer”, one is “Heart Disease”
and the remaining eight are “Flu”. This satisfies 3-diversity, but
the attacker can still affirm that the target person’s disease is
“Flu” with the accuracy of 70%.
[Machanavajjhala 2006]
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
38
l-Diversity
Entropy l-diversity
Each equivalence class not only must have enough different
sensitive values, but also the different sensitive values must be
distributed evenly enough.
The entropy of the distribution of sensitive values in each
equivalence class is at least log(l).
Sometimes this maybe too restrictive. When some values are very
common, the entropy of the entire table may be very low. This
leads to the less conservative notion of l-diversity.
Recursive (c,l)-diversity
The most frequent value does not appear too frequently.
r1<c(rl+rl+1+…+rm).
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
39
Limitations of l-Diversity & p-sensitive
k-anonymity
attribute disclosure not completely prevented.
Skewness Attack [Li 2007]
Two sensitive values
Serious privacy risk
HIV positive (1%) and HIV negative (99%).
Consider an equivalence class that contains an large number of positive records
compared to negative records.
l-diversity & p-sensitive k-anonymity does not differentiate
Equivalence class 1: 49 positive + 1 negative.
Equivalence class 2: 1 positive + 49 negative.
Overall distribution of sensitive values not considered.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
40
Limitations of l-Diversity & p-sensitive
k-anonymity
attribute disclosure not completely prevented.
Similarity Attack [Li 2007]
Bob
2.
Zipcode
Age
Salary
Disease
476**
2*
20K
Gastric Ulcer
Zip
Age
476**
2*
30K
Gastritis
47678
27
476**
2*
40K
Stomach Cancer
4790*
≥40
50K
Gastritis
4790*
≥40
100K
Flu
4790*
≥40
70K
Bronchitis
476**
3*
60K
Bronchitis
476**
3*
80K
Pneumonia
476**
3*
90K
Stomach Cancer
Conclusion
1.
A 3-diverse patient table
Bob’s salary is in [20k,40k], which is
relative low.
Bob has some stomach-related
disease.
Semantic meanings of sensitive values not considered.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
41
t-Closeness: A New Privacy Measure
Rationale
Belief
B0
B1
A completely generalized microdata
Age
Zipcode
……
Gender
Disease
*
*
……
*
Flu
*
*
……
*
Heart Disease
Knowledge
*
*
……
*
Cancer
External
Knowledge
.
.
.
.
.
.
……
……
……
.
.
.
.
.
.
*
*
……
*
Gastritis
Overall distribution Q of
sensitive values
[Li 2007]
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
42
t-Closeness: A New Privacy Measure
Rationale
Belief
B0
B1
B2
April 30, 2009
A released microdata
Knowledge
External
Knowledge
Age
Zipcode
……
Gender
Disease
2*
479**
……
Male
Flu
2*
479**
……
Male
Heart Disease
2*
479**
……
Male
Cancer
.
.
.
.
.
.
……
……
……
.
.
.
.
.
.
≥50
4766*
……
*
Gastritis
Overall distribution Q of
sensitive values
Distribution Pi of
sensitive values in each
equi-class
[Li 2007]
Traian Marius Truta – DIMACS Tutorial
43
t-Closeness: A New Privacy Measure
Rationale
Observations
Q should be public.
Knowledge gain in two parts:
Belief
B0
B1
B2
April 30, 2009
Knowledge
Whole population (from B0 to B1).
Specific individuals (from B1 to B2).
We bound knowledge gain between
B1 and B2 instead.
External
Knowledge
Principle
Overall distribution Q of
sensitive values
Distribution Pi of
sensitive values in each
equi-class
The distance between Q and Pi
should be bounded by a threshold t.
[Li 2007]
Traian Marius Truta – DIMACS Tutorial
44
Extended P-Sensitive K-Anonymity
*****
001-139 Infectious and
parasitic diseases
001-009
Interstinal
infectious
diseases
…
042 HIV
Infection
240-279 Endocrine, nutritio-nal
and metabolic diseases, and
immunity disorders
140-239
Neoplasms
140-149 Malignant
neoplasm of lip, oral
cavity,and pharynx
280-289 Diseases of
the blood and
blood-forming organs
…
800-999
Injury and
poisoning
…
…
042 HIV
Disease
140 Malignant
neoplasm of lip
140.0 Upper
lip, vermilion
border
…
[Campan 2006]
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
45
Extended P-Sensitive K-Anonymity
Requirements: Let S be a confidential
attribute and HVS its value generalization
hierarchy. The following two requirements
must be met by the protected values in HVS:
All ground values in HVS are protected.
All the descendants of a protected internal value
in HVS are protected.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
46
Extended P-Sensitive K-Anonymity
A protected value in the value generalization
hierarchy HVS of a confidential attribute S is
called strong if none of its ascendants
(including the root) is protected.
We call protected subtree of a hierarchy
HVS a subtree in HVS that has as root a
strong protected value.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
47
Extended P-Sensitive K-Anonymity
The masked microdata (MM) satisfies extended psensitive k-anonymity property if it satisfies kanonymity and for each group of tuples with the identical
combination of key attribute values that exists in MM, the
values of each confidential attribute S within that group
belong to at least p different protected subtrees in HVS.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
48
Content of the Talk
Introduction
Statistical Disclosure Control (SDC)
Privacy-Preserving Data Mining (PPDM)
De-identification Techniques
Disclosure Risk & Information Loss
Conclusions
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
49
Disclosure Control Techniques
Remove Identifiers [Truta 2003]
Global and Local Recoding [Willemborg 2001]
Local Suppression [Willemborg 2001]
Sampling [Skinner 1994]
Microaggregation [Domingo-Ferrer 2002]
Simulation [Willemborg 2001]
Adding Noise [Kim 1986]
Rounding [Willemborg 2001]
Data Swapping [Anderson 2004]
Etc.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
50
Disclosure Control Techniques
Remove Identifiers
Global and Local Recoding
Local Suppression
Sampling
Microaggregation
Simulation
Adding Noise
Rounding
Data Swapping
Etc.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
51
Disclosure Control Techniques
Different disclosure control techniques are applied to the
following initial microdata:
RecID
Name
SSN
Age
State
Diagnosis
Income
Billing
1
John Wayne
123456789
44
MI
AIDS
45,500
1,200
2
Mary Gore
323232323
44
MI
Asthma
37,900
2,500
3
John Banks
232345656
55
MI
AIDS
67,000
3,000
4
Jesse Casey
333333333
44
MI
Asthma
21,000
1,000
5
Jack Stone
444444444
55
MI
Asthma
90,000
900
6
Mike Kopi
666666666
45
MI
Diabetes
48,000
750
7
Angela Simms
777777777
25
IN
Diabetes
49,000
1,200
8
Nike Wood
888888888
35
MI
AIDS
66,000
2,200
9
Mikhail Aaron
999999999
55
MI
AIDS
69,000
4,200
10
Sam Pall
100000000
45
MI
Tuberculosis
34,000
3,100
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
52
Remove Identifiers
Identifiers such as Names, SSN etc. are removed.
April 30, 2009
RecID
Age
State
Diagnosis
Income
Billing
1
44
MI
AIDS
45,500
1,200
2
44
MI
Asthma
37,900
2,500
3
55
MI
AIDS
67,000
3,000
4
44
MI
Asthma
21,000
1,000
5
55
MI
Asthma
90,000
900
6
45
MI
Diabetes
48,000
750
7
25
IN
Diabetes
49,000
1,200
8
35
MI
AIDS
66,000
2,200
9
55
MI
AIDS
69,000
4,200
10
45
MI
Tuberculosis
34,000
3,100
Traian Marius Truta – DIMACS Tutorial
53
Sampling
Sampling is the disclosure control method in which only a subset of
records is released.
If n is the number of elements in initial microdata and t the released
number of elements we call sf = t / n the sampling factor.
Simple random sampling is more frequently used. In this technique,
each individual is chosen entirely by chance and each member of
the population has an equal chance of being included in the sample.
RecID
April 30, 2009
Age
State
Diagnosis
Income
Billing
5
55
MI
Asthma
90,000
900
4
44
MI
Asthma
21,000
1,000
8
35
MI
AIDS
66,000
2,200
9
55
MI
AIDS
69,000
4,200
7
25
IN
Diabetes
49,000
1,200
Traian Marius Truta – DIMACS Tutorial
54
Microaggregation
Order records from the initial microdata by an attribute, create groups
of consecutive values, replace those values by the group average .
Microaggregation for attribute Income and minimum size 3.
The total sum for all Income values remains the same.
April 30, 2009
RecID
Age
State
Diagnosis
Income
Billing
2
44
MI
Asthma
30,967
2,500
4
44
MI
Asthma
30,967
1,000
10
45
MI
Tuberculosis
30,967
3,100
1
44
MI
AIDS
47,500
1,200
6
45
MI
Diabetes
47,500
750
7
25
IN
Diabetes
47,500
1,200
3
55
MI
AIDS
73,000
3,000
5
55
MI
Asthma
73,000
900
8
35
MI
AIDS
73,000
2,200
9
55
MI
AIDS
73,000
4,200
Traian Marius Truta – DIMACS Tutorial
55
Data Swapping
In this disclosure method a sequence of so-called elementary swaps
is applied to a microdata.
An elementary swap consists of two actions:
A random selection of two records i and j from the microdata.
A swap (interchange) of the values of the attribute being swapped for records i
and j.
April 30, 2009
RecID
Age
State
Diagnosis
Income
Billing
1
44
MI
AIDS
48,000
1,200
2
44
MI
Asthma
37,900
2,500
3
55
MI
AIDS
67,000
3,000
4
44
MI
Asthma
21,000
1,000
5
55
MI
Asthma
90,000
900
6
45
MI
Diabetes
45,500
750
7
25
IN
Diabetes
49,000
1,200
8
35
MI
AIDS
66,000
2,200
9
55
MI
AIDS
69,000
4,200
10
45
MI
Tuberculosis
34,000
3,100
Traian Marius Truta – DIMACS Tutorial
56
Content of the Talk
Introduction
Statistical Disclosure Control (SDC)
Privacy-Preserving Data Mining (PPDM)
De-identification Techniques
Disclosure Risk & Information Loss
Conclusions
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
57
Disclosure Risk and Information Loss
Disclosure risk - the risk that a given form of
disclosure will arise if a masked microdata is
released [Chen 1998].
Value/Attribute disclosure
Identity disclosure
Information loss - the quantity of information
which exist in the initial microdata and because
of disclosure control methods does not occur in
masked microdata [Willemborg 2001].
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
58
Disclosure Risk
Individual measures: measure the risk per
record. Usually, it is expressed by means of the
probability of correctly re-identifying a unit, or by
means of the uniqueness and rareness in the
sample or population [Willemborg 2001].
Global measures: measure the risk for the entire
dataset. Usually, it is expressed by means of the
expected number of correct re-identifications
[Domingo-Ferrer 2003].
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
59
Frequency Count
Age
State
Diagnosis
Age
State
Diagnosis
Freq. Count
44
MI
Asthma
44
MI
Asthma
2
44
MI
Asthma
45
MI
Tuberculosis
1
45
MI
Tuberculosis
44
MI
AIDS
1
44
MI
AIDS
45
MI
Diabetes
1
45
MI
Diabetes
25
IN
Diabetes
1
25
IN
Diabetes
55
MI
AIDS
3
55
MI
AIDS
55
MI
Asthma
1
55
MI
Asthma
55
MI
AIDS
55
MI
AIDS
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
60
Sample Unique and Population Unique
Sample unique. A record is defined as a sample
unique if fck= 1, i.e. there is only one record in the
sample microdata presenting the combination k
of scores of the key variables.
Population unique. A record is defined as a
population unique if FCk= 1. For census data, or
when an administrative register covering the
whole population is available, FCk is known for
each k and the risk measure can be computed.
[Elliot 2002]
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
61
Information Loss Measures
Compare IM and MM -greater similarities
between the values for key attributes being
indicative of less information loss.
Compare statistics (covariance, correlation,
means) between IM and MM .
Average between the two approaches.
[Domingo-Ferrer 2001]
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
62
Information Loss Measures
n – the number of records in initial or masked microdata
p – the number of attributes in initial or masked microdata
V and R – the covariance and correlation matrices of X.
V’ and R’ – the covariance and correlation matrices of X’
S and S’ – the variance vectors for X and X’ (The values are form
the corresponding main diagonal covariance matrix).
Xand X ' – the average of attributes vectors
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
63
Information Loss Measures
Mean abs. error
p
n
X – X’
| x
i 1 i 1
xij' |
p
n
| x
j 1
j
| xij xij' |
| xij |
i 1 i 1
n p
p
X - X'
ij
Mean variation
n p
p
| x j x 'j |
j 1
| xj |
x 'j |
p
p
p
| v
V – V’
j 1 1i j
ij
vij' |
| vij vij' |
| vij |
j 1 1 i j
p ( p 1)
2
p
S – S’
p
| v jj v 'jj |
j 1
p ( p 1)
2
p
| v jj v 'jj |
j 1
| v jj |
p
p
p
| r
R – R’
April 30, 2009
j 1 1 i j
ij
rij' |
p ( p 1)
2
p
j 1 1 i j
| rij rij' |
| rij |
p ( p 1)
2
Traian Marius Truta – DIMACS Tutorial
64
Content of the Talk
Introduction
Statistical Disclosure Control (SDC)
Privacy-Preserving Data Mining (PPDM)
De-identification Techniques
Disclosure Risk & Information Loss
Conclusions
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
65
Conclusions
Data Privacy investigated from various perspectives.
Theoretical results needs to be better applied to real
data
More collaboration needed between various groups.
More collaboration needed between practitioners and
researchers.
To advance Data Privacy Research &
Practical Data Privacy Applications more
collaboration needed.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
66
References
Anderson M., Fienberg S.E. (2004), U.S. Census Confidentiality: Perception and Reality,
Bulletin of the International Statistical Institute.
Campan A., Truta T.M. (2006), Extended P-Sensitive K-Anonymity, Studia Universitatis
Babes-Bolyai, Informatica, Vol. 51, No. 2, pp. 19 – 30.
Campan A., Truta T.M. (2008), A Clustering Approach for Data and Structural Anonymity in
Social Networks,” 2nd ACM SIGKDD International Workshop on Privacy, Security, and Trust
in KDD (PinKDD2008), Las Vegas, Nevada.
Chen G., Keller-McNulty S. (1998), Estimation of Deidentification Disclosure Risk in
Microdata, Journal of Official Statistics, Vol. 14, No. 1, 79-95
Clifton C., Marks D. (1996), Security and Privacy Implication of Data Mining, Proceedings of
the ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge
Discovery, 15 – 19
Domingo-Ferrer J., Mateo-Sanz J., Torra V. (2001), Comparing SDC Methods for Microdata
on the Basis of Information Loss and Disclosure Risk, Pre-proceedings of ETK-NTTS'2001
(vol. 2), Luxembourg: Eurostat, 807-826.
Domingo-Ferrer J., Torra V., (2003), Disclosure risk assessment in statistical microdata
protection via advanced record linkage, Statistics and Computing, vol. 13, no. 4, pp. 343354.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
67
References
Elliot M. (2002) Integrating File and Record Level Disclosure Risk Assessment, In J.
Domingo-Ferrer (Ed). Inference Control in Statistical Databases, Springer Verlag.
Evfimievski A., Gehrke J., Srikant R. (2003), Limiting Privacy Breaches in Privacy
Preserving Data Mining, Proceedings of the PODS, 2003.
Fung B. (2007), Privacy Preserving Data Publishing, Ph.D. Thesis, Simon Fraser University.
Hay M., Miklau G., Jensen D., Weiss P., Srivastava S. (2007), Anonymizing Social
Networks, University of Massachusetts Amherst, Technical Report No. 07-19, Available
online at: http://kdl.cs.umass.edu/papers/hay-et-al-tr0719.pdf.
Kalnis P., Ghinita G., Mouratidis K., Papadias D. (2007), Preventing Location-Based Identity
Inference in Anonymous Spatial Queries, IEEE Transactions on Knowledge and Data
Engineering (IEEE TKDE), 19(12), 1719-1733.
Kantarcioglu M. and Clifton C. (2004), Privacy-preserving distributed mining of association
rules on horizontally partitioned data. IEEE Transactions on Knowledge and Data
Engineering, 16(9): 1026-1037.
Kim J.J. (1986), A Method for Limiting Disclosure in Microdata Based on Random Noise and
Transformation, American Statistical Association, Proceedings of the Section on Survey
Research Methods, 303-308.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
68
References
Iyengar V. (2002), Transforming Data to Satisfy Privacy Constraints, Proceedings of the
Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
279 – 288.
Lambert D. (1993), Measures of Disclosure Risk and Harm. Journal of Official Statistics, Vol.
9, 313-331
LeFevre K., DeWitt D., Ramakrishnan R. (2005), Incognito: Efficient Full-Domain KAnonymity, Proceedings of the ACM SIGMOD, Baltimore, Maryland, 49-60.
Lunacek M., Whitley D, Ray I. (2006), A Crossover Operator for the k-Anonymity Problem,
Proceedings of the GECCO Conference, 1713 – 1720.
Machanavajjhala A., Gehrke J., Kifer D. (2006), L-Diversity: Privacy beyond K-Anonymity, In
IEEE International Conference on Data Engineering (ICDE), 24.
Samarati P. (2001), Protecting Respondents Identities in Microdata Release, IEEE
Transactions on Knowledge and Data Engineering, Vol. 13, No. 6, 1010-1027.
Skinner C.J., Marsh C., Openshaw S., Wymer C. (1994), Disclosure Control for Census
Microdata, Journal of Official Statistics, 31-51.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
69
References
Sweeney L. (2001), Computational Disclosure Control: A Primer on Data Privacy Protection,
Ph.D. Thesis, MIT.
Sweeney, L., (2002) k-Anonymity: A Model for Protecting Privacy, International Journal on
Uncertainty, Fuzziness, and Knowledge-based Systems, Vol. 10, No. 5, pp. 557 – 570.
Sweeney L., (2002), Comments on Standards of Privacy of Individually Identifiable Health
Information, Addressed to the Department of Health and Human Services, April 4.
Tan P.N., Steinbach M., Kumar V. (ed) (2006), Introduction to Data Mining. Addison Wesley.
Truta T.M. (2004), Adaptive Disclosure Control for Healthcare Microdata, Ph.D. Thesis,
Wayne State University
Truta T.M., Vinay B. (2006), Privacy Protection: p-Sensitive k-Anonymity Property,
International Workshop of Privacy Data Management (PDM2006), In Conjunction with 22th
International Conference of Data Engineering (ICDE), Atlanta, Georgia.
Truta T.M., Fotouhi F., Barth-Jones D. (2003), Disclosure Risk Measures for Microdata,
Proceedings of the International Conference on Scientific and Statistical Database
Management, Cambridge, Ma, 15 – 22.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
70
References
Vaidya J., Clifton C. (2002), Privacy Preserving Association Rule Mining in Vertically
Partitioned Data, Proceedings of the 8th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 639-644.
Verykios V. S., Bertino E., Fovino I. N., Provenza L. P., Saygin Y., Theodoridis Y. (2004),
State-of-the-Art in Privacy Preserving Data Mining, SIGMOD Record, Vol. 33, No. 1, 50-57.
Xu Y., Wang K., Fu A.W.C., She R., Pei J. (2008) Privacy-Preserving Data Stream
Classification, in Privacy-Preserving Data Mining Models and Algorithms, Springer.
Willemborg L., Waal T. (ed) (1996), Statistical Disclosure Control in Practice. Springer
Verlag.
Willemborg L., Waal T. (ed) (2001), Elements of Statistical Disclosure Control, Springer
Verlag.
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
71
Questions
April 30, 2009
Traian Marius Truta – DIMACS Tutorial
72