Download slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Anonymizing Sequential
Releases
Ke Wang
Benjamin C. M. Fung
Simon Fraser University
Simon Fraser University
[email protected]
[email protected]
ACM SIGKDD 2006
Motivation: Sequential Releases
• Previous works address single release only.
• Data are released in multiple shots.
• An organization makes a new release:
– New information become available.
– A tailored view for each data sharing purpose.
– Separate release for sensitive information and
identifying information.
• Related releases sharpens the identification
of individuals by a global quasi-identifier.
2
T1: Current Release
T2: Previous Release
Pid Name
Job
Class
Pid Job
Disease
1
Alice
Banker
c1
1
Banker
Cancer
2
Alice
Banker
c1
2
Banker
Cancer
3
Bob
Clerk
c2
3
Clerk
HIV
4
Bob
Driver
c3
4
Driver
Cancer
5
Cathy
Engineer
c4
5
Engineer
HIV
The join on T1.Job = T2.Job
Pid
Name
Job
Disease
Class
1
Alice
Banker
Cancer
c1
2
Alice
Banker
Cancer
c1
3
Bob
Clerk
HIV
c2
4
Bob
Driver
Cancer
c3
5
Cathy
Engineer
HIV
c4
-
Alice
Banker
Cancer
c1
-
Alice
Banker
Cancer
c1
Do not want Name to
be linked to Disease
in the join of the two
releases.
3
T1: Current Release
T2: Previous Release
Pid Name
Job
Class
Pid Job
Disease
1
Alice
Banker
c1
1
Banker
Cancer
2
Alice
Banker
c1
2
Banker
Cancer
3
Bob
Clerk
c2
3
Clerk
HIV
4
Bob
Driver
c3
4
Driver
Cancer
5
Cathy
Engineer
c4
5
Engineer
HIV
The join on T1.Job = T2.Job
Pid
Name
Job
Disease
Class
1
Alice
Banker
Cancer
c1
2
Alice
Banker
Cancer
c1
3
Bob
Clerk
HIV
c2
4
Bob
Driver
Cancer
c3
5
Cathy
Engineer
HIV
c4
-
Alice
Banker
Cancer
c1
-
Alice
Banker
Cancer
c1
join sharpens
identification:
{Bob, HIV} has
groups size 1.
4
T1: Current Release
T2: Previous Release
Pid Name
Job
Class
Pid Job
Disease
1
Alice
Banker
c1
1
Banker
Cancer
2
Alice
Banker
c1
2
Banker
Cancer
3
Bob
Clerk
c2
3
Clerk
HIV
4
Bob
Driver
c3
4
Driver
Cancer
5
Cathy
Engineer
c4
5
Engineer
HIV
The join on T1.Job = T2.Job
Pid
Name
Job
Disease
Class
1
Alice
Banker
Cancer
c1
2
Alice
Banker
Cancer
c1
3
Bob
Clerk
HIV
c2
4
Bob
Driver
Cancer
c3
5
Cathy
Engineer
HIV
c4
-
Alice
Banker
Cancer
c1
-
Alice
Banker
Cancer
c1
join weakens
identification:
{Alice,
Cancer} has
groups size 4.
lossy join: combat
join attack.
5
T1: Current Release
T2: Previous Release
Pid Name
Job
Class
Pid Job
Disease
1
Alice
Banker
c1
1
Banker
Cancer
2
Alice
Banker
c1
2
Banker
Cancer
3
Bob
Clerk
c2
3
Clerk
HIV
4
Bob
Driver
c3
4
Driver
Cancer
5
Cathy
Engineer
c4
5
Engineer
HIV
The join on T1.Job = T2.Job
Pid
Name
Job
Disease
Class
1
Alice
Banker
Cancer
c1
2
Alice
Banker
Cancer
c1
3
Bob
Clerk
HIV
c2
4
Bob
Driver
Cancer
c3
5
Cathy
Engineer
HIV
c4
-
Alice
Banker
Cancer
c1
-
Alice
Banker
Cancer
c1
join enables
inferences
across tables:
AliceCancer
has 100%
confidence.
6
Related Work
• k-anonymity [SS98, FWY05, BA05, LDR05, WYC04, WLFW06]
– Quasi-identifier (QID): a set of identifying
attributes in the table. If some record is linked to
an external source by a QID value, so are at
least k-1 other records.
– The database is made anonymous to itself.
– In sequential releases, the database must be
made anonymous to the combination of all
releases thus far.
7
Related Work
• l-diversity [MGK06]
– Ensures that sensitive values are “wellrepresented” in each QID group,
measured by entropy.
• Confidence limiting [WFY05, WFY06]:
qid  s, confidence < h
where qid is a value on QID, s is a sensitive
value.
8
Related Work
• View releases
– e.g., T1 and T2 are two views, both can be
modified before the release: more room for
satisfying privacy and information
requirements.
– [MW04, DP05] measure information disclosure
of a view set wrt a secret view.
– [YWJ05, KG06] detect privacy violation by a
view set over a base table.
– They measure or detect violations, but do not
remove them.
9
Sequential Release
• Sequential release:
– Current release T1. Previous release T2.
– T1 was unknown when T2 was released.
– T2, once released, cannot be modified when T1 is
released.
• Solution #1: k-anonymize all attributes in T1.
– Excessive distortion.
• Solution #2: generalize T1 based on T2.
– Monotonically distort the later release.
• Solution #3: release a “complete” cohort of all
potential releases anonymized at one time.
– Require predicting all future releases
10
Intuition of Our Approach
• A lossy join hides the true join relationship
to cripple a global QID.
• Generalizing the current release T1 so that
the join with the previous release T2
becomes lossy enough to disorient the
attacker.
• Two general notions of privacy: (X,Y)anonymity and (X,Y)-linkability, where X
and Y are sets of attributes.
11
(X,Y)-Privacy
• k-anonymity: # of distinct records for each
QID value ≥ k.
• (X,Y)-anonymity: # of distinct Y values for
each X value ≥ k.
• (X,Y)-linkability: the maximum confidence
that a record contains y given that it
contains x ≤ k, where (x,y) are values on X
and Y.
• Generalize k-anonymity [SS98] and
confidence limiting [WFY05, WFY06].
15
Example: (X,Y)-Anonymity
Pid
1
1
1
2
2
2
2
Job
Banker
Banker
Banker
Clerk
Clerk
Clerk
Clerk
Zip
123
123
123
456
456
456
456
PoB
Canada
Canada
Canada
Japan
Japan
Japan
Japan
Test
HIV
Diabetes
Eye
HIV
Diabetes
Eye
Heart
• QID = {Job, Zip, PoB} is not a key.
• k-anonymity fails to ensure that each value
on QID is linked to at least k distinct
16
patients.
Example: (X,Y)-Anonymity
• With (X,Y)-anonymity,
– specify the anonymity wrt patients by letting
X = {Job, Zip, PoB} and
Y = Pid
– Each X group must be linked to at least k
distinct values on Pid.
• If X = {Job, Zip, PoB} and Y = Test, each X
group is required to be linked to at least k
distinct tests.
17
Example: (X,Y)-Linkability
Pid
1
2
3
4
5
6
Job
Banker
Banker
Banker
Banker
Clerk
Clerk
Zip
123
123
123
123
456
456
PoB
Canada
Canada
Canada
Canada
Japan
Japan
Test
HIV
HIV
HIV
Diabetes
Diabetes
Diabetes
• {Banker,123,Canada}  HIV (75% confidence).
• With Y = Test, the (X,Y)-linkability states that no
test can be inferred from a value on X with a
confidence higher than a given threshold. 18
Problem Statement
• The data holder has previously released
T2 and wants to release T1, where T2 and
T1 are projections of the same underlying
table.
• Want to ensure (X,Y)-privacy on the join of
T1 and T2.
• Sequential anonymization is to generalize
T1 on X ∩ att(T1) so that the join of T1 and
T2 preserves the (X,Y)-privacy and T1
remains as useful as possible.
19
Generalization / Specialization
• Each generalization replaces all child
values with the parent value.
– A cut contains exactly one
value on every root-to-leaf
path.
Professional
Engineer
Lawyer
Job
ANY
Admin
Banker
• Each specialization v  {v1,…,vc},
replaces the value v in every record
containing v with the child value vi that is
consistent with the original domain value
in the record.
Clerk
20
Generalization / Specialization
• An interval of a continuous attribute is split
on-the-fly to maximize information utility.
– e.g., age [30-40)  [30-37), [37-40)
– The split at 37 maximizes the information
gain.
• A taxonomy tree is dynamically grown for
each continuous (non-join) attribute.
21
Match Function
• Given T1 and T2, the attacker may apply
prior knowledge to match the records in T1
and T2.
• So, the data holder applies such prior
knowledge for matching:
– schema information of T1 and T2.
– taxonomies for attributes.
– following inclusion-exclusion principle.
22
Match Function
• Let t1  T1 and t2  T2.
• Consistency Predicate: t1.A matches t2.A
if they are on the same generalization path
for attribute A.
– e.g., Male matches Single Male.
• Inconsistency Predicate: t1.A matches
t2.B only if t1.A and t2.B are not
semantically inconsistent.
– Excludes impossible matches.
– e.g., Male and Pregnant are semantically
inconsistent, so are Married Male and 6
Month Pregnant.
23
Algorithm Overview
Top-Down Specialization for Sequential Anonymization
Input: T1, T2, a (X,Y)-privacy requirement, a taxonomy tree
for each attribute in X1 where X1=X ∩ att(T1).
Output: a generalized T1 satisfying the privacy requirement.
1.
2.
3.
4.
5.
6.
7.
generalize every value of Aj to ANYj where Aj  X1;
while there is a valid candidate in ỤCutj do
find the winner w of highest Score(w) from ỤCutj;
specialize w on T1 and remove w from ỤCutj;
update Score(v) and the valid status for all v in ỤCutj;
end while
output the generalized T1 and ỤCutj;
24
Monotonic Privacy
• Theorem 1: On a single table, the (X,Y)-privacy
is anti-monotone wrt specialization on X.
– If violated, remains violated after a specialization.
• AY(X) is non-increasing wrt specialization on X.
– X always reduces the set of records that contain a X
value, therefore, reduces the set of Y values that cooccur with a X value.
• LY(X) is non-decreasing wrt specialization on X.
– A specialization v  {v1,…,vc} transforms a value x on
X to the specialized values x1,…,xc on X.
– If ly(xi) < ly(x) for some xi, there must exist some xj
such that ly(xj) > ly(x) (otherwise, ly(x) < ly(xi)).
25
Monotonic Privacy
• On the join of T1 and T2, in general, (X,Y)anonymity is not anti-monotone wrt a
specialization on X ∩ att(T1).
– Specializing T1 may create dangling records.
• Two tables are population-related if every record
in each table has at least one matching record in
the other table  no dangling record.
• Lemma 1: If T1 and T2 are population-related,
AY(X) is non-increasing wrt specialization on X ∩
att(T1).
26
Monotonic Privacy
• Lemma 2: If Y contains attributes from T1
or T2, but not from both, LY(X) does not
decrease after specialization of T1 on the
attributes X ∩ att(T1).
• Theorem 2: Assume that T1 and T2 are
projections of the same underlying tables,
(X,Y)-anonymity and (X,Y)-linkability on
the join of T1 and T2 are anti-monotone
wrt specialization of T1 on X ∩ att(T1).
27
Score Metric
• Score(v) evaluates the “goodness” of a
specialization v for preserving privacy and
information.
• Each specialization v gains some information and
loses some privacy. We maximize
• InfoGain(v) is measured on T1.
• PrivLoss(v) is measured on the join of T1 and T2.
28
Information Gain
• If T1 is released for classification on a specified
class column, InfoGain(v) could be the reduction
of the class entropy:
• T1[v] denotes the set of generalized records in
T1 that contain v before the specialization.
• T1[vi] denotes the set of records in T1 that
contain vi after the specialization.
• InfoGain(v) could be the notion of distortion.
29
Privacy Loss
• PrivLoss(v) is measured by the decrease
of AY(X) or the increase of LY(X) due to the
specialization of v:
AY(X) - AY(Xv) for (X,Y)-anonymity
LY(Xv) - LY(X) for (X,Y)-linkability
where X and Xv represent the attributes
before and after specializing v
respectively.
30
Challenges
1. Each specialization on w affects the
matching of join, thus, privacy checking.
•
too expensive to rejoin the two tables for
each specialization.
2. Materializing the join is impractical.
•
A lossy join can be very large.
Our solution: Incrementally maintains some
count statistics to update Score(v)
without executing the join.
31
Data Structure
• Expensive operations on specializing w
– accessing the records in T1 containing w
– matching the records in T1 with the records in
T2.
• X1 = X ∩ att(T1) and X2 = X ∩ att(T2),
• J1 and J2 denote the join attributes in T1
and T2.
32
Data Structure
• Tree1: partition T1 records by the
attributes X1 and J1-X1 in that order, one
level per attribute.
– Link[v] links up all nodes for v at the attribute
level of v.
• Tree2: partition T2 records by the
attributes J2 and X2-J2 in that order.
– Tree2 is static.
• Probe the matching partitions in Tree2.
– Match the last |J1| attributes in a partition in
Tree1 with the first |J2| attributes in Tree2.
33
Analysis
• On specializing w, Link[w] provides a direct access
to the records involved in T1
• Tree2 provides a direct access to the matching
partitions in T2.
• Matching is performed at the partition level, not at
the record level.
• The cost of each iteration has two parts.
1. Specialize the affected partitions on Link[w].
2. Update the score and status of candidates using count
statistics.
• Each record in T1 is accessed at most
| X ∩ att(T1) |  h times where h is the maximum
34
height of the taxonomies.
Empirical Study
• The Adult data set. 45222 records.
• Two versions of (T1,T2)
• Set A (categorical attributes only)
– T1 contains the Class attribute, the 3
categorical attributes and the 3 join attributes.
– T2 contains the 2 categorical attributes and
the 3 join attributes.
• Set B (both categorical and continuous)
– T1 contains the additional 6 continuous
attributes from Taxation Department.
35
Schema for Set A
• T1 contains the Class attribute
Department Attribute
# of
Leaves
Taxation
Education (E)
16
(T1)
Occupation (O)
14
Work-class (W)
8
Common
Marital-status (M) 7
(T1 & T2)
Relationship (Ra) 6
Sex (S)
2
Immigration Native-country (Nc) 40
(T2)
Race (Ra)
5
# of
Levels
5
3
5
4
3
2
5
3
Empirical Study
• Classification metric
– Classification error on the generalized testing set
of T1.
• Distortion metric [SS98]
– Categorical: 1 unit of distortion for each
generalization.
– Continuous: Suppose v is generalized to interval
[a-b). Unit of distortion = (b-a)/(f2-f1), where
[f1,f2) is the full range of the attribute.
– Normalize total distortion by the number of
37
records.
(X,Y)-Anonymity
• TopN attributes: most important for classification.
– Chosen by successively removing the top
attribute in a decision tree.
• Join attributes are the Top3 attributes.
– If not important, simply remove them.
• X contains
– TopN attributes in T1 for a specified N (to ensure
that the generalization is performed on important
attributes),
– all join attributes,
– all attributes in T2 (to ensure X is global).
38
Distortion of (X,Y)-anonymity
• Ki is a key in Ti.
• XYD: produced by our method with Y = K1.
• KAD: produced by k-anonymity on T1 with
QID=att(T1).
39
Set A
Set B
Classification error of (X,Y)-anonymity
•
•
•
•
•
XYE: produced by our method with Y = K1.
XYE(row): produced by our method with Y={K1,K2}.
BLE: produced by the unmodified data.
KAE: produced by k-anonymity on T1 with QID=att(T1).
RJE: produced by removing all join attributes from T1.
40
Set A
Set B
(X,Y)-Linkability
• Y contains the TopN attributes.
– If not important, simply remove them.
• X contains the rest of the attributes in T1 and T2,
except T2.Ra and T2.Nc because otherwise no
privacy requirement can be satisfied.
• Focus on the classification error because the
distortion due to (X,Y)-linkability is not comparable
with the distortion due to k-anonymity.
41
Classification error of (X,Y)-linkability
•
•
•
•
XYE: produced by our method with Y = TopN.
BLE: produced by the unmodified data.
RJE: produced by removing all join attributes from T1.
RSE: produced by removing all attributes in Y from T1.
42
Set A
Set B
Scalability
(X,Y)-anonymity (k=40)
(X,Y)-linkability (k=90%)
43
Conclusion
• Previous works on k-anonymization
focused on a single release of data.
• Studied the sequential anonymization
problem.
• Extended the privacy notion to this model.
• Introduced lossy join as a way to hide the
join relationship among releases.
• Addressed computational challenges due
to large size of lossy join.
• Extendable to more than one previously
44
released tables T2,…,Tp.
References
[BA05] R. Bayardo and R. Agrawal. Data privacy
through optimal k-anonymization. In IEEE ICDE,
pages 217.228, 2005.
[DP05] A. Deutsch and Y. Papakonstantinou.
Privacy in database publishing. In ICDT, 2005.
[FWY05] B. C. M. Fung, K. Wang, and P. S. Yu.
Top-down specialization for information and
privacy preservation. In IEEE ICDE, pages
205.216, April 2005.
[KG06] D. Kifer and J. Gehrke. Injecting utility into
anonymized datasets. In ACM SIGMOD,
Chicago, IL, June 2006.
45
References
[LDR05] K. LeFevre, D. J. DeWitt, and R.
Ramakrishnan. Incognito: Efcient full-domain kanonymity. In ACM SIGMOD, 2005.
[MGK06] A. Machanavajjhala, J. Gehrke, and D.
Kifer. l-diversity: Privacy beyond k-anonymity. In
IEEE ICDE, 2006.
[MW04] A. Meyerson and R. Williams. On the
complexity of optimal k-anonymity. In PODS,
2004.
[SS98] P. Samarati and L. Sweeney. Protecting
privacy when disclosing information: k-anonymity
and its enforcement through generalization and
suppression. In IEEE Symposium on Research in
Security and Privacy, May 1998.
46
References
[WFY05] K. Wang, B. C. M. Fung, and P. S. Yu.
Template-based privacy preservation in
classification problems. In IEEE ICDM, pages
466.473, November 2005.
[WFY06] K. Wang, B. C. M. Fung, and P. S. Yu.
Handicapping attacker's condence: An alternative
to k-anonymization. Knowledge and Information
Systems: An International Journal, 2006.
[WYC04] K. Wang, P. S. Yu, and S. Chakraborty.
Bottom-up generalization: A data mining solution
to privacy protection. In IEEE ICDM, November
47
2004.
References
[WLFW06] R. C. W. Wong, J. Li., A. W. C. Fu, and
K. Wang. (,k)-anonymity: An enhanced kanonymity model for privacy preserving data
publishing. In ACM SIGKDD, 2006.
[YWJ05] C. Yao, X. S. Wang, and S. Jajodia.
Checking for k-anonymity violation by views. In
VLDB, 2005.
48