Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Pushing Sensitive Transactions for
Itemset Utility
(IEEE ICDM 2008)
Presenter:
Yabo, Xu
Authors:
Yabo Xu, Benjam C.M. Fung,
Ke Wang, Ada. W.C. Fu, Jian Pei
Affiliation
Simon Fraser University, Canada
Concordia University, Canada
Chinese University of Hong Kong
SFU
Outline
Motivation: Real privacy outcry on transactions
The problem
Privacy attacks on Transactions
Research Challenges
Our Approach: (h,k,p)-Coherence
Attack Model
Privacy, Utility and Anonymization Method
A bordered-based algorithm
Related works
2
Real Privacy Outcry on Transactions
1:
AOL
search
scandal,
Aug
2006 to
Fact 3:
2: Google
Neflix
movie
was data
ordered
rating
dataset
by a federal
for4,
movie
judge
recommendation
hand
over Youtube
contest,
user
data
2006
( all the user
Transactions
of query
terms
view
) to
Viacom
for
copyright
issues,
Transactions
of movie
ratings
20logs
million
queries
from
650k
users within
three
July,
2008
month
100
million movie ratings made by 500,000
subscribers
The famous of
searcher
with IDlogs
4417749 Thelma
Tansactions
user viewing
Arnold
was
identified
soon
after.
A researcher
de-anonymize
the datathe
twodata
weeks
Google
claimed
they will
anonymize
after
dataitrelease
beforethe
giving
to Viacom
3
Challenges on Anonymizing Transactions
High dimensional and sparse data characteristic
Relational data: only tens, or at most hundreds of attributes
Transaction data: 10,000 distinct items or more, and each
transaction contains a small portion of the items
Hard to model attacker’s prior knowledge
Relational data: a small set of public( identifying ) attributes
Transaction data: a large number of items are potentially
identifying, and considering all of them will render data
useless.
4
Re-identification Attack: An example
Activities
Jane
a c
d
f g
Sam
a b
c
f
Albert
b
f
x
Grace
b c g y z
Tim
b
d
c
f
g
Public items( identifying ),
possibly known by an
attacker
Medical History
Examples: financial
information,
health
Diabetes
information, sexualAttack 1
T2
orientation, religion{a,b}
and
Hepatitis
political beliefs.
Hepatitis
Hepatitis
In specialized industries,
well defined guidelines
for2
Attack
HIV
public/private items{b,g}
often
{T4,T5}
exist, i.e. HIPAA
HIV
HIV
Private items ( sensitive ),
which the attacker wants to
find out
5
Attack Model
Attacker’s knowledge: a subset of public items --
Attacker’s goal: infer private items – e
Attacker’s power p
max number of public items he can obtain, i.e. ||<= p
p=2: < a , b, c, d, e, f, g, e1, e2, e3>
Attack succeeds when
less than k transactions containing , i.e. support()<k
most of the transactions containing have some private
item e, i.e. P(e)>h,
where k and h are two privacy parameter.
We call such with ||<= p as moles.
6
Privacy, Utility and Anonymization Method
Privacy notion: (h,k,p)-Coherence
High-dimensional
utility
A transaction database is coherent if there is
no moles, i.e.
for every || p, support()k or P(e)hmeasure,
Utility measure: loss of nuggets
preserving associations
rather than items.
Frequent itemsets are information nuggets for transaction
database, and important for many data mining applications
The choice of anonymization method
pertubation - lost truthfulness of the data, NO
item generalization – require a item hierarchy which may not
exist in many applications, NO.
item suppression – preserve itemset support, critical for
many data mining applications, YES
7
The problem and a greedy framework
Optimal Coherence ( NP-hard Problem) :
Suppress a set of items so that all the moles are
eliminated while preserving as much as nuggets.
A greedy approach:
In each round, suppress one item v with maximal
| M(v) |
Maximize the moles
Score(v)
suppressed
| N(v) |
until coherence is achieved.
Minimize the loss of
nuggets
Challenges
The number of moles/nuggets are both exponential
10,000 distinct query terms, p=3 potentially 1012 moles!!
Suppressing an item will affect other moles/nuggets, so the
moles/nuggets have to be maintained and updated
efficiently.
Solution – Border Approach
Highly compact structure is needed to deal with both the
exponential growth of moles and nuggets
ae,
af,
ag,
be,
Moles/
Nuggets
Mole/Nugget
Border
abe, aef, abef
abf, afg, abfg
abg
bef
U= {
Minimal
itemset
ae,
af,
edge <ae, abef>
edge <af, abfg>
ag,
L = { abef,
be }
abfg }
Maximal
itemset
Key contribution: an efficient border-based counting
algorithm: never enumerating all the moles and nuggets
Related Works
G. Ghinita etc. ICDE 2008 paper.
M. Terrovitis, etc. VLDB08 paper.
l-diversity, and public/private items
bucketization approach, vulnerable to background knowledge attacks
due to its property of preserving original items without generalization
Also assume the attacker’s prior knowledge as a subset of items
k-anonymity, no protection for the homogeneity attack
Generalization
Y. Xu etc. KDD08 paper
Assume the attacker’s prior knowledge as a subset of items
K-anonymity and l-diversity
Suppression
Single dimension item utility.
10
Questions?
Please write to [email protected]