Download Sequential Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Maintaining Data Privacy in Association
Rule Mining
VLDB 2002
Authors: Shariq J. Rizvi
Jayant R. Haritsa
Speaker: Minghua ZHANG
Oct. 11, 2002
1
Content






2
Background
Problem framework
MASK -- distortion part
MASK -- mining part
Performance
Conclusion
Background



In data mining, the accuracy of the input data is very
important for obtaining valuable mining results.
However, in real life, there are many reasons which
lead to inaccurate data.
One example is that, the users deliberately provide
wrong information to protect their privacy.
–

3
age, income, illness, etc.
Problem: how to protect user privacy while getting
accurate mining results at the same time?
Background (cont’d)


Privacy and accuracy are contradictory in
nature.
A compromise way is more feasible.
–

4
satisfactory (not 100%) privacy and satisfactory (not
100%) accuracy
This paper studied this problem in the context
of mining association rules.
Overview of the Paper


The authors proposed a scheme --- MASK (Mining
Associations with Secrecy Konstraints).
Major idea of MASK
–
Apply a simple probabilistic distortion on original data

–
The miner tries to find accurate mining results, given the
following inputs:


5
The distortion can be done at the user machine
The distorted data
A description of the distortion procedure
Problem Framework

Database model
–
–
Each customer transaction is a record in the database.
A record is a fixed-length sequence of 1’s and 0’s.

E.g: for market-basket data
–
length of the record: the total number of items sold by the market.
– 1: the corresponding item was bought in the transaction
– 0: vice versa.
–
6
The database can be regarded as a two-dimensional boolean
matrix.
Problem Framework (cont’d)

The matrix is very sparse. Why not use itemlists?
–
–

Mining objective: find frequent itemsets
–
7
The data will be distorted.
After the distortion, it will not as sparse as the
original (true) data.
Itemset whose appearance (support) in the
database is larger than a threshold.






8
Background
Problem framework
MASK --- distortion part
MASK --- mining part
Performance
Conclusion
MASK --- Distortion Part

Distortion Procedure
–
–
–
Represent a customer record by a random vector.
Original record: X={Xi}, where Xi =0 or 1.
Distorted record: Y={Yi}, where Yi =0 or 1.


9
Yi = Xi
Yi = 1-Xi
(with a probability of p)
(with a probability of 1-p)
Quantifying Privacy

Privacy metric
–
–
The probability of reconstructing the true data
Consider each individual item


Calculate reconstruction probability
–
–
Let si = prob (a random customer C bought the ith item)
= the true support of item i
The probability of correctly reconstruction of a ‘1’ in a random
item i is:

10
With what probability can a given 1 or 0 in the true matrix
database be reconstructed?
R1(p,si)= si x p2 / (si x p +(1-si) x (1-p) )
+ si x (1-p) 2 / ( si x (1-p) + (1-si) x p)
Reconstruction Probability


Reconstruction probability of a ‘1’ across all items:
R1(p) = ( i siR1(p,si) ) / (isi)
Suppose
–

s0=the average support of an item
Replace si by s0, we get
–
R1(p)= s0 x p2 / (s0 x p +(1-s0) x (1-p) )
+ s0 x (1-p) 2 / ( s0 x (1-p) + (1-s0) x p)
11
Reconstruction Probability (cont’d)

Relationship between R1(p) and p, s0

Observations:
–
–
12
R1(p) is high when p is near 0 and 1, and it is lowest when
p=0.5.
The curves become flatter as s0 decreases.
Privacy Measure

The reconstruction probability of a ‘0’
–

The total reconstruction probability
–
–

R(p)=a R1(p) +(1-a) R0(p)
a is the weight parameter.
Privacy
–
13
R0(p)= func(p and s0).
P(p) = ( 1- R(p) ) x 100
Privacy Measure (cont’d)

Privacy vs. p
P(p) for s0=0.01

Observations:
–
For a given value of s0, the curve shape is fixed.

–
The privacy is nearly constant for a large range of p.

14
The value of a determines the absolute value of privacy.
provide flexibility in choosing p that can minimize the error in the
later mining part.






15
Background
Problem framework
MASK --- distortion part
MASK --- mining part
Performance
Conclusion
MASK --- Mining Part

How to estimate the accurate supports of
itemsets from a distorted database?
–



16
Remember that the miner knows the value of p.
Estimating 1-itemset supports
Estimating n-itemset supports
The whole mining process
Estimating 1-itemset Supports

Symbols:
–
–
–
–

From distortion method, we have
–
–

17
T: the original true matrix;
D: the distorted matrix;
i: a random item;
C1T and C0T: the number of 1’s and 0’s in the i column of T;
C1D and C0D : the number of 1’s and 0’s in the i column of D.
C1D : roughly C1T p+ C0T(1-p) -> C1D = C1T p+ C0T(1-p)
C0D : roughly C0T p+ C1T(1-p) -> C0D = C0T p+ C1T(1-p)
Let
,
CT = M-1 CD.
,
, then CD = MCT. So
Estimating n-itemset Supports


Still use CT = M-1 CD to estimate support.
Define
–
CKT is the number of records in T that have the binary form of k
for the given itemset.

E.g: for a 3-itemset that contains the first 3 items
CT has 23=8 rows
– C3T is the No. of records in T of form {0,1,1,…}
Prob ( CjT -> CiD).
–

Mi,j =
–
18
M7,3=p2(1-p) (C3T -> C7D or C011T -> C111D)
Mining Process


Similar to Apriori algorithm
Difference:
–
E.g: when counting supports of 2-itemsets,


Apriori only need to count the No. of records that have value ‘1’
for both items, or of form “11”.
MASK has to keep track of all 4 combinations: 00,01,10 and 11
for the corresponding items.
–

MASK requires more time and space than Apriori.
–
19
C2n-1T is estimated from C0D, C1D, … , C2n-1D.
Some optimizations (omitted)






20
Background
Problem framework
MASK --- distortion part
MASK --- mining part
Performance
Conclusion
Performance

Data sets
–
Synthetic database


–
Real dataset



21
1,000,000 records; 1000 items
s0=0.01
Click-stream data of a retailer web site
600,000 records; about 500 items
s0=0.005
Performance (cont’d)

Error Metrics
–
Right class, wrong support


Infrequent itemsets, error doesn’t matter
Frequent itemsets
–
–
Support Error ():
Wrong class

Identity Error ()
–
false positives:
– false negatives:
22
Performance (cont’d)

Parameters
–
–
–
–
sup = 0.25%, 0.5%
p = 0.9, 0.7
a=1: only concern of privacy of 1’s
r = 0%, 10%



23
Coverage may be more important than precision.
Use a smaller support threshold to mine the distorted
database.
Support used to mine D = sup x (1-r)
Performance (cont’d)

Synthetic dataset
–
Experiment 1: p=0.9 (85%), sup=0.25%
Level
|F|

-
+
Level
|F|

-
+
1
689
3.31
1.16
1.16
1
689
3.37
0.73
3.19
2
2648
3.58
4.49
5.14
2
2648
3.73
0.19
19.68
3
1990
1.71
4.57
2.16
3
1990
1.76
0
28.09
4
1418
1.28
3.67
0.22
4
1418
1.29
0
25.81
5
730
1.27
5.89
0
5
730
1.32
0
16.44
6
212
1.36
4.25
5.19
6
212
1.37
0
25.47
7
35
1.40
0
0
7
35
1.40
0
51.43
8
3
0.99
0
0
8
3
0.99
0
66.67
r=0%
24
r=10%
Performance (cont’d)

Synthetic dataset
–
Experiment 2: p=0.9 (85%), sup=0.5%
Level
|F|

-
+
Level
|F|

-
+
1
560
2.60
1.25
0.89
1
560
2.66
0.18
4.29
2
470
2.13
5.53
4.89
2
470
2.21
0
44.89
3
326
1.22
3.07
0.31
3
326
1.26
0
42.64
4
208
1.34
1.44
0.48
4
208
1.35
0
51.44
5
125
1.81
0
0
5
125
1.81
0
22.4
6
43
2.62
0
0
6
43
2.62
0
18.60
7
10
3.44
10
0
7
10
3.47
0
10
8
1
4.50
0
0
8
1
4.50
0
0
r=0%
25
r=10%
Performance (cont’d)

Synthetic dataset
–
26
Experiment 3: p=0.7 (96%), sup=0.25%, r=10%
Level
|F|

-
+
1
689
10.16
2.61
7.84
2
2648
25.23
19.52
630.93
3
1990
26.93
42.86
172.71
4
1418
29.14
65.94
0.35
5
730
28.47
79.32
0
6
212
36.25
84.91
0
7
35
51.37
85.71
0
8
3
-
100
0
Performance (cont’d)

Real database
–
Experiment 1: p=0.9 (89%), sup=0.25%
Level
|F|

-
+
Level
|F|

-
+
1
249
5.89
4.02
2.81
1
249
6.12
1.2
0.40
2
239
3.87
6.69
7.11
2
239
4.04
1.26
23.43
3
73
2.60
10.96
9.59
3
73
2.93
0
45.21
4
4
1.41
0
25.0
4
4
1.41
0
75
r=0%
27
r=10%
Performance (cont’d)

Real database
–
Experiment 2: p=0.9 (89%), sup=0.5%
Level
|F|

-
+
Level
|F|

-
+
1
150
4.23
0.67
4.67
1
150
4.27
0
8
2
45
2.42
2.22
4.44
2
45
2.56
0
37.77
3
6
1.07
0
16.66
3
6
1.07
0
66.66
r=0%
28
r=10%
Performance (cont’d)

Real database
–
29
Experiment 3: p=0.7 (97%), sup=0.25%, r=10%
Level
|F|

-
+
1
249
18.96
7.23
15.66
2
239
33.59
20.08
1907.53
3
73
32.87
30.14
2308.22
4
4
7.55
50
400
Performance (cont’d)

Summary
–
–
–
–
30
Good privacy and good accuracy can be achieved
at the same time by careful selection of p.
In experiments, p around 0.9 is the best choice.
A smaller p leads to much error in mining results.
A larger p will reduces the privacy greatly.
Conclusion



31
This paper studies the problem of achieving a
satisfactory privacy and accuracy
simultaneously for association rule mining.
A probabilistic distortion of the true data is
proposed.
Privacy is measured by a formula, which is a
function of p and s0.
Conclusion (cont’d)


32
A mining process is put forward to estimate the
real support from the distorted database.
Experiment results show that there is a small
window of p (near 0.9) that can achieve good
accuracy (90%+) and privacy (80%+) at the
same time.
Related Works

On preventing sensitive rules from being
inferred by the miner (output privacy)
–
–
33
Y. Saygin, V. Verykios and C. Clifton, “Using
Unknowns to Prevent Discovery of Association
Rules”, ACM SIGMOD Record, vol.30 no. 4, 2001
M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim
and V. Verykios, “Disclosure Limitation of Sensitive
Rules”, Proc. Of IEEE Knowledge and Data
Engineering Exchange Workshop, Nov.1999
Related Works

On input data privacy in distributed databases
–
–
34
J. Vaidya and C. Clifton, “Privacy Preserving
Association Rule Mining in Vertically Partitioned
Data”, KDD2002
M. Kantarcioglu and C. Clifton, “Privacy-preserving
Distributed Mining of Association Rules on
Horizontally Partitioned Data”, Proc. Of ACM
SIGMOD Workshop on Research Issues in Data
Mining and Knowledge Discovery, 2002
Related Works

Privacy-preserving mining in the context of
classification rules
–

A recent paper also appears in 2002
–
35
D. Agrawal and C. Aggarwal, “On the Design and
Quantification of Privacy Preserving Data Mining
Algorithms”, PODS, 2001
A. Evfimievski, R. Srikant, R. Agrawal and J.
Gehrke, “Privacy Preserving Mining of Association
Rules”, KDD2002
36
More information

Distortion procedure
–
Yi = Xi XOR ri‘, where ri‘ is the complement of ri,
 ri
is a random variable with density function f ( r ) =
bernoulli(p) (0 <= p <= 1)
37
More Information

Reconstruction error bounds (1-itemsets)
–
With probability PE(m,p,(2p-1)/2) X PE(n,p,(2p1)/2) , the error is less than .



38
n: the real support count of the item;
m: dbsize-n;
PE(n,p,) = (r=np-np+) nCrpr(1-p)n-r

Reconstruction probability of a ‘1’ in a random item i
–
–
Si = the true support of item i
= pr (a random customer C bought the ith item),
Xi = the original entry for item i
Yi = the distorted entry for item I
The probability of correct reconstruction of a ‘1’ in a random
item i is:

39
R1(p,si)= Pr{Yi =1| Xi =1} x pr{Xi =1| Yi =1}
+ Pr{Yi =0| Xi =1} x Pr{Xi =1| Yi =0}
= si x p2 / (si x p +(1-si) x (1-p) )
+ si x (1-p) 2 / ( si x (1-p) + (1-si) x p)
Related documents