Download Overview - Personal Web Pages - University of North Carolina at

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Privacy Preserving Market Basket Data
Analysis
Ling Guo, Songtao Guo, Xintao Wu
University of North Carolina at Charlotte
Market Basket Data
…
TID
milk
sugar
bread
…
cereals
1
1
0
1
…
1
2
0
1
1
…
1
3
1
0
0
…
1
4
1
1
1
…
0
.
.
.
.
…
.
N
0
1
1
…
0
1: presence
 Association rule X  Y
 with support
0: absence
(R.Agrawal SIGMOD 1993)
s  P ( XY ) and confidence
2
c
P ( XY )
P( X )
Other measures
2 x 2 contingency table
Objective measures for A=>B
3
Related Work
• Privacy preserving association rule mining




Data swapping
Frequent itemset or rule hiding
Inverse frequent itemset mining
Item randomization
4
Item Randomization
Original Data
Randomized Data
…
…
cereals
TID
milk
sugar
bread
1
1
1
0
1
1
1
1
1
1
2
1
1
1
0
1
0
0
1
4
1
1
1
0
3
1
1
1
1
.
.
.
.
4
0
0
1
1
N
0
1
1
.
.
.
.
N
1
1
0
TID
milk
sugar
bread
1
1
0
2
0
3


…
.
0
…
cereals
.
1
To what extent randomization affects mining results? (Focus)
To what extent it protects privacy?
5
Randomized Response ([ Stanley Warner; JASA 1965])
A
: Cheated in the exam
A
: Didn’t cheat in the exam
A
Purpose
Purpose: Get the proportion(  A) of population
members that cheated in the exam.
 Procedure:
Cheated in
exam
A
“Yes” answer
Didn’t cheat
Randomization device

Do you belong to A? (p)
…
…
Do you belong to A?(1-p)
1 
“No” answer
As:
   A p  (1   A )(1  p)
Unbiased estimate of
ˆ AW
6
A
p 1
ˆ


2 p 1 2 p 1
is:
Application of RR in MBD
•
RR can be expressed by matrix as: ( 0: No 1:Yes)
0
1
=
p
1 p
0
1 p
p
1
  P
 Extension to multiple variables

  ( P1  P2  ...  Pm )
e.g., for 2 variables
  ( 00 ,  01, 10 , 11 )
  (00 , 01, 10 , 11 )
 Unbiased estimate of

is:
stands for Kronecker product
ˆ  P 1ˆ
disp (ˆ )  n 1 (   )
disp (ˆ )  n 1P 1 (   ) P1
diagonal matrix with elements
7
Randomization example
Original Data
TID
milk
sugar
Randomized Data
…
bread
cereals
1
1
0
1
1
2
0
1
1
1
3
1
0
0
4
1
1
1
.
.
.
.
N
0
1
RR
0
…
.
A: Milk
B: Cereals
PA 
0 .8 0 .2
0 .2 0 .8
0.458
0.183
0.359
0.542
0
1
1
1
2
1
1
1
0
3
1
1
1
1
4
1
0
1
1
.
.
.
.
PB 
0 .9 0 . 1
0 .1 0 .9
N
0
1
0
cereals
…
.
1
Data miners
A
A
0.402
  ( 00 ,  01, 10 , 11 )
c AB
1
B
B
0.043
 11


 10   11
bread
B
0.415
0.598
sugar
0
Data owners
A
A
milk
1
1
B
…
TID
0.368
0.097
0.465
0.218
0.317
0.537
0.586
0.414
ˆ  (ˆ00 , ˆ01, ˆ10 , ˆ11) =(0.368,0.097,0.218,0.316)’
=(0.415,0.043,0.183,0.359)’
s AB
ˆ  ( PA1  PB1 )ˆ  (ˆ 00 , ˆ 01, ˆ10 , ˆ11 ) =(0.427,0.031,0.181,0.362)’
cˆ AB 
0.662
ˆ11

ˆ10  ˆ11
ŝ AB
0.671
c AB
ĉ AB
We can get the estimate, how accurate we can achieve?
10
Motivation
Estimated values
sˆ6  sˆ2  smin
Original values
s2  smin
36.3
31.5
smin  23%
23.8
Frequent set
s6  smin
35.9
Both are
frequent set
Not frequent
set
Rule 6 is falsely recognized
from estimated value!
22.1
12.3
Lower& Upper bound
s2l  smin
s6l  smin
11
Frequent set with high confidence
Frequent set without confidence
Accuracy on Support S
• Estimate of support
ˆ  P1ˆ  ( P11  ... Pk1 )ˆ
ˆ11
ˆ  (ˆ 00,ˆ 01,ˆ10,ˆ11 )  ( P11  P21 )ˆ  (0.427,0.031,0.181,0.362)
ˆ 00
ˆ 01
ˆ10
ˆ11
ˆ 00  7.113  1.668  3.134  2.311 


ˆ 01   1.668 2.902
0.244  1.478 
côv(ˆ ) 
10 5


ˆ
 10  3.134 0.244 5.667  2.777


ˆ11   2.311  1.478  2.777 6.566 
• Variance of support
côv(ˆ )  (n  1) 1 P 1 (ˆ  ˆˆ) P1
côv(ˆ10 , ˆ11 )
v̂ar(ˆ11 )
• Interquantile range
(normal dist.)
ˆ i i
1
k
 za / 2 ˆar (ˆ i1ik ) , ˆ i1ik  z a / 2 ˆar (ˆ i1ik )

0.362
0.346
12
0.378
Accuracy on Confidence C
• Estimate of confidence A =>B
cˆ 
sˆ AB
ˆ11

sˆ A
ˆ10  ˆ11
• Variance of confidence
ˆ102
ˆ ˆ
ˆ112
ˆar (cˆ)  4 ˆar (ˆ11 )  4 ˆar (ˆ10 )  2 10 4 11 coˆ (ˆ11, ˆ10 )
ˆ1
ˆ1
ˆ1
• Interquantile range (ratio dist. is F(w))
1

ˆ
c




 Loose
v̂ar(cˆ) , cˆ 
1


v̂ar(cˆ) 

2
where   1 / k
range derived on Chebyshev’s theorem
Let X be a random variable with expected value  and finite
variance  2 .Then for any real k  0
Pr( X    k )  1 / k 2
13
Bounds of other measures
Accuracy Bounds
14
General Framework


Step1: Estimation

Express the measure as one derived function from the observed variables ( ij or their
marginal totals  i  ,   j ).

Compute the estimated measure value.
Step2: Variance of the estimated measure

Get the variance of the estimated measure (a function with multi known variables) through
Taylor approximation
k
ar{g ( x)}  {g 'i ( )} ar ( xi ) 
2
i 1

k
r
g
'
(

)
g
'
(

)
co

(
x
,
x
)


(
n
)
 i j
i
j
i  j 1
Step 3: Derive the interquantile range through Chebyshev's
theorem
15
Example for

 2with two variables
Step 1: Get the estimate of the measure
(ˆ 00  ˆ 0ˆ 0 ) 2 (ˆ 01  ˆ 0ˆ 1 ) 2 (ˆ10  ˆ1ˆ  0 ) 2 (ˆ11  ˆ1ˆ 1 ) 2
ˆ  n{



}
ˆ 0ˆ 0
ˆ 0ˆ 1
ˆ1ˆ 0
ˆ1ˆ 1
2

 Step 2: Get the variance of the estimated measure
4
ˆ 2 2
ˆ 2 ˆ 2
ar{ˆ }   (
) ar ( xi )   (
)(
)co ( xi , x j )
xi
xi x j
i 1
i  j 1
4
2

Where:
x1  ̂ 00 , x2  ̂ 01 , x3  ̂ 10 , x4  ̂11
 Step 3: Derive the interquantile range through Chebyshev's theorem .
16
Accuracy Bounds
• With unknown distribution, Chebyshev theorm only gives
loose bounds.
Bounds of the support vs. varying p
17
Distortion
• All the above discussions assume distortion matrices P
are known to data miners

P could be exploited by attackers to improve the posteriori
probability of their prediction on sensitive items
• How about not releasing P?


Disclosure risk is decreased
Data mining result?
18
Unknown distortion P
 Some measures have monotonic properties
Measure
Correlation (
Expression
 11 00   01 10
 1 1 0 0
)
 ij
 i   j
 i  i  log  i 
 
i
Mutual Information (M)
2i
2
Likelihood ratio G
( )
Pearson Statistics( )
2
 
i
 ij log
j

j
ij
log
ran  ori M ran  M ori
 ij
 i   j
2
2
Gran
 Gori
( ij   i   j ) 2
j
 i   j
 Other measures don’t have such properties
19
2
2
 ran
  ori
2

Applications:
hypothesis test

From the randomized data, if we discover an itemset which
2
2
satisfies  ran   , we can guarantee dependence exists among the
2
2
original itemset since  ori
.
  ran
Still be able to derive the strong dependent itemsets
from the randomized data
No false positive
20
Conclusion
• Propose a general approach to deriving accuracy bounds of
various measures adopted in MBD analysis
• Prove some measures have monotonic property and some
data mining tasks can be conducted directly on randomized
data (without knowing the distortion). No false positive
pattern exists in the mining result.
21
Future Work
• Which measures are more sensible to randomization?
• The tradeoff between the privacy of individual data and the
accuracy of data mining results
• Accuracy vs. disclosure analysis for general categorical data
22
Acknowledgement
• NSF IIS-0546027
• Ph.D. students
Ling Guo
Songtao Guo
23
Q&A
24
Related documents