Download Overview - Personal Web Pages - University of North Carolina at

Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte Market Basket Data … TID milk sugar bread … cereals 1 1 0 1 … 1 2 0 1 1 … 1 3 1 0 0 … 1 4 1 1 1 … 0 . . . . … . N 0 1 1 … 0 1: presence  Association rule X  Y  with support 0: absence (R.Agrawal SIGMOD 1993) s  P ( XY ) and confidence 2 c P ( XY ) P( X ) Other measures 2 x 2 contingency table Objective measures for A=>B 3 Related Work • Privacy preserving association rule mining     Data swapping Frequent itemset or rule hiding Inverse frequent itemset mining Item randomization 4 Item Randomization Original Data Randomized Data … … cereals TID milk sugar bread 1 1 1 0 1 1 1 1 1 1 2 1 1 1 0 1 0 0 1 4 1 1 1 0 3 1 1 1 1 . . . . 4 0 0 1 1 N 0 1 1 . . . . N 1 1 0 TID milk sugar bread 1 1 0 2 0 3   … . 0 … cereals . 1 To what extent randomization affects mining results? (Focus) To what extent it protects privacy? 5 Randomized Response ([ Stanley Warner; JASA 1965]) A : Cheated in the exam A : Didn’t cheat in the exam A Purpose Purpose: Get the proportion(  A) of population members that cheated in the exam.  Procedure: Cheated in exam A “Yes” answer Didn’t cheat Randomization device  Do you belong to A? (p) … … Do you belong to A?(1-p) 1  “No” answer As:    A p  (1   A )(1  p) Unbiased estimate of ˆ AW 6 A p 1 ˆ   2 p 1 2 p 1 is: Application of RR in MBD • RR can be expressed by matrix as: ( 0: No 1:Yes) 0 1 = p 1 p 0 1 p p 1   P  Extension to multiple variables    ( P1  P2  ...  Pm ) e.g., for 2 variables   ( 00 ,  01, 10 , 11 )   (00 , 01, 10 , 11 )  Unbiased estimate of  is: stands for Kronecker product ˆ  P 1ˆ disp (ˆ )  n 1 (   ) disp (ˆ )  n 1P 1 (   ) P1 diagonal matrix with elements 7 Randomization example Original Data TID milk sugar Randomized Data … bread cereals 1 1 0 1 1 2 0 1 1 1 3 1 0 0 4 1 1 1 . . . . N 0 1 RR 0 … . A: Milk B: Cereals PA  0 .8 0 .2 0 .2 0 .8 0.458 0.183 0.359 0.542 0 1 1 1 2 1 1 1 0 3 1 1 1 1 4 1 0 1 1 . . . . PB  0 .9 0 . 1 0 .1 0 .9 N 0 1 0 cereals … . 1 Data miners A A 0.402   ( 00 ,  01, 10 , 11 ) c AB 1 B B 0.043  11    10   11 bread B 0.415 0.598 sugar 0 Data owners A A milk 1 1 B … TID 0.368 0.097 0.465 0.218 0.317 0.537 0.586 0.414 ˆ  (ˆ00 , ˆ01, ˆ10 , ˆ11) =(0.368,0.097,0.218,0.316)’ =(0.415,0.043,0.183,0.359)’ s AB ˆ  ( PA1  PB1 )ˆ  (ˆ 00 , ˆ 01, ˆ10 , ˆ11 ) =(0.427,0.031,0.181,0.362)’ cˆ AB  0.662 ˆ11  ˆ10  ˆ11 ŝ AB 0.671 c AB ĉ AB We can get the estimate, how accurate we can achieve? 10 Motivation Estimated values sˆ6  sˆ2  smin Original values s2  smin 36.3 31.5 smin  23% 23.8 Frequent set s6  smin 35.9 Both are frequent set Not frequent set Rule 6 is falsely recognized from estimated value! 22.1 12.3 Lower& Upper bound s2l  smin s6l  smin 11 Frequent set with high confidence Frequent set without confidence Accuracy on Support S • Estimate of support ˆ  P1ˆ  ( P11  ... Pk1 )ˆ ˆ11 ˆ  (ˆ 00,ˆ 01,ˆ10,ˆ11 )  ( P11  P21 )ˆ  (0.427,0.031,0.181,0.362) ˆ 00 ˆ 01 ˆ10 ˆ11 ˆ 00  7.113  1.668  3.134  2.311    ˆ 01   1.668 2.902 0.244  1.478  côv(ˆ )  10 5   ˆ  10  3.134 0.244 5.667  2.777   ˆ11   2.311  1.478  2.777 6.566  • Variance of support côv(ˆ )  (n  1) 1 P 1 (ˆ  ˆˆ) P1 côv(ˆ10 , ˆ11 ) v̂ar(ˆ11 ) • Interquantile range (normal dist.) ˆ i i 1 k  za / 2 âr (ˆ i1ik ) , ˆ i1ik  z a / 2 âr (ˆ i1ik )  0.362 0.346 12 0.378 Accuracy on Confidence C • Estimate of confidence A =>B cˆ  sˆ AB ˆ11  sˆ A ˆ10  ˆ11 • Variance of confidence ˆ102 ˆ ˆ ˆ112 âr (cˆ)  4 âr (ˆ11 )  4 âr (ˆ10 )  2 10 4 11 coˆ (ˆ11, ˆ10 ) ˆ1 ˆ1 ˆ1 • Interquantile range (ratio dist. is F(w)) 1  ˆ c      Loose v̂ar(cˆ) , cˆ  1   v̂ar(cˆ)   2 where   1 / k range derived on Chebyshev’s theorem Let X be a random variable with expected value  and finite variance  2 .Then for any real k  0 Pr( X    k )  1 / k 2 13 Bounds of other measures Accuracy Bounds 14 General Framework   Step1: Estimation  Express the measure as one derived function from the observed variables ( ij or their marginal totals  i  ,   j ).  Compute the estimated measure value. Step2: Variance of the estimated measure  Get the variance of the estimated measure (a function with multi known variables) through Taylor approximation k ar{g ( x)}  {g 'i ( )} ar ( xi )  2 i 1  k r g ' (  ) g ' (  ) co  ( x , x )   ( n )  i j i j i  j 1 Step 3: Derive the interquantile range through Chebyshev's theorem 15 Example for   2with two variables Step 1: Get the estimate of the measure (ˆ 00  ˆ 0ˆ 0 ) 2 (ˆ 01  ˆ 0ˆ 1 ) 2 (ˆ10  ˆ1ˆ  0 ) 2 (ˆ11  ˆ1ˆ 1 ) 2 ˆ  n{    } ˆ 0ˆ 0 ˆ 0ˆ 1 ˆ1ˆ 0 ˆ1ˆ 1 2   Step 2: Get the variance of the estimated measure 4 ˆ 2 2 ˆ 2 ˆ 2 ar{ˆ }   ( ) ar ( xi )   ( )( )co ( xi , x j ) xi xi x j i 1 i  j 1 4 2  Where: x1  ̂ 00 , x2  ̂ 01 , x3  ̂ 10 , x4  ̂11  Step 3: Derive the interquantile range through Chebyshev's theorem . 16 Accuracy Bounds • With unknown distribution, Chebyshev theorm only gives loose bounds. Bounds of the support vs. varying p 17 Distortion • All the above discussions assume distortion matrices P are known to data miners  P could be exploited by attackers to improve the posteriori probability of their prediction on sensitive items • How about not releasing P?   Disclosure risk is decreased Data mining result? 18 Unknown distortion P  Some measures have monotonic properties Measure Correlation ( Expression  11 00   01 10  1 1 0 0 )  ij  i   j  i  i  log  i    i Mutual Information (M) 2i 2 Likelihood ratio G ( ) Pearson Statistics( ) 2   i  ij log j  j ij log ran  ori M ran  M ori  ij  i   j 2 2 Gran  Gori ( ij   i   j ) 2 j  i   j  Other measures don’t have such properties 19 2 2  ran   ori 2  Applications: hypothesis test  From the randomized data, if we discover an itemset which 2 2 satisfies  ran   , we can guarantee dependence exists among the 2 2 original itemset since  ori .   ran Still be able to derive the strong dependent itemsets from the randomized data No false positive 20 Conclusion • Propose a general approach to deriving accuracy bounds of various measures adopted in MBD analysis • Prove some measures have monotonic property and some data mining tasks can be conducted directly on randomized data (without knowing the distortion). No false positive pattern exists in the mining result. 21 Future Work • Which measures are more sensible to randomization? • The tradeoff between the privacy of individual data and the accuracy of data mining results • Accuracy vs. disclosure analysis for general categorical data 22 Acknowledgement • NSF IIS-0546027 • Ph.D. students Ling Guo Songtao Guo 23 Q&A 24

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Overview - Personal Web Pages - University of North Carolina at