Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Privacy-Preserving Data Mining Jaideep Vaidya ([email protected]) Joint work with Chris Clifton (Purdue University) Outline • Introduction – Privacy-Preserving Data Mining – Horizontal / Vertical Partitioning of Data – Secure Multi-party Computation • Privacy-Preserving Outlier Detection • Privacy-Preserving Association Rule Mining • Conclusion Back in the good ol’ days Future Now Dominick’s Safeway Jewel A “real” example • Ford / Firestone – Individual databases – Possible to join both databases (find corresponding transactions) – Commercial reasons to not share data – Valuable corporate information - Cost structures / business structures • Ford Explorers with Firestone tires → Tread Separation Problems (Accidents!) • Might have been able to figure out a bit earlier (Tires from Decatur, Ill. Plant, certain situations) Public (mis)Perception of Data Mining: Attack on Privacy • Fears of loss of privacy constrain data mining – Protests over a National Registry • In Japan – Data Mining Moratorium Act • Would stop all data mining R&D by DoD • Terrorism Information Awareness ended – Data Mining could be key technology Is Data Mining a Threat? • Data Mining summarizes data – (Possible?) exception: Anomaly / Outlier detection • Summaries aren’t private – Or are they? – Does generating them raise issues? • Data mining can be a privacy solution – Data mining enables safe use of private data Privacy-Preserving Data Mining • How can we mine data if we cannot see it? • Perturbation – Agrawal & Srikant, Evfimievski et al. – Extremely scalable, approximate results – Debate about security properties • Cryptographic – Lindell & Pinkas, Vaidya & Clifton – Completely accurate, completely secure (tight bound on disclosure), appropriate for small number of parties • Condensation/Hybrid Assumptions • Data distributed – Each data set held by source authorized to see it – Nobody is allowed to see aggregate data • Knowing all data about an individual violates privacy • Data holders don’t want to disclose data – Won’t collude to violate privacy Gold Standard: Trusted Third Party Horizontal Partitioning of Data CC# Active? Delinquent? Amount Bank of America 123 Yes Yes <$300 324 No No $300-500 919 Yes No >$1000 Chase Manhattan 3450 Yes Yes <$300 4127 No No $300-500 8772 Yes No >$1000 Vertical Partitioning of Data Global Database View TID Brain Tumor? Medical Records RPJ Yes CAC No Tumor PTR No Tumor Diabetes? Model Battery Cell Phone Data Diabetic RPJ 5210 Li/Ion No CAC none none Diabetic PTR 3650 NiCd Secure Multi-Party Computation (SMC) • Given a function f and n inputs, x , x ,..., x 1 2 distributed at n sites, compute the result y f x1 , x2 ,, xn while revealing nothing to any site except its own input(s) and the result. n Secure Multi-Party Computation It can be done! • Yao’s Millionaire’s problem (Yao ’86) – Secure computation possible if function can be represented as a circuit – Idea: Securely compute gate • Continue to evaluate circuit • Extended to multiple parties (BGW/GMW ’87) • Biggest Problem - Efficiency – Will not work for lots of parties / large quantities of data SMC – Models of Computation • Semi-honest Model – Parties follow the protocol faithfully • Malicious Model – Anything goes! • Provably Secure • In either case, input can always be modified What is an Outlier? • An object O in a dataset T is a DB(p,dt)-outlier if at least fraction p of the objects in T lie at distance greater than dt from O 2 1 1 • Centralized solution from Knorr and Ng – Nested loop comparison – Maintain count of objects inside threshold – If count exceeds threshold, declare non-outlier and move to next • Clever processing order minimizes I/O cost Privacy-Preserving Solution • Key idea: share splitting – Computations leave results (randomly) split between parties – Only outcome is if the count of points within distance threshold exceeds outlier threshold • Requires pairwise comparison of all points – But failure to compare all points reveals information about non-outliers • This alone makes it possible to cluster points • This is a privacy violation – Asymptotically equivalent to Knorr & Ng Solution: Horizontal Partition • Compare locally with your own points • For remote points, get random share of distance – Calculate random share of “exceeds threshold or doesn’t” • Sum shares and test if enough “close” points 1.5 0.3 2.5 1.5 32 3 -12 1 24 1 -31 -3 12 -1 -23 -0.9 0.9 -0.7 3.2 Random share of distance • m Distance ( X , Y ) ( xr yr ) 2 2 • • r 1 x 2 x1 y1 y 2 1 2 1 m m m x y 2 xr yr r 1 2 r r 1 2 r r 1 • x2, y2 local; sum of xy is scalar product – Several protocols for share-splitting scalar product (Du&Atallah’01; Vaidya&Clifton’02; Ioannidis, Grama, Atallah’02) Shares of “Within Threshold” • Goal: is x + y ≤ dt ? • Essentially Yao’s Millionaires’ problem (Yao’86) – Represent function to be computed as circuit – Cryptographic protocol gives random shares of each wire • Solves “sum of shares from within dt exceeds minimum” as well Vertically Partitioned Data • Each party computes its part of distance m – Distance2 ( X , Y ) ( xr yr )2 r 1 a – ( xr yr ) 2 r 1 m (x r a 1 r yr ) 2 • Secure comparison (circuit evaluation) gives each party shares of 1/0 (close/not) • Sum and compare as with horizontal partitioning Why is this Secure? • Random shares indistinguishable from random values – Contain no knowledge in isolation – Assuming no collusion – so shares viewed in isolation • Number of values (= number of shares) known – Nothing new revealed • Too few close points is outlier definition – This is the desired result • No knowledge that can’t be discovered from one’s own input and the result! Conclusion (Outlier Detection) • Outlier detection feasible without revealing anything but the outliers – Possibly expensive (quadratic) – But more efficient solution for this definition of outlier inherently reveals potential privacy-violating information • Key: Privacy of non-outliers preserved – Reason why outliers are outliers also hidden • Allows search for “unusual” entities without disclosing private information about entities Association Rules • Association rules a common data mining task – Find A, B, C such that AB C holds frequently (e.g. Diapers Beer) • Fast algorithms for centralized and distributed computation – Basic idea: For AB C to be frequent, AB, AC, and BC must all be frequent – Require sharing data • Secure Multiparty Computation too expensive Association Rule Mining • Find out if itemset {A1, B1} is frequent (i.e. If support of {A1, B1} ≥ k) A B Key A1 Key B1 k1 1 k1 0 k2 0 k2 1 k3 0 k3 0 k4 1 k4 1 k5 1 k5 1 • Support of itemset is defined as number of transactions in which all attributes of the itemset are present • For binary data, support =|Ai Λ Bi|. Association Rule Mining • Idea based on TID-list representation of data – Represent attribute A as TID-list Atid – Support of ABC is | Atid ∩ Btid ∩ Ctid | • Use a secure protocol to find size of set intersection to find candidate sets Cardinality of Set Intersection • • • • Use a secure commutative hash function Pohlig-Hellman Encryption Each party generates own encryption key All parties encrypt all the input sets E1 E2 Ek X El Ei E j X Result # common objects in all sets Cardinality of Set Intersection • Hashing – All parties hash all sets with their key • Initial intersection – Each party finds intersection of all sets (except its own) • Final intersection – Parties exchange the final intersection set, and compute the intersection of all sets Computing Size of Intersection 1 X E2(E (E (Z)) EX:α,λ,σ,β (X))) E213(Y) X∩Y∩Z:λ,β 2(E3 X:α,λ,σ,β X∩Z:α,β,λ 2 Y (Y)) Z:α,β,κ,λ,γ E1E (X) 1(E2(E 3(Z))) X∩Y∩Z:λ,β Z:α,β,κ,λ,γ Y∩Z:λ,β 3 Z (Z) EE (E (E Y:λ,σ,φ,υ,β 33(E 1(X)) 3 1 2(Y))) X∩Y∩Z:λ,β Y:λ,σ,φ,υ,β X∩Y:λ,σ,β Proof of Security • Proof by Simulation • What is known – The size of the intersection set – Site i learns S p , C 0, S k 1 Sp p 0 , k 1 , i C , C 2 * pC • How it can be simulated – Protocol is symmetric, simulating view of one party is sufficient Proof of Security • Hashing – Party i receives encrypted set from party i-1 – Can use random numbers to simulate this • Intersection – Party i receives fully hashed sets of all parties Simulating Fully Encrypted Sets |ABC| = 2, |AB| = 3, |AC| = 4, |BC| = 2, |A| = 6, |B| = 7, |C| = 8 ABC 2 AB 3-2 =1 A 6-2-1-2 =1 AC B 4-2 =2 7-2-1-0 =4 BC C 2-2 =0 8-2-2-0 =4 A B C R1 R1 R1 R2 R2 R2 R3 R3 R4 R4 R5 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 Optimized version Association Rule Mining (Revisited) • Naïve algorithm => Simply use APRIORI. A single set intersection determines the frequency of a single candidate itemset – Thousands of itemsets • Key intuition – Set Intersection algorithm developed also allows computation of intermediate sets – All parties get fully encrypted sets for all attributes – Local computation allows efficient discovery of all association rules Communication Cost • k parties, m set size, p frequent attributes – k*(2k-2) = O(k2) messages – p*(2p-2)*m*encrypted message size = O(p2m) bits – k rounds • Independent of number of itemsets found Other Results • ID3 Decision Tree learning – Horizontal Partitioning: Lindell&Pinkas ’00 – Also vertical partitioning (Du, Vaidya) • Association Rules – Horizontal Partitioning: Kantarcıoğlu • • • • K-Means / EM Clustering K-Nearest Neighbor Naïve Bayes, Bayes network structure And many more Challenges • What do the results reveal? • A general approach (instead of per data mining technique) • Experimental results • Incentive Compatibility • Note: Upcoming book in the Advances in Information Security series by SpringerVerlag Questions