Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Privacypreservingdatamining 2016 Japan-America Frontiers of Engineering Symposium Hiromi Arai The University of Tokyo Privacythreatsinbigdata Companieslearn yoursecret! Bigdataandprivacyrisk epidemicresearch usingdatabases collec;onofpersonaldata formanypurposes personalized medicare … Services,Knowledge,… PrivacyThreats personalized adver;sement Privacypreservingdatamining Ø Privacyandu;lityistrade-off Ø Privacypreservingdatamining(PPDM)dealswith datau;liza;onwhiletechnicallypreservingthe privacyoftheinput U;lity U;lityANDprivacy PPDM U;lityORprivacy Privacy Reduceprivacyrisks epidemicresearch usingdatabases Control DataSharing collec;on Prohibi;on ofAbuse sharing personalized medicare Controllingpersonaldataepidemicresearch usingdatabases Control DataSharing personalized medicare Prohibi;on ofAbuse Ø Regula;onbylaws orguidelines Ø Agreement-base datacollec;on Ø TechnicalApproach Privacyrisk? Driverlicensewithout photoandname? Records/logswithout iden;fiers? Sta;s;cs? Exampleofprivacyrisks • LinkATackonanonymizedmicrorecords • TheAOLsearchdataleak • Privacybreachfromresearchpapers LinkATack Whendirectiden;fiers(id,names,etc.)are removed… <private><public> Ideleted iden;fiers. OK! raw data sanitized data data Inference attack owner Re-iden;fica;onusinganonymizeddata LinkaTack[Sweeny02] re-iden;fica;onthrough commonaTributelinkage voterregistra;onlist Candy suffersfrom obesity.. anonymizedmedicalrecords TheAOLsearchdataleak acertainuserwasiden;fiedfromtheir anonymizedqueries Privacybreachfromresearchpapers Par;cipant’sdiseasescanbeinferredfroma cohortstudyreport[Homer+2006] GWASstudy publish Inferthestateofhisor herhealth Privacyrisk? Privacythatisimplicitlycontainedin certaindatamightbeinferred bytheaTacker! Driverlicensewithout photoandname? Records/logswithout iden;fiers? Sta;s;cs? Dealwithprivacyrisks Newtechnologyisa double-edgedsword Withoutsafetydevices, signals,trafficrules,… Weneedsafetydevices forbigdata Withthese,weenjoydriving withverysmallrisk! Controllingyourdata Highly private! Data Audi;ng securemul;party computa;on Not harmful! Partner Learnnothing ThirdParty DataAudi;ng • Techniqueforquan;fica;onofprivacyofa certaindataforanappropriatedatahandling – Modelingofprivacythreats – Developmentofquan;fica;onmethods Ipostedmy photoonSNS! Heissuffered fromdiabetes…! raisehisinsurancefee Insurance company Privacyingene;ctestresults CanIpublish mydisease risks? inference x Client’sgenome f(x) Infervalue ofx r Gene;ctestalgorithm Diseaserisks • Considergene;ctestsbasedonpersonalSNP genotypes • Adversarycaninfertheclient’sSNPgenotypes fromhis/herdiseaserisks? Exampleofdiseaseriskdisclosure • Clientscarelesslydisclosetheirdiseaserisks – Theydonotcontaingene;cinforma;onexplicitly – Theyarenotsensi;vediseses Ipostedmygene;ctest resultsonFacebook! Ihaveahighriskof freckles,soIshouldtake careofmyskin…!haha! SNPgenotype DNAsequences allelesatSNPlocus twoalleles atoneSNP T G T T T A A C ExtractRiskFactors SNPgenotypeTTGATATC Riskallele TACT #ofRiskallele2101 • Singlenucleo;depolymorphisms(SNPs)arethemost relevantofgene;cvariantsamongindividuals • Variantsthatareposi;velyassociatedwithacertain disease(diseasefactors)arecalled“riskalleles” Diseaserisks • Theriskofhavingacertaindiseaserela;velyin acertainethnicgroup • Rela;veriskofthediseaseismul;pliedforall associatedSNPs personalgenomeinforma;oni T G T T T A A C ! #ofriskalleles x 2101 oddsofdisease a1a2a3a4 witheachriskallele Riskofdiseaser1=a12r2=a2r3=1r4=a4 usedforrisk calcula;on r = ∏ ri i∈L Adversarymodel • AdversaryinfersriskallelenumbersatnSNPs fromrisksofkdiseasesandparameters – theriskareroundedbyparameterb – iftheriskis1.5andb=0.5,theroundedriskis[1.5,2.0] Reverse-engineering adversaryknows diseaserisksand parameters adversaryinfers gene;cparameter ⎛ ⎞ xi rk = round ⎜ ∏ a i / a,b ⎟ ⎝ i∈L ⎠ roundingparameter Privacyandu;lityistrade-off Privacyinacertainsetoftheclientsanddiseases (174,96and176forCEU,JPT,YRIdatasets,56diseasesandtotally119SNPs) CEU JPT YRI 1 CEU JPT YRI mostofthe riskallelesare 0.8 inferred 0.6 correctly 0.4 0.6 0.8 b 1 privacypreserva;on Full disclosure rate Successrate 0.4 0.2 0 0 0.2 0.4 0.6 0.8 b rounding 1 Diseaseriskis Privacy breach changes by reverse engi- Diseaseriskis smallu;lity useful thod due to changes in rounding parame- meaningless Sensi;vediseaserisk Theriskofacertaincancerwereinferredfromthe riskoffreckleusingproper;esofhumangenome cancer SNP rs1805007-T backgroundknowledge Securemul;partycomputa;on • Techniquetoobtainafunc;onvalueswhilehiding privateinforma;onininputs • Usingcryptographictechniques – AliceandBobexchangerandomvalues(ciphers)eachother • Jointanalysis/DBqueryingcanbedonewithoutsharing theirdata! Input;xA,xB Output;f(xA,xB)=(yA,yB) Alice Bob xB xA yA MPC yB Privacypreservingdatabasesearch • The researcher wants to query DB • Both the researcher and DB want to hide their inputs researcher Similar compound? DB Yes or NO ・・・ KanaShimizu,KojiNuida,HiromiArai,et.al.,:Privacy-PreservingSearch forChemicalCompoundDatabases,BMCBioinforma;cs,toappear. Privacypreservingdatabasesearch BothresearcherandDBjointlyexecutea cryptographicprotocolbasedonaddi:ve homomorphiccryptosystem researcher Input data 01001010 cipher …3248905... cipher decryption cipher cryptographic matching cipher results Yes or NO encryption Similar compound? Similarity measurement Similaritymetricforcompounds for Chemical compound H N • Fingerprint – Bit vector representing feature of a compound. ex) p=(0,1,1,1,0), q=(1,1,0,1,0) • Tversky index O HO N N= NH2 0 1 1 1 0 HO CH3 N H 1 1 0 1 0 – similarity measure of two bit vectors. ex: Tanimoto index (α=β=1), Dice index (α=β=0.5) | p q| S Tversky | p q| | p\q| |q\ p| q p | p q| S tan imoto | p| |q| | p q| Novel protocol for comparing Similaritysearchprotocol two bit vectors. • The protocol can detect if the Tversky index of given two bit vectors ( p, q ) is larger than threshold θ or not, without leaking p, q each other. p = (1,1,1,0,0,0,1) Search with encrypted q Encrypted q q = (1,0,1,0,0,0,1) Encrypted binary result. The client knows if (Similar or not.) Tversky(p, q)>θ or not. Addi;veHomomorphicCryptosystem Additive homomorphic cryptosystem Additive op. on the plain text is equivalent to another op. on the cipher text Lifted ElGamal [Elgamal84], Paillier [Paillier99] Enc(m1 m2) Enc(m1) Enc(m2) Paillier encryption [Paillier99] MainIdea Secret key: sk ( p, q ) ← key for Dec. | p∩q| θn ≥ Public key: Spk (n, g ) , n p q ← key for Enc. Tversky = Problem: | p ∩ q | +α | p \ qm| + βn | q \ p | 2θ d Cipheroftext of m: Enc pk (m) g r mod n calculation * ↔ division wherein r Z n 2 is a randomized value. cryptographic θ d | p ∩ q | −θ n (| p ∩ q | +α | p \ q | + β | q \ p |) ≥ 0 m1 m 2 n 2 manner is Enc ( m 1 ) Enc ( m 2 ) g ( r 1 r 2 ) mod n pk cost ↔pk at high Decsk ( Enc(θpkd(+mθ1n)) | Enc (m m|2q |) ≥ 0 p ∩ qpk| + θ n2(−))| p |)m+1θ n (− Addi;veHomomorphicCryptosystem Additive homomorphic cryptosystem Additive op. on the plain text is equivalent to another op. on the cipher text Lifted ElGamal [Elgamal84], Paillier [Paillier99] Enc(m1 m2) Enc(m1) Enc(m2) Paillier encryption [Paillier99] MainIdea Secret key: sk ( p, q ) ← key for Dec. | p∩q| θn ≥ Public key: Spk (n, g ) , n p q ← key for Enc. Tversky = | p ∩ q | +α | p \ qm| + βn | q \ p | 2θ d Cipher text of m: Enc pk (m) g r mod n * ↔ where r Z n 2 is a randomized value. θ d | p ∩ q | −θ n (| p ∩ q | +α | p \ q | + β | q \ p |) ≥ 0 m1 m 2 n 2 Enc pk (m1) Enc ( m 2 ) g ( r 1 r 2 ) mod n ↔pk Decsk ( Enc(θpkd(+mθ1n)) | Enc (m m|2q |) ≥ 0 p ∩ qpk| + θ n2(−))| p |)m+1θ n (− Both DB and query DB query Computa;onalefficiencyofPPDS Process;mefor1query withIntelXeon2.9GHz SFE>10min SFEcan’tbeexecuted withfingerprint>200 DB10sec, User>10sec Time required Executablefingerprint length #of Communica:on PPDS short enough twice SFE(Yao86) long insufficient >>2 Conclusion PPDMencouragetheuseofprivatedata! Personaldata Controlling sharingof ourdata Benefits smallprivacyrisks