Download Privacy preserving data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Privacypreservingdatamining
2016 Japan-America Frontiers of Engineering Symposium
Hiromi Arai
The University of Tokyo
Privacythreatsinbigdata
Companieslearn
yoursecret!
Bigdataandprivacyrisk epidemicresearch
usingdatabases
collec;onofpersonaldata
formanypurposes
personalized
medicare
…
Services,Knowledge,…
PrivacyThreats
personalized
adver;sement
Privacypreservingdatamining
Ø Privacyandu;lityistrade-off
Ø Privacypreservingdatamining(PPDM)dealswith
datau;liza;onwhiletechnicallypreservingthe
privacyoftheinput
U;lity
U;lityANDprivacy
PPDM
U;lityORprivacy
Privacy
Reduceprivacyrisks
epidemicresearch
usingdatabases
Control
DataSharing
collec;on
Prohibi;on
ofAbuse
sharing
personalized
medicare
Controllingpersonaldataepidemicresearch
usingdatabases
Control
DataSharing
personalized
medicare
Prohibi;on
ofAbuse
Ø Regula;onbylaws
orguidelines
Ø Agreement-base
datacollec;on
Ø TechnicalApproach
Privacyrisk?
Driverlicensewithout
photoandname?
Records/logswithout
iden;fiers?
Sta;s;cs?
Exampleofprivacyrisks
•  LinkATackonanonymizedmicrorecords
•  TheAOLsearchdataleak
•  Privacybreachfromresearchpapers
LinkATack
Whendirectiden;fiers(id,names,etc.)are
removed…
<private><public>
Ideleted
iden;fiers.
OK!
raw data
sanitized data
data
Inference attack
owner
Re-iden;fica;onusinganonymizeddata
LinkaTack[Sweeny02]
re-iden;fica;onthrough
commonaTributelinkage
voterregistra;onlist
Candy
suffersfrom
obesity..
anonymizedmedicalrecords
TheAOLsearchdataleak
acertainuserwasiden;fiedfromtheir
anonymizedqueries
Privacybreachfromresearchpapers
Par;cipant’sdiseasescanbeinferredfroma
cohortstudyreport[Homer+2006]
GWASstudy
publish
Inferthestateofhisor
herhealth
Privacyrisk?
Privacythatisimplicitlycontainedin
certaindatamightbeinferred
bytheaTacker!
Driverlicensewithout
photoandname?
Records/logswithout
iden;fiers?
Sta;s;cs?
Dealwithprivacyrisks
Newtechnologyisa
double-edgedsword
Withoutsafetydevices,
signals,trafficrules,…
Weneedsafetydevices
forbigdata
Withthese,weenjoydriving
withverysmallrisk!
Controllingyourdata
Highly
private!
Data
Audi;ng
securemul;party
computa;on
Not
harmful!
Partner
Learnnothing
ThirdParty
DataAudi;ng
•  Techniqueforquan;fica;onofprivacyofa
certaindataforanappropriatedatahandling
–  Modelingofprivacythreats
–  Developmentofquan;fica;onmethods
Ipostedmy
photoonSNS!
Heissuffered
fromdiabetes…!
raisehisinsurancefee
Insurance
company
Privacyingene;ctestresults
CanIpublish
mydisease
risks?
inference
x
Client’sgenome
f(x)
Infervalue
ofx
r
Gene;ctestalgorithm Diseaserisks
•  Considergene;ctestsbasedonpersonalSNP
genotypes
•  Adversarycaninfertheclient’sSNPgenotypes
fromhis/herdiseaserisks?
Exampleofdiseaseriskdisclosure
•  Clientscarelesslydisclosetheirdiseaserisks
–  Theydonotcontaingene;cinforma;onexplicitly
–  Theyarenotsensi;vediseses
Ipostedmygene;ctest
resultsonFacebook!
Ihaveahighriskof
freckles,soIshouldtake
careofmyskin…!haha!
SNPgenotype
DNAsequences
allelesatSNPlocus
twoalleles
atoneSNP
T
G
T
T
T
A
A
C
ExtractRiskFactors
SNPgenotypeTTGATATC
Riskallele
TACT
#ofRiskallele2101
•  Singlenucleo;depolymorphisms(SNPs)arethemost
relevantofgene;cvariantsamongindividuals
•  Variantsthatareposi;velyassociatedwithacertain
disease(diseasefactors)arecalled“riskalleles”
Diseaserisks
•  Theriskofhavingacertaindiseaserela;velyin
acertainethnicgroup
•  Rela;veriskofthediseaseismul;pliedforall
associatedSNPs
personalgenomeinforma;oni
T
G
T
T
T
A
A
C
!
#ofriskalleles x
2101
oddsofdisease a1a2a3a4
witheachriskallele
Riskofdiseaser1=a12r2=a2r3=1r4=a4
usedforrisk
calcula;on
r = ∏ ri
i∈L
Adversarymodel
•  AdversaryinfersriskallelenumbersatnSNPs
fromrisksofkdiseasesandparameters
–  theriskareroundedbyparameterb
–  iftheriskis1.5andb=0.5,theroundedriskis[1.5,2.0]
Reverse-engineering
adversaryknows
diseaserisksand
parameters
adversaryinfers
gene;cparameter
⎛
⎞
xi
rk = round ⎜ ∏ a i / a,b ⎟
⎝ i∈L
⎠
roundingparameter
Privacyandu;lityistrade-off
Privacyinacertainsetoftheclientsanddiseases
(174,96and176forCEU,JPT,YRIdatasets,56diseasesandtotally119SNPs)
CEU
JPT
YRI
1
CEU
JPT
YRI
mostofthe
riskallelesare 0.8
inferred
0.6
correctly
0.4 0.6 0.8
b
1
privacypreserva;on
Full disclosure rate
Successrate
0.4
0.2
0
0
0.2 0.4 0.6 0.8
b
rounding
1
Diseaseriskis
Privacy breach
changes by reverse engi- Diseaseriskis
smallu;lity
useful
thod due to changes in rounding parame- meaningless
Sensi;vediseaserisk
Theriskofacertaincancerwereinferredfromthe
riskoffreckleusingproper;esofhumangenome
cancer
SNP rs1805007-T
backgroundknowledge
Securemul;partycomputa;on
•  Techniquetoobtainafunc;onvalueswhilehiding
privateinforma;onininputs
•  Usingcryptographictechniques
–  AliceandBobexchangerandomvalues(ciphers)eachother
•  Jointanalysis/DBqueryingcanbedonewithoutsharing
theirdata!
Input;xA,xB
Output;f(xA,xB)=(yA,yB)
Alice
Bob
xB
xA
yA
MPC
yB
Privacypreservingdatabasesearch
•  The researcher wants to query DB •  Both the researcher and DB want to hide their
inputs
researcher
Similar
compound?
DB
Yes or NO
・・・
KanaShimizu,KojiNuida,HiromiArai,et.al.,:Privacy-PreservingSearch
forChemicalCompoundDatabases,BMCBioinforma;cs,toappear.
Privacypreservingdatabasesearch
BothresearcherandDBjointlyexecutea
cryptographicprotocolbasedonaddi:ve
homomorphiccryptosystem
researcher
Input data
01001010
cipher
…3248905...
cipher
decryption
cipher
cryptographic
matching
cipher
results
Yes or
NO
encryption
Similar
compound?
Similarity measurement
Similaritymetricforcompounds
for Chemical compound
H
N
• Fingerprint
– Bit vector representing
feature of a compound.
ex) p=(0,1,1,1,0), q=(1,1,0,1,0)
• Tversky index
O
HO
N N=
NH2
0 1 1 1 0
HO
CH3
N
H
1 1 0 1 0
– similarity measure of two bit vectors.
ex: Tanimoto index (α=β=1), Dice index (α=β=0.5)
| p q|
S Tversky
| p q|
| p\q|
|q\ p|
q
p
| p q|
S tan imoto
| p| |q| | p q|
Novel protocol for comparing
Similaritysearchprotocol
two bit vectors.
• The protocol can detect if the Tversky index of given
two bit vectors ( p, q ) is larger than threshold θ or
not, without leaking p, q each other.
p = (1,1,1,0,0,0,1)
Search with
encrypted q
Encrypted q
q = (1,0,1,0,0,0,1)
Encrypted binary result.
The client knows if
(Similar or not.)
Tversky(p, q)>θ or not.
Addi;veHomomorphicCryptosystem
Additive homomorphic cryptosystem
Additive op. on the plain text is equivalent to another op. on
the cipher text
Lifted ElGamal [Elgamal84], Paillier [Paillier99]
Enc(m1 m2)
Enc(m1)
Enc(m2)
Paillier encryption [Paillier99]
MainIdea
Secret key: sk ( p, q ) ← key for Dec.
| p∩q|
θn
≥
Public key: Spk
(n, g ) , n p q ← key for Enc.
Tversky =
Problem:
| p ∩ q | +α | p \ qm| + βn | q \ p | 2θ d
Cipheroftext of m: Enc pk (m) g r mod n
calculation
*
↔
division
wherein r Z n 2 is a randomized value.
cryptographic
θ d | p ∩ q | −θ n (| p ∩ q | +α | p \ q | + β | q \ p |) ≥ 0
m1 m 2
n
2
manner is
Enc
(
m
1
)
Enc
(
m
2
)
g
(
r
1
r
2
)
mod
n
pk cost
↔pk
at high
Decsk ( Enc(θpkd(+mθ1n)) | Enc
(m
m|2q |) ≥ 0
p ∩ qpk| +
θ n2(−))| p |)m+1θ n (−
Addi;veHomomorphicCryptosystem
Additive homomorphic cryptosystem
Additive op. on the plain text is equivalent to another op. on
the cipher text
Lifted ElGamal [Elgamal84], Paillier [Paillier99]
Enc(m1 m2)
Enc(m1)
Enc(m2)
Paillier encryption [Paillier99]
MainIdea
Secret key: sk ( p, q ) ← key for Dec.
| p∩q|
θn
≥
Public key: Spk
(n, g ) , n p q ← key for Enc.
Tversky =
| p ∩ q | +α | p \ qm| + βn | q \ p | 2θ d
Cipher text of m: Enc pk (m) g r mod n
*
↔
where r Z n 2 is a randomized value.
θ d | p ∩ q | −θ n (| p ∩ q | +α | p \ q | + β | q \ p |) ≥ 0
m1 m 2
n
2
Enc pk (m1) Enc
(
m
2
)
g
(
r
1
r
2
)
mod
n
↔pk
Decsk ( Enc(θpkd(+mθ1n)) | Enc
(m
m|2q |) ≥ 0
p ∩ qpk| +
θ n2(−))| p |)m+1θ n (−
Both DB and query
DB
query
Computa;onalefficiencyofPPDS
Process;mefor1query
withIntelXeon2.9GHz
SFE>10min
SFEcan’tbeexecuted
withfingerprint>200
DB10sec,
User>10sec
Time
required
Executablefingerprint
length
#of
Communica:on
PPDS
short
enough
twice
SFE(Yao86)
long
insufficient
>>2
Conclusion
PPDMencouragetheuseofprivatedata!
Personaldata
Controlling
sharingof
ourdata
Benefits
smallprivacyrisks