Download ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Transcript
Privacy Preserving Data Mining
Lecture 3
Non-Cryptographic Approaches for
Preserving Privacy
(Based on Slides of Kobbi Nissim)
Benny Pinkas
HP Labs, Israel
March 3, 2005
10th Estonian Winter School in Computer Science
page 1
Why not use cryptographic methods?
d
…
•
Many users contribute data. Cannot require them to participate in
a cryptographic protocol.
–
•
In particular, cannot require p2p communication between users.
Cryptographic protocols incur considerable overhead.
March 3, 2005
10th Estonian Winter School in Computer Science
page 2
Data Privacy
d
Data
access
mechanism
March 3, 2005
10th Estonian Winter School in Computer Science
users
breach
privacy
page 3
Easy Tempting
Solution
A Bad
Solution
Idea: a. Remove identifying information (name, SSN, …)
b. Publish data
Mr. Brown
d
Ms. John
Mr. Doe
•
•
•
But, ‘harmless’ attributes uniquely identify many
patients (gender, age, approx weight, ethnicity, marital status…)
Recall, DOB+gender+zip code identify people whp.
Worse:`rare’ attributes (e.g. disease with prob.  1/3000)
March 3, 2005
10th Estonian Winter School in Computer Science
page 4
What is Privacy?
•
Something should not be computable from query
answers
–
–
•
E.g.  Joe={Joe’s private data}
The definition should take into account the adversary’s
power (computational, #of queries, prior knowledge, …)
Quite often it is much easier to say what is surely nonprivate
–
March 3, 2005
Intuition:
privacy
breached
E.g. Strong breaking
of privacy:
adversary
is ifable to
it is possible private
to compute
retrieve (almost) everybody’s
data
someone’s private information
from his identity
10th Estonian Winter School in Computer Science
page 5
The Data Privacy Game: an Information-Privacy Tradeoff
•
Private functions:
x
– want to hide x(DB)=dx
• Information functions:
– want to reveal
f (q, DB) for queries q
f
f f
•
Here: explicit definition of private functions.
– The question: which information functions may be allowed?
•
Different from Crypto (secure function evaluation):
– There, want to reveal f() (explicit definition of information function)
– want to hide all functions () not computable from f()
– Implicit definition of private functions
– The question whether f() should be revealed is not asked
March 3, 2005
10th Estonian Winter School in Computer Science
page 6
A simplistic model: Statistical Database (SDB)
query
Mr. Fox 0/1
d
{0,1}n
q  [n]
Ms. John 0/1
answer

aq=iq di
Mr. Doe 0/1
bits
March 3, 2005
10th Estonian Winter School in Computer Science
page 7
Approaches to SDB Privacy
•
•
Studied extensively since the 70s
Perturbation
–
–
Add randomness. Give `noisy’ or `approximate’ answers
Techniques:
• Data perturbation (perturb data and then answer queries as
usual) [Reiss 84, Liew Choi Liew 85, Traub Yemini Wozniakowski 84] …
• Output perturbation (perturb answers to queries) [Denning 80,
Beck 80, Achugbue Chin 79, Fellegi Phillips 74] …
–
•
Recent interest: [Agrawal, Srikant 00] [Agrawal, Aggarwal 01],…
Query Restriction
–
–
Answer queries accurately but sometimes disallow queries
Require queries to obey some structure [Dobkin Jones Lipton 79]
• Restricts number of queries
–
March 3, 2005
Auditing [Chin Ozsoyoglu 82, Kleinberg Papadimitriou Raghavan 01]
10th Estonian Winter School in Computer Science
page 8
Some Recent Privacy Definitions
X – data, Y – (noisy) observation of X
[Agrawal, Srikant ‘00] Interval of confidence
– Let Y = X+noise (e.g. uniform noise in [-100,100].
– Perturb input data. Can still estimate underlying distribution.
– Tradeoff: more noise  less accuracy but more privacy.
– Intuition: large possible interval  privacy preserved
• Given Y, we know that within c% confidence X is in [a1,a2]. For
example, for Y=200, with 50% X is in [150,250].
• a2-a1 defines the amount of privacy at c% confidence
–
Problem: there might be some a-priori information about X
• X = someone’s age & Y= -97
March 3, 2005
10th Estonian Winter School in Computer Science
page 9
The [AS] scheme can be turned against itself
•
Assume that N is large
–
Even if the data-miner doesn’t have a-priori information
about X, it can estimate it given the randomized data Y.
• The perturbation is uniform in [-1,1]
• [AS]: privacy interval 2 with confidence 100%
• Let fx(X)=50% for x[0,1], and 50% for x[4,5].
• But, after learning fx(X) the value of X can be easily localized
within an interval of size at most 1.
–
Problem: aggregate information provides information that
can be used to attack individual data
March 3, 2005
10th Estonian Winter School in Computer Science
page 10
Some Recent Privacy Definitions
X – data, Y – (noisy) observation of X
• [Agrawal, Aggarwal ‘01] Mutual information
– Intuition:
• High entropy is good. I(X;Y) = H(X)-H(X|Y) (mutual
information)
• small I(X;Y) (mutual information)  privacy preserved (Y
provides little information about X).
•
Problem [EGS] :
–
–
Average notion. Privacy loss can happen with low but
significant probability, but without affecting I(X;Y).
Sometimes I(X;Y) seems good but privacy is breached
March 3, 2005
10th Estonian Winter School in Computer Science
page 11
Output Perturbation (Randomization Approach)
•
Exact answer to query q:
–
aq =iq di
•
Actual SDB answer: âq
•
Perturbation  :
–
•
For all q: | âq – aq | ≤ 
Questions:
–
–
–
Does perturbation give any privacy?
How much perturbation is needed for privacy?
Usability
March 3, 2005
10th Estonian Winter School in Computer Science
page 12
Privacy Preserved by Perturbation   n
Database: dR{0,1}n (uniform input distribution!)
âq
Algorithm: on query q,
1.
Let aq=iq di
q/2
2.
If | aq - |q|/2 | <  return âq = |q| / 2
3.
Otherwise return âq = aq
  n (lgn)2  Privacy is preserved
– Assume poly(n) queries
–
If   n (lgn)2, whp always use rule 2
•
No information about d is given!
•
•
March 3, 2005
q/2
aq
(but database is completely useless…)
Shows that sometimes perturbation  n is enough for
privacy. Can we do better?
10th Estonian Winter School in Computer Science
page 13
Perturbation  << n Implies no Privacy
•
The previous useless database achieves the best
possible perturbation.
•
Theorem [Dinur-Nissim]:
Given any DB and any DB response algorithm with
perturbation  = o(n), there is a poly-time
reconstruction algorithm that outputs a database d’, s.t.
dist(d,d’) < o(n).
strong breaking of
privacy
March 3, 2005
10th Estonian Winter School in Computer Science
page 14
’
The Adversary as a Decoding Algorithm
d
encode
March 3, 2005
aq1
âq1
aq2
âq2
aq3
pert
âq3
d
decode
aqt
âqt
partial sums
perturbed sums
10th Estonian Winter School in Computer Science
page 15
Proof of Theorem [DN03]
The Adversary Reconstruction Algorithm
• Query phase: Get âqj for t random subsets q1,…,qt
• Weeding phase: Solve the Linear Program (over ):
0  xi  1
|iqj xi - âqj |  
• Rounding: Let ci = round(xi), output c
Observation: A solution always exists, e.g. x=d.
March 3, 2005
10th Estonian Winter School in Computer Science
page 16
Why does the Reconstruction Algorithm Work?
•
Consider x{0,1}n s.t. dist(x,d)=c·n = (n)
•
Observation:
•
–
A random q contains c’·n coordinates in which x≠d
–
The differences in the sum of these coordinates is, with
constant probability, at least (n) (>  = o(n) ).
–
Such a q disqualifies x as a solution for the LP
Since the total number of queries q is polynomial, then
all such vectors x are disqualified with overwhelming
probability.
March 3, 2005
10th Estonian Winter School in Computer Science
page 17
Summary of Results (statistical database)
small DB
–
Unlimited adversary:
• Perturbation of magnitude (n) required
medium DB
[Dinur, Nissim 03] :
–
Polynomial-time adversary:
• Perturbation of magnitude (sqrt(n)) is required (shown above)
–
In both cases, adversary may reconstruct a good approximation for
the database
• Disallows even very weak notions of privacy
large DB
•
–
Bounded adversary, restricted to T << n queries (SuLQ):
• There is a privacy preserving access mechanism with
perturbation << sqrt(T)
• Chance for usability
• Reasonable model as database grows larger and larger
March 3, 2005
10th Estonian Winter School in Computer Science
page 18
SuLQ for Multi-Attribute Statistical Database (SDB)
Database {di,j}
Row distribution
D (D1,D2,…,Dn)
k attributes
n persons
0 0 0 1 0
1 1 0 0 1
0 0 1 1 0
Query (q, f)
q  [n]
Answer
aq,f=iq f(di)
f : {0,1}k {0,1}
f
f
f

aq,f
1 0 1 0 0
1 1 0 1 1
f
0 0 1 0 1
March 3, 2005
10th Estonian Winter School in Computer Science
page 19
Privacy and Usability Concerns for the Multi-Attribute
Model [DN]
•
•
•
•
March 3, 2005
Rich set of queries: subset sums over any property of
the k attributes
– Obviously increases usability, but how is privacy
affected?
More to protect: functions of the k attributes
Relevant factors:
– What is the adversary’s goal?
– Row dependency
Vertically split data (between k or less databases):
– Can privacy still be maintained with independently
operating databases?
10th Estonian Winter School in Computer Science
page 20
Privacy Definition - Intuition
•
3-phase adversary
–
Phase 0: defines a target set G of poly(n)
functions g: {0,1}k {0,1}
• Will try to learn some of this information about
someone
–
–
Phase 1: adaptively queries the database T=o(n)
times
Phase 2: chooses an index i of a row it intends to
attack and a function gG
use all
• Attack:
– given d-i
–try to guess g(di,1…di,k)
March 3, 2005
10th Estonian Winter School in Computer Science
gained
info to
choose
i, g
page 21
The Privacy Definition
•
•
P 0i,g – a-priori probability that g(di,1…di,k)=1
p Ti,g – a-posteriori probability that g(di,1…di,k)=1
–
•
Given answers to the T queries, and d-i
Define conf(p) = log (p/(1-p))
–
–
1-1 relationship between p and conf(p)
conf(1/2)=0; conf(2/3)=1; conf(1)=
•
conf i,g = conf(pTi,g) – conf(p0i,g)
•
(,T) – privacy: (“relative privacy”)
–
March 3, 2005
For all distributions D1…Dn , row i, function g and any
adversary making at most T queries:
Pr[conf i,g > ] = neg(n)
10th Estonian Winter School in Computer Science
page 22
The SuLQ* Database
•
•
Adversary restricted to T << n queries
On query (q, f):
• q  [n]
• f : {0,1}k {0,1} (binary function)
Let aq,f = iq f(di,1…di,k)
– Let N  Binomial(0, T )
– Return aq,f+N
–
*SuLQ – Sub Linear Queries
March 3, 2005
10th Estonian Winter School in Computer Science
page 23
Privacy Analysis of the SuLQ Database
•
Pmi,g - a-posteriori probability that g(di,1…di,k)=1
–
•
conf(pmi,g) Describes a random walk on the line with:
–
–
•
Given d-i and answers to the first m queries
Starting point: conf(p0i,g)
Compromise: conf(pmi,g) – conf(p0i,g) > 
W.h.p. more than T steps needed to reach
compromise
conf(p0i,g)
March 3, 2005
10th Estonian Winter School in Computer Science
conf(p0i,g) +
page 24
Usability: One multi-attribute SuLQ DB
•
Statistics of any property f of the k
attributes
–
I.e. for what fraction of the
(sub)population does f(d1…dk) hold?
– Easy: just put f in the query
– Other applications:
• k independent multi-attribute SuLQ
DBs
• Vertically partitioned SulQ DBs
• Testing whether Pr[|] ≥ Pr[]+
– Caveat: we hide g() about a specific
row (not about multiple rows)
March 3, 2005
10th Estonian Winter School in Computer Science
0
1
0
1
0
1
0
1
1
1
0
1
1
0
0
1
1
1
0
1
1
1
0
1
1
0
0
1
0
1
page 25
Overview of Methods
•
Input Perturbation
Query
SDB
Data
Perturbation
SDB’
User
Response
•
Output Perturbation
(Restricted) Query
SDB
•
Perturbed Response
User
Query Restriction
(Restricted) Query
SDB
March 3, 2005
User
Exact Response
Or Denial
10th Estonian Winter School in Computer Science
page 26
Query restriction
(Restricted) Query
User
SDB
•
Exact Response
Or Denial
The decision whether to answer or deny the query
–
–
Can be based on the content of the query and on answers
to previous queries
Or, can be based on the above and on the content of the
database
March 3, 2005
10th Estonian Winter School in Computer Science
page 27
Auditing
•
[AW89] classify auditing as a query restriction method:
– “Auditing of an SDB involves keeping up-to-date logs of all
queries made by each user (not the data involved) and
constantly checking for possible compromise whenever a
new query is issued”
•
Partial motivation: May allow for more queries to be
posed, if no privacy threat occurs.
•
Early work: Hofmann 1977, Schlorer 1976, Chin,
Ozsoyoglu 1981, 1986
Recent interest: Kleinberg, Papadimitriou, Raghavan 2000, Li,
•
Wang, Wang, Jajodia 2002, Jonsson, Krokhin 2003
March 3, 2005
10th Estonian Winter School in Computer Science
page 28
How Auditors may Inadvertently
Compromise Privacy
March 3, 2005
10th Estonian Winter School in Computer Science
page 29
The Setting
q = (f ,i1,…,ik)
f (di1,…,dik)
•
Dataset: d={d1,…,dn}
–
•
•
March 3, 2005
Entries di: Real, Integer, Boolean
Query: q = (f ,i1,…,ik)
–
•
Statistical
database
f : Min, Max, Median, Sum, Average, Count…
Bad users will try to breach the privacy of individuals
Compromise  uniquely determine di (very weak def)
10th Estonian Winter School in Computer Science
page 30
Auditing
Here’s the answer
OR
Here’s a new query: qi+1
Query denied (as the answer
would cause privacy loss)
Auditor
Query log
Statistical
database
q1,…,qi
March 3, 2005
10th Estonian Winter School in Computer Science
page 31
Example 1: Sum/Max auditing
di real, sum/max queries, privacy breached if some di learned
q1 = sum(d1,d2,d3)
sum(d1,d2,d3) = 15
q2 = max(d1,d2,d3)
There
must beiffa
q2 is denied
Ohreason
well…
for the
d1=d2=d3
=5
denial…
I win!
Denied (the answer would
cause privacy loss)
Auditor
March 3, 2005
10th Estonian Winter School in Computer Science
page 32
Sounds Familiar?
David Duncan, Former auditor for Enron and
partner in Andersen:
Mr. Chairman, I would like to answer the
committee's questions, but on the advice
of my counsel I respectfully decline to
answer the question based on the
protection afforded me under the
Constitution of the United States.
March 3, 2005
10th Estonian Winter School in Computer Science
page 33
Max Auditing
d1 d2 d3 d4 d5 d6 d7 d8 … dn-1 dn
di real
q1 = max(d1,d2,d3,d4)
M1234
q2 = max(d1,d2,d3)
M123 / denied
If denied: d4=M1234
q2 = max(d1,d2)
If denied: d3=M123
Learn an item with prob ½
M12 / denied
Auditor
March 3, 2005
10th Estonian Winter School in Computer Science
page 34
Boolean Auditing?
d1 d2 d3 d4 d5 d6 d7 d8 … dn-1 dn
q1 = sum(d1,d2)
di Boolean
1 / denied
q2=sum(d2,d3)
…
1 / denied
qi denied iff di = di+1  learn database/complement
Auditor
March 3, 2005
10th Estonian Winter School in Computer Science
page 35
The Problem
•
The problem:
–
Query denials leak (potentially sensitive) information
• Users cannot decide denials by themselves
Possible assignments to {d1,…,dn}
Assignments consistent
with (q1,…qi, a1,…,ai)
qi+1 denied
March 3, 2005
10th Estonian Winter School in Computer Science
page 36
Solution to the problem: simulatable Auditing
An auditor is simulatable if a simulator exists s.t.:
q1,…,qi
Statistical
database
qi+1
Auditor
Deny/answer
qi+1
q1,…,qi
a1,…,ai

Simulator
Deny/answer
Simulation  denials do not leak information
March 3, 2005
10th Estonian Winter School in Computer Science
page 37
Why Simulatable Auditors do not Leak Information?
Possible assignments to {d1,…,dn}
Assignments consistent
with (q1,…qi, a1,…,ai )
qi+1 denied/allowed
March 3, 2005
10th Estonian Winter School in Computer Science
page 38
Simulatable auditing
March 3, 2005
10th Estonian Winter School in Computer Science
page 39
Query Restriction for Sum Queries
•
Given:
– D={x1,..,xn} dataset, xi 
– S is a subset of X. Query: xiS xi
•
Is it possible to compromise D?
– Here compromise means: uniquely determine
xi from the queries
•
Can compromise if subsets arbitrarily small:
– sum(x9)= x9
March 3, 2005
10th Estonian Winter School in Computer Science
page 40
Query Set Size Control
•
•
Do not permit queries that involve a small subset of
the database.
Compromise still possible
–
•
•
Want to discover x:
sum(x,y1,..,yk) - sum(y1,..,yk) = x
Issue: Overlap
In general, overlap is not enough.
–
–
March 3, 2005
Need to restrict also the number of queries
Note that overlap itself sometimes restricts number of
queries (e.g. size of queries = cn, overlap = const, only
about 1/c possible queries)
10th Estonian Winter School in Computer Science
page 41
Restricting Set-Sum Queries
•
Restricting the sum queries based on
–
–
–
•
Note that the criteria are known to the user
–
•
Number of database elements in the sum
Overlap with previous sum queries
Total number of queries
They do not depend on the contents of the database
Therefore, the user can simulate the denial/no-denial
answer given by the DB
–
Simulatable auditing
March 3, 2005
10th Estonian Winter School in Computer Science
page 42
Restricting Overlap and Number of Queries
•
Assume:
– |Query Qi| ≥ k
– |Qi  Qj| ≤ r
– Adversary knows a-priori at most L values, L+1 < k
•
Claim: Data cannot be compromised with fewer than 1+(2k-L)/r
Sum Queries.
1 0 0 0 1 1 1 1
1 0 0 1 0 0 1 0
x1
x2
x3
..
=
Q1
Q2
Q3
...
≥k
≥k
xl
≤r
xn
March 3, 2005
≤r
≤r
Qt
≥k
10th Estonian Winter School in Computer Science
≥k
page 43
Overlap + Number of Queries
Claim: Data cannot be compromised with fewer than
1+(2k-L)/r Sum Queries [Dobkin,Jones,Lipton][Reiss]
–
•
Suppose xc compromised after t queries where each
query represented by:
–
•
k < query size, r > overlap, L  a-priori known items
Qi = xi1 + xi2 + … + xik
for i =1, …, t
Implies that:
–
–
–
xc = i=1,t i Qi = i=1,t i j=1,k xij
Let i = 1 if x in query i, 0 otherwise
xc= i=1,t i =1,n i x = =1,n (i=1,t i i)x
March 3, 2005
10th Estonian Winter School in Computer Science
page 44
Overlap + Number of Queries
We have:
xc= =1,n (i=1,ti i)x
•
In the above sum, (i=1,ti i) must be 0 for all x
except for xc (in order for xc to be compromised)
•
This happens iff i=0 for all i, or if i =j =1 and i j
have opposite signs
–
or i =0, in which case the ith query didn’t matter
March 3, 2005
10th Estonian Winter School in Computer Science
page 45
Overlap + Number of Queries
•
•
•
•
•
•
Wlog, first query contains xc, second query is of
opposite sign.
In the first query, k elements are probed
The second query adds at least k-r elements
Elements from first and second queries cannot be
canceled within the same (additional) query (opposite
signs requires)
Therefore each new query cancels items from first or
from second query, but not from both.
Need to cancel 2k-r-L elements.
–
Need 2+(2k-r-L)/r queries, i.e. 1+(2k-L)/r.
March 3, 2005
10th Estonian Winter School in Computer Science
page 46
Notes
•
The number of queries satisfying |Qi|≥ k and |Qi 
Qj| ≤r is small
–
–
–
March 3, 2005
If k=n/c for some constant c and r=const, then there
are only ~c queries where no two queries overlap by
more than 1.
Hence , query sequence length may be
uncomfortably short.
Or, if r=k/c (overlap is a constant fraction of query
size) then number of queries, 1+(2k-L)/r, is O( c).
10th Estonian Winter School in Computer Science
page 47
Conclusions
•
Privacy should be defined and analyzed rigorously
– In particular, assuming randomization  privacy is dangerous
• High perturbation is needed for privacy against polynomial
adversaries
– Threshold phenomenon – above n: total privacy, below n:
no privacy (for poly-time adversary)
– Main tool: a reconstruction algorithm
• Careless auditing might leak private information
• Self Auditing (simulatable auditors) is safe
– Decision whether to allow a query based on previous `good’
queries and their answers
• Without access to DB contents
• Users may apply the decision procedure by themselves
March 3, 2005
10th Estonian Winter School in Computer Science
page 48
ToDo
•
Come up with good model and requirements for
database privacy
– Learn from crypto
– Protect against more general loss of privacy
•
Simulatable auditors are a starting point for
designing more reasonable audit mechanisms
March 3, 2005
10th Estonian Winter School in Computer Science
page 49
References
•
Course web page:
–
–
A Study of Perturbation Techniques for Data Privacy,
Cynthia Dwork and Nina Mishra and Kobbi Nissim,
http://theory.stanford.edu/~nmishra/cs369-2004.html
Privacy and Databases
http://theory.stanford.edu/~rajeev/privacy.html
March 3, 2005
10th Estonian Winter School in Computer Science
page 50
Foundations of CS at the Weizmann Institute
•
•
•
•
•
Uri Feige
Oded Goldreich
Shafi Goldwasser
David Harel
Moni Naor
•
•
•
•
•
David Peleg
Amir Pnueli
Ran Raz
Omer Reingold
Adi Shamir
Yellow  crypto
• All students receive a fellowship
• Language of instruction English
March 3, 2005
10th Estonian Winter School in Computer Science
page 51