Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Privacy Preserving Data Mining
Lecture 1
Motivating privacy research,
Introducing Crypto
Benny Pinkas
HP Labs, Israel
February 28, 2005
10th Estonian Winter School in Computer Science
page 1
Course structure
•
Lecture 1:
–
–
•
Lecture 2
–
•
Introduction to privacy
Introduction to cryptography, in particular, to rigorous
cryptographic analysis.
• Definitions
• Proofs of security
Cryptographic tools for privacy preserving data mining.
Lecture 3
–
–
Non-cryptographic tools for privacy preserving data mining
In particular, answer perturbation.
February 28, 2005
10th Estonian Winter School in Computer Science
page 2
Privacy-Preserving Data Mining
•
Allow multiple data holders to collaborate in order to
compute important information while protecting the
privacy of other information.
–
–
–
•
Security-related information
Public health information
Marketing information
Advantages of privacy protection
–
–
–
–
protection of personal information
protection of proprietary or sensitive information
enables collaboration between different data owners
(since they may be more willing or able to collaborate if
they need not reveal their information)
compliance with the law
February 28, 2005
10th Estonian Winter School in Computer Science
page 3
Privacy Preserving Data Mining
•
Two papers appeared in 2000
–
–
•
“Privacy preserving data mining”, Agrawal and Srikant,
SIGMOD 2000. (statistical approach)
“Privacy preserving data mining”, Lindell and Pinkas,
Crypto 2000. (cryptographic approach)
Why privacy now?
–
–
–
–
Technological changes erode privacy: ubiquitous
computing, cheap storage.
Public awareness: health coverage, employment, personal
relationships.
Historical changes: Small towns vs. Cities vs. Connected
society.
Privacy is a real problem that needs to be solved
February 28, 2005
10th Estonian Winter School in Computer Science
page 4
Some data privacy cases: hospital data
•
Hospital data contains
–
–
–
–
•
Identifying information: name, id, address
General information: age, marital status
Medical information
Billing information
Database access issues:
–
Your doctor should get every information that is required to
take care of you
– Emergency rooms should get all medical information that
is required to take care of whoever comes there
– Billing department should only get information relevant to
billing
•
Problem: how to stop employees from getting
information about family, neighbors, celebrities?
February 28, 2005
10th Estonian Winter School in Computer Science
page 5
Some data privacy cases: Medical Research
•
Medical research:
Trying to learn patterns in the data, in “aggregate” form.
– Problem: how to enable learning aggregate data without
revealing personal medical information?
– Hiding names is not enough, since there are many ways to
uniquely identify a person
–
•
•
A single hospitals/medical researcher might not have
enough data
How can different organizations share research data
without revealing personal data?
February 28, 2005
10th Estonian Winter School in Computer Science
page 6
Public Data
•
•
•
Many public records are available in electronic form:
birth records, property records, voter registration
“Your information serves as an error correcting code of
your identity”
Latanya Sweeney:
–
Date of birth uniquely identifies 12% of the population of
Cambridge, MA.
– Date of birth + gender: 29%
– Date of birth + gender + (9 digit) zip code: 95%
– Sweeney was therefore able to get her medical
information from an “annonymized” database
February 28, 2005
10th Estonian Winter School in Computer Science
page 7
Census data
•
A trusted party (the census bureau) collects information
about individuals
•
Collected data:
–
–
–
•
Explicitly identifying data (names, address..)
Implicitly identifying data (combination of several
attributes)
Private data
The data should is collected to help decision making
–
Partial or aggregate data should therefore made public
February 28, 2005
10th Estonian Winter School in Computer Science
page 8
Total Information Awareness (TIA)
•
Collects information about transactions (credit card
purchases, magazine subscriptions, bank deposits,
flights)
–
–
•
Early collection of epidemic bursts
–
–
–
•
Early detection of terrorist activity
Check a chemistry book in the library, buy something at a
hardware store and something in a pharmacy…
Early symptoms of Anthrax are similar to the flu
Check non-traditional data sources: grocery and pharmacy
data, school attendance records, etc..
Such systems are developed and used
Could the collection of data be done in a privacy
preserving manner? (without learning about individuals?)
February 28, 2005
10th Estonian Winter School in Computer Science
page 9
Basic Scenarios
•
Single (centralized) database, e.g., census data:
–
–
•
This is often a simple abstraction of a more complicated
scenario, so we better solve this one
Need to collect data and present it in a privacy preserving
way
Published data (e.g., on a CD)
A “trusted” party collects data and then publishes a
“sanitized” version
– Users can do any computation they wish with the sanitized
data
– For example, statistical tabulations.
–
February 28, 2005
10th Estonian Winter School in Computer Science
page 10
Basic Scenarios
•
Multi database scenarios:
–
–
Two or more parties with private data want to cooperate.
Horizontally split: Each party has a large database.
Databases have same attributes but are about different
subjects. For example, the parties are banks which each
have information about their customers.
u1
–
u1
bank 2
un
taxes
bank
u’n
bank 1
houses
un
u’1
Vertically split: Each party has some information about the
same set of subjects. For example, the participating
parties are government agencies; each with some data
about every citizen.
February 28, 2005
10th Estonian Winter School in Computer Science
page 11
Issues and Tools
•
•
Best privacy can be achieved by not giving any data, but..
Privacy tools: cryptography [LP00]
–
Encryption: data is hidden unless you have the decryption key.
However, we also want to use the data.
–
Secure function evaluation: two or more parties with private inputs.
Can compute any function they wish without revealing anything else.
–
•
Strong theory. Starts to be relevant to real applications.
Non-cryptographic tools [AS00]
–
Query restriction: prevent certain queries from being answered.
– Data/Input/output perturbation: add errors to inputs – hide personal
data while keeping aggregates accurate. (randomization, rounding, data
swapping.)
–
Can these be understood as well as we understand Crypto?
Provide the same level of security as Crypto?
February 28, 2005
10th Estonian Winter School in Computer Science
page 12
Introduction to Cryptography
February 28, 2005
10th Estonian Winter School in Computer Science
page 13
Why learn/use crypto to solve privacy issues?
•
Why are we referring to crypto?
–
Cryptography is one of the tools we can use for preserving
privacy
– A mature research area:
– many useful results/tools
– Can reflect on our thinking – how is “security” defined in
cryptography? How should we define “privacy”?
February 28, 2005
10th Estonian Winter School in Computer Science
page 14
What is Cryptography?
Traditionally: how to maintain secrecy in communication
Alice and Bob talk while Eve tries to listen
Bob
Alice
Eve
February 28, 2005
10th Estonian Winter School in Computer Science
page 15
History of Cryptography
•
Very ancient occupation
Up to the mid 70’s - mostly classified military work
– Exception: Shannon, Turing*
• Since then - explosive growth
– Commercial applications
– Scientific work: tight relationship with Computational Complexity
Theory
– Major works: Diffie-Hellman, Rivest, Shamir and Adleman (RSA)
• Recently - more involved models for more diverse tasks.
•
•
Scope: How to maintain the secrecy, integrity and functionality in
computer and communication system.
February 28, 2005
10th Estonian Winter School in Computer Science
page 16
Relation to computational hardness
•
•
Cryptography uses problems that are infeasible to solve.
Uses the intractability of some problems in order to
construct secure systems.
Feasible – computable in probabilistic polynomial time (PPT)
– Infeasible – no probabilistic polynomial time algorithm
– Usually average case hardness is needed
–
•
For example, the discrete log problem
February 28, 2005
10th Estonian Winter School in Computer Science
page 17
The Discrete Log Problem
•
•
•
•
Let G be a group and g an element in G.
Given yG let x be minimal non-negative integer
satisfying the equation y=gx.
x is called the discrete log of y to base g.
Example: y=gx mod p in the multiplicative group of Zp* (p
is prime). (For example, p=7, g=3, y=4  x=4.)
In general, it is easy to exponentiate
–
•
(using repeated squaring and the binary representation of x)
Computing the discrete log is believed to be hard in Zp* if
p is large. (E.g., p is a prime, |p|>768 bits, p=2q+1 and q
is also a prime.)
February 28, 2005
10th Estonian Winter School in Computer Science
page 18
Encryption
•
Alice wants to send a message m  {0,1}n to Bob
–
–
•
Set-up phase is secret
Symmetric encryption: Alice and Bob share a secret key k
They want to prevent Eve from learning anything about
the message
Alice
Ek(m)
k
k
Bob
Eve
February 28, 2005
10th Estonian Winter School in Computer Science
page 19
Public key encryption
•
•
•
•
Alice generates a private/public key pair (SK,PK)
Only Alice knows the secret key SK
Everyone (even Eve) knows the public key PK, and can
encrypt messages to Alice
Only Alice can decrypt (using SK)
EPK(m)
Alice
SK
EPK(m)
Charlie
PK
Eve
February 28, 2005
Bob
PK
10th Estonian Winter School in Computer Science
page 20
Rigorous Specification of Security
To define the security of a system we must specify:
1.
What constitute a failure of the system
2.
The power of the adversary
–
–
–
February 28, 2005
computational
access to the system
what it means to break the system.
10th Estonian Winter School in Computer Science
page 21
What does `learn’ mean?
•
Even if Eve has some prior knowledge of m, she should not
have any advantage in
–
•
Ideally: the message sent is a independent of the message m
–
•
Implies all the above
Achievable: one-time pad (symmetric encryption)
–
–
–
–
•
Probability of guessing m, or probability of guessing whether m is m0 or
m1, or prob. of computing any other function f of m ,or even computing |m|
Let rR {0,1} n be the shared key.
Let m  {0,1} n
To encrypt m send r  m
To decrypt z send m = z  r
Shannon: achievable only if the entropy of the shared secret is
at least as large as that of m. Therefore must use long key .
February 28, 2005
10th Estonian Winter School in Computer Science
page 22
Defining security
The power of the adversary
–
–
–
•
Computational: Probabilistic polynomial time machine (PPTM)
Access to the system: e.g. can it change messages?
Passive adversary, (adaptive) chosen plaintext attack, chosen
ciphertext attack…
What constitutes a failure of the system?
–
–
Recovering plaintext from ciphertext – not enough
• Allows for the leakage of partial information
• In general, hard to answer which partial information may/should not
be leaked. Application dependent.
• How would partial information the adversary already holds be
combined with what he learns to affect privacy?
Better: Prevent learning anything about an encrypted message
• There are two common, equivalent, definitions…
February 28, 2005
10th Estonian Winter School in Computer Science
page 23
Security of Encryption: Definition 1
Indistinguishability of Encryptions
Adversary A chooses any X0 , X1 0,1n
• Receives encryption of Xb for bR0,1
• Has to decide whether b  0 or b  1.
•
For every PPTM A, choosing a pair X0 , X1 0,1n :
| Pr A(E(X0))= ‘1’  - Pr A(E(Xb1)) ‘1’  | = neg(n)
–
•
(Probability is over the choice of keys, randomization in the
encryption and A‘s coins)
Note that a proof of security must be rigorous
February 28, 2005
10th Estonian Winter School in Computer Science
page 24
Security of Encryption: Definition 2
Semantic Security
Simulation: Whatever Adversary A can compute given an
encryption of X 0,1n so can a `simulator’ S that does
not get to see the encryption of X.
A selects a distribution Dn on 0,1n and a relation R(X,Y) computable in PPT (e.g. R(X,Y)=1 iff Y is last bit of X).
• XR Dn is sampled
• Given E(X), A outputs Y trying to satisfy R(X,Y)
• The simulator S does the same without access to E(X)
• Simulation is successful if A and S have the same success
probability
• Successful simulation  semantic security
•
February 28, 2005
10th Estonian Winter School in Computer Science
page 26
Security of Encryption (2)
Semantic Security
More formally:
For every PPTM A there is a PPTM S so that
for all PPTM relations R
for XR Dn
 Pr R(X,A(E(X))  - Pr R(X,S())  
is negligible.
In other words: The outputs of A and S are
indistinguishable even for a test that is aware of X.
February 28, 2005
10th Estonian Winter School in Computer Science
page 27
Which is the Right Definition?
•
Semantic security seems to convey that the message is protected
• But it is usually easier to prove indistinguishability of encryptions
•
Would like to argue that the two definitions are equivalent
•
Must define the attack: chosen plaintext attack
–
–
•
Adversary can obtain the encryption for any message it chooses, in an
adaptive manner
More severe attacks: chosen ciphertext
The Equivalence Theorem:
A cryptosystem is semantically secure if and only if it has the
indistinguishability of encryptions property
February 28, 2005
10th Estonian Winter School in Computer Science
page 28
Equivalence Proof (informal)
Semantic security  Indistinguishability of encryptions
•
Suppose no indistinguishability:
– A chooses a pair X0 , X10,1n for which it can distinguish
encryptions with non-negligible advantage 
•
Choose
–
–
•
•
•
•
Distribution Dn = {X0 , X1 }
Relation R which is “equality with X ”
S that doesn’t get E(X), and outputs Y’ we have
Prob[ R( X, Y’ ) ]= ½
Given E(Xb ), run A(E(Xb )), get output b{0,1}, set Y=Xb
Now, | PrA(E(Xb))= ‘1’  b  1 - PrA(E(Xb)) ‘1’  b  0 | > 
Therefore, | PrR(X,Y) - PrR(E(X,Y’) | >  / 2
February 28, 2005
10th Estonian Winter School in Computer Science
page 29
Equivalence Proof (informal)
Indistinguishability of encryptions  Semantic security
•
•
Suppose no semantic security: A chooses some distribution
Dn and some relation R
Choose X0, X1 R Dn , choose bR {0,1}, compute E(Xb).
– Give E(Xb) to A, ask A to compute Yb = A(E(Xb))
•
For X0 , X1 R Dn let
– 0 = Prob[R(X0, Yb)], 1 = Prob[R(X1, Yb)]
•
With noticeable probability |0 - 1 | is non-negligible, since
otherwise Yb can be computed without the encryption.
If |0 - 1 | is non-negligible, then we can distinguish between
an encryption of X0 and X1
•
February 28, 2005
10th Estonian Winter School in Computer Science
page 30
Lessons learned?
•
Rigorous approach to cryptography
–
–
Defining security
Proving security
February 28, 2005
10th Estonian Winter School in Computer Science
page 31
References
Books:
•
O. Goldreich, Foundations of Cryptography Vol 1, Basic Tools,
Cambridge, 2001
• Pseudo-randomness, zero-knowledge
–
Vol 2, Basic Applications (to be available May 2004)
• Encryption, Secure Function Evaluation)
–
Other volumes in www.wisdom.weizmann.ac.il/~oded/books.html
Web material/courses:
•
•
S. Goldwasser and M. Bellare, Lecture Notes on Cryptography,
http://www-cse.ucsd.edu/~mihir/papers/gb.html
M. Naor, 9th EWSCS,
http://www.cs.ioc.ee/yik/schools/win2004/naor.php
February 28, 2005
10th Estonian Winter School in Computer Science
page 32
Secure Function Evaluation
•
•
A major topic of cryptographic research
How to let n parties, P1,..,Pn compute a function
f(x1,..,xn)
–
–
Where input xi is known to party Pi
Parties learn the final input and nothing else
February 28, 2005
10th Estonian Winter School in Computer Science
page 33
The Millionaires Problem [Yao]
x
Alice
Bob
y
Whose value is greater?
Leak no other information!
February 28, 2005
10th Estonian Winter School in Computer Science
page 34
Comparing Information without Leaking it
x
Alice
•
•
Bob
y
Output: Is x=y?
The following solution is insecure:
–
–
Use a one-way hash function H()
Alice publishes H(x), Bob publishes H(y)
February 28, 2005
10th Estonian Winter School in Computer Science
page 35
Secure two-party computation - definition
Input:
Output:
As if…
y
x
F(x,y) and nothing else
y
x
Trusted third party
F(x,y)
February 28, 2005
10th Estonian Winter School in Computer Science
F(x,y)
page 36
Leak no other information
•
•
A protocol is secure if it emulates the ideal solution
Alice learns F(x,y), and therefore can compute
everything that is implied by x, her prior knowledge of y,
and F(x,y).
Alice should not be able to compute anything else
•
Simulation:
•
–
A protocol is considered secure if:
For every adversary in the real world
There exists a simulator in the ideal world, which outputs
an indistinguishable ``transcript” , given access to the
information that the adversary is allowed to learn
February 28, 2005
10th Estonian Winter School in Computer Science
page 37
More tomorrow…
February 28, 2005
10th Estonian Winter School in Computer Science
page 38