Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto Benny Pinkas HP Labs, Israel February 28, 2005 10th Estonian Winter School in Computer Science page 1 Course structure • Lecture 1: – – • Lecture 2 – • Introduction to privacy Introduction to cryptography, in particular, to rigorous cryptographic analysis. • Definitions • Proofs of security Cryptographic tools for privacy preserving data mining. Lecture 3 – – Non-cryptographic tools for privacy preserving data mining In particular, answer perturbation. February 28, 2005 10th Estonian Winter School in Computer Science page 2 Privacy-Preserving Data Mining • Allow multiple data holders to collaborate in order to compute important information while protecting the privacy of other information. – – – • Security-related information Public health information Marketing information Advantages of privacy protection – – – – protection of personal information protection of proprietary or sensitive information enables collaboration between different data owners (since they may be more willing or able to collaborate if they need not reveal their information) compliance with the law February 28, 2005 10th Estonian Winter School in Computer Science page 3 Privacy Preserving Data Mining • Two papers appeared in 2000 – – • “Privacy preserving data mining”, Agrawal and Srikant, SIGMOD 2000. (statistical approach) “Privacy preserving data mining”, Lindell and Pinkas, Crypto 2000. (cryptographic approach) Why privacy now? – – – – Technological changes erode privacy: ubiquitous computing, cheap storage. Public awareness: health coverage, employment, personal relationships. Historical changes: Small towns vs. Cities vs. Connected society. Privacy is a real problem that needs to be solved February 28, 2005 10th Estonian Winter School in Computer Science page 4 Some data privacy cases: hospital data • Hospital data contains – – – – • Identifying information: name, id, address General information: age, marital status Medical information Billing information Database access issues: – Your doctor should get every information that is required to take care of you – Emergency rooms should get all medical information that is required to take care of whoever comes there – Billing department should only get information relevant to billing • Problem: how to stop employees from getting information about family, neighbors, celebrities? February 28, 2005 10th Estonian Winter School in Computer Science page 5 Some data privacy cases: Medical Research • Medical research: Trying to learn patterns in the data, in “aggregate” form. – Problem: how to enable learning aggregate data without revealing personal medical information? – Hiding names is not enough, since there are many ways to uniquely identify a person – • • A single hospitals/medical researcher might not have enough data How can different organizations share research data without revealing personal data? February 28, 2005 10th Estonian Winter School in Computer Science page 6 Public Data • • • Many public records are available in electronic form: birth records, property records, voter registration “Your information serves as an error correcting code of your identity” Latanya Sweeney: – Date of birth uniquely identifies 12% of the population of Cambridge, MA. – Date of birth + gender: 29% – Date of birth + gender + (9 digit) zip code: 95% – Sweeney was therefore able to get her medical information from an “annonymized” database February 28, 2005 10th Estonian Winter School in Computer Science page 7 Census data • A trusted party (the census bureau) collects information about individuals • Collected data: – – – • Explicitly identifying data (names, address..) Implicitly identifying data (combination of several attributes) Private data The data should is collected to help decision making – Partial or aggregate data should therefore made public February 28, 2005 10th Estonian Winter School in Computer Science page 8 Total Information Awareness (TIA) • Collects information about transactions (credit card purchases, magazine subscriptions, bank deposits, flights) – – • Early collection of epidemic bursts – – – • Early detection of terrorist activity Check a chemistry book in the library, buy something at a hardware store and something in a pharmacy… Early symptoms of Anthrax are similar to the flu Check non-traditional data sources: grocery and pharmacy data, school attendance records, etc.. Such systems are developed and used Could the collection of data be done in a privacy preserving manner? (without learning about individuals?) February 28, 2005 10th Estonian Winter School in Computer Science page 9 Basic Scenarios • Single (centralized) database, e.g., census data: – – • This is often a simple abstraction of a more complicated scenario, so we better solve this one Need to collect data and present it in a privacy preserving way Published data (e.g., on a CD) A “trusted” party collects data and then publishes a “sanitized” version – Users can do any computation they wish with the sanitized data – For example, statistical tabulations. – February 28, 2005 10th Estonian Winter School in Computer Science page 10 Basic Scenarios • Multi database scenarios: – – Two or more parties with private data want to cooperate. Horizontally split: Each party has a large database. Databases have same attributes but are about different subjects. For example, the parties are banks which each have information about their customers. u1 – u1 bank 2 un taxes bank u’n bank 1 houses un u’1 Vertically split: Each party has some information about the same set of subjects. For example, the participating parties are government agencies; each with some data about every citizen. February 28, 2005 10th Estonian Winter School in Computer Science page 11 Issues and Tools • • Best privacy can be achieved by not giving any data, but.. Privacy tools: cryptography [LP00] – Encryption: data is hidden unless you have the decryption key. However, we also want to use the data. – Secure function evaluation: two or more parties with private inputs. Can compute any function they wish without revealing anything else. – • Strong theory. Starts to be relevant to real applications. Non-cryptographic tools [AS00] – Query restriction: prevent certain queries from being answered. – Data/Input/output perturbation: add errors to inputs – hide personal data while keeping aggregates accurate. (randomization, rounding, data swapping.) – Can these be understood as well as we understand Crypto? Provide the same level of security as Crypto? February 28, 2005 10th Estonian Winter School in Computer Science page 12 Introduction to Cryptography February 28, 2005 10th Estonian Winter School in Computer Science page 13 Why learn/use crypto to solve privacy issues? • Why are we referring to crypto? – Cryptography is one of the tools we can use for preserving privacy – A mature research area: – many useful results/tools – Can reflect on our thinking – how is “security” defined in cryptography? How should we define “privacy”? February 28, 2005 10th Estonian Winter School in Computer Science page 14 What is Cryptography? Traditionally: how to maintain secrecy in communication Alice and Bob talk while Eve tries to listen Bob Alice Eve February 28, 2005 10th Estonian Winter School in Computer Science page 15 History of Cryptography • Very ancient occupation Up to the mid 70’s - mostly classified military work – Exception: Shannon, Turing* • Since then - explosive growth – Commercial applications – Scientific work: tight relationship with Computational Complexity Theory – Major works: Diffie-Hellman, Rivest, Shamir and Adleman (RSA) • Recently - more involved models for more diverse tasks. • • Scope: How to maintain the secrecy, integrity and functionality in computer and communication system. February 28, 2005 10th Estonian Winter School in Computer Science page 16 Relation to computational hardness • • Cryptography uses problems that are infeasible to solve. Uses the intractability of some problems in order to construct secure systems. Feasible – computable in probabilistic polynomial time (PPT) – Infeasible – no probabilistic polynomial time algorithm – Usually average case hardness is needed – • For example, the discrete log problem February 28, 2005 10th Estonian Winter School in Computer Science page 17 The Discrete Log Problem • • • • Let G be a group and g an element in G. Given yG let x be minimal non-negative integer satisfying the equation y=gx. x is called the discrete log of y to base g. Example: y=gx mod p in the multiplicative group of Zp* (p is prime). (For example, p=7, g=3, y=4 x=4.) In general, it is easy to exponentiate – • (using repeated squaring and the binary representation of x) Computing the discrete log is believed to be hard in Zp* if p is large. (E.g., p is a prime, |p|>768 bits, p=2q+1 and q is also a prime.) February 28, 2005 10th Estonian Winter School in Computer Science page 18 Encryption • Alice wants to send a message m {0,1}n to Bob – – • Set-up phase is secret Symmetric encryption: Alice and Bob share a secret key k They want to prevent Eve from learning anything about the message Alice Ek(m) k k Bob Eve February 28, 2005 10th Estonian Winter School in Computer Science page 19 Public key encryption • • • • Alice generates a private/public key pair (SK,PK) Only Alice knows the secret key SK Everyone (even Eve) knows the public key PK, and can encrypt messages to Alice Only Alice can decrypt (using SK) EPK(m) Alice SK EPK(m) Charlie PK Eve February 28, 2005 Bob PK 10th Estonian Winter School in Computer Science page 20 Rigorous Specification of Security To define the security of a system we must specify: 1. What constitute a failure of the system 2. The power of the adversary – – – February 28, 2005 computational access to the system what it means to break the system. 10th Estonian Winter School in Computer Science page 21 What does `learn’ mean? • Even if Eve has some prior knowledge of m, she should not have any advantage in – • Ideally: the message sent is a independent of the message m – • Implies all the above Achievable: one-time pad (symmetric encryption) – – – – • Probability of guessing m, or probability of guessing whether m is m0 or m1, or prob. of computing any other function f of m ,or even computing |m| Let rR {0,1} n be the shared key. Let m {0,1} n To encrypt m send r m To decrypt z send m = z r Shannon: achievable only if the entropy of the shared secret is at least as large as that of m. Therefore must use long key . February 28, 2005 10th Estonian Winter School in Computer Science page 22 Defining security The power of the adversary – – – • Computational: Probabilistic polynomial time machine (PPTM) Access to the system: e.g. can it change messages? Passive adversary, (adaptive) chosen plaintext attack, chosen ciphertext attack… What constitutes a failure of the system? – – Recovering plaintext from ciphertext – not enough • Allows for the leakage of partial information • In general, hard to answer which partial information may/should not be leaked. Application dependent. • How would partial information the adversary already holds be combined with what he learns to affect privacy? Better: Prevent learning anything about an encrypted message • There are two common, equivalent, definitions… February 28, 2005 10th Estonian Winter School in Computer Science page 23 Security of Encryption: Definition 1 Indistinguishability of Encryptions Adversary A chooses any X0 , X1 0,1n • Receives encryption of Xb for bR0,1 • Has to decide whether b 0 or b 1. • For every PPTM A, choosing a pair X0 , X1 0,1n : | Pr A(E(X0))= ‘1’ - Pr A(E(Xb1)) ‘1’ | = neg(n) – • (Probability is over the choice of keys, randomization in the encryption and A‘s coins) Note that a proof of security must be rigorous February 28, 2005 10th Estonian Winter School in Computer Science page 24 Security of Encryption: Definition 2 Semantic Security Simulation: Whatever Adversary A can compute given an encryption of X 0,1n so can a `simulator’ S that does not get to see the encryption of X. A selects a distribution Dn on 0,1n and a relation R(X,Y) computable in PPT (e.g. R(X,Y)=1 iff Y is last bit of X). • XR Dn is sampled • Given E(X), A outputs Y trying to satisfy R(X,Y) • The simulator S does the same without access to E(X) • Simulation is successful if A and S have the same success probability • Successful simulation semantic security • February 28, 2005 10th Estonian Winter School in Computer Science page 26 Security of Encryption (2) Semantic Security More formally: For every PPTM A there is a PPTM S so that for all PPTM relations R for XR Dn Pr R(X,A(E(X)) - Pr R(X,S()) is negligible. In other words: The outputs of A and S are indistinguishable even for a test that is aware of X. February 28, 2005 10th Estonian Winter School in Computer Science page 27 Which is the Right Definition? • Semantic security seems to convey that the message is protected • But it is usually easier to prove indistinguishability of encryptions • Would like to argue that the two definitions are equivalent • Must define the attack: chosen plaintext attack – – • Adversary can obtain the encryption for any message it chooses, in an adaptive manner More severe attacks: chosen ciphertext The Equivalence Theorem: A cryptosystem is semantically secure if and only if it has the indistinguishability of encryptions property February 28, 2005 10th Estonian Winter School in Computer Science page 28 Equivalence Proof (informal) Semantic security Indistinguishability of encryptions • Suppose no indistinguishability: – A chooses a pair X0 , X10,1n for which it can distinguish encryptions with non-negligible advantage • Choose – – • • • • Distribution Dn = {X0 , X1 } Relation R which is “equality with X ” S that doesn’t get E(X), and outputs Y’ we have Prob[ R( X, Y’ ) ]= ½ Given E(Xb ), run A(E(Xb )), get output b{0,1}, set Y=Xb Now, | PrA(E(Xb))= ‘1’ b 1 - PrA(E(Xb)) ‘1’ b 0 | > Therefore, | PrR(X,Y) - PrR(E(X,Y’) | > / 2 February 28, 2005 10th Estonian Winter School in Computer Science page 29 Equivalence Proof (informal) Indistinguishability of encryptions Semantic security • • Suppose no semantic security: A chooses some distribution Dn and some relation R Choose X0, X1 R Dn , choose bR {0,1}, compute E(Xb). – Give E(Xb) to A, ask A to compute Yb = A(E(Xb)) • For X0 , X1 R Dn let – 0 = Prob[R(X0, Yb)], 1 = Prob[R(X1, Yb)] • With noticeable probability |0 - 1 | is non-negligible, since otherwise Yb can be computed without the encryption. If |0 - 1 | is non-negligible, then we can distinguish between an encryption of X0 and X1 • February 28, 2005 10th Estonian Winter School in Computer Science page 30 Lessons learned? • Rigorous approach to cryptography – – Defining security Proving security February 28, 2005 10th Estonian Winter School in Computer Science page 31 References Books: • O. Goldreich, Foundations of Cryptography Vol 1, Basic Tools, Cambridge, 2001 • Pseudo-randomness, zero-knowledge – Vol 2, Basic Applications (to be available May 2004) • Encryption, Secure Function Evaluation) – Other volumes in www.wisdom.weizmann.ac.il/~oded/books.html Web material/courses: • • S. Goldwasser and M. Bellare, Lecture Notes on Cryptography, http://www-cse.ucsd.edu/~mihir/papers/gb.html M. Naor, 9th EWSCS, http://www.cs.ioc.ee/yik/schools/win2004/naor.php February 28, 2005 10th Estonian Winter School in Computer Science page 32 Secure Function Evaluation • • A major topic of cryptographic research How to let n parties, P1,..,Pn compute a function f(x1,..,xn) – – Where input xi is known to party Pi Parties learn the final input and nothing else February 28, 2005 10th Estonian Winter School in Computer Science page 33 The Millionaires Problem [Yao] x Alice Bob y Whose value is greater? Leak no other information! February 28, 2005 10th Estonian Winter School in Computer Science page 34 Comparing Information without Leaking it x Alice • • Bob y Output: Is x=y? The following solution is insecure: – – Use a one-way hash function H() Alice publishes H(x), Bob publishes H(y) February 28, 2005 10th Estonian Winter School in Computer Science page 35 Secure two-party computation - definition Input: Output: As if… y x F(x,y) and nothing else y x Trusted third party F(x,y) February 28, 2005 10th Estonian Winter School in Computer Science F(x,y) page 36 Leak no other information • • A protocol is secure if it emulates the ideal solution Alice learns F(x,y), and therefore can compute everything that is implied by x, her prior knowledge of y, and F(x,y). Alice should not be able to compute anything else • Simulation: • – A protocol is considered secure if: For every adversary in the real world There exists a simulator in the ideal world, which outputs an indistinguishable ``transcript” , given access to the information that the adversary is allowed to learn February 28, 2005 10th Estonian Winter School in Computer Science page 37 More tomorrow… February 28, 2005 10th Estonian Winter School in Computer Science page 38