Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Differential Privacy Xintao Wu Oct 31, 2012 Sanitization approaches • Input perturbation – Add noise to data – Generalize data • Summary statistics – Means, variances – Marginal totals – Model parameters • Output perturbation – Add noise to summary statistics Blending/hiding into a crowd • K-anonymity based approaches • Adversary may have various background knowledge to breach privacy • Privacy models often assume “the adversary’s background knowledge is given” Classic intuition for privacy • Privacy means that anything can be learned about a respondent from the statistical database can be learned without access to the database. • Security of encryption – Anything about the plaintext that can be learned from a ciphertext can be learned without the ciphertext. • Prior and posterior views about an individual should not change much Motivation • Publicly release statistical information about a dataset without compromising the privacy of any individual Requirement • Anything that can be learned about a respondent from a statistical database should be learnable without access to the database • Reduce the knowledge gain of joining the database • Require that the probability distribution on the public results is essentially the same independent of whether any individual opts in to, or opts out of the dataset Definition Sensitivity function • Captures how great a difference must be hidden by the additive noise LAP distribution noise Guassian noise Adding LAP noise Proof sketch Delta_f=1, epsilon varies Delta_f=1 epsilon=0.01 Delta_f=1 epsilon=0.1 Delta_f=1 epsilon=1 Delta_f=1 epsilon=2 Delta_f=1 epsilon=10 Delta_f=2, epsilon varies Delta_f=3, epsilon varies Delta_f=10000, epsilon varies Composition • Sequential composition • Parallel composition --for disjoint sets, the ultimate privacy guarantee depends only on the worst of the guarantees of each analysis, not the sum. Example • Let us assume a table with 1000 customers and each record has attributes: name, gender, city, cancer, salary. – – – – For attribute city, we assume the domain size is 10; for attribute cancer, we only record Yes or No for each customer; for attribute salary, the domain range is 0-10k. The privacy threshold \epsilon is a constant 0.1 set by data owner. • For one single query “How many customers got cancer?” • The adversary is allowed to ask three times of the query shown the above. Example (continued) • “How many customers got cancer in each city?” • For one single query “What is the sum of salaries across all customers?” Type of computing (query) • • • • • some are very sensitive, others are not single query vs. query sequence query on disjoint sets or not outcome expected: number vs. arbitrary interactive vs. not interactive Sensitivity • Global sensitivity • Local sensitivity • Smooth sensitivity Different areas of DP • PINQ • DM with DP • Optimizing linear counting queries under differential privacy. -Matrix mechanism for answering a workload of predicate counting queries PPDM interface--PINQ • A programmable privacy preserving layer • Add calibrated noise to each query • Need to assign privacy cost budget Data Mining with DP • Previous study—privacy preserving interface ensures everything about DP • Problems—inferior results if the interface is utilized simply during data mining • Solution—consider both together • DP ID3 —noisy count —evaluate all attributes in one exponential mechanism query using entire budget instead of splitting budget among multiple DP in Social Networks • Page 97-120 of pakdd11 tutorial