Download Differential Privacy

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Differential Privacy
Xintao Wu
Oct 31, 2012
Sanitization approaches
• Input perturbation
– Add noise to data
– Generalize data
• Summary statistics
– Means, variances
– Marginal totals
– Model parameters
• Output perturbation
– Add noise to summary statistics
Blending/hiding into a crowd
• K-anonymity based approaches
• Adversary may have various background
knowledge to breach privacy
• Privacy models often assume “the
adversary’s background knowledge is
given”
Classic intuition for privacy
• Privacy means that anything can be
learned about a respondent from the
statistical database can be learned without
access to the database.
• Security of encryption
– Anything about the plaintext that can be
learned from a ciphertext can be learned
without the ciphertext.
• Prior and posterior views about an
individual should not change much
Motivation
• Publicly release statistical information
about a dataset without compromising the
privacy of any individual
Requirement
• Anything that can be learned about a
respondent from a statistical database should be
learnable without access to the database
• Reduce the knowledge gain of joining the
database
• Require that the probability distribution on the
public results is essentially the same
independent of whether any individual opts in to,
or opts out of the dataset
Definition
Sensitivity function
• Captures how great a difference must be
hidden by the additive noise
LAP distribution noise
Guassian noise
Adding LAP noise
Proof sketch
Delta_f=1, epsilon varies
Delta_f=1 epsilon=0.01
Delta_f=1 epsilon=0.1
Delta_f=1 epsilon=1
Delta_f=1 epsilon=2
Delta_f=1 epsilon=10
Delta_f=2, epsilon varies
Delta_f=3, epsilon varies
Delta_f=10000, epsilon varies
Composition
• Sequential composition
• Parallel composition
--for disjoint sets, the ultimate privacy
guarantee depends only on the worst of
the guarantees of each analysis, not the
sum.
Example
• Let us assume a table with 1000 customers and each
record has attributes: name, gender, city, cancer, salary.
–
–
–
–
For attribute city, we assume the domain size is 10;
for attribute cancer, we only record Yes or No for each customer;
for attribute salary, the domain range is 0-10k.
The privacy threshold \epsilon is a constant 0.1 set by data
owner.
• For one single query “How many customers got cancer?”
• The adversary is allowed to ask three times of the query
shown the above.
Example (continued)
• “How many customers got cancer in each
city?”
• For one single query “What is the sum of
salaries across all customers?”
Type of computing (query)
•
•
•
•
•
some are very sensitive, others are not
single query vs. query sequence
query on disjoint sets or not
outcome expected: number vs. arbitrary
interactive vs. not interactive
Sensitivity
• Global sensitivity
• Local sensitivity
• Smooth sensitivity
Different areas of DP
• PINQ
• DM with DP
• Optimizing linear counting queries under
differential privacy.
-Matrix mechanism for answering a
workload of predicate counting queries
PPDM interface--PINQ
• A programmable privacy preserving layer
• Add calibrated noise to each query
• Need to assign privacy cost budget
Data Mining with DP
• Previous study—privacy preserving interface
ensures everything about DP
• Problems—inferior results if the interface is
utilized simply during data mining
• Solution—consider both together
• DP ID3
—noisy count
—evaluate all attributes in one exponential
mechanism query using entire budget instead of
splitting budget among multiple
DP in Social Networks
• Page 97-120 of pakdd11 tutorial