Download 0214_statisticalDB - Emory Math/CS Department

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Transcript
CS573 Data Privacy and Security
Statistical Databases
Li Xiong
Today
• Statistical databases
– Definitions
– Early query restriction methods
– Output perturbation and differential privacy
Statistical Data Release
Population count
city
Age City
Diagnosis
25
Lilburn
35
Decatur adult T-cell lymphoma
35
Lilburn
20
mantle cell lymphoma
adult T-cell lymphoma
Age
30
50
40
50
Diagnosis
• Release statistical summary of the data (vs. individual records)
• Useful for analysis and learning
• Medical statistics
• Query log statistics – frequent search terms
• Still need rigorous inference control
Statistical Database
• A statistical database is a database which
provides statistics on subsets of records
• Statistics may be performed to compute SUM,
MEAN, MEDIAN, COUNT, MAX AND MIN of
records
• Inference control to prevent inference from
statistics to individual records
Methods
 Data perturbation/anonymization
 Query restriction
 Output perturbation
Data Perturbation
User 1
Re
su
lts
Qu
e
ry
Noise Added
Original
Database
Perturbed
Database
Re
ue
Q
su
lts
ry
User 2
Query Resitrction
Query 2
Query 1
Original
Database
Query 1
Results
K
Query
Results
Query 2
Results
K
Query
Results
Output Perturbation
ry
e
u
Q
User 1
Query
Results
su
Re
lts
Noise Added
to Results
Original
Database
Re
Query
Q
ue
sul
ts
Results
ry
User 2
Methods
 Data perturbation/anonymization
 Query restriction
 Query set size control
 Query set overlap control
 Query auditing
 Output perturbation
Query Set Size Control
 A query-set size control limit the number of
records that must be in the result set
 Allows the query results to be displayed only if
the size of the query set |C| satisfies the
condition
K <= |C| <= L – K
where L is the size of the database and K is a
parameter that satisfies 0 <= K <= L/2
Query Set Size Control
Query 2
Query 1
Original
Database
Query 1
Results
K
Query
Results
Query 2
Results
K
Query
Results
Tracker
• Q1: Count ( Sex = Female ) = A
• Q2: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) ) = B
What if B = A+1?
Tracker
• Q1: Count ( Sex = Female ) = A
• Q2: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) ) = B
If B = A+1
• Q3: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) &
Diagnosis = Schizophrenia)
Positively or negatively compromised!
Query set size control
• If the threshold value k is large, then it will
restrict too many queries
– And still does not guarantee protection from
compromise
• The database can be easily compromised
within a frame of 4-5 queries
Query Set Overlap Control
• Basic idea: successive queries must be
checked against the number of common
records.
• If the number of common records in any
query exceeds a given threshold, the
requested statistic is not released.
• A query q(C) is only allowed if:
| q (C ) ^ q (D) | ≤ r, r > 0
Where r is set by the administrator
Query-set-overlap control
• Statistics for a set and its subset cannot be
released – limiting usefulness
• High processing overhead – every new query
compared with all previous ones
• Multiple users - need to keep user profile,
need to consider collusion between users
• Still no formal privacy guarantee
Auditing
• Keeping up-to-date logs of all queries made by
each user and check for possible compromise
when a new query is issued
• Excessive computation and storage
requirements
• Only “efficient” methods for special types of
queries
Audit Expert (Chin 1982)
• Query auditing method for SUM queries
• A SUM query can be considered as a linear equation
where
is whether record i belongs to the query set, xi is the
sensitive value, and q is the query result
• A set of SUM queries can be thought of as a system of linear
equations
• Maintains the binary matrix representing linearly independent
queries and update it when a new query is issued
• A row with all 0s except for ith column indicates disclosure
Audit Expert
• Only stores linearly independent queries
• Not all queries are linearly independent
Q1: Sum(Sex=M)
Q2: Sum(Sex=M AND Age>20)
Q3: Sum(Sex=M AND Age<=20)
Audit Expert
• O(L2) time complexity
• Further work reduced to O(L) time and space
when number of queries < L
• Only for SUM queries
Auditing – recent developments
• Online auditing
– “Detect and deny” queries that violate privacy
requirement
– Denial themselves may implicitly disclose sensitive
information
• Offline auditing
– Check if a privacy requirement has been violated
after the queries have been executed
– Not to prevent
Methods
 Data perturbation/anonymization
 Query restriction
 Output perturbation
 Differential privacy
Differential Privacy
• Differential privacy requires the outcome to be
formally indistinguishable when run with and
without any particular record in the data set
– E.g.: Q = select count() where Age = [20,30] and Diagnosis
=B
D1
Bob in
D2
Bob out
Output
Perturbation
A(D1)
Q
User
A(D2)
Differential Privacy
•
Differential privacy
• Laplace mechanism
Q(D) + Y where Y is drawn from
• Query sensitivity
D1
Bob in
D2
Bob out
A(D1) = Q(D1) + Y1
Differentially
Private
Interface
Q
User
A(D2) = Q(D2) + Y2
Composition of Differential Privacy
• Sequential composition [McSherry SIGMOD 09]
– Let Mi each provides
differential privacy. The sequence of
Mi provides
differential privacy
• Parallel composition
– If Di are disjoint subsets of the original database and Mi
provides differential privacy for each Di, then the sequence of
Mi provides differential privacy.
D1
Bob in
D2
Bob out
Differentially
Private
Interface
A1(D1), A2(D1), …
Q1,Q2, …
User
A1(D2), A2(D2), …
Differential Privacy
• Is unfettered access to raw data truly essential?
• Is released data sufficient (provide sufficient utility
guarantee)?
Privacy
mechanism
Raw
Data
Released
Data
User
count
Age
City
Diagnosis
25
Lilburn
mantle cell
lymphoma
35
Decatur
adult T-cell
lymphoma
35
Lilburn
adult T-cell
lymphoma
city
Age
Diagnosis
Challenges
• Differential privacy cost accumulates quickly
with number of queries
– Typical tasks require multiple queries or multiple
steps
– Need to support multiple users
• Impossible to guarantee utility for all (any)
data or all (any) applications
Possible Middle Ground
• Guaranteed utility for certain applications
– Counting queries, classification, logistic regression
• Guaranteed utility for certain kinds of data
– Use prior or domain knowledge about data
– Use intermediate results (differentially private)
Target
Applications
Prior or domain
knowledge
Raw
Data
Intermediate
Result
Privacy
mechanism
Released
Data
User
•
•
Our Research: Adaptive Differentially Private
Data Release
Data knowledge
• Dense and “smooth” data
• High dimensional and sparse data
• Dynamic data
Application knowledge
• Query workload
• Specific tasks
Histogram Example
?
Strategy I: Baseline Cell Partitioning
Q1: count() where Age = 20, Diagnosis = A
Q2: count() where Age = 20, Diagnosis = B
…
diagnosis
A
Age
B
20 50
10
30 50
90
DP
alpha
Diagnosis
A
Age
B
20 50’
10’
30 50’
90’
• Goal: to release a differentially private histogram to support random
predicate queries
• Q: select count() where Age = [20,30] and Income = 40K
• If a query predicate consists of multiple cells or partitions, it will have
aggregated perturbation error
Q
Strategy II: Hierarchical Partitioning
A
20
diagnosis
A
Age
B
200’
alpha/3
20 50
10
30 50
90
B
30
A
alpha/3
B
20
60’
30
140’
alpha/3
A
B
20 50’
10’
30 50’
90’
• Large perturbation error due to small divided privacy budget at each level
DPCube Strategy: Two phase
partitioning
diagnosis
A
Age
B
A
20 50
10
30 50
90
10’
20
Age
B
100’
30
90’
• If a query predicate is contained in a published partition, the answer has to
be estimated typically based on a uniform distribution assumption. This
introduces an approximation error.
DPCube Strategy: Two phase
partitioning
1. Cell Partitioning
diagnosis
A
20 50
Age
30 50
A
B
20 50’
10’
30 50’
90’
Cell
histogram
B
10
90
2. Multi-dimensional
Partitioning
A
B
20 50’
10’
30 50’
90’
A
B
10’
20
100’
30
90’
partition
histogram
Partitioning Algorithm
• Define a uniformity (randomness) measure for a partition
H(Dt)
– information gain, variance
• Recursive algorithm Partition(Dt) for a given partition Dt
• Find the best splitting point (e.g. largest information gain) and Partition
the data into Dt1 and Dt2
• Partition(Dt1) and Partition(Dt2)
Privacy and Utility of the Released
Histogram
• The released data satisfies -differential privacy
• Support for count queries and other OLAP queries and
learning tasks
• Formal utility results
– (epsilon,delta) - usefulness
• Experimental results for partition histogram
– CENSUS dataset, 1M tuples, 4 attributes: Age (79),
Education (14), Occupation (23), and Income (100)
– Report absolute error and relative error for
random count queries
DPCube Result Example
Original histogram
Diff. private partition histogram
Diff. Private Cell histogram
Diff. Private Estimated Cell histogram
Experimental Results: Comparison
with other partitioning strategies
• Higher alpha (lower privacy) results in lower error (higher utility)
• Kd tree based approach outperforms others
• Cell partitioning is comparable in absolute error but suffers in relative error
due to the sparsity of the data
High dimensional sparse data
• Many real-world data are high dimensional
and sparse
• Web search log data, web transactions, etc.
• A direct application of the 2-phase approach
– Cell histogram highly inaccurate
– Computationally not scalable
39
Top-down recursive partitioning
• Recursively partition the spaces that have sufficient density
• Use a context free taxonomy tree
• Dynamically allocate and keep track of the budget
Adaptive Hierarchical Strategy
Data is sparse and
Highly dimensional
1a. Overall count
1b. Partitioning of
non-sparse regions
2a. Partition count
2b. Partitioning of
non-sparse regions
n. Partition count
Today
• Statistical databases
– Definitions
– Early query restriction methods
– Output perturbation and differential privacy