Download Document

Document related concepts

Relational model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Mass surveillance wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
Hippocratic
Data Management
Rakesh Agrawal
IBM Almaden Research Center
Thesis

We need information systems that
– respect the privacy of data they manage
AND
– do not impede the useful flow of information.

It is feasible to reconcile the apparent
contradiction
Outline

Why Privacy in Data Systems
 Some Technology Directions
 Some Challenging Problems
Drivers for Privacy

Privacy Surveys:
– 17% privacy fundamentalists, 56% pragmatic majority, 27%
marginally concerned (Understanding net users' attitude about
online privacy, April 99)
– 83% would stop doing business with a company if it misused
customer information (Privacy on and off the Internet: What
consumers want, Nov. 2001)

Govt. legislations & guidelines:
–
–
–
–
–
–
Fair Information Practices Act (US, 1974)
OECD Guidelines (Europe, 1980)
Canadian Standards Association’s Model Code (1995)
Australian Privacy Amendment (2000)
Japan: proposed legislation (2003)
HIPAA, GLB, Recent U.S. Federal & State Initiatives
Privacy Violations

Accidents:
– Kaiser, GlobalHealthrax

Lax security:
– Massachusetts govt.

Ethically questionable behavior:
– Lotus & Equifax, Lexis-Nexis, Medical Marketing
Service, Boston University, CVS & Giant Food

Illegal:
– Toysmart
Assertion

Enterprises lack tools and technologies for
managing private data and enforcing
privacy policies.
Founding Tenets of Current
Database Systems
Ullman, “Principles of Database and
Knowledgebase Systems”
 Fundamental:

– Manage persistent data.
– Access a large amount of data efficiently.

Desirable:
– Support for data model, high-level languages,
transaction management, access control, and
resiliency.

Similar list in other database textbooks.
Statistical & Secure Databases

Statistical Databases
– Provide statistical information (sum, count, etc.)
without compromising sensitive information about
individuals, [AW89]

Multilevel Secure Databases
– Multilevel relations, e.g., records tagged “secret”,
“confidential”, or “unclassified”, e.g. [JS91]

Need to protect privacy in transactional databases
that support daily operations.
– Cannot restrict queries to statistical queries.
– Cannot tag all the records “top secret”.
Our Research Directions

Privacy Preserving Data Mining
 Hippocratic Databases
Data Mining and Privacy
 The
primary task in data mining:
development of models about
aggregated data.
 Can we develop accurate models
without access to precise information
in individual data records?
R. Agrawal, R. Srikant. Privacy Preserving Data Mining.
ACM Int’l Conf. On Management of Data (SIGMOD), May 2000.
Privacy Preserving Data Mining
30 | 25K | …
50 | 40K | …
Randomizer
Randomizer
65 | 50K | …
35 | 60K | …
Reconstruct
Age Distribution
Reconstruct
Salary Distribution
Data Mining
Algorithm
Model
Reconstruction Problem

Original values x1, x2, ..., xn
– from probability distribution X

To hide these values, we use y1, y2, ..., yn
– from probability distribution Y

Given
– x1+y1, x2+y2, ..., xn+yn
– the probability distribution of Y
Estimate the probability distribution of X.
Intuition (Reconstruct single
point)

Use Bayes' rule for density functions
1
0V
A
g
e
9
0
O
r
i
g
i
n
a
l
d
i
s
t
r
i
b
u
t
i
o
n
f
o
r
A
g
e
P
r
o
b
a
b
i
l
i
s
t
i
c
e
s
t
i
m
a
t
e
o
f
o
r
i
g
i
n
a
l
v
a
l
u
e
o
f
V
Intuition (Reconstruct single
point)

Use Bayes' rule for density functions
1
0V
A
g
e
9
0
O
r
i
g
i
n
a
l
D
i
s
t
r
i
b
u
t
i
o
n
f
o
r
A
g
e
P
r
o
b
a
b
i
l
i
s
t
i
c
e
s
t
i
m
a
t
e
o
f
o
r
i
g
i
n
a
l
v
a
l
u
e
o
f
V
Reconstruction: Intuition

Combine estimates of where a point came
from for all the points:
– yields estimate of original distribution.
10
Age
90
Reconstruction Algorithm
fX0 := Uniform distribution
j := 0
repeat
n
1
fY (( xi  yi )  a ) f Xj (a )
fXj+1(a) := n  
Bayes’ Rule
j
i 1
 fY (( xi  yi )  a ) f X (a )

j := j+1
until (stopping criterion met)

Converges to maximum likelihood estimate.
– D. Agrawal & C.C. Aggarwal, PODS 2001.
Works Well
1000
Original
800
600
Randomized
400
Reconstructed
0
60
200
20
Number of People
1200
Age
Classification

Naïve Bayes
– Assumes independence between attributes.

Decision Tree
– Correlations are weakened by randomization.
Experimental Methodology

Compare accuracy against
– Original: unperturbed data without randomization.
– Randomized: perturbed data but without making any
corrections for randomization.

Test data not randomized.
 Synthetic data benchmark from [AGI+92].
 Training set of 100,000 records, split equally
between the two classes.
Decision Tree Experiments
100% Randomization Lev el
100
Accuracy
90
Original
80
Randomized
70
Reconstructed
60
50
Fn 1
Fn 2
Fn 3
Fn 4
Fn 5
Accuracy vs. Randomization
Fn 3
100
Accuracy
90
80
Original
70
Randomized
Reconstructed
60
50
40
10
20
40
60
80
100
Randomization Level
150
200
So far…

Question: Can we develop accurate models
without access to precise information in individual
data records?

Answer: yes, by randomization.
– for numerical attributes, classification

How about Association Rules?
Associations Recap

A transaction t is a set of items (e.g. books)
 All transactions form a set T of transactions
 Any itemset A has support s in T if
# t  T | A  t
s  supp  A 
T

Itemset A is frequent if s  smin

Task: Find all frequent itemsets
The Problem

How to randomize transactions so that
– we can find frequent itemsets
– while preserving privacy at transaction level?
Evfimievski, R. Srikant, R. Agrawal, J. Gehrke.
Mining Association Rules Over Privacy Preserving Data.
8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining, July 2002.
Alice
Randomization Overview
J.S. Bach,
painting,
nasa.gov,
…
J.S. Bach,
painting,
nasa.gov,
…
Bob
B. Spears,
baseball,
cnn.com,
…
B. Spears,
baseball,
cnn.com,
…
Chris
B. Marley,
camping,
linux.org,
…
B. Marley,
camping,
linux.org,
…
Recommendation
Service
Alice
Randomization Overview
J.S. Bach,
painting,
nasa.gov,
…
J.S. Bach,
painting,
nasa.gov,
…
Bob
B. Spears,
baseball,
cnn.com,
…
B. Spears,
baseball,
cnn.com,
…
Chris
B. Marley,
camping,
linux.org,
…
Recommendation
Service
Associations
B. Marley,
camping,
linux.org,
…
Recommendations
Alice
Randomization Overview
Metallica,
painting,
nasa.gov,
…
J.S. Bach,
painting,
nasa.gov,
…
Recommendation
Service
Support Recovery
Bob
B. Spears,
baseball,
cnn.com,
…
B. Spears,
soccer,
bbc.co.uk,
…
Chris
B. Marley,
camping,
linux.org,
…
Associations
B. Marley,
camping,
ibm.com
…
Recommendations
Uniform Randomization

Given a transaction,
– keep item with, say 20% probability,
– replace with a new random item with 80% probability.
Example: {x, y, z}
10 M transactions of size 10 with 10 K items:
1%
5% have
have {x, y}, {x, z},
{x, y, z} or {y, z} only
•
0.23
0.008%
800 ts.
97.8%
0.22 •
•
8/10,000
0.00016%
16 trans.
1.9%
94%
have one or zero
items of {x, y, z}
at most
• 0.2 • (9/10,000)2
less than 0.00002%
2 transactions
0.3%
Privacy Breach: Given {x, y, z} in the randomized transaction,
we have about 98% certainty of {x, y, z} in the original one
Privacy Breach

Suppose:
– t is an original transaction;
– t’ is the corresponding randomized transaction;
– A is a (frequent) itemset.

Definition: Itemset A causes a privacy
breach of level  if, for some item z  A,
Prz  t | A  t   
Our Solution
“Where does a wise man hide a leaf? In the forest.
But what does he do if there is no forest?”
“He grows a forest to hide it in.”
G.K. Chesterton


Insert many false items into each transaction
Hide true itemsets among false ones
Can we still find frequent itemsets while having sufficient
privacy?
Cut and Paste Randomization

Given transaction t of size m, construct t’:
t =
t’ =
a, b, c, u, v, w, x, y, z
Cut and Paste Randomization

Given transaction t of size m, construct t’:
– Choose a number j between 0 and Km (cutoff);
t =
a, b, c, u, v, w, x, y, z
t’ =
j=4
Cut and Paste Randomization

Given transaction t of size m, construct t’:
– Choose a number j between 0 and Km (cutoff);
– Include j items of t into t’;
t =
t’ =
a, b, c, u, v, w, x, y, z
b, v, x, z
j=4
Cut and Paste Randomization

Given transaction t of size m, construct t’:
– Choose a number j between 0 and Km (cutoff);
– Include j items of t into t’;
– Each other item is included into t’ with probability pm .
The choice of Km and pm is based on the desired level of privacy.
t =
t’ =
a, b, c, u, v, w, x, y, z
b, v, x, z
j=4
œ, å, ß, ξ, ψ, €, ‫א‬, ъ, ђ, …
Partial Supports
To recover original support of an itemset, we need randomized
supports of its subsets.
 Given an itemset A of size k and transaction size m,
 A vector of partial supports of A is

s  s0 , s1 ,..., sk , where
sl 
1
 # t  T | # t  A  l
T
– Here sk is the same as the support of A.
– Randomized partial supports are denoted by

s .
Transition Matrix



Let k = |A|, m = |t|.
Transition matrix P = P (k, m) connects randomized
partial supports with original ones:



E s  P  s , where
Pl , l  Pr # t   A  l  | # t  A  l 
Randomized supports are distributed as a sum of
multinomial distributions.
The Unbiased Estimators

Given randomized partial supports, we can estimate original
partial supports:


sest  Q  s , where Q  P 1

Covariance matrix for this estimator:

1 k
T
Cov sest 
s

Q
D
[
l
]
Q
,

l
T l 0
where D[l ]i , j  Pi , l   i  j  Pi , l  Pj , l

To estimate it, substitute sl with (sest)l .
– Special case: estimators for support and its variance
Privacy Breach Analysis

How many added items are enough to protect privacy?
– Have to satisfy Pr [z  t | A  t’] <  ( no privacy breaches)
– Select parameters so that it holds for all itemsets.
– Use formula ( s   Pr # t  A  l , z  t , s   0
l
0
k
Prz  t | A  t    s  Pk , l
l 0


l
):
k
s P
l 0
l
k ,l
Parameters are to be selected in advance!
– Construct a privacy-challenging test: an itemset whose all subsets
have maximum possible support.
– Enough to know maximal support of an itemset for each size.
Lowest Discoverable Support
LDS is s.t., when predicted, is 4 away from zero.
 Roughly, LDS is proportional to
1 T

|t| = 5,  = 50%
LDS vs. number of transactions
1.2
1-itemsets
2-itemsets
3-itemsets
1
LDS, %
0.8
0.6
0.4
0.2
0
1
10
Number of transactions, millions
100
LDS vs. Breach Level
|t| = 5, |T| = 5 M
2.5
1-itemsets
2-itemsets
2
LDS, %
3-itemsets
1.5
1
0.5
0
30
40
50
60
70
80
Privacy Breach Level, %

Reminder: breach level is the limit on Pr [z  t | A  t’]
90
Real Datasets: soccer, mailorder

Soccer is the clickstream log of WorldCup’98 web
site, split into sessions of HTML requests.
– 11 K items (HTMLs), 6.5 M transactions

Mailorder is a purchase dataset from a certain online store
– Products are replaced with their categories
– 96 items (categories), 2.9 M transactions
Results
Breach level = 50%.
Itemset
Size
True
Itemsets
True
Positives
False
Drops
False
Positives
smin = 0.2%
1
266
254
12
31
  0.07% for
2
217
195
22
45
3-itemsets
3
48
43
5
26
Itemset
Size
True
Itemsets
True
Positives
False
Drops
False
Positives
smin = 0.2%
1
65
65
0
0
  0.05% for
2
228
212
16
28
3-itemsets
3
22
18
4
5
Soccer:
Mailorder:
Summary


Can have our cake and mine it too!
Randomization is an interesting approach for building data
mining models while preserving user privacy!!!
Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000.
S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002
J. Vaidya, C.W. Clifton. Privacy Preserving Association Rule Mining in
Vertically Partitioned Data. KDD 2002.
The Hippocratic Oath
“What I may see or hear in the course of treatment
or even outside of the treatment in regard to the
life of men, which on no account [ought to be]
spread abroad, I will keep to myself, holding such
things shameful to be spoken about.”
– Hippocratic Oath, 8 (circa 400 BC)
Hippocratic Databases
Founding tenet:
Responsibility for the privacy of data they
manage.
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu
Hippocratic Databases
28th Int'l Conf. on Very Large Databases (VLDB), August 2002..
Approach


Derive founding principles from current
privacy legislation.
Strawman Design
Ten Principles of Hippocratic
Databases

Collection Group
– Purpose Specification, Consent, Limited
Collection

Use Group
– Limited Use, Limited Disclosure, Limited
Retention, Accuracy

Security & Openness Group
– Safety, Openness, Compliance
Collection Group
1.
Purpose Specification
– For personal information stored in the database, the
purposes for which the information has been collected
shall be associated with that information.
2.
Consent
– The purposes associated with personal information
shall have consent of the donor (person whose
information is being stored).
3.
Limited Collection
– The information collected shall be limited to the
minimum necessary for accomplishing the specified
purposes.
Use Group
4.
Limited Use
– The database shall run only those queries that
are consistent with the purposes for which the
information has been collected.
5.
Limited Disclosure
– Personal information shall not be
communicated outside the database for
purposes other than those for which there is
consent from the donor of the information.
Use Group (2)
6.
Limited Retention
– Personal information shall be retained only as
long as necessary for the fulfillment of the
purposes for which it has been collected.
7.
Accuracy
– Personal information stored in the database
shall be accurate and up-to-date.
Security & Openness Group
8.
Safety
– Personal information shall be protected by security
safeguards against theft and other misappropriations.
9.
Openness
– A donor shall be able to access all information about
the donor stored in the database.
10.
Compliance
– A donor shall be able to verify compliance with the
above principles. Similarly, the database shall be able
to address a challenge concerning compliance.
Strawman Architecture
Privacy
Policy
Data
Collection
Queries
Store
Other
Architecture: Policy
Privacy
Policy
Privacy
Metadata
Creator
Converts privacy policy into
privacy metadata tables.
For each purpose & piece
of information (attribute):
• External recipients
• Retention period
• Authorized users
Different designs possible.
Privacy
Metadata
Store
Limited
Disclosure
Limited
Retention
Privacy Policies Table
Purpose
Table
Attribute External- Authorizedrecipients users
Retention
purchase
customer
name
{delivery,
credit-card}
{shipping,
charge}
1 month
purchase
customer
email
empty
{shipping}
1 month
register
customer
name
empty
{registration}
3 years
register
customer
email
empty
{registration}
3 years
book
empty
{mining}
10 years
recommen order
dations
Architecture: Data Collection
Data
Collection
Privacy policy
compatible with
user’s privacy
preference?
Privacy
Constraint
Validator
Audit trail for
compliance.
Audit
Info
Privacy
Metadata
Audit
Trail
Store
Consent
Compliance
Architecture: Data Collection
Data
Collection
Privacy
Constraint
Validator
Data
Accuracy
Analyzer
Audit
Info
Privacy
Metadata
Audit
Trail
Data cleansing,
e.g., errors in
address.
Accuracy
Associate set of
Purpose
purposes with
Specification
each record.
Store
Record
Access
Control
Architecture: Queries
Queries
Safety
Limited
Use
2. Query tagged
“telemarketing” cannot
see credit card info.
Attribute
Access
Control
3. Telemarketing query
only sees records that
include “telemarketing”
in set of purposes.
Privacy
Metadata
Store
Record
Access
Control
Safety
1. Telemarketing
cannot issue
query tagged
“charge”.
Architecture: Queries
Queries
Safety
Compliance
Attribute
Access
Control
Telemarketing query
that asks for all
phone numbers.
Query
Intrusion
Detector
• Compliance
• Training data for
query intrusion
detector
Privacy
Metadata
Audit
Trail
Audit
Info
Store
Record
Access
Control
Architecture: Other
Other
Analyze queries to identify
unnecessary collection,
retention & authorizations.
Limited
Collection
Limited
Retention
Delete items in accordance
with privacy policy.
Safety
Privacy
Metadata
Data
Collection
Analyzer
Data
Retention
Manager
Additional security for
sensitive data.
Store
Encryption
Support
Strawman Architecture
Privacy
Policy
Privacy
Metadata
Creator
Privacy
Metadata
Data
Collection
Queries
Other
Privacy
Constraint
Validator
Attribute
Access
Control
Data
Collection
Analyzer
Data
Accuracy
Analyzer
Query
Intrusion
Detector
Data
Retention
Manager
Audit
Info
Audit
Info
Audit
Trail
Store
Record
Access
Control
Encryption
Support
Status

Prototyping core functionality of the design
 Nibbling at some of the open problems (see
VLDB-2002 paper)
Privacy-Preserving Synthetic
Datasets for Data Mining Research

How to randomize to
be able to build
multiple types of
models
 How to handle
combination of data
types
 How to handle rare
events
Synthetic Data
Transactions
Comunications
Randomize
Demographic
Govt
Records
State
Birth
Marriage
Local
Credit
Agencies
Network is the Database

What if private data
never leaves a
person’s data store?
Jane’s Data
Credit Application
Decision
– Computations travel to
data
Jane’s Data
Approval Function
Result
Decision-Making Across Private Data Repositories


Separate databases due to
statutory, competitive, or security
reasons.
 Selective, minimal sharing on
need-to-know basis.
Example: Among those who took
a particular drug, how many had
adverse reaction and their DNA
contains a specific sequence?
 Researchers must not learn
anything beyond counts.
Minimal Necessary Sharing
R
a
u
v
x
RS
 R must not
know that S
has b & y
 S must not
know that R
has a & x
RS
u
v
S
b
u
v
y
Count (R  S)
 R & S do not learn
anything except that
the result is 2.
Closing Thoughts

The right to privacy: the most cherished of
human freedoms
-- Warren & Brandeis, 1890
 Code is law … it is all a matter of code: the
software and hardware that now rule
-- L. Lessig
 We can architect computing systems to protect
values we believe are fundamental, or we can
architect them to allow those values to disappear.
 What do we want to do as computer scientists?
References








R. Agrawal, R. Srikant. Information Integration Across Autonomous Enterprises. ACM
Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003.
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for
P3P. 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003.
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database
Technology. 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003.
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P. W3C Workshop on the
Future of P3P, Dulles, Virginia, Nov. 2002.
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. 28th Int'l Conf. on
Very Large Databases (VLDB), Hong Kong, August 2002.
R. Agrawal, J. Kiernan. Watermarking Relational Databases. 28th Int'l Conf. on Very
Large Databases (VLDB), Hong Kong, August 2002.
A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over
Privacy Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data
Mining (KDD), Edmonton, Canada, July 2002.
R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On
Management of Data (SIGMOD), Dallas, Texas, May 2000.
New Challenges

General
– Language
– Efficiency

Use
– Limited Collection
– Limited Disclosure
– Limited Retention

Security and Openness
– Safety
– Openness
– Compliance
Language



Need a language for privacy policies & user preferences.
P3P can be used as starting point.
– Developed primarily for web shopping.
– What about richer domains?
How do we balance expressibility and usability?
– Arrange concepts in hierarchy or subsumption
relationship.
 P3P recipients:
 Purpose:
contact
email
phone
home
work
Ours
Same
Delivery
Unrelated
Public
Language (2)

How do we accommodate user negotiation
models?
– User willing to disclose information only if
fairly compensated.
– Value of privacy as coalitional game
[KPR2001]
Efficiency

How do we minimize the cost of privacy
checking?
 How do we incorporate purpose into database
design and query optimization?
 Tradeoffs between space & running time.


Only tag records in customer table with purpose, not all
records. But now need to do a join when scanning records in
order table.
How does the secure databases work on
decomposition of multilevel relations into singlelevel relations [JS91] apply here?
Limited Collection

How do we identify attributes that are collected
but not used?
– Assets are only needed for mortgage when salary is
below some threshold.

What’s the needed granularity for numeric
attributes?
– Queries only ask “Salary > threshold” for rent
application.

How do we generate minimal queries?
– Redundancy may be hidden in application code.
Limited Disclosure

Can the user dynamically determine the set
of recipients?
 Example: Alice wants to add EasyCredit to
set of recipients in EquiRate’s database.
 Digital signatures.
Limited Retention

Completely forgetting some information is
non-trivial.
 How do we delete a record from the logs
and checkpoints, without affecting
recovery?
 How do we continue to support historical
analysis and statistical queries without
incurring privacy breaches?
Safety

Encryption provides additional layer of
security.
 How do we index encrypted data?
 How do we run queries against encrypted
data?
 [SWP00], [HILM02]
Openness

A donor shall be able to access all information
about the donor stored in the database.
 How does the database check Alice is really Alice
and not somebody else?
– Princeton admissions office broke into Yale’s
admissions using applicant’s social security number and
birth date.

How does Alice find out what databases have
information about her?
– Symmetrically private information retrieval [GIKM98].
Compliance

Universal Logging
– Can we provide each user whose data is accessed with a
log of that access, along with the query reading the
data?
– Use intermediaries who aggregate and analyze logs for
many users.

Tracking Privacy Breaches
– Insert “fingerprint” records with emails, telephone
numbers, and credit card numbers.
– Some data may be more valuable for spammers or
credit card theft. How do we identify categories to do
stratified fingerprinting rather than randomly
inserting records?