Download “Candidate” Query

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Access wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Ingres (database) wikipedia , lookup

SQL wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Relational algebra wikipedia , lookup

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Auditing Compliance with
a Hippocratic Database
Rakesh Agrawal
Roberto Bayardo
Christos Faloutsos
Jerry Kiernan
Ralf Rantzau
Ramakrishnan Srikant
Intelligent Information Systems Research
IBM Almaden Research Center
Outline






Introduction and motivation
Problem statement
Foundations
System organization and algorithms
Performance
Summary
Motivation

Hippocratic databases advocate policy directed data
management for privacy sensitive data
– Need reinforced by legislations and regulations:



Health Insurance Portability & Accountability Act
Gramm-Leach Bliley Act – Consumer Privacy Rule
Goal
– Build a system to assist with auditing compliance with the stated
policy


Event driven - privacy complaint
Periodic - monitor exposure to privacy violation
Audit Scenario
The doctor must now review
disclosures
of Jane’s
Sometime
later, Jane
information
in order
The doctor
uncovers
that
Jane’stoblood sugar level is
receives
promotional
understand
high literature
and suspects
fromdiabetes
a the circumstances
of the disclosure, and take
pharmaceutical
appropriate
action
company,
proposing
over
theto
counter
diabetes of Health and Human
Jane complains
the department
tests
Services saying
that
of the
Janeshe
hashad
notopted
been out
feeling
welldoctor
and decides to
sharing her medical
information
with
pharmaceutical
consult her doctor
companies for marketing purposes
Audit Expression
Who has accessed Jane’s disease information?
audit
T.disease
from
Customer C, Treatment T
where
C.cid=T.pcid and C.name = ‘Jane’
Outline






Introduction and motivation
Problem statement
Foundations
System organization and algorithms
Performance
Summary
Problem Statement

Given
– A log of queries executed over a database
– An audit expression specifying sensitive
data

Precisely identify
– Those queries that accessed the data
specified by the audit expression
“Suspicious” Queries
A query Qi has accessed
information contained in
the Customer table
The audit expression A specifies
the data to the audited
cid
Customer table
name
address
zip
…
1
Jane
95120
…
1234 …
…
If query Qi accesses all the cells specified by
the audit expression A for any row, Qi is
suspicious
Issues




Convenient language
– Audit expression (essentially SPJ query)
Fast and precise on audits
Non disruptive
– Minimal performance impact on normal
database operation
Fine grained
Assumptions



Disclosures stemming from multiple
query executions is not considered
No use of outside knowledge to
deduce information without detection
Queries considered include
– Joins and aggregation, but not nested
subqueries

Note that existential subqueries can be
converted into joins [SIGMOD92]
Outline






Introduction and motivation
Problem statement
Foundations
System organization and algorithms
Performance
Summary
Informal Definitions

“Candidate” query
– Logged query that accesses all columns specified
by the audit expression

“Indispensable” tuple (for a query)
– A tuple whose omission makes a difference to
the result of a query

“Suspicious” query
– A candidate query that shares an indispensable
tuple with the audit expression
Indispensable Tuple
columns inoperator
Q
Predicates
inOutput
Q
Duplicate
preserving
projection
Tables common to Q and A
The SPJ query Q and the audit expression A are of the form:
Q   COQ(PQ (T  R))
A   COA(PA(T  S ))
Columns appearing anywhere in Q
Definition 1 - A virtual tuple v cT is indispensable for an SPJ
query Q if the result of Q changes when we delete v:
ind (v, Q)   CQ(PQ(T  R))   CQ(PQ((T  {v})  R))
“Candidate” Query
Definition 6 - Q is a candidate query with respect to A if:
CQ  COA
Only candidate queries can be suspicous queries
“Suspicious” Query
Definition 5 - Maximal virtual tuple (MVT):
A tuple v is a MVT for queries Q1 and Q2 if it belongs to the
cross product of common tables in their from clauses
Definition 7 - Q is suspicious with respect to A if they
share an indispensable MVT v
susp (Q, A)  v  T s.t. ind (v, Q)  ind (v, A)
For example,
Query Q:
Audit A:
Addresses of people with diabetes
Jane’s diagnosis
Jane’s tuple is indispensable for both; hence query Q is “suspicious” with
respect to A
Outline






Introduction and motivation
Problem statement
Foundations
System organization and algorithms
Performance
Summary
System Overview
Query with purpose, recipient
Updates, inserts, delete
Generate
audit query
Database
Layer
Backlog
by the audit query
Audit
Static analysis
Database triggers
track updates to
base tables
Data
Tables
IDs of log queries having
Audit
expression accessed data specified
Database
Layer
Audit
query
Query Log
ID
Timestamp
Query
User
Purpose
Recipient
1
2004-02…
Select …
James
Current
Ours
2
2004-02…
Select …
John
Telemarketing
public
Static Analysis
Query Log
ID
Timestamp
Query
User
Purpose
Recipient
1
2004-02…
Select …
James
Current
Ours
2
2004-02…
Select …
John
Telemarketing
public
Audit expression
Accomplished by
examining only the
queries themselves
(i.e., without
running the
queries)
Filter Queries
Eliminates queries that
could not possibly have
violated the audit
expression
Insures that
Candidate queries
CQ  COA
Audit Query Generation

Goal
– Build a query which, when run, returns
the id’s of suspicious queries with respect
to an audit expression A
Generating the Audit
Query
Audit
Expression
Replace each table Union
with it’s backlog to restore the
Combineofthe
version
theaudit
tableexpression
to the timewith
of each
individual
query
candidate queries to identify suspicious queries
Combine individual candidate
queries and the audit expression
Candidate
Candidate
into a singleQGM
queryisgraph
Lines
represent input/output
a
graphical
representation
of as
a query
Boxes
represent
operators,
such
select
Boxes
with
no
inputs
are
tables
Query 1
Queryrelationships
2
between operators
T1
T2
Suspicious SPJ Query
The candidate SPJ query Q and the audit expression A are of
the form:
Proof of correctness is based
Q   COQ(PQ (T  R))
upon Definition 7 (suspicious
query) and given in the paper
A  COA(PA(T  S ))
QGM rewrites, shown in previous slide, transform Q and A into:
 " Q " (P (P (T  R)  S ))
i
A
Q
Theorem 2 - A candidate SPJ query Q is suspicious with
respect to an audit expression A if and only if:
P (P (T  R  S )  
A
Q
Suspicious Aggregate
Query (Including Having)

Solution in the paper
Example
Jane’s audit
Audit Expression
Who has accessed Jane’s disease information?
audit
T.disease
from
Customer C, Treatment T
where
C.cid=T.pcid and C.name = ‘Jane’
Query Log
Query 1 was executed
at time T3
ID Query
TS
User
Purpose
Recipient
1
select name, address, zip
from Customer, Treatment
where disease = ‘diabetes’
and cid=pcid
T3
james
marketing
others
2
select name, address from
Customer where zip=‘95112’
T3
john
contact
others
Backlog Table (Time
Stamp)
Jane’s record was inserted at time T2 and updated
at time T4. The backlog table records both versions
of her information
Operation on a tuple
among Insert, Update
and Delete
Timestamp of
the operation
C. S. Jensen, L. Mark, and N. Roussopoulos [TKDE 1991]
Name
Address …
OPR
TS
Jane
1234…
…
I
T2
Jane
1234…
…
U
T4
Alice
…
…
I
T1
Attributes also in the source table
Attributes only in the backlog table
Merge Logged Queries
and Audit Expression
Merge logged queries and audit expression into a single
query graph
C.n, C.a, C.z
T.s
Select := T.s=‘diabetes’ and T.p=C.c
audit expression := T.p=C.c and
C.n= ‘Jane’
C
T
T
C
p, r, …, t
c, n, …, t
Treatment
Customer
Transform Query Graph
into an Audit Query
‘Q1’
audit expression := X.n= ‘Jane’
X
C.n
The audit expression now ranges over
the logged query. If the logged query is
suspicious, the audit query will output
the id of the logged query
Select := T.s=‘diabetes’ and C.c=T.p
C
T
p, r, ..., t
c, n, …, t
Treatment
Customer
View of Customer (Treatment) is a
temporal view at the time of the query
was executed
Scenario Outcome

The audit uncovers that Query 1 in the query log
accessed Jane’s information
Outline






Introduction and motivation
Problem statement
Foundations
System organization and algorithms
Performance
Summary
Empirical Evaluation:
Goals

Cost of maintaining backlog tables
– Understand the impact of maintaining
backlog tables on ongoing database
operations

Cost of running audits
– Understand whether audits can run in
reasonable time
Experimental Setup

IBM M Pro 6868 Intellistation
– 800 MHz Pentium III processor
– 512 MB of memory
– 16.9 GB disk drive



Windows 2000 Version 5, SP 4
DB2 v7 with default settings
TPC-H database
– Supplier table

100,000 tuples
System Structures

Indexing
– Eager indexing


Maintain an index over the backlog table
Maintained during ongoing database operations
– Lazy indexing



No index over the backlog table
Create indices at the time of audit
Choice of index
– Simple index

Primary key of source table
– Composite index


Primary key of source table
Time stamp
Impact on Ongoing
Operations

Queries
– Additionally log the query string


Already performed in many application environments
Updates
– For each updated tuple,

Insert a tuple to the backlog table
– Inserts and deletes are handled similarly

In a majority of environments, queries are
much more frequent than updates
Update Performance




100,000 tuples in Supplier table
Update statement updates all tuples
Each update statement fires triggers
which inserts an additional 100,000
tuples in backlog
Evaluate impact of multiple versions
on performance
Overhead on Updates
Simple wins over Composite
Number of version of each tuple in
the Supplier backlog table
7x if all tuples are updates
250
Time (minutes)
3x if a single
is updated
Eagertuple
indexing
doesn’t add much cost
200
Composite
Simple
No Index
No Triggers
150
100
50
0
5
20
35
# of versions per tuple
50
Audit Query Performance
Audit query:
select ‘Q’ from Supplier where skey = k
Experiment:
Evaluate the impact of the number of versions of
tuples in the backlog table on performance
Audit Query Execution
Composite
Simple
winswins
overover
simple
composite
if
if the
Time
initial version
current
is version
selectedis selected
Simple-I
Simple-C
Composite-I
Composite-C
100
10
50
40
30
20
10
1
1
Time (msec.)
1000
# versions per tuple
Takeaways

The composite index
– Enhances the performance of audits, but
– Additionally burdens updates when using
eager indexing

The system supports
– Efficient auditing
– Without substantially burdening normal
query processing
Related Work

Oracle Privacy Security Auditing
– Facility for logging queries with timestamp
– Flash-back queries

Restores the version of the data at the time of the query
– No support for automated auditing



User manually selects queries from the log and runs them
The user to decide if the query is suspicious
G. Miklau D. Suciu [SIGMOD 2004]
– Formal analysis of information disclosure in data exchange



Is information about a secret query S revealed by views V1,…,Vn
Considers all possible instances of a database schema
Assumes tuple independence
– We’re interested in given instances (temporal versions)
– Nonetheless, it will be interesting to explore the connection between the
two works


Active enforcement of policies by limiting disclosure [VLDB’04]
Literature on multi-query optimization
Summary

In light of new privacy legislation
– The problem of auditing usage of
information represents an important
opportunity for database research


Formalized the problem through the
fundamental concepts of indispensable
tuple and suspicious queries
Achieved our design goals:
Design Goals




Convenient language
Fast and precise on audits
Non disruptive
– Minimal performance impact on
normal database operation
Fine grained
Backup
Multiple Candidate
Queries
Union
‘Q1’
‘Q2’
audit expression := C.n= ‘Jane’
audit expression := C.n= ‘Jane’
Aggregate Queries with
Having
‘Q1’
select:= q1.c1=q2.c1 and … and q1.ci=q2.ci
Qh
q1
q1
c1, …, ci The join on aggregate columns ensures that the
select:= …group being tracked by the audit has not been
eliminated by the having clause
Qg
c1, …, ci, agg1, …, aggn
group:= c1, …, ci
c1, …, ck
audit expression := …
Qs
Dynamic Temporal Views
View of Customer table at time t
c = id
n = name
a = address
h = phone
z = zip
C1
Time stamp of
the logged query
c, n, a, h, z, o, t
Select :=
ts <= t and
op <> ‘delete’ and
not(C5)
C5
o = contact
t = marketing
Exists :=
C4.ts <= t and
C3.c = C4.c and
C4.ts > C3.ts
C3
ts = ts
op = opr
*
C4
c, n, a, h, z, o, t, ts, op
Customer_backlog
Cost of Building Indices
over Backlog Tables
12
10
8
TS-Composite
TS-Simple
6
4
2
50
40
30
20
10
0
1
Time (minutes)
14
# versions per tuple