Download “Candidate” Query

Auditing Compliance with a Hippocratic Database Rakesh Agrawal Roberto Bayardo Christos Faloutsos Jerry Kiernan Ralf Rantzau Ramakrishnan Srikant Intelligent Information Systems Research IBM Almaden Research Center Outline       Introduction and motivation Problem statement Foundations System organization and algorithms Performance Summary Motivation  Hippocratic databases advocate policy directed data management for privacy sensitive data – Need reinforced by legislations and regulations:    Health Insurance Portability & Accountability Act Gramm-Leach Bliley Act – Consumer Privacy Rule Goal – Build a system to assist with auditing compliance with the stated policy   Event driven - privacy complaint Periodic - monitor exposure to privacy violation Audit Scenario The doctor must now review disclosures of Jane’s Sometime later, Jane information in order The doctor uncovers that Jane’stoblood sugar level is receives promotional understand high literature and suspects fromdiabetes a the circumstances of the disclosure, and take pharmaceutical appropriate action company, proposing over theto counter diabetes of Health and Human Jane complains the department tests Services saying that of the Janeshe hashad notopted been out feeling welldoctor and decides to sharing her medical information with pharmaceutical consult her doctor companies for marketing purposes Audit Expression Who has accessed Jane’s disease information? audit T.disease from Customer C, Treatment T where C.cid=T.pcid and C.name = ‘Jane’ Outline       Introduction and motivation Problem statement Foundations System organization and algorithms Performance Summary Problem Statement  Given – A log of queries executed over a database – An audit expression specifying sensitive data  Precisely identify – Those queries that accessed the data specified by the audit expression “Suspicious” Queries A query Qi has accessed information contained in the Customer table The audit expression A specifies the data to the audited cid Customer table name address zip … 1 Jane 95120 … 1234 … … If query Qi accesses all the cells specified by the audit expression A for any row, Qi is suspicious Issues     Convenient language – Audit expression (essentially SPJ query) Fast and precise on audits Non disruptive – Minimal performance impact on normal database operation Fine grained Assumptions    Disclosures stemming from multiple query executions is not considered No use of outside knowledge to deduce information without detection Queries considered include – Joins and aggregation, but not nested subqueries  Note that existential subqueries can be converted into joins [SIGMOD92] Outline       Introduction and motivation Problem statement Foundations System organization and algorithms Performance Summary Informal Definitions  “Candidate” query – Logged query that accesses all columns specified by the audit expression  “Indispensable” tuple (for a query) – A tuple whose omission makes a difference to the result of a query  “Suspicious” query – A candidate query that shares an indispensable tuple with the audit expression Indispensable Tuple columns inoperator Q Predicates inOutput Q Duplicate preserving projection Tables common to Q and A The SPJ query Q and the audit expression A are of the form: Q   COQ(PQ (T  R)) A   COA(PA(T  S )) Columns appearing anywhere in Q Definition 1 - A virtual tuple v cT is indispensable for an SPJ query Q if the result of Q changes when we delete v: ind (v, Q)   CQ(PQ(T  R))   CQ(PQ((T  {v})  R)) “Candidate” Query Definition 6 - Q is a candidate query with respect to A if: CQ  COA Only candidate queries can be suspicous queries “Suspicious” Query Definition 5 - Maximal virtual tuple (MVT): A tuple v is a MVT for queries Q1 and Q2 if it belongs to the cross product of common tables in their from clauses Definition 7 - Q is suspicious with respect to A if they share an indispensable MVT v susp (Q, A)  v  T s.t. ind (v, Q)  ind (v, A) For example, Query Q: Audit A: Addresses of people with diabetes Jane’s diagnosis Jane’s tuple is indispensable for both; hence query Q is “suspicious” with respect to A Outline       Introduction and motivation Problem statement Foundations System organization and algorithms Performance Summary System Overview Query with purpose, recipient Updates, inserts, delete Generate audit query Database Layer Backlog by the audit query Audit Static analysis Database triggers track updates to base tables Data Tables IDs of log queries having Audit expression accessed data specified Database Layer Audit query Query Log ID Timestamp Query User Purpose Recipient 1 2004-02… Select … James Current Ours 2 2004-02… Select … John Telemarketing public Static Analysis Query Log ID Timestamp Query User Purpose Recipient 1 2004-02… Select … James Current Ours 2 2004-02… Select … John Telemarketing public Audit expression Accomplished by examining only the queries themselves (i.e., without running the queries) Filter Queries Eliminates queries that could not possibly have violated the audit expression Insures that Candidate queries CQ  COA Audit Query Generation  Goal – Build a query which, when run, returns the id’s of suspicious queries with respect to an audit expression A Generating the Audit Query Audit Expression Replace each table Union with it’s backlog to restore the Combineofthe version theaudit tableexpression to the timewith of each individual query candidate queries to identify suspicious queries Combine individual candidate queries and the audit expression Candidate Candidate into a singleQGM queryisgraph Lines represent input/output a graphical representation of as a query Boxes represent operators, such select Boxes with no inputs are tables Query 1 Queryrelationships 2 between operators T1 T2 Suspicious SPJ Query The candidate SPJ query Q and the audit expression A are of the form: Proof of correctness is based Q   COQ(PQ (T  R)) upon Definition 7 (suspicious query) and given in the paper A  COA(PA(T  S )) QGM rewrites, shown in previous slide, transform Q and A into:  " Q " (P (P (T  R)  S )) i A Q Theorem 2 - A candidate SPJ query Q is suspicious with respect to an audit expression A if and only if: P (P (T  R  S )   A Q Suspicious Aggregate Query (Including Having)  Solution in the paper Example Jane’s audit Audit Expression Who has accessed Jane’s disease information? audit T.disease from Customer C, Treatment T where C.cid=T.pcid and C.name = ‘Jane’ Query Log Query 1 was executed at time T3 ID Query TS User Purpose Recipient 1 select name, address, zip from Customer, Treatment where disease = ‘diabetes’ and cid=pcid T3 james marketing others 2 select name, address from Customer where zip=‘95112’ T3 john contact others Backlog Table (Time Stamp) Jane’s record was inserted at time T2 and updated at time T4. The backlog table records both versions of her information Operation on a tuple among Insert, Update and Delete Timestamp of the operation C. S. Jensen, L. Mark, and N. Roussopoulos [TKDE 1991] Name Address … OPR TS Jane 1234… … I T2 Jane 1234… … U T4 Alice … … I T1 Attributes also in the source table Attributes only in the backlog table Merge Logged Queries and Audit Expression Merge logged queries and audit expression into a single query graph C.n, C.a, C.z T.s Select := T.s=‘diabetes’ and T.p=C.c audit expression := T.p=C.c and C.n= ‘Jane’ C T T C p, r, …, t c, n, …, t Treatment Customer Transform Query Graph into an Audit Query ‘Q1’ audit expression := X.n= ‘Jane’ X C.n The audit expression now ranges over the logged query. If the logged query is suspicious, the audit query will output the id of the logged query Select := T.s=‘diabetes’ and C.c=T.p C T p, r, ..., t c, n, …, t Treatment Customer View of Customer (Treatment) is a temporal view at the time of the query was executed Scenario Outcome  The audit uncovers that Query 1 in the query log accessed Jane’s information Outline       Introduction and motivation Problem statement Foundations System organization and algorithms Performance Summary Empirical Evaluation: Goals  Cost of maintaining backlog tables – Understand the impact of maintaining backlog tables on ongoing database operations  Cost of running audits – Understand whether audits can run in reasonable time Experimental Setup  IBM M Pro 6868 Intellistation – 800 MHz Pentium III processor – 512 MB of memory – 16.9 GB disk drive    Windows 2000 Version 5, SP 4 DB2 v7 with default settings TPC-H database – Supplier table  100,000 tuples System Structures  Indexing – Eager indexing   Maintain an index over the backlog table Maintained during ongoing database operations – Lazy indexing    No index over the backlog table Create indices at the time of audit Choice of index – Simple index  Primary key of source table – Composite index   Primary key of source table Time stamp Impact on Ongoing Operations  Queries – Additionally log the query string   Already performed in many application environments Updates – For each updated tuple,  Insert a tuple to the backlog table – Inserts and deletes are handled similarly  In a majority of environments, queries are much more frequent than updates Update Performance     100,000 tuples in Supplier table Update statement updates all tuples Each update statement fires triggers which inserts an additional 100,000 tuples in backlog Evaluate impact of multiple versions on performance Overhead on Updates Simple wins over Composite Number of version of each tuple in the Supplier backlog table 7x if all tuples are updates 250 Time (minutes) 3x if a single is updated Eagertuple indexing doesn’t add much cost 200 Composite Simple No Index No Triggers 150 100 50 0 5 20 35 # of versions per tuple 50 Audit Query Performance Audit query: select ‘Q’ from Supplier where skey = k Experiment: Evaluate the impact of the number of versions of tuples in the backlog table on performance Audit Query Execution Composite Simple winswins overover simple composite if if the Time initial version current is version selectedis selected Simple-I Simple-C Composite-I Composite-C 100 10 50 40 30 20 10 1 1 Time (msec.) 1000 # versions per tuple Takeaways  The composite index – Enhances the performance of audits, but – Additionally burdens updates when using eager indexing  The system supports – Efficient auditing – Without substantially burdening normal query processing Related Work  Oracle Privacy Security Auditing – Facility for logging queries with timestamp – Flash-back queries  Restores the version of the data at the time of the query – No support for automated auditing    User manually selects queries from the log and runs them The user to decide if the query is suspicious G. Miklau D. Suciu [SIGMOD 2004] – Formal analysis of information disclosure in data exchange    Is information about a secret query S revealed by views V1,…,Vn Considers all possible instances of a database schema Assumes tuple independence – We’re interested in given instances (temporal versions) – Nonetheless, it will be interesting to explore the connection between the two works   Active enforcement of policies by limiting disclosure [VLDB’04] Literature on multi-query optimization Summary  In light of new privacy legislation – The problem of auditing usage of information represents an important opportunity for database research   Formalized the problem through the fundamental concepts of indispensable tuple and suspicious queries Achieved our design goals: Design Goals     Convenient language Fast and precise on audits Non disruptive – Minimal performance impact on normal database operation Fine grained Backup Multiple Candidate Queries Union ‘Q1’ ‘Q2’ audit expression := C.n= ‘Jane’ audit expression := C.n= ‘Jane’ Aggregate Queries with Having ‘Q1’ select:= q1.c1=q2.c1 and … and q1.ci=q2.ci Qh q1 q1 c1, …, ci The join on aggregate columns ensures that the select:= …group being tracked by the audit has not been eliminated by the having clause Qg c1, …, ci, agg1, …, aggn group:= c1, …, ci c1, …, ck audit expression := … Qs Dynamic Temporal Views View of Customer table at time t c = id n = name a = address h = phone z = zip C1 Time stamp of the logged query c, n, a, h, z, o, t Select := ts <= t and op <> ‘delete’ and not(C5) C5 o = contact t = marketing Exists := C4.ts <= t and C3.c = C4.c and C4.ts > C3.ts C3 ts = ts op = opr * C4 c, n, a, h, z, o, t, ts, op Customer_backlog Cost of Building Indices over Backlog Tables 12 10 8 TS-Composite TS-Simple 6 4 2 50 40 30 20 10 0 1 Time (minutes) 14 # versions per tuple

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download “Candidate” Query