Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Privacy-oriented Data Mining by Proof Checking Stan Matwin (joint work with Amy Felty ) SITE University of Ottawa, Canada [email protected] The TAMALE Group • 4 profs • Some 30 graduate students • Areas: machine learning, data mining, text mining, NLP, data warehousing • Research in – – – – Inductive Logic Programming Text mining Learning in the presence of knowledge Applications of ML/DM (e.g. in SE: tools for maintenance personnel) UBC 29/7/03 2 • Why did I get into this research? • what is already being done… and why it ’s not enough • the main idea • its operation • discussion – ’correctness ’ • prototype - Coq and CIC • example • some technical challenges • acceptance? UBC 29/7/03 3 Some useful concepts... • opting out vs opting in • Use Limitation Principle: data should be used only for the explicit purpose for which it has been collected UBC 29/7/03 4 …and existing technical proposals On the web: P3P Platform for Privacy Preferences • W3C standard • XML specifications - on websites and in browsers - of what can be collected and for what purpose - ULP? • Handles cookies • Data exchange protocol more than privacy protocol: no provisions for opting out after an initial opt-in • the ULP part is in NL…not verifiable UBC 29/7/03 5 Agrawal ’s data perturbation transformations • data is perturbed by random distortion: xi xi +r r uniform or gaussian • a procedure to reconstruct a PACesitimation of the original distribution (but not the values) • a procedure to build an accurate decision tree on the perturbed distribution UBC 29/7/03 6 Agrawal ’s transformations cont’d • proposes a measure to quantify privacy – estimate intervals and their size • lately extended to non-numerical attributes, and to association rules • does not address the ULP • how do we know it is applied? UBC 29/7/03 7 the main idea: towards a verifiable ULP • User sets permissions: what can and cannot be done with her data • Any claim that a software respects these permissions is a proof of a theorem about the software • Verifying the claim is then checking that proof against the software UBC 29/7/03 8 Who are the players? • • • • • UBC 29/7/03 User C Data miner Org Data mining software developer Dev Independent verifier Veri …BUT no one owns the data D 9 D: A: S: database scheme given set of database and data mining operations source code for A UBC 29/7/03 PC(D,A): C’s permissions T(PC,S): theorem that S respects PC R(PC,S): proof of T(PC,S): B: binary code of S 10 Discussion - properties • It can be proven that C ’s permissions are respected (or not): PC is in fact a verifiable ULP • PC can be negative (out) or positive (in) • proof construction needs to be done only once for a given PC ,D and A • Scheme is robust against cheating by Dev or Org UBC 29/7/03 11 Acceptance issues • No Org will give Veri access to S • Too much overhead to check R(PC,S) for each task, and each user • Too cumbersome for C • Based on all Orgs buying in UBC 29/7/03 12 Acceptance1:Veri’s operation- access • Veri needs – – – – PC from C R(S, PC) from Dev S from Dev B from Org • Veri could check R(S, PC) at Dev’s • Veri needs to verify that S (belonging normally to Dev) corresponds to B that Org runs. UBC 29/7/03 13 Acceptance2: overhead • Veri runs proof checking on a control basis • Org’s execution ovhd ? UBC 29/7/03 14 Issues • Naming the fields: XML or disclosure • restricted class of theorems for a given P-automating proof techniques for this class UBC 29/7/03 15 Acceptance3: C’s perspective • Building PCs must be easy for C, based on D and processing schema; initially a closed set? • permissions could be encoded on a credit card, smart card, in the electronic wallet • or in the CA; they can then be dynamically modified and revoked UBC 29/7/03 16 « Political » aspects: who is Veri? • generally trusted – – « consumer association »? – « Ralph Nader »? – « transparency international »? • IT expert at the level of instrumenting and running the proof checker – connection to Open Software Foundation? • theorem proving can be cast as « better testing » UBC 29/7/03 17 how to make Orgs buy in? • The first Org is needed to volunteer • a Green Data Mining logo will be granted and administered (verified) by Veri • other Orgs will have an incentive to join UBC 29/7/03 18 Future work • Build the tools • expand the prototype • extend from Weka to commercial data mining packages • Integrate with P3P? • find a willing Org UBC 29/7/03 19 UBC 29/7/03 20 Link between S and B • compilation not an option • watermarking solution: B is watermaked by a slightly modified compiler with MD5(tar(S)) =128 bytes • marks are inserted by a trusted makefile-and-compiler in locations in B given by Veri and unknown to Org UBC 29/7/03 21 Link… • Veri, given access to S, can verify that B corresponds to S • An attack by I requires hacking the compiler • An attack by Org requires knowing the locations of watermarks UBC 29/7/03 22 Example C restricts her Employee data from participating in a join with her Payroll data Record Payroll : Set := mkPay{PID : nat; JoinInd : bool; Position : string; Salary: nat}. Record Employee : Set := mkEmp{Name : string; EID : nat; …}. Record Combined : Set := mkComb{CID : nat; CName : string; Csalary: nat; …}. UBC 29/7/03 23 Fixpoint Join [Ps: list Payroll]: (list Employee) (list Combined) := [Es : list Employee] Cases Ps of nil (nil Combined) | (cons p ps) (app (check_JoinInd_and_find_employee_record p Es) (Join ps Es)) end. (check_JoinInd_and_find_employee_record p Es) if a record is found in Es whose EID matches Ps PID and JoinInd permits Join, then a list of length 1 with the result of Join is returned, otherwise empty UBC 29/7/03 24 Definition Pc:= [S:((list Payroll)(list Employee) (list Combined)) Prop] Ps:list Payroll}. Es:list Employee. (UniqueJoinInd Ps) P: Payroll.(In P Ps) ((JoinInd P)=false not C:Combined ((In C (S Ps Es)) ((CID C)=(PID P))) • PC(S) is written as (PC Join): Coq expands the definition of PC and provides the theorem • request to proof checking operator of Coq will check this proof i.e; it will check that the user permissions are encoded into the Join program given • Whole proof: 300 lines of Coq code; proof checking: 1 sec on a 600MHz machine UBC 29/7/03 25