Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
When small data is better data Paul Francis, MPI-SWS Ruichuan Chen, Ekin Akkus, Johannes Gehrke private When small data is better data Paul Francis, MPI-SWS Ruichuan Chen, Ekin Akkus, Johannes Gehrke The user data “understanding” • There is a tacit understanding among users that if you send data to a company, they are free to use it how they wish Facebook “knowing” all kinds of personal information Doubleclick monitoring your browsing behavior Google gathering WLAN traffic during drive-by Leads to data gathering services • Companies build (free) services designed to gather as much data about users as they can – And often secretly gather data about users when they can’t • Then try to monetize that data – Mainly through advertising – Though Jean Bolot had some interesting ideas Leads to “big data” (mining) • Companies gather what they can, but don’t always get what they want – Google knows your searches, but not your relationship status – Facebook knows your relationship status, but not what you buy – Amazon knows what you buy, but not what you search for • So they use big data mining to infer what they don’t know A new user data understanding • It is ok to monetize (or otherwise benefit) from user data if: – The user data is very expensive to collected in any identifiable form – Users can know what is going on, and users can opt-out Why is this interesting? • Keeping user data on the user device is the key to user privacy • Most user data is at, or has passed through, the user device – Search and browsing in browser history – Facebook user profile easily scraped – Amazon purchases easily scraped Premise of “Private by Design” • If we can monetize user data, without collecting user data, then we have legitimate access to far more user data – Less need to deal with big data – Better monetization, less overhead My group’s research agenda • “Private by Design” behavioral advertising • “Private by Design” aggregate analytics My group’s research agenda • “Private by Design” behavioral advertising • “Private by Design” aggregate analytics Aggregate Analytics • Web analytics: want to know demographics of user base, what other websites users visit, etc. • App analytics: want to know what other apps user runs (competitors) • Mobile analytics, general analytics,…. Typical database privacy settings: trusted component sees database Untrusted Analyst Analyst query query Database’ Query Module (add noise) query Database anonymize Trusted Database Our setting: nobody (except user) sees individual user data Untrusted Analyst ??? Untrusted Data Data Data Previous work in our setting • Assumed differential privacy • Poor scaling characteristics, and/or • Could not tolerate user fraud Analyst Our goal: Assume differential privacy, but fix scaling and user fraud problems. ??? Data Data Data Differential privacy • Differential privacy adds noise to the output of a computation (i.e., query). Database Query Module (add noise) DB1 DB2 (differs by one user) Analyst Components & assumptions Analyst is potentially malicious (violating user privacy) Analyst Proxy (add DP noise blindly) Data Data Data Proxy is honest but curious 1) Follows the specified protocol (does not collude) 2) Tries to exploit additional info that can be learned in so doing Clients are user devices. Clients are potentially malicious (distorting the final results) Actually, two proxies! Honest-but-Curious proxy must not see user data Analyst Blind Proxy Data Blind Proxy Data Data If one proxy, need expensive public key encryption between clients and analyst If two proxies, can use much cheaper form of encryption (one time pad) Message XOR Random_String = Result Proxy 1 Result Sender Result Random_String Random_String Receiver Proxy 2 Result XOR Random_String = Message Queries are counting queries: Analyst Blind Proxy Data Blind Proxy Data Data Ex: How many users…..are male and between ages of 10-20? Analyst Blind Proxy Data Blind Proxy Data Data Clients answer ‘yes’ or ‘no’ only Analyst Proxies adds N additional random yes/no answers (coins) Blind Proxy Blind Proxy N = 2σ2 But, must not know how many yes’s and no’s it added! Data Data Data Analyst Each proxy independently adds N random coins Blind Proxy Data Blind Proxy Data Data XOR at analyst will produce random result But neither proxy knows what the result will be Analyst Coins and answers Blind Proxy Data Blind Proxy Data Data Analyst Decrypt and tabulate Blind Proxy Data Blind Proxy Data Data Buckets • Not “is your age between 10-20?”, – but “are you 1?”, “are you 2?”, “are you 3?”…. • Query is generally a vector of yes/no questions – Answer a vector of 1’s and 0’s • Vector can be big: – List of 20K websites – 185K combinations of 10 of 20 attributes Analyst Blind Proxy b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… Blind Proxy Proxies add coins and shuffle user answers (per bucket) u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Data Data Data b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Analyst Blind Proxy Data b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… Blind Proxy Data Data u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Blind Proxy Analyst b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… Blind Proxy u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. The shuffling at each proxy must be identical (though random) Because eachData bit must Data be paired Data with its XOR partner b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Analyst Blind Proxy b1: u4, u12, c2, …… b2: u6, c3, u19, …… b3: u12, c7, u6, …… Blind Proxy u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. But the proxies may have a (slightly) different set of answers. Data Data Data Analyst u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Blind Proxy Blind Proxy Synchronize the list of answers. Share a random seed for a Datagenerator, Data use Data random number to shuffle. u1: b1, b2, b3, …… u2: b1, b2, b3, …… u3: b1, b2, b3, …… ……. Time • Queries (unfortunately) take time: – There is a period of time during which a query is active • 10s of minutes, hours, or days??? Start query Synchronize and add coins TIME Clients pull in and answer queries Differential Privacy, good and bad • Good: – Adds noise – Lots of machinery being built • Bad: – Very pessimistic (measure of privacy loss is almost certainly way worse than actual privacy loss) – “Throwing away the database” not realistic From INTIMATE workshop • Jean’s mobility a good application • Collaborative filtering (Bach, Aruna) looks hard to do • Serge’s social knowledge may be centered on user devices… – Query for people’s opinions… • Real-time analytics may be possible – Streamed coin addition??? Status and future • Building an application analytics tool – Initial focus is PC platforms – Hope to get real app developers to bundle our tool • Additional privacy mechanisms (beyond differential privacy) • Work on better understanding of privacy loss in a realistic setting