Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas Weigend (www.weigend.com) Data Mining and Electronic Business: The Social Data Revolution STATS 252 June 10, 2009 Class 9 Privacy: (Part 1 of 2) This transcript: http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Corresponding audio file: http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.mp3 Next Transcript - Mobile: (Part 2 of 2): http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.2_mobile_2009.06.10.doc To see the whole series: Containing folder: http://weigend.com/files/teaching/stanford/2009/recordings/audio/ Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 1 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas: Welcome to the last class, spring 2009 Data Mining and E-Business: The Social Data Revolution, here at Stanford. It is class number nine. We have a fairly tight agenda for today. We will start, in the first half, by thinking about the question I emailed out last night; where do you feel that some digital exhaust is being sniffed? Where do you feel that somebody is going through your digital garbage? Did anybody have time to think about that? Student: I kept feeling Gmail is tracking my emails and Google Docs. I’m afraid of IP leaks through that. Andreas: There are two things here. One is that email is something that has all of our secrets pretty much in it. The one is that Google actually is being able to potentially look at it or the spooks being able to look at it. The other one is if there is a security breach, which of course would never happen anywhere, including Google. If there is a security breach, which by the way, for social security numbers, both at Stanford and at Berkeley, and we managed to get out twice. That is a very different story. Think about beyond email, what is it that Google knows? Google knows all your queries, all the things, all your desires, all the things you really wouldn’t share with your closest friend, but Google would know. Student: With the coming of Google, voice, all your phone - Andreas: Google Voice, yes, short messages. It understands your implicit calling pattern in addition to your implicit emailing pattern. Student: Chat logs Andreas: Chat logs, that’s another dimension. There is a company called Thirty-three - something. I need to look it up. I just talked to the CEO a few days ago. He claims to have access to AOL and all the big companies that do chat, and only looks at header information, but header information is interesting enough. Gary, where are you worried about someone sniffing your digital exhaust? Gary: My search browser history, what I search on YouTube, what I search on Google. Andreas: Yes, YouTube, that’s a beautiful example. Who of you would be comfortable to have all your searches on YouTube showing up on the screen right now? [Laughter] You would be comfortable? Having your searches on YouTube? Student: I try not to share truly personal content on the web. 0:02:59.4 Andreas: In the Hong Kong example, Yahoo China managed to get somebody into prison because he emailed or published - I forgot the details from four years ago, something Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 2 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics that was supposed to be internal within the Chinese government. Despite actually having been in Hong Kong, not in mainland China, this guy still was delivered to the mainland Chinese authorities. Think very differently now. Think about your mobile. Your mobile knows so much about you. It knows where you are as opposed to where you say you are. What digital exhaust does your mobile create? Student: They know all the businesses you go to and how you get to them. It’s in the same way you have this funnel leading to transactions on Amazon; it’s the same thing with mobile but in a physical reality. Andreas: That is applications that could potentially use it for advertising. Five years ago, I had a student who ran the Shanghai subway in my class in Shanghai. He took me to the headquarters. They had a special car and so on. He drove around quite a bit to make sure I sort of lost my way, and he said, “Now we’re here. Do you know where we are?” I said, “Yeah, that’s the stall where I buy my CDs on the other side of the street. He was seriously disappointed that he didn’t manage to fool me. In that building, they have data that you cannot believe about the behavior of each individual. We all have what’s called a [0:04:41.2 Chinese phrase], which is the railway card or subway card. As opposed to systems like San Francisco where you put in quarters, which are anonymous, or New York where you do have a card but it doesn’t relate to your identity, the Shanghai railway card is totally related to your name and it fills up. You can fill it up from your bank account, or automatically deducted. It’s very convenient for a user. I was swiping my card through that test slot and it told me where I was three years ago, at 4:53:60; I went through turnstile number 5 at whatever subway station was, and 17 minutes later, I left at that certain turnstile at another subway station. On the one hand that’s interesting for understanding people movement and for potential planning, but on the other hand they do know a lot of information about us, such as your credit cards, your purchasing history, where else is there digital exhaust? We’re talking about what digital exhaust we produce here and I want to get the bigger spectrum of things here. Student: … 0:06:27.5 Andreas: Shopping, now I’m surprised I haven’t heard the F word yet, which is Facebook, just to make sure you know what I’m talking about. [Laughter] Facebook - it does know a lot about us. If you look at peoples’ friends and see I think we learn so much about them. On the other F word, Flickr, by the pictures we post - let me tell you a story here. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 3 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics I went to a party about two or three years ago in San Francisco, on Market Street. I saw a guy there; went up to him and said, “Hi, I’m Andreas.” He said, “I know.” I’m sorry, I forgot - it happens. I forget people. I said, “What’s your name?” He said, “[0:07:17.9 unclear].” I said, “I do consulting and teach at Stanford.” “I know.” “What do you do?” “I work at Google.” Okay. Another attempt - “I use to be at Amazon.” “I know.” After a few moments like this, I got sort of surprised just how much he knew about me. He even went to the degree that he knew who was at my birthday party because he looked up people at Flickr that was pre-Facebook tagging. He didn’t know peoples’ names but he pretty much knew who my friends were. At the end of the day, I said, “I’m on foot here. I take muni; do you mind driving me home?” He said sure. He drove home. Stopped and said, “This is where you live, right?” I said yes. It turned out he listens to my mp3s. It’s very interesting how people really can find out so much about us. There was nothing funny about it but if he was not a great guy - and I know where he works. I know who he works for; just to be on the safe side, if he was spooky I might be somewhat worried about it. That was about four years ago. Student: You seem to be setting up this question sort of asymmetrically that alludes to the scariness of it. There are people who know more about you but I think I’d like to get over this hump to the future where everybody’s information is just as easily accessible so if you’re going to use it against me I know where you are, I know where you live. When does it become symmetrical, whether that’s government versus the population or population versus population? At what point in this science fiction journey is nobody afraid because we know you can’t get me because I know just as much about you? Andreas: That’s a good point. I know that about ten years ago I gave a keynote in Germany. Germany has very different sensibilities about privacy than here. I always put my schedule on the web, as you know; www.weigend.com/itinerary tells you the next flight I’m on. People were so worried about it being a security risk. I asked them, “What do you mean by security risk? Do you think someone will do something to the plane just because I’m on it? I don’t think I’m anyone important.” “No, about somebody breaking into your house.” I said, “Great, if somebody wants to break into my house, it’s better to break into my house when I’m not there than when I’m there. If I have the choice, please figure out the time when I’m not there.” 0:10:15.0 Last year, Ryan Mason, who is one of the graders this year, he actually had me carry a [0:10:20.9 unclear] device which is a little orange device that is unfortunately pretty flaky software that every fifteen minutes, via satellite and to my website, told people where I was. Cynthia would have known exactly when I was leaving home today. It’s pretty convenient because you could look at an [0:10:40.1 unclear] and let’s say, if you’re consulting, you know exactly the times you arrived at a spot and exactly the times you left them for the last half year or so you’ve been carrying it. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 4 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics There is a lot of asymmetry here. My problem is what if somebody actually fakes the information. Let’s say somebody robs a bank. Now, they fake my information and claim that it was me who robbed the bank. I talked to a friend of mine last year who solved the problem for me, which Cynthia will unsolve in her talk. The problem was how can I prove that it wasn’t me who robbed the bank; that I actually was in Los Angeles visiting my friend as opposed to being in San Francisco when the bank was robbed? I was stuck with that question and Greg Wolfe said, “Andreas, the solution of course is more data, namely by constantly collecting those data and hashing them into some space and then being able to compare what is the probability of him being in Los Angeles versus what is the probability of him being in San Francisco. You can, with high probability say, if you collect more data, that he really was not in San Francisco that day.” Similar to our problem with information overload where we asked how we deal with this thing and the answer if by creating more data, metadata, by allowing websites to observe what we’re doing and using machine learning is probably the only approach I can see. Plus social graph, which is one way in social data of sharing stuff. Student: I really wanted to give you back to Robert. Maybe there will be symmetry in consumer-to-consumer interaction or user-to-user interaction but what about userto-government? That’s at the end of the day, when you look at what happened … and China… at some point there are entities that break the symmetry and there are a lot of reasons to do that, from disease control, to terrorism control, and to economy control. I think that to reach symmetry on the level of the government would require a very radical change in government structure and I don’t think that would happen. Andreas: We see it in some way in banking. UBS, for instance, probably has to hand over records of people here and whether that is the right thing or the wrong thing, I’m not to judge, but if people break the law by not paying taxes or whatever the story is there, it’s understandable that the US government uses its leverage and tells the bank, “If you want to do business here, I’m sorry but you have to obey the rules we have in this country.” Are there any pouches of data that we haven’t even thought about when we’re talking about the digital exhaust? The data that we don’t actively contribute, knowingly and willingly, as I always say, but that we implicitly contribute? Student: GPS and Wi-Fi, do they know we’re in this group… 0:14:25.2 Andreas: Of course, there is a German company called Plazes, which has a mapping from hot spots. It’s very simple and that is a smooth transition into the talk. If you have some other source of the information, because some people might have an iPhone or might tag this explicitly, where you have GPS information and Wi-Fi information, suddenly, you can build a pretty good idea of where the different base stations are. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 5 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics As you know from Google Latitude - who use that actually? One person? Google Latitude, even with phones like mine that have, for marketing and revenue reasons, the GPS chip disabled; down the road they hope to catch a premium by selling it in the new version of the phone, which is the same version of the phone, which now has the GPS chip enabled. Even there, within a few hundred yards, typically, it is pretty good in knowing where I am. That is based on mapping, having auxiliary data and having data that is both GPS location data as well as Wi-Fi data. Anything else? Did I forget something? Student: RFID tags. When you buy things … Andreas: Yes, RFID is a good example. You know these little chips, they cost two cents, five cents, the passive ones where when we drive over the Bay Bridge many of us have an RFID chip there that gets read out and automatically deducted from our credit card. WalMart required RFID chips two or three years ago, on every single container that was shipped, not every single shirt but on every single container. There is a future store MetroGroup has in Germany, where every single item in the store has an RFID. The prices are also dynamic. The prices could change so as I walk through that aisle, and think about getting a yogurt and then I go for the non-fat yogurt, and then walk by the aisle where they have dieting pills, they might be blinking. The projection down in the floor, and the price might be going up because they know that maybe for me it’s more important than for other people. It’s called the Future Store that MetroGroup had and they actually got an award by some German consumer group. It got them the Big Brother Award. [Laughter] On the other hand, checkout is easier. It remembers what I’ve been looking at. Now, just to close my little introduction here, day one of Amazon.com was already in that situation because at Amazon, of course it knows and has a data strategy that highly leverages all the things you have looked at. People who bought X also bought Y and people who looked at X eventually bought Y. All these things are like RFIDs in the physical store. As we have seen in a number of cases, sometimes the physical world, and the digital world - it’s just that if you have a physical embodiment you think about the world differently but there is some convergence. Alright, I think I have done enough to warm us up for certainly the best female speaker in the quarter, because she’s the only female speaker we had, but I have to say my favorite speaker in this quarter, Cynthia Dwork. 0:18:05.5 Cynthia and I met at a conference two years ago in San Diego. I forgot; you did cryptography or something like this. Were you still at IBM or were you at Microsoft? We shared a cab and she gave me her mobile number in case we had two different flights. If one of us actually finds the seat on the plane and the other one doesn’t have one, then we could communicate, so I had her mobile number. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 6 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics When it came to who I wanted last year, I thought, “She isn’t going to talk in my class.” She said, “Andreas, since you asked so nicely, I will be coming.” It’s a talk Cynthia gave about a year ago at the Almaden Institute and I have put the link to the video from that talk on the wiki. I was absolutely amazed about what I learned in this talk about what you can do by having thoughtfully anonymized data, but with a bit of side information, you can find out so much more about the people. In retrospect, I was very happy that when I was at Amazon, I wanted to release a couple of data sets to the academic community. I prepared them well and then at the last moment, there was some reason somebody came up with “Maybe we shouldn’t do this.” In retrospect, I’m super happy that I never released anything because I know otherwise; I would be on Cynthia’s list today. It’s a pleasure having you here. Let’s welcome Cynthia here. [Applause] Cynthia; Thank you for that introduction and the interesting discussion you had. I am not going to stand up here and be controversial about whether information is good or bad. I’m going to talk to you about privacy and getting precise about what privacy could possibly mean and what we might be able to do about it in a very restricted setting. Privacy, the word, means a lot of different things to different people. It comes up, as you’ve just heard and have been discussing, in many different contexts. A lot of times, when people talk about privacy, they’re also thinking about security. The data are sitting somewhere and someone is going to get in and see them. In order to just try to clean up the picture and to distill it as much as possible, I’m going to focus on a very pure privacy problem. In this problem, some trusted and trustworthy curator has gathered a bunch of information. It might be Alfred Kinsey, who has gathered a lot of information about the sexual practices of individuals. It might be a hospital that has a lot of information about different treatments for specific disease. 0:18:05.5 It’s a pure problem in the sense that for now let’s assume that the curator is really trustworthy. There is no malice going on. The goal is to release statistical information in some way that protects the privacy of individuals. We need to have a rigorous definition of what privacy is because otherwise, we can’t really say if they succeeded or failed. We want a mathematical definition of privacy. We want to be able to do things like find statistical correlations, say correlating a cough outbreak with a chemical plant malfunction. If you’re familiar with HIPAA regulations and how you can release sanitized data by removing certain fields from the data and releasing the results, they have to remove, under the Safe Harbor Provisions, geographic information. This correlation is something that you couldn’t do. 0:22:36.3 We want to be able to notice events, like detecting spikes in emergency room admissions for asthma, various kinds of data mining tasks and clustering and official statistics. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 7 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics You might ask, “Hasn’t this been done before?” The answer is of course it has, and in fact, especially in the context of official statistics, such as the census bureau, department of education, labor statistics, IRS information; this has been under the purview of the statisticians for several decades. I don’t have time here to say what went wrong, but what I can say is modern cryptography has developed a strong language and understanding of information leakage. It’s not very surprising that it was possible to add something to the conversation. A brief outline - when we talk about some broken privacy cases, I’m going to say that what went wrong was the nature of the privacy promise was wrong. I will propose one right privacy promise. I’m not saying it’s the only one. It happens to be the only one I know at the moment but there could well be others. I’ll say a couple words about how to achieve it, some limitations, and future work. You could think of two models. One of them is a non-interactive model where we have the database on the left, the curator in the middle, and the curator takes the database and does some kind of sanitization process to it and releases the sanitized database that the data analysts, represented by the question mark, can interact with to her heart’s content. We don’t need the real data anymore. They can just go away. This is typically what people have done when you hear terms like “anonymized” or “de-identified” data, or scrubbed or sanitized. There is another possibility; open your minds to another possibility, which is an interactive model. Somebody asks a question to the curator who is holding the data and the curator computes the answer, may modify the question slightly or modify the answer slightly, and then releases the results. This data analyst, based on the question and the answer, figures out a new question, asks the new question, gets a possibly perturbed answer, and this goes on. It’s interactive and the data have to stay there. Andreas: One interesting element is the one who owns the data knows what question is being asked. Cynthia: Yes, they know what questions are being asked. Ask me later, when we care, and how we use that information. 0:25:46.8 I would like to point out something; the non-interactive problem is necessarily a lot harder. The reason is this sanitized database that is released has to be able to answer all the questions that the analysts might possibly come up with, even if the analyst is restricted to ten questions. Since the sanitizer, the curator, doesn’t know in advance what those questions are, the released object has to have answers to all the possible questions so that the first ten at least, can get answered correctly; whereas, in the interactive setting, you only need to answer the questions that are actually asked. That turns out to be an important difference. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 8 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics The theme that I’m going to repeat is that data analysis doesn’t happen in a vacuum. There is always other information around. We call this auxiliary information. It’s information from any source at all, other than the database under consideration. It could be other databases, including old releases of this one, and it could be future databases, later. It could be newspapers, things you’ve heard from insiders, people who know something about the data who will say, “That number was somewhere between 680 and 742.” That could be a piece of useful auxiliary information. In fact, it could be used to attack specific schemes. It could be government reports, census website information, or inside information from a different organization, not the one that holds the database, but somebody at some other company, for example, Google’s view of the attacker or user as a Google employee. A linkage attack is just a malicious use of auxiliary information. The idea is this; maybe you have a data set that is public and it has some kind of innocuous information. You have some other data set that has both innocuous and sensitive data. You can take these two datasets and you can link on the innocuous fields in order to learn who owns that sensitive piece of information. Here is an example. How many people know this example? Do you know about the break on this? Great. The Netflix prize. Netflix recommends movies to its subscribers. It is looking for an improved recommendation system. It offers a million dollar prize for a 10% improvement in their recommendation system. We are not concerned here with how they measure improvement. It has to publish training data so that you can develop an algorithm for recommending and then test it on the training data. What does the Netflix prize rules page say? It says first of all, something about how much data there are. The training data set consists of more than a hundred million ratings from over 480 thousand randomly chosen, anonymous customers, on nearly 18 thousand movie titles. There is some discussion about what the ratings are, on a scale from 1 to 5 stars. To protect customer privacy, all personal information identifying individual customers has been removed, and all customer IDs have been replaced by randomly assigned IDs. One thing you should walk away from this talk is if they tell you they have removed all personally identifying information they’re wrong. Don’t believe it. Occasionally, you’ll be wrong, but most of the time you’ll be right. Here is a very relevant source of auxiliary information. People don’t watch movies in a vacuum. They go to the Internet Movie Data Base and they say things about the movies they’ve seen. Individuals can register for an account. How many of you have accounts on the IMDB? A couple? You don’t have to be anonymous and the visible material that you leave there includes ratings, dates, and comments. 0:30:21.7 A very interesting paper by [0:30:25.9 unclear name] attacked the Netflix training data, using the IMDB as auxiliary information. First, they did some analysis and made Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 9 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics some nice observations. For example, if they have 8 movie ratings, of which they allowed 2 to be completely wrong, and dates that might have a 3 day error, they could still uniquely identify 96% of the Netflix subscribers. For 89%, just 2 ratings and dates are enough to reduce the set of plausible records from about 480 thousand down to 8 which can then be inspected by a human for further de-anonymization. They succeeded in identifying somebody in the Netflix data set and they drew some conclusions about this person, and they’re not flattering conclusions. I think they said he was homophobic, for example. I forget, maybe racist. This is a very important point and this goes to what Andreas said; they may be right in their conclusions about him, and they may be wrong. Either way, this guy is harmed. I don’t know what to make of that fact, but I think it’s a really important statement. In fact, there is very interesting writing in the Philosophy of Law literature, by Ruth Gavison, who talks about privacy as protection from being brought to the attention of others. We’d like to be able to walk down the street without feeling as if people are staring at us, but in addition to the manifest loss of privacy when people stare at you; this invites further compromise of your privacy. Once you have been drawn to the government’s attention, for example, they may start investigating your phone records, where you’ve been and your Shanghai past, and so on. Here are some other successful attacks that I’ll mention briefly. One of the first attacks was done by Latanya Sweeney and it was against anonymized HMO records. Are you familiar with this example? She cross referenced the group medical health insurance medical encounter data, which are public, and had been “anonymized” with the records of the voter registration rolls. She identified the medical records of William Weld, who was the Governor of Massachusetts. She proposed a fix, a syntactic condition that released data sets should follow in order to protect privacy. That syntactic condition is called “K Anonymity.” It turns out that K Anonymity doesn’t protect privacy and a few people observed this. Three people proposed an alternative, another syntactic condition called “L Diversity. Of course, that didn’t quite work. An attack was made against L Diversity by [0:33:47.9 unclear] and they proposed another variant called “M Invariance,” another syntactic condition on data sets. There are some techniques that go against all of these by [0:33:58.7 unclear] and Smith. 0:34:00.3 Here is one more example, social network graphs. Suppose we have a friendship graph where we have nodes that correspond to users, and users can list others as friends, which will create an edge between them. The edges may be annotated with directional information. I have named Andreas as my friend but he has not reciprocated. The question that one might ask if one is a researcher wanting to study social phenomena is how frequently is the friend designation reciprocated. That’s a perfectly nice, statistical question. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 10 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics One idea for allowing this kind of research question is to publish an anonymization of the graph, which means we simply take all of the names and replace them by random IDs. This obviously permits analysis of the structure of the graph, and the privacy hope is that the randomization on the identifiers, the fact that the names have been obliterated, will make it hard or impossible to identify nodes with specific individuals. That way, you maintain the privacy of who is connected to whom. Those are the secrets; who is connected to whom. Of course, this is disastrous and it’s vulnerable to both active and passive attacks. The point again is that this graph was not given to us transported through the vacuum of space from Mars. It is really here on Earth and the people who want to attack it can also in fact, create nodes and edges in it. Just to give you an idea, here is the flavor of the attack. Before release, you create a sub graph that has special structure. It’s very small. It only has to be about 12 nodes. It’s going to be highly interconnected, internally, but very lightly connected to the rest of the graph. The special graph is the light blue one, the thin connection to the rest of the graph is the green edge, and then there is the rest of the graph. Suppose we want to know whether some people, Steve and Jerry, are in contact with each other. We’re going to take our attacking graph and we’re going to pick two contacts in there, A and B. We’re going to create an edge from A to S and B to J. A to Steve and B to Jerry. Later, in the anonymized graph, if we can find the nodes A and B, then we can look at the edges that flow out of that sub graph and see whether S and J are connected. There is a magic step, which involves something called the [0:37:02.0 Gamari Hutchri], which allows you to isolate lightly linked in sub graphs from the rest of the graph. If we’ve chosen our blue graph correctly, its special structure will allow us to find A and B, and in fact, figure out all the other nodes in there, and therefore figure out those connections out of the sub graph, to Steve and Jerry. Then we just check and see if there is an edge between Steve and Jerry or not. I promised you this was the last one, but there was another one I found that I really loved, which is anonymizing query logs via token-based hashing. Suppose you want to allow people to do studies of query logs, like you’re Google or Microsoft, and you have all of this information? Some people have proposed taking the logs, hashing each of the tokens, the big words, into the log to some random string, and release the results. 0:38:06.3 You can still get some kind of information about co-occurrences of hashing or hash terms. The search string is tokenized, tokens are hashed to identifiers, and you release it. This idea was successfully attacked by some people at Yahoo. It requires a piece of auxiliary information such as a reference query log. Are you familiar with what happened with the AOL query logs, the anonymized - that one is good enough to allow attack on anything else? At least, over time, the things that Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 11 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics people search for does change, so eventually the utility of the AOL published log will diminish. It was good enough at the time that this was done. It exploits co-occurrence information in the reference log to guess hash preimages. It turns out as an interesting fact, that frequency statistics alone don’t work. What’s going wrong? One thing is definitional failures. The guarantees are syntactic. They’re not symantic. They don’t have meaning, they’re just forms. If the database satisfies these visibly checkable conditions, then blah. That was the case with K Anonymity, L Diversity, and M Invariance. Or the idea that it’s enough to just remove the names and replace them with random strings, that’s syntactic. That has nothing to do with understanding the relationships between these nodes in the friendship graph, for example. They’re completely ad hoc, so privacy compromised is simply defined to be a certain set of undesirable outcomes, but there is no argument that is made that says this is an exhaustive set. It’s all that you should care about and it actually captures privacy. The biggest flaw in all of these things is that auxiliary information is typically just not reckoned with. It’s a kind of studying the problem in vitro instead of in vivo and it doesn’t work. You might ask, how did we get into this situation in the first place, why settle for ad hoc notions of privacy? For the context of statistical databases, the statistician [0:40:34.7 Tordelinius] made a suggestion in 1977 and it’s a beautiful suggestion, at least it sounds beautiful. He said look you want to have a statistical database? Fine, but anything that can be learned about an individual - he called them respondents because the idea was people would be responding to questions and their answers would be logged in the database - so anything that can be learned about an individual from the statistical database should be learnable without access to the statistical database. That is a beautiful definition. For example, I might be an extrovert and publish all kinds of intimate details about myself on my own website. People would learn bad things about me or things someone might think I wouldn’t want them to know. That isn’t the fault of some statistical database somewhere. We have to try to isolate what it is that the statistical database would be responsible for. Delinius said, “The statistical database shouldn’t give any other information about people.” That sounds great; would be great, but it’s obviously silly. Although we didn’t understand it was silly until we actually tried to prove that things had this property. Why is it silly? 0:41:49.7 Suppose I come from Mars and I think that everybody has two left feet. That’s my prior belief. Everybody has two left feet. Before I see this database I think Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 12 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics everybody has two left feet. Then I see this database and I discover that almost everybody has one left foot and one foot. I know now, more than I would have known without access to this database, for example, about Andreas. Things have changed. My posterior view of this is different than my prior view. This is really unavoidable. Delinius’ goal, as nice as it is, is really not achievable. To understand why, I’ll tell it as a story. The proof told as a parable, and I promise you this can be made very, very rigorous. Suppose we have database that teaches average heights of demographic subgroups. The average American woman is this tall; the average French man is that tall. It sounds like it couldn’t possibly violate privacy. There is a famous field medal winner named Terence Tao. Suppose there is a piece of auxiliary information floating around, which says that Terry Tao is two inches shorter than the average Swedish man. Suppose Tao is sensitive about his height and he wants it kept secret. Somebody who has access to the database and can learn the height of the average Swedish man knows Terence Tao’s height. Somebody who doesn’t have access to the database knows a whole lot less about Tao’s height. There is a real difference between what we can learn about Tao, given access to the database than what we could’ve learned without the database. This is another silly example, but this can be made rigorous. You can always combine the information that actually is taught by the database with some bizarre piece of auxiliary information in order to cause a privacy compromise. Think about this example. There is one remarkable thing about this example. Terry didn’t have to be in the database for this to happen to him. His privacy, somehow or other, was compromised even though he wasn’t in the database. By this point you must say this is completely absurd and obnoxious and we should stop. Basically you’re right so that’s what our reaction was. It doesn’t work but it doesn’t really make any sense and it doesn’t work. What’s going on here? The answer is we have to change our goal. Instead of this old view of what the attacker knows before it interacts with the database or sees the sanitized data versus after. Maybe we can come up with a different measure, a different goal. What about saying that joining the database shouldn’t put me at any greater risk than I was at before I joined the database? If I can show that that’s true, then this silly example with Terry’s height will just go away. 0:45:50.8 At least we can say it’s a reasonable goal for a database to have, joining the database shouldn’t hurt you. The databases, we think, actually serve a social function. There is a reason why this census bureau collects information. It’s not for its health; it’s for the health of the country. It’s because we have to apportion resources. There is a reason why we want to understand genotype phenotype correlations. We Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 13 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics want to be able to design special drugs or treatments or screening programs, and things like that. If you believe the data sets have social utility, then you want some reason to convince people to go ahead and give their data. At least it will be handled in a way that is protective of your privacy, in the sense that you’re in no more danger of anything bad happening to you by joining than if you don’t join. We call that “differential privacy.” Instead of looking at what the adversary knows before versus after interacting with the dataset, we talk about the risk to an individual when the person is in versus not in the database. The intuition for differential privacy is that whatever your data system is, its behavior should be essentially unchanged, independent of whether any individual or small group of individuals opt in or opt out of the dataset. There is a formal way of stating this guarantee. If I have some kind of a database, a curator K, we say that it give epsilon differential privacy, if no matter what the rest of the database is, for any value of me, in any possible behavior of this database, if we look at the probability that this behavior occurs when I’m in the dataset - that’s the numerator - versus when I’m not in the dataset - that’s the denominator - this ratio is bounded by e to the epsilon. That is a technical, mathematical definition of privacy, and we’ll get to why that’s useful in a minute. This was the key part of the equation. Let’s just rearrange it by taking the red term and multiplying both sides of the inequality by the red term and we get the inequality on the bottom. Suppose something cannot happen if I’m not in the dataset, something bad can’t happen to me if I’m not in the dataset. That red term is zero. That means the entire right hand side is zero so the left hand side, because it’s a probability, is also a zero. It can’t be less than zero because it’s a probability. It’s at most zero, because of the inequality. For any finite epsilon, epsilon differential privacy guarantees that bad things that can’t happen if I’m not in the dataset won’t happen if I am in the dataset. That’s already a strong guarantee. My claim is this is at least an ad omnia as opposed to ad hoc guarantee. It captures reasonably notions of privacy. To think about how this might work, the red curve is the probability of a response from the database when I’m not in it, and the black curve is the probability of these responses when I am in it. For any particular response, we know, by the differential privacy condition, that the ratio of those curves is bounded. 0:50:00.1 The intuition is that no perceptible risk is incurred by joining the dataset. Suppose I want to buy insurance and there is a dataset somewhere. The price that I pay for my insurance might depend on the answers the insurance company gets when querying this Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 14 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics database. I don’t know how it makes up its mind, but it does various things and it also queries the database. From my perspective, some of the answers that the database gives might be bad. Some of them aren’t bad. The bad ones are the ones that will cause me to have to pay more for my insurance. The promise of differential privacy says that the probability of a bad response is almost the same whether I’m in the database or I’m not in the database. In a very strong way, this database is not harming me. That technical condition bounds exactly the ratios in the probability of response. Notice that this is also true, independent of anything else, that the insurance company might know. If I have a database that behaves in this differentially private fashion, it behaves this way no matter what the attacker knows, so it neutralizes all linkage attacks. Linkage attacks don’t hurt you if you have differential privacy. It also, for technical reasons, related to the same one that neutralizes all linkage attacks, it composed unconditionally and automatically. If there are other databases out there, it doesn’t matter. Mine is still epsilon differentially private. Quickly, I’ll tell you just a flavor of how you can achieve this for certain kinds of questions. Here is a technique, a key technique. There are others but this is a fundamental one. We’re going to do something you’ve probably thought of, which is to add noise to the answer. We have to do it in a very smart way, very carefully. Let’s say the query is a counting query. How many people in the database are more than 4 feet tall and are obese. That is just counting. You go through the list of all the people in the database and you check whether they satisfy the property or not and you just increment a counter for each one. There is a true answer, which is the function “F,” the counting function F applied to the database and that’s a real number. We’re going to add noise to it. How much noise? How do we distribute the noise? What shape does the random noise generator have? We’re interested in the question of how much the data of one person can affect the outcome. How much can F of the database when I’m in it differ from F of the database when I’m not in it? The question is asking what difference, the value of the function on the database, what difference does the noise we add have to obscure if we’re going to add noise in order to obscure my presence or absence. The sensitivity is the maximum over all possible databases and all possible rows “me” of the difference in the value that the function can have when “me” is in the database versus when “me” is not in the database. It’s the absolute value of that difference. 0:54:18.0 For a counting query the difference is just one. My presence or absence in the dataset can change the answer to the counting query how many people in the database are over Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 15 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics 4 feet an obese, by at most one. There is a certain distribution that we use, which is called the Laplacian Distribution and this is what it looks like. It’s centered at 0; that’s where most of its mass is and its probability drops exponentially as you move away from 0. It drops exponentially with the distance from 0. There is a parameter that you can vary that says something about how quickly it is dropping. You can scale it. The fundamental theorem that we use is that to achieve epsilon differential privacy, we use the Laplacian with a scaling parameter, which is the sensitivity of the function divided by epsilon. Small epsilon, more privacy, more noise to obscure more. Big epsilon, less privacy, more peaked at zero. An important question though is what about multiple queries? If I take noise that is centered at zero and you get to sample this a lot, you could take the answers and average them and you’ll hone in on the true answer if I’m generating fresh noise each time. The answer is it composes automatically. If your epsilon differentially private and you ask two questions and the result is necessarily at worst 2 epsilon differentially private, or if you know you want to handle [0:56:00.4 T counting] queries, and you have a privacy budget of epsilon, you can divide by T and use epsilon over T as your budget for each query. There is a way of actually handling something that all of you have probably thought of intuitively, which is that there seems to be a tension between utility and privacy. That tension is embodied in epsilon. By scaling epsilon, you can have more utility and less privacy, or more privacy and less utility. That is a designer’s choice, or a social choice. It’s not a math choice. There is a time when you have to stop answering questions with any reasonable amount of accuracy. Not only was that intuitive, but it’s also rigorously demonstratable. There are various attacks in the literature saying suppose you make the noise fairly small, how many questions does it take until I can completely break privacy. That has been studied thoroughly. This brings us back to the non-interactive versus the interactive cases where I said the non-interactive case was intrinsically harder because you have to answer everything at once. You’re answering many questions and we know from these sorts of results that it means you have to increase the noise a lot, even if the data analyst is only interested in ten of them or square root “n” of them. 0:57:48.6 It’s possible to do a lot. We have a serious definition. We have a general approach to achieving it. We know how to do a lot of different data mining tasks in a differentially private fashion. There are various extensions that people have studied, which we will go into and we have lower bounds on how much noise we have to add to protect privacy. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 16 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics There are three things that I think are important for future work in this specific setting. I don’t know exactly what are the questions that people want to ask when analyzing social networks so it’s not clear to me whether these techniques will give us good answers. We can ensure the privacy but I don’t know if we can make things accurate enough for certain social networking questions. I just don’t know. If you guys know what kinds of questions you want to ask, I want to hear those. I told you at the beginning we were looking at the purest privacy problem. The notion of a differential privacy, of trying to define privacy rigorously and having this differential view of it definitely extends to other settings. Finally, we understand a lot what it means to be differentially private. What I don’t understand is what it means to fail to be differentially private. Is it always the case that this can be exploited to make something bad happen, or are there weaker notions of privacy that still are ad omnia, very general, feel good, but are not as strict as this? I don’t know. Andreas: Let’s thank Cynthia for the talk. I am just as interested as I was last year. We have about ten minutes for a discussion. Please stay up here. I think we should open it up for your thoughts rather than me trying to summarize what I learned again although I love to do that. Are there any thoughts, that notion of interactivity when we talked about data analysis? We said it’s not about making pretty graphs, which people always love to see, but it’s about interacting with the data. That’s what the tools are for. It’s not so much about the real time aspect, but the interactional aspect. That was certainly reflected here as well. Student: I have some questions. There are some things I didn’t understand. Can I ask? For example, you made several statements about all these attacks happening, even though Terence Tao’s height is not in the database and could still be found. Can you say explicitly how? Cynthia: You didn’t understand the example? Student: I figure it kind of correlated with … Cynthia: Here, suppose somebody knows that Terence Tao is 2 inches shorter than the average Swedish man, and make a public statement of this type. It is a toy example, but they make a public statement of this type. Now, Tao doesn’t have to be in the database, but somebody who has access to the database learns the height of the average Swedish man and now learns Tao’s height. 1:01:51.0 Someone who doesn’t have access to the database can’t learn Tao’s height from that auxiliary information. It is contrived but it can also be made rigorous, in general, it captures almost any notion of a privacy compromise. Student: I had another question. Can you explain better what you mean by interactive database? Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 17 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Cynthia: For example, just keeping the discussion at counting queries, let’s say I have a medical database and there are a couple of fields of interest and either somebody satisfies these things or doesn’t. What I might do is I might try to take the data, which is now a collection of K bits in each row, and add noise to these bits and publish the results. That would be my non-interactive private database. Another possibility is I handle questions in the form of how many people in the database have the first and the third bits set to one, both the first and the third bits set to one? That’s a number. I’d give an approximate answer to the number of rows in the database that satisfy that property, and then based on that, you might say, “Okay, that’s really interesting; now how many people just have the first bit set to one?” We would have a discussion like that and it would be an interactive database, which means it’s the way a [1:03:33.2 - 1:03:49.1 audio glitch]. Student: I’m leading back to F differentiability, the new concept, so that doesn’t include doing an attack like the one you just described, so it doesn’t protect people from correlating information to outside information if they’re not in the database? Cynthia: It does Student: So how does it protect, leading back to her example, how does it protect it then? Cynthia: Suppose I ensure that the transcript is differentially private. Let’s say I’m going to let her ask five questions. I have some epsilon. I make sure that my noise is scaled to epsilon over 5 for each question so the sequence is epsilon differentially private. This is true. It’s a true statement independent of what she actually knows from other sources. I don’t care. My behavior satisfies the differential privacy guarantee. So, her ability to - let’s put it this way; if there are certain things, unpleasant events that could happen to Andreas as a result of her questions, the chance that they’ll happen satisfies the differential bound, whether he’s in the database or he isn’t in the database, independent of what she knows. I’ve ensured this ratio of probabilities based on my behavior, my coin flips, and I don’t care what she knows. That’s hard to understand, isn’t it? My behavior is the same, essentially the same, whether he’s in the database or not. Student: Regarding the average height of Swedish people, the height of this person who was 2 inches smaller, and see how that … Cynthia: The question for this record is how does the Terence Tao example fit in with the differential privacy guarantee. It doesn’t say no harm can come to Tao because of my database. It says that there is no additional harm that Tao suffers by joining the database. Do you understand now? This is true because my answers are more or less same, independent of whether he’s in the database. 1:06:22.8 Student: So it’s still possible … Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 18 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Cynthia: It’s still possible to do the attack, but the database - there is no reason for Tao not to join the database. This gets back to the fact that the database has utility and we’ve decided as a society that the utility is useless so a really good example of this is suppose the database teaches us that smoking causes cancer. A perspective employer didn’t know this before the database and has learned this, and therefore decides not to hire the smoker because of what the database has taught. We could also argue the other side, that the database has also now taught the smoker something important, that the smoker might now enter a smoking cessation program and so on, but we do want to know these facts so that people can choose their lifestyles accordingly. Student: We talk about these things as attacks. Would we also have counterattack intelligence for example, when Dr. Weigend leaves his home in Germany and we know now he’s gone and now we’re going to rob his house, don’t we have all this other counter data that says he’s leaving the house, let’s send the security people or let’s turn on the security alarm? Can we counteract or take measures anticipating particular sets that are compromised? Cynthia: You’re back to your original position, from earlier in the class. [Laughs] In some sense, what you’re asking about has nothing to do with what I’ve been talking about. You’re asking a completely different question about is it possible to use information for good as well as for malicious ends. Of course it is. Student: … Cynthia: I don’t know. I do want to go back to your general philosophy. I think it’s a very good exercise and it’s not one that I have completed, of trying to articulate what are the harms that follow from loss of privacy. One kind of harm might be that somebody could be blackmailed. Another kind of harm might be simply that someone feels bad, and maybe that’s harm enough. On the other side of this, you can ask what are the benefits so how many people do you want to kill by not allowing free access to this medical data in order to do scientific studies. That’s the other side that I’ve heard. It’s really worth trying to sit down and figure out what are the losses and what are the gains and this has not been done by the computer scientists and it has not been done by the lawyers, and it has not been completely done, either, by the philosophers. These are really important, real life questions and an articulation of this kind of a list can try to help us make choices and help us figure out what we need to design protections against or support of. I did want to make that comment earlier since it’s really important. 1:10:14.1 Andreas: I think that’s a very important remark. It’s not computer scientists; it’s not algorithms that will solve this. One of the main things I took away this time is the whole notion of what privacy really means is normally not clear when people talk to each other. Prima facia good approaches just don’t work. That’s very interesting, that when we talk about Twitter taking the private sphere to the public sphere, we don’t really Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 19 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics know what we’re talking about. Even on Facebook, when we super simplified it and said Facebook is about C-to-C communication where people actually know each other and are usually confirmed - not really. Let’s say for the fan page of Social Data Revolution, everybody can see that. It is just a very complicated question where you present it so well today; what actually does privacy mean and what should it mean? We didn’t think about this before class. On that note, let’s give Cynthia a nice round of applause here. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 20 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc