Download weigend_stanford2009_9.1_privacy_2009

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Andreas Weigend (www.weigend.com)
Data Mining and Electronic Business: The Social Data Revolution
STATS 252
June 10, 2009
Class 9 Privacy: (Part 1 of 2)
This transcript:
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Corresponding audio file:
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.mp3
Next Transcript - Mobile: (Part 2 of 2):
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.2_mobile_2009.06.10.doc
To see the whole series: Containing folder:
http://weigend.com/files/teaching/stanford/2009/recordings/audio/
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 1
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Andreas:
Welcome to the last class, spring 2009 Data Mining and E-Business: The Social Data
Revolution, here at Stanford. It is class number nine. We have a fairly tight agenda for
today. We will start, in the first half, by thinking about the question I emailed out last
night; where do you feel that some digital exhaust is being sniffed? Where do you feel
that somebody is going through your digital garbage?
Did anybody have time to think about that?
Student:
I kept feeling Gmail is tracking my emails and Google Docs. I’m afraid of IP leaks
through that.
Andreas:
There are two things here. One is that email is something that has all of our secrets
pretty much in it. The one is that Google actually is being able to potentially look
at it or the spooks being able to look at it. The other one is if there is a security
breach, which of course would never happen anywhere, including Google. If there is a
security breach, which by the way, for social security numbers, both at Stanford and at
Berkeley, and we managed to get out twice. That is a very different story.
Think about beyond email, what is it that Google knows? Google knows all your
queries, all the things, all your desires, all the things you really wouldn’t share with
your closest friend, but Google would know.
Student:
With the coming of Google, voice, all your phone -
Andreas:
Google Voice, yes, short messages. It understands your implicit calling pattern in
addition to your implicit emailing pattern.
Student:
Chat logs
Andreas:
Chat logs, that’s another dimension. There is a company called Thirty-three - something.
I need to look it up. I just talked to the CEO a few days ago. He claims to have access to
AOL and all the big companies that do chat, and only looks at header information, but
header information is interesting enough.
Gary, where are you worried about someone sniffing your digital exhaust?
Gary:
My search browser history, what I search on YouTube, what I search on Google.
Andreas:
Yes, YouTube, that’s a beautiful example. Who of you would be comfortable to
have all your searches on YouTube showing up on the screen right now?
[Laughter] You would be comfortable? Having your searches on YouTube?
Student:
I try not to share truly personal content on the web.
0:02:59.4
Andreas:
In the Hong Kong example, Yahoo China managed to get somebody into prison
because he emailed or published - I forgot the details from four years ago, something
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 2
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
that was supposed to be internal within the Chinese government. Despite actually
having been in Hong Kong, not in mainland China, this guy still was delivered to
the mainland Chinese authorities.
Think very differently now. Think about your mobile. Your mobile knows so much
about you. It knows where you are as opposed to where you say you are. What
digital exhaust does your mobile create?
Student:
They know all the businesses you go to and how you get to them. It’s in the same way
you have this funnel leading to transactions on Amazon; it’s the same thing with mobile
but in a physical reality.
Andreas:
That is applications that could potentially use it for advertising. Five years ago, I had a
student who ran the Shanghai subway in my class in Shanghai. He took me to the
headquarters. They had a special car and so on. He drove around quite a bit to make
sure I sort of lost my way, and he said, “Now we’re here. Do you know where we are?” I
said, “Yeah, that’s the stall where I buy my CDs on the other side of the street. He was
seriously disappointed that he didn’t manage to fool me.
In that building, they have data that you cannot believe about the behavior of each
individual. We all have what’s called a [0:04:41.2 Chinese phrase], which is the railway
card or subway card. As opposed to systems like San Francisco where you put in
quarters, which are anonymous, or New York where you do have a card but it doesn’t
relate to your identity, the Shanghai railway card is totally related to your name and it fills
up. You can fill it up from your bank account, or automatically deducted. It’s very
convenient for a user.
I was swiping my card through that test slot and it told me where I was three years
ago, at 4:53:60; I went through turnstile number 5 at whatever subway station was,
and 17 minutes later, I left at that certain turnstile at another subway station.
On the one hand that’s interesting for understanding people movement and for
potential planning, but on the other hand they do know a lot of information about
us, such as your credit cards, your purchasing history, where else is there digital
exhaust? We’re talking about what digital exhaust we produce here and I want to get the
bigger spectrum of things here.
Student:
…
0:06:27.5
Andreas:
Shopping, now I’m surprised I haven’t heard the F word yet, which is Facebook, just to
make sure you know what I’m talking about. [Laughter] Facebook - it does know a lot
about us. If you look at peoples’ friends and see I think we learn so much about
them. On the other F word, Flickr, by the pictures we post - let me tell you a story
here.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 3
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
I went to a party about two or three years ago in San Francisco, on Market Street. I saw
a guy there; went up to him and said, “Hi, I’m Andreas.” He said, “I know.” I’m sorry, I
forgot - it happens. I forget people. I said, “What’s your name?” He said, “[0:07:17.9
unclear].” I said, “I do consulting and teach at Stanford.” “I know.” “What do you do?” “I
work at Google.” Okay. Another attempt - “I use to be at Amazon.” “I know.” After a few
moments like this, I got sort of surprised just how much he knew about me. He even
went to the degree that he knew who was at my birthday party because he looked up
people at Flickr that was pre-Facebook tagging. He didn’t know peoples’ names but he
pretty much knew who my friends were.
At the end of the day, I said, “I’m on foot here. I take muni; do you mind driving me
home?” He said sure. He drove home. Stopped and said, “This is where you live,
right?” I said yes. It turned out he listens to my mp3s. It’s very interesting how people
really can find out so much about us. There was nothing funny about it but if he was not
a great guy - and I know where he works. I know who he works for; just to be on the safe
side, if he was spooky I might be somewhat worried about it. That was about four years
ago.
Student:
You seem to be setting up this question sort of asymmetrically that alludes to the
scariness of it. There are people who know more about you but I think I’d like to get over
this hump to the future where everybody’s information is just as easily accessible so if
you’re going to use it against me I know where you are, I know where you live. When
does it become symmetrical, whether that’s government versus the population or
population versus population? At what point in this science fiction journey is nobody
afraid because we know you can’t get me because I know just as much about you?
Andreas:
That’s a good point. I know that about ten years ago I gave a keynote in Germany.
Germany has very different sensibilities about privacy than here. I always put my
schedule on the web, as you know; www.weigend.com/itinerary tells you the next flight
I’m on. People were so worried about it being a security risk. I asked them, “What do
you mean by security risk? Do you think someone will do something to the plane just
because I’m on it? I don’t think I’m anyone important.” “No, about somebody breaking
into your house.” I said, “Great, if somebody wants to break into my house, it’s better to
break into my house when I’m not there than when I’m there. If I have the choice, please
figure out the time when I’m not there.”
0:10:15.0
Last year, Ryan Mason, who is one of the graders this year, he actually had me carry a
[0:10:20.9 unclear] device which is a little orange device that is unfortunately pretty flaky
software that every fifteen minutes, via satellite and to my website, told people where I
was. Cynthia would have known exactly when I was leaving home today. It’s pretty
convenient because you could look at an [0:10:40.1 unclear] and let’s say, if you’re
consulting, you know exactly the times you arrived at a spot and exactly the times you left
them for the last half year or so you’ve been carrying it.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 4
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
There is a lot of asymmetry here. My problem is what if somebody actually fakes
the information. Let’s say somebody robs a bank. Now, they fake my information
and claim that it was me who robbed the bank.
I talked to a friend of mine last year who solved the problem for me, which Cynthia will
unsolve in her talk. The problem was how can I prove that it wasn’t me who robbed
the bank; that I actually was in Los Angeles visiting my friend as opposed to being
in San Francisco when the bank was robbed? I was stuck with that question and
Greg Wolfe said, “Andreas, the solution of course is more data, namely by constantly
collecting those data and hashing them into some space and then being able to compare
what is the probability of him being in Los Angeles versus what is the probability of him
being in San Francisco. You can, with high probability say, if you collect more data, that
he really was not in San Francisco that day.”
Similar to our problem with information overload where we asked how we deal with this
thing and the answer if by creating more data, metadata, by allowing websites to observe
what we’re doing and using machine learning is probably the only approach I can see.
Plus social graph, which is one way in social data of sharing stuff.
Student:
I really wanted to give you back to Robert. Maybe there will be symmetry in
consumer-to-consumer interaction or user-to-user interaction but what about userto-government? That’s at the end of the day, when you look at what happened … and
China… at some point there are entities that break the symmetry and there are a lot of
reasons to do that, from disease control, to terrorism control, and to economy control. I
think that to reach symmetry on the level of the government would require a very
radical change in government structure and I don’t think that would happen.
Andreas:
We see it in some way in banking. UBS, for instance, probably has to hand over
records of people here and whether that is the right thing or the wrong thing, I’m not to
judge, but if people break the law by not paying taxes or whatever the story is there,
it’s understandable that the US government uses its leverage and tells the bank, “If
you want to do business here, I’m sorry but you have to obey the rules we have in
this country.”
Are there any pouches of data that we haven’t even thought about when we’re talking
about the digital exhaust? The data that we don’t actively contribute, knowingly and
willingly, as I always say, but that we implicitly contribute?
Student:
GPS and Wi-Fi, do they know we’re in this group…
0:14:25.2
Andreas:
Of course, there is a German company called Plazes, which has a mapping from hot
spots. It’s very simple and that is a smooth transition into the talk. If you have some
other source of the information, because some people might have an iPhone or
might tag this explicitly, where you have GPS information and Wi-Fi information,
suddenly, you can build a pretty good idea of where the different base stations are.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 5
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
As you know from Google Latitude - who use that actually? One person? Google
Latitude, even with phones like mine that have, for marketing and revenue reasons, the
GPS chip disabled; down the road they hope to catch a premium by selling it in the new
version of the phone, which is the same version of the phone, which now has the GPS
chip enabled. Even there, within a few hundred yards, typically, it is pretty good in
knowing where I am. That is based on mapping, having auxiliary data and having data
that is both GPS location data as well as Wi-Fi data.
Anything else? Did I forget something?
Student:
RFID tags. When you buy things …
Andreas:
Yes, RFID is a good example. You know these little chips, they cost two cents, five
cents, the passive ones where when we drive over the Bay Bridge many of us have an
RFID chip there that gets read out and automatically deducted from our credit card. WalMart required RFID chips two or three years ago, on every single container that was
shipped, not every single shirt but on every single container.
There is a future store MetroGroup has in Germany, where every single item in the store
has an RFID. The prices are also dynamic. The prices could change so as I walk
through that aisle, and think about getting a yogurt and then I go for the non-fat yogurt,
and then walk by the aisle where they have dieting pills, they might be blinking. The
projection down in the floor, and the price might be going up because they know that
maybe for me it’s more important than for other people. It’s called the Future Store that
MetroGroup had and they actually got an award by some German consumer group. It
got them the Big Brother Award. [Laughter]
On the other hand, checkout is easier. It remembers what I’ve been looking at. Now, just
to close my little introduction here, day one of Amazon.com was already in that situation
because at Amazon, of course it knows and has a data strategy that highly leverages all
the things you have looked at. People who bought X also bought Y and people who
looked at X eventually bought Y. All these things are like RFIDs in the physical store.
As we have seen in a number of cases, sometimes the physical world, and the digital
world - it’s just that if you have a physical embodiment you think about the world
differently but there is some convergence.
Alright, I think I have done enough to warm us up for certainly the best female speaker in
the quarter, because she’s the only female speaker we had, but I have to say my favorite
speaker in this quarter, Cynthia Dwork.
0:18:05.5
Cynthia and I met at a conference two years ago in San Diego. I forgot; you did
cryptography or something like this. Were you still at IBM or were you at Microsoft? We
shared a cab and she gave me her mobile number in case we had two different flights. If
one of us actually finds the seat on the plane and the other one doesn’t have one, then
we could communicate, so I had her mobile number.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 6
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
When it came to who I wanted last year, I thought, “She isn’t going to talk in my class.”
She said, “Andreas, since you asked so nicely, I will be coming.” It’s a talk Cynthia gave
about a year ago at the Almaden Institute and I have put the link to the video from that
talk on the wiki. I was absolutely amazed about what I learned in this talk about
what you can do by having thoughtfully anonymized data, but with a bit of side
information, you can find out so much more about the people. In retrospect, I was
very happy that when I was at Amazon, I wanted to release a couple of data sets to the
academic community. I prepared them well and then at the last moment, there was some
reason somebody came up with “Maybe we shouldn’t do this.” In retrospect, I’m super
happy that I never released anything because I know otherwise; I would be on Cynthia’s
list today.
It’s a pleasure having you here. Let’s welcome Cynthia here. [Applause]
Cynthia;
Thank you for that introduction and the interesting discussion you had. I am not going to
stand up here and be controversial about whether information is good or bad. I’m going
to talk to you about privacy and getting precise about what privacy could possibly
mean and what we might be able to do about it in a very restricted setting.
Privacy, the word, means a lot of different things to different people. It comes up,
as you’ve just heard and have been discussing, in many different contexts. A lot of
times, when people talk about privacy, they’re also thinking about security. The
data are sitting somewhere and someone is going to get in and see them.
In order to just try to clean up the picture and to distill it as much as possible, I’m going to
focus on a very pure privacy problem. In this problem, some trusted and trustworthy
curator has gathered a bunch of information. It might be Alfred Kinsey, who has gathered
a lot of information about the sexual practices of individuals. It might be a hospital that
has a lot of information about different treatments for specific disease. 0:18:05.5 It’s a
pure problem in the sense that for now let’s assume that the curator is really
trustworthy. There is no malice going on. The goal is to release statistical
information in some way that protects the privacy of individuals. We need to have
a rigorous definition of what privacy is because otherwise, we can’t really say if
they succeeded or failed. We want a mathematical definition of privacy.
We want to be able to do things like find statistical correlations, say correlating a
cough outbreak with a chemical plant malfunction. If you’re familiar with HIPAA
regulations and how you can release sanitized data by removing certain fields from
the data and releasing the results, they have to remove, under the Safe Harbor
Provisions, geographic information. This correlation is something that you
couldn’t do.
0:22:36.3
We want to be able to notice events, like detecting spikes in emergency room
admissions for asthma, various kinds of data mining tasks and clustering and
official statistics.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 7
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
You might ask, “Hasn’t this been done before?” The answer is of course it has, and in
fact, especially in the context of official statistics, such as the census bureau, department
of education, labor statistics, IRS information; this has been under the purview of the
statisticians for several decades.
I don’t have time here to say what went wrong, but what I can say is modern
cryptography has developed a strong language and understanding of information
leakage. It’s not very surprising that it was possible to add something to the
conversation.
A brief outline - when we talk about some broken privacy cases, I’m going to say
that what went wrong was the nature of the privacy promise was wrong. I will
propose one right privacy promise. I’m not saying it’s the only one. It happens to be
the only one I know at the moment but there could well be others. I’ll say a couple words
about how to achieve it, some limitations, and future work.
You could think of two models. One of them is a non-interactive model where we
have the database on the left, the curator in the middle, and the curator takes the
database and does some kind of sanitization process to it and releases the
sanitized database that the data analysts, represented by the question mark, can
interact with to her heart’s content. We don’t need the real data anymore. They
can just go away. This is typically what people have done when you hear terms like
“anonymized” or “de-identified” data, or scrubbed or sanitized.
There is another possibility; open your minds to another possibility, which is an
interactive model. Somebody asks a question to the curator who is holding the
data and the curator computes the answer, may modify the question slightly or
modify the answer slightly, and then releases the results. This data analyst, based
on the question and the answer, figures out a new question, asks the new
question, gets a possibly perturbed answer, and this goes on. It’s interactive and
the data have to stay there.
Andreas:
One interesting element is the one who owns the data knows what question is being
asked.
Cynthia:
Yes, they know what questions are being asked. Ask me later, when we care, and how
we use that information.
0:25:46.8
I would like to point out something; the non-interactive problem is necessarily a lot
harder. The reason is this sanitized database that is released has to be able to
answer all the questions that the analysts might possibly come up with, even if the
analyst is restricted to ten questions. Since the sanitizer, the curator, doesn’t
know in advance what those questions are, the released object has to have
answers to all the possible questions so that the first ten at least, can get
answered correctly; whereas, in the interactive setting, you only need to answer
the questions that are actually asked. That turns out to be an important difference.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 8
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
The theme that I’m going to repeat is that data analysis doesn’t happen in a
vacuum. There is always other information around. We call this auxiliary
information. It’s information from any source at all, other than the database under
consideration. It could be other databases, including old releases of this one, and it could
be future databases, later. It could be newspapers, things you’ve heard from insiders,
people who know something about the data who will say, “That number was somewhere
between 680 and 742.” That could be a piece of useful auxiliary information. In fact, it
could be used to attack specific schemes. It could be government reports, census
website information, or inside information from a different organization, not the one that
holds the database, but somebody at some other company, for example, Google’s view
of the attacker or user as a Google employee.
A linkage attack is just a malicious use of auxiliary information. The idea is this;
maybe you have a data set that is public and it has some kind of innocuous
information. You have some other data set that has both innocuous and sensitive
data. You can take these two datasets and you can link on the innocuous fields in
order to learn who owns that sensitive piece of information.
Here is an example. How many people know this example? Do you know about the
break on this? Great. The Netflix prize. Netflix recommends movies to its subscribers.
It is looking for an improved recommendation system. It offers a million dollar prize for a
10% improvement in their recommendation system. We are not concerned here with
how they measure improvement.
It has to publish training data so that you can develop an algorithm for
recommending and then test it on the training data. What does the Netflix prize rules
page say? It says first of all, something about how much data there are. The training
data set consists of more than a hundred million ratings from over 480 thousand
randomly chosen, anonymous customers, on nearly 18 thousand movie titles. There is
some discussion about what the ratings are, on a scale from 1 to 5 stars. To protect
customer privacy, all personal information identifying individual customers has
been removed, and all customer IDs have been replaced by randomly assigned
IDs.
One thing you should walk away from this talk is if they tell you they have removed
all personally identifying information they’re wrong. Don’t believe it. Occasionally,
you’ll be wrong, but most of the time you’ll be right.
Here is a very relevant source of auxiliary information. People don’t watch movies
in a vacuum. They go to the Internet Movie Data Base and they say things about the
movies they’ve seen. Individuals can register for an account. How many of you have
accounts on the IMDB? A couple? You don’t have to be anonymous and the visible
material that you leave there includes ratings, dates, and comments.
0:30:21.7
A very interesting paper by [0:30:25.9 unclear name] attacked the Netflix training
data, using the IMDB as auxiliary information. First, they did some analysis and made
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 9
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
some nice observations. For example, if they have 8 movie ratings, of which they
allowed 2 to be completely wrong, and dates that might have a 3 day error, they
could still uniquely identify 96% of the Netflix subscribers.
For 89%, just 2 ratings and dates are enough to reduce the set of plausible records
from about 480 thousand down to 8 which can then be inspected by a human for
further de-anonymization.
They succeeded in identifying somebody in the Netflix data set and they drew
some conclusions about this person, and they’re not flattering conclusions. I think
they said he was homophobic, for example. I forget, maybe racist. This is a very
important point and this goes to what Andreas said; they may be right in their
conclusions about him, and they may be wrong. Either way, this guy is harmed. I
don’t know what to make of that fact, but I think it’s a really important statement.
In fact, there is very interesting writing in the Philosophy of Law literature, by Ruth
Gavison, who talks about privacy as protection from being brought to the attention
of others. We’d like to be able to walk down the street without feeling as if people
are staring at us, but in addition to the manifest loss of privacy when people stare
at you; this invites further compromise of your privacy. Once you have been
drawn to the government’s attention, for example, they may start investigating
your phone records, where you’ve been and your Shanghai past, and so on.
Here are some other successful attacks that I’ll mention briefly. One of the first attacks
was done by Latanya Sweeney and it was against anonymized HMO records. Are
you familiar with this example? She cross referenced the group medical health
insurance medical encounter data, which are public, and had been “anonymized”
with the records of the voter registration rolls. She identified the medical records
of William Weld, who was the Governor of Massachusetts. She proposed a fix, a
syntactic condition that released data sets should follow in order to protect
privacy. That syntactic condition is called “K Anonymity.”
It turns out that K Anonymity doesn’t protect privacy and a few people observed
this. Three people proposed an alternative, another syntactic condition called “L
Diversity. Of course, that didn’t quite work. An attack was made against L
Diversity by [0:33:47.9 unclear] and they proposed another variant called “M
Invariance,” another syntactic condition on data sets. There are some techniques
that go against all of these by [0:33:58.7 unclear] and Smith.
0:34:00.3
Here is one more example, social network graphs. Suppose we have a friendship
graph where we have nodes that correspond to users, and users can list others as
friends, which will create an edge between them. The edges may be annotated
with directional information. I have named Andreas as my friend but he has not
reciprocated. The question that one might ask if one is a researcher wanting to study
social phenomena is how frequently is the friend designation reciprocated. That’s a
perfectly nice, statistical question.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 10
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
One idea for allowing this kind of research question is to publish an anonymization
of the graph, which means we simply take all of the names and replace them by
random IDs. This obviously permits analysis of the structure of the graph, and the
privacy hope is that the randomization on the identifiers, the fact that the names
have been obliterated, will make it hard or impossible to identify nodes with
specific individuals. That way, you maintain the privacy of who is connected to
whom. Those are the secrets; who is connected to whom.
Of course, this is disastrous and it’s vulnerable to both active and passive attacks.
The point again is that this graph was not given to us transported through the
vacuum of space from Mars. It is really here on Earth and the people who want to
attack it can also in fact, create nodes and edges in it.
Just to give you an idea, here is the flavor of the attack. Before release, you create a
sub graph that has special structure. It’s very small. It only has to be about 12 nodes.
It’s going to be highly interconnected, internally, but very lightly connected to the
rest of the graph. The special graph is the light blue one, the thin connection to the rest
of the graph is the green edge, and then there is the rest of the graph.
Suppose we want to know whether some people, Steve and Jerry, are in contact
with each other. We’re going to take our attacking graph and we’re going to pick
two contacts in there, A and B. We’re going to create an edge from A to S and B to
J. A to Steve and B to Jerry. Later, in the anonymized graph, if we can find the
nodes A and B, then we can look at the edges that flow out of that sub graph and
see whether S and J are connected.
There is a magic step, which involves something called the [0:37:02.0 Gamari
Hutchri], which allows you to isolate lightly linked in sub graphs from the rest of
the graph. If we’ve chosen our blue graph correctly, its special structure will allow
us to find A and B, and in fact, figure out all the other nodes in there, and therefore
figure out those connections out of the sub graph, to Steve and Jerry. Then we
just check and see if there is an edge between Steve and Jerry or not.
I promised you this was the last one, but there was another one I found that I really
loved, which is anonymizing query logs via token-based hashing. Suppose you
want to allow people to do studies of query logs, like you’re Google or Microsoft,
and you have all of this information? Some people have proposed taking the logs,
hashing each of the tokens, the big words, into the log to some random string, and
release the results.
0:38:06.3
You can still get some kind of information about co-occurrences of hashing or
hash terms. The search string is tokenized, tokens are hashed to identifiers, and
you release it. This idea was successfully attacked by some people at Yahoo. It
requires a piece of auxiliary information such as a reference query log. Are you
familiar with what happened with the AOL query logs, the anonymized - that one is
good enough to allow attack on anything else? At least, over time, the things that
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 11
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
people search for does change, so eventually the utility of the AOL published log will
diminish. It was good enough at the time that this was done.
It exploits co-occurrence information in the reference log to guess hash preimages. It turns out as an interesting fact, that frequency statistics alone don’t
work.
What’s going wrong? One thing is definitional failures. The guarantees are
syntactic. They’re not symantic. They don’t have meaning, they’re just forms. If
the database satisfies these visibly checkable conditions, then blah. That was the
case with K Anonymity, L Diversity, and M Invariance. Or the idea that it’s enough
to just remove the names and replace them with random strings, that’s syntactic.
That has nothing to do with understanding the relationships between these nodes
in the friendship graph, for example.
They’re completely ad hoc, so privacy compromised is simply defined to be a
certain set of undesirable outcomes, but there is no argument that is made that
says this is an exhaustive set. It’s all that you should care about and it actually
captures privacy.
The biggest flaw in all of these things is that auxiliary information is typically just
not reckoned with. It’s a kind of studying the problem in vitro instead of in vivo
and it doesn’t work.
You might ask, how did we get into this situation in the first place, why settle for ad
hoc notions of privacy? For the context of statistical databases, the statistician
[0:40:34.7 Tordelinius] made a suggestion in 1977 and it’s a beautiful suggestion, at
least it sounds beautiful. He said look you want to have a statistical database? Fine,
but anything that can be learned about an individual - he called them respondents
because the idea was people would be responding to questions and their answers
would be logged in the database - so anything that can be learned about an
individual from the statistical database should be learnable without access to the
statistical database. That is a beautiful definition.
For example, I might be an extrovert and publish all kinds of intimate details about
myself on my own website. People would learn bad things about me or things
someone might think I wouldn’t want them to know. That isn’t the fault of some
statistical database somewhere. We have to try to isolate what it is that the
statistical database would be responsible for.
Delinius said, “The statistical database shouldn’t give any other information about
people.” That sounds great; would be great, but it’s obviously silly. Although we
didn’t understand it was silly until we actually tried to prove that things had this
property. Why is it silly?
0:41:49.7
Suppose I come from Mars and I think that everybody has two left feet. That’s my
prior belief. Everybody has two left feet. Before I see this database I think
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 12
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
everybody has two left feet. Then I see this database and I discover that almost
everybody has one left foot and one foot. I know now, more than I would have
known without access to this database, for example, about Andreas. Things have
changed. My posterior view of this is different than my prior view. This is really
unavoidable.
Delinius’ goal, as nice as it is, is really not achievable. To understand why, I’ll tell it
as a story. The proof told as a parable, and I promise you this can be made very, very
rigorous. Suppose we have database that teaches average heights of demographic
subgroups. The average American woman is this tall; the average French man is that
tall. It sounds like it couldn’t possibly violate privacy.
There is a famous field medal winner named Terence Tao. Suppose there is a piece of
auxiliary information floating around, which says that Terry Tao is two inches shorter than
the average Swedish man. Suppose Tao is sensitive about his height and he wants it
kept secret. Somebody who has access to the database and can learn the height of the
average Swedish man knows Terence Tao’s height. Somebody who doesn’t have
access to the database knows a whole lot less about Tao’s height.
There is a real difference between what we can learn about Tao, given access to
the database than what we could’ve learned without the database.
This is another silly example, but this can be made rigorous. You can always
combine the information that actually is taught by the database with some bizarre
piece of auxiliary information in order to cause a privacy compromise. Think about
this example. There is one remarkable thing about this example.
Terry didn’t have to be in the database for this to happen to him. His privacy,
somehow or other, was compromised even though he wasn’t in the database. By
this point you must say this is completely absurd and obnoxious and we should stop.
Basically you’re right so that’s what our reaction was. It doesn’t work but it doesn’t really
make any sense and it doesn’t work. What’s going on here? The answer is we have to
change our goal.
Instead of this old view of what the attacker knows before it interacts with the
database or sees the sanitized data versus after. Maybe we can come up with a
different measure, a different goal. What about saying that joining the database
shouldn’t put me at any greater risk than I was at before I joined the database? If I
can show that that’s true, then this silly example with Terry’s height will just go
away.
0:45:50.8
At least we can say it’s a reasonable goal for a database to have, joining the
database shouldn’t hurt you. The databases, we think, actually serve a social
function. There is a reason why this census bureau collects information. It’s not for its
health; it’s for the health of the country. It’s because we have to apportion resources.
There is a reason why we want to understand genotype phenotype correlations. We
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 13
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
want to be able to design special drugs or treatments or screening programs, and things
like that.
If you believe the data sets have social utility, then you want some reason to
convince people to go ahead and give their data. At least it will be handled in a
way that is protective of your privacy, in the sense that you’re in no more danger of
anything bad happening to you by joining than if you don’t join.
We call that “differential privacy.” Instead of looking at what the adversary knows
before versus after interacting with the dataset, we talk about the risk to an
individual when the person is in versus not in the database.
The intuition for differential privacy is that whatever your data system is, its
behavior should be essentially unchanged, independent of whether any individual
or small group of individuals opt in or opt out of the dataset. There is a formal way
of stating this guarantee. If I have some kind of a database, a curator K, we say
that it give epsilon differential privacy, if no matter what the rest of the database is,
for any value of me, in any possible behavior of this database, if we look at the
probability that this behavior occurs when I’m in the dataset - that’s the numerator
- versus when I’m not in the dataset - that’s the denominator - this ratio is bounded
by e to the epsilon. That is a technical, mathematical definition of privacy, and
we’ll get to why that’s useful in a minute.
This was the key part of the equation. Let’s just rearrange it by taking the red term and
multiplying both sides of the inequality by the red term and we get the inequality on the
bottom.
Suppose something cannot happen if I’m not in the dataset, something bad can’t
happen to me if I’m not in the dataset. That red term is zero. That means the entire
right hand side is zero so the left hand side, because it’s a probability, is also a
zero. It can’t be less than zero because it’s a probability. It’s at most zero,
because of the inequality.
For any finite epsilon, epsilon differential privacy guarantees that bad things that
can’t happen if I’m not in the dataset won’t happen if I am in the dataset. That’s
already a strong guarantee.
My claim is this is at least an ad omnia as opposed to ad hoc guarantee. It
captures reasonably notions of privacy. To think about how this might work, the
red curve is the probability of a response from the database when I’m not in it, and
the black curve is the probability of these responses when I am in it. For any
particular response, we know, by the differential privacy condition, that the ratio of
those curves is bounded.
0:50:00.1
The intuition is that no perceptible risk is incurred by joining the dataset. Suppose
I want to buy insurance and there is a dataset somewhere. The price that I pay for my
insurance might depend on the answers the insurance company gets when querying this
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 14
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
database. I don’t know how it makes up its mind, but it does various things and it also
queries the database.
From my perspective, some of the answers that the database gives might be bad. Some
of them aren’t bad. The bad ones are the ones that will cause me to have to pay more
for my insurance. The promise of differential privacy says that the probability of a
bad response is almost the same whether I’m in the database or I’m not in the
database. In a very strong way, this database is not harming me. That technical
condition bounds exactly the ratios in the probability of response.
Notice that this is also true, independent of anything else, that the insurance
company might know. If I have a database that behaves in this differentially
private fashion, it behaves this way no matter what the attacker knows, so it
neutralizes all linkage attacks. Linkage attacks don’t hurt you if you have
differential privacy.
It also, for technical reasons, related to the same one that neutralizes all linkage
attacks, it composed unconditionally and automatically. If there are other
databases out there, it doesn’t matter. Mine is still epsilon differentially private.
Quickly, I’ll tell you just a flavor of how you can achieve this for certain kinds of
questions. Here is a technique, a key technique. There are others but this is a
fundamental one. We’re going to do something you’ve probably thought of, which
is to add noise to the answer. We have to do it in a very smart way, very carefully.
Let’s say the query is a counting query. How many people in the database are
more than 4 feet tall and are obese. That is just counting. You go through the list of
all the people in the database and you check whether they satisfy the property or not and
you just increment a counter for each one. There is a true answer, which is the
function “F,” the counting function F applied to the database and that’s a real
number. We’re going to add noise to it.
How much noise? How do we distribute the noise? What shape does the random
noise generator have? We’re interested in the question of how much the data of
one person can affect the outcome. How much can F of the database when I’m in it
differ from F of the database when I’m not in it? The question is asking what
difference, the value of the function on the database, what difference does the
noise we add have to obscure if we’re going to add noise in order to obscure my
presence or absence.
The sensitivity is the maximum over all possible databases and all possible rows
“me” of the difference in the value that the function can have when “me” is in the
database versus when “me” is not in the database. It’s the absolute value of that
difference.
0:54:18.0
For a counting query the difference is just one. My presence or absence in the dataset
can change the answer to the counting query how many people in the database are over
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 15
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
4 feet an obese, by at most one. There is a certain distribution that we use, which is
called the Laplacian Distribution and this is what it looks like. It’s centered at 0;
that’s where most of its mass is and its probability drops exponentially as you
move away from 0. It drops exponentially with the distance from 0.
There is a parameter that you can vary that says something about how quickly it is
dropping. You can scale it. The fundamental theorem that we use is that to
achieve epsilon differential privacy, we use the Laplacian with a scaling parameter,
which is the sensitivity of the function divided by epsilon.
Small epsilon, more privacy, more noise to obscure more. Big epsilon, less
privacy, more peaked at zero.
An important question though is what about multiple queries? If I take noise that
is centered at zero and you get to sample this a lot, you could take the answers
and average them and you’ll hone in on the true answer if I’m generating fresh
noise each time.
The answer is it composes automatically. If your epsilon differentially private and
you ask two questions and the result is necessarily at worst 2 epsilon differentially
private, or if you know you want to handle [0:56:00.4 T counting] queries, and you
have a privacy budget of epsilon, you can divide by T and use epsilon over T as
your budget for each query.
There is a way of actually handling something that all of you have probably
thought of intuitively, which is that there seems to be a tension between utility and
privacy. That tension is embodied in epsilon. By scaling epsilon, you can have
more utility and less privacy, or more privacy and less utility. That is a designer’s
choice, or a social choice. It’s not a math choice.
There is a time when you have to stop answering questions with any reasonable
amount of accuracy. Not only was that intuitive, but it’s also rigorously
demonstratable. There are various attacks in the literature saying suppose you
make the noise fairly small, how many questions does it take until I can completely
break privacy. That has been studied thoroughly.
This brings us back to the non-interactive versus the interactive cases where I said
the non-interactive case was intrinsically harder because you have to answer
everything at once. You’re answering many questions and we know from these
sorts of results that it means you have to increase the noise a lot, even if the data
analyst is only interested in ten of them or square root “n” of them.
0:57:48.6
It’s possible to do a lot. We have a serious definition. We have a general approach
to achieving it. We know how to do a lot of different data mining tasks in a
differentially private fashion. There are various extensions that people have
studied, which we will go into and we have lower bounds on how much noise we
have to add to protect privacy.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 16
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
There are three things that I think are important for future work in this specific
setting. I don’t know exactly what are the questions that people want to ask when
analyzing social networks so it’s not clear to me whether these techniques will give us
good answers. We can ensure the privacy but I don’t know if we can make things
accurate enough for certain social networking questions. I just don’t know. If you
guys know what kinds of questions you want to ask, I want to hear those.
I told you at the beginning we were looking at the purest privacy problem. The
notion of a differential privacy, of trying to define privacy rigorously and having
this differential view of it definitely extends to other settings.
Finally, we understand a lot what it means to be differentially private. What I don’t
understand is what it means to fail to be differentially private. Is it always the case
that this can be exploited to make something bad happen, or are there weaker
notions of privacy that still are ad omnia, very general, feel good, but are not as
strict as this? I don’t know.
Andreas:
Let’s thank Cynthia for the talk. I am just as interested as I was last year. We have
about ten minutes for a discussion. Please stay up here. I think we should open it up for
your thoughts rather than me trying to summarize what I learned again although I love to
do that. Are there any thoughts, that notion of interactivity when we talked about data
analysis? We said it’s not about making pretty graphs, which people always love to see,
but it’s about interacting with the data. That’s what the tools are for. It’s not so much
about the real time aspect, but the interactional aspect. That was certainly reflected here
as well.
Student:
I have some questions. There are some things I didn’t understand. Can I ask? For
example, you made several statements about all these attacks happening, even though
Terence Tao’s height is not in the database and could still be found. Can you say
explicitly how?
Cynthia:
You didn’t understand the example?
Student:
I figure it kind of correlated with …
Cynthia:
Here, suppose somebody knows that Terence Tao is 2 inches shorter than the average
Swedish man, and make a public statement of this type. It is a toy example, but they
make a public statement of this type. Now, Tao doesn’t have to be in the database, but
somebody who has access to the database learns the height of the average Swedish
man and now learns Tao’s height.
1:01:51.0
Someone who doesn’t have access to the database can’t learn Tao’s height from that
auxiliary information. It is contrived but it can also be made rigorous, in general, it
captures almost any notion of a privacy compromise.
Student:
I had another question. Can you explain better what you mean by interactive
database?
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 17
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Cynthia:
For example, just keeping the discussion at counting queries, let’s say I have a
medical database and there are a couple of fields of interest and either somebody
satisfies these things or doesn’t. What I might do is I might try to take the data,
which is now a collection of K bits in each row, and add noise to these bits and
publish the results. That would be my non-interactive private database.
Another possibility is I handle questions in the form of how many people in the
database have the first and the third bits set to one, both the first and the third bits
set to one? That’s a number. I’d give an approximate answer to the number of
rows in the database that satisfy that property, and then based on that, you might
say, “Okay, that’s really interesting; now how many people just have the first bit
set to one?” We would have a discussion like that and it would be an interactive
database, which means it’s the way a [1:03:33.2 - 1:03:49.1 audio glitch].
Student:
I’m leading back to F differentiability, the new concept, so that doesn’t include doing an
attack like the one you just described, so it doesn’t protect people from correlating
information to outside information if they’re not in the database?
Cynthia:
It does
Student:
So how does it protect, leading back to her example, how does it protect it then?
Cynthia:
Suppose I ensure that the transcript is differentially private. Let’s say I’m going to
let her ask five questions. I have some epsilon. I make sure that my noise is
scaled to epsilon over 5 for each question so the sequence is epsilon differentially
private. This is true. It’s a true statement independent of what she actually knows from
other sources. I don’t care. My behavior satisfies the differential privacy guarantee.
So, her ability to - let’s put it this way; if there are certain things, unpleasant events
that could happen to Andreas as a result of her questions, the chance that they’ll
happen satisfies the differential bound, whether he’s in the database or he isn’t in
the database, independent of what she knows. I’ve ensured this ratio of
probabilities based on my behavior, my coin flips, and I don’t care what she
knows. That’s hard to understand, isn’t it? My behavior is the same, essentially
the same, whether he’s in the database or not.
Student:
Regarding the average height of Swedish people, the height of this person who was 2
inches smaller, and see how that …
Cynthia:
The question for this record is how does the Terence Tao example fit in with the
differential privacy guarantee. It doesn’t say no harm can come to Tao because of
my database. It says that there is no additional harm that Tao suffers by joining
the database. Do you understand now? This is true because my answers are
more or less same, independent of whether he’s in the database.
1:06:22.8
Student:
So it’s still possible …
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 18
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Cynthia:
It’s still possible to do the attack, but the database - there is no reason for Tao not
to join the database. This gets back to the fact that the database has utility and we’ve
decided as a society that the utility is useless so a really good example of this is suppose
the database teaches us that smoking causes cancer. A perspective employer didn’t
know this before the database and has learned this, and therefore decides not to hire the
smoker because of what the database has taught. We could also argue the other side,
that the database has also now taught the smoker something important, that the smoker
might now enter a smoking cessation program and so on, but we do want to know these
facts so that people can choose their lifestyles accordingly.
Student:
We talk about these things as attacks. Would we also have counterattack intelligence for
example, when Dr. Weigend leaves his home in Germany and we know now he’s gone
and now we’re going to rob his house, don’t we have all this other counter data that says
he’s leaving the house, let’s send the security people or let’s turn on the security alarm?
Can we counteract or take measures anticipating particular sets that are compromised?
Cynthia:
You’re back to your original position, from earlier in the class. [Laughs] In some sense,
what you’re asking about has nothing to do with what I’ve been talking about. You’re
asking a completely different question about is it possible to use information for good as
well as for malicious ends. Of course it is.
Student:
…
Cynthia:
I don’t know. I do want to go back to your general philosophy. I think it’s a very good
exercise and it’s not one that I have completed, of trying to articulate what are the harms
that follow from loss of privacy. One kind of harm might be that somebody could be
blackmailed. Another kind of harm might be simply that someone feels bad, and maybe
that’s harm enough.
On the other side of this, you can ask what are the benefits so how many people do you
want to kill by not allowing free access to this medical data in order to do scientific
studies. That’s the other side that I’ve heard. It’s really worth trying to sit down and
figure out what are the losses and what are the gains and this has not been done
by the computer scientists and it has not been done by the lawyers, and it has not
been completely done, either, by the philosophers. These are really important, real
life questions and an articulation of this kind of a list can try to help us make
choices and help us figure out what we need to design protections against or
support of. I did want to make that comment earlier since it’s really important.
1:10:14.1
Andreas:
I think that’s a very important remark. It’s not computer scientists; it’s not algorithms that
will solve this. One of the main things I took away this time is the whole notion of
what privacy really means is normally not clear when people talk to each other.
Prima facia good approaches just don’t work. That’s very interesting, that when we
talk about Twitter taking the private sphere to the public sphere, we don’t really
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 19
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
know what we’re talking about. Even on Facebook, when we super simplified it
and said Facebook is about C-to-C communication where people actually know
each other and are usually confirmed - not really. Let’s say for the fan page of
Social Data Revolution, everybody can see that. It is just a very complicated
question where you present it so well today; what actually does privacy mean and
what should it mean? We didn’t think about this before class.
On that note, let’s give Cynthia a nice round of applause here.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 20
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_9.1_privacy_2009.06.10.doc