Download weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Andreas Weigend (www.weigend.com)
The Social Data Revolution: Data Mining and Electronic Business:
MS&E 237, Stanford University
April 15, 2010
Class 6: People Discovery, Facebook PYMK
This transcript:
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
Corresponding audio file:
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.mp3
To see the whole series: Containing folder:
http://weigend.com/files/teaching/stanford/2010/recordings/audio/
Course Wiki:
http://stanford2010.wikispaces.com
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Andreas:
Welcome to class 6 of the Social Data Revolution. The agenda for today is that I want to
give you some feedback on your 2-minute feedback about what was not clear and was
clear about last class. I will use that as a way to reflect on what was important.
I will briefly talk about the homework we are assigning today which is due a week from
today, which is our Bit.ly homework. I will ask you to get an early start because the Chief
Product Officer of Bit.ly will be in class next week on Tuesday. Any questions you have,
you will have the man himself here in class, ready to give answers.
The questions we didn’t do a good job explaining were what metadata is. A number of
you asked if I could explain what the meaning of metadata is. Metadata is data about
data, such as annotations. If you have a file and you put some tags on the file, that
would be metadata.
Our example of RTN would be an example of metadata, not just doing a retweet but
also saying how important that retweet is for the audience. Music metadata would
be what genre the song is, or if you submit a paper to a journal, you are asked to
give topics which are examples of metadata.
What has changed there is that the metadata creation can be easily crowd sourced
by people as they go along, simply adding metadata. Wikipedia is an example of
that which works quite well.
The second question was asking if I would explain PHAME better. When talking to
Lars, we decided his talk today would be a beautiful example of PHAME framed as
PHAME. If you are still not sure, at the end of the day, what problem, hypothesis,
action, metrics, and experiment means, then we’ll have that conversation
afterwards.
The most important things from last class were the production patterns have
shifted. It’s very easy to create data now. The distribution patterns have shifted
and also what we learned from Dan Olsen is the consumption patterns have
shifted. What he called “information snacking” is an example of small bits, bitesized pieces, snacks of consumption as opposed to buying a book.
These are a few dimensions. Production has changed. We are all carrying devices
which make it easy to us to produce data. Distribution has changed; the cost of
communication has dropped. As a consequence of that, the consumption has
changed.
The second point that was important about last class is to understand the
tradeoffs. Any product design is full of tradeoffs, and one of the questions we had
last class was should we surface the question about recency versus relevance to
the user? Should the user have a nob where he can say “I really want very recent
stuff, even if it’s not relevant,” or should we make the decision for him, or should
we learn from their past behavior.
0:04:03
For instance, if they always click on the most recent stuff, presumably they’re
more interested in recency than in relevance.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
For those of you with a statistics background, that is the old problem of bias
variance tradeoff. You can have a model that has a strong bias, which means you
have a small variance, but if it is the wrong model it’s the wrong model. Or you
can have a model which has a very low bias, which means you can model anything
but the parameters are not determined as well.
Another way on Wall Street is the noise, non-stationarity tradeoff, so you might see
something and you don’t know if it’s just a fluctuation or if there is actually a
regime shift. If you go right after any little blurb, then you hunt for the noise, but
you assume the world is non-stationary. If on the other hand you look a long way
into the past, the problem is you miss small shifts. All these are tradeoffs of
recency and of a larger dataset, bias, variance, noise, and non-stationarity.
Are there any other tradeoffs which you would like to add? If you think about the
class prior to that when we talked about conversion rates and metrics at a
company like Amazon, for instance you can buy lots of hits to the site, by buying
unqualified traffic. The person who is responsible for hits to the site, for bringing
traffic to the site gets a big bonus, but the person responsible for conversion rates
is in big trouble. Those people you push to the site from random keywords… of
network, are not the ones who are going to sign up or buy.
Tradeoffs are important. Metadata is important, and understanding the economics
impact on production, dissemination, and consumption of data was another
important part.
A couple of minutes on homework three, I think you did a good job so far as I can tell,
from what people told me, on the Delicious homework. Homework three is due one week
from today. I just talked to Bit.ly this morning and the data is being prepared as we
speak. Let me tell you what it is that you’ll be doing.
First of all, don’t worry about computational complexity. We have one very small data
set and we have one big data set. I don’t expect everybody to work on the big one. Play
with the small one. See what insights you can get, even if they’re far from statistically
significant.
To give you an example of why I said this, here is some data from an interview I gave last
year on MPR as a marketplace. I put a Bit.ly link in there. They told us there were 112
clicks, and they also give us the timeline of those clicks. In the first order, we can
describe this by the area under the curve, the total number of clicks, and the decay
time; how fast do things age? Do people still listen to it half a year later? Hardly.
0:08:01
I want you to think about whether you can build a model predicting the overall
number of clicks, as well as the decay time, and for that I am giving you a few
variables. The variables Jeremy is writing up is the long URL, the time stamp of
when it was encoded or created. Maybe on Friday afternoon that is not a good time for
press releases because it gets stuck in email and Monday morning people are busy.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
If you think about yourself as a marketer or working with a marketer, you want to help
them understand when they should send links out, and you want to get their buy in, in the
details you can measure. Bit.ly for me is a global measurement system of attention.
We know what people are clicking on. The powerful thing is the websites cannot do
anything about it. Like NPR has no say; I am making a shortcut. People click on that
shortcut, I measure what is happening, and I don’t need to talk to NPR about that. That
is very interesting and for me, for marketing, understanding what works and what doesn’t,
is one of the most powerful probes we have.
Student:
To clarify, does Bit.ly make unique links or could there be 12 links that go to the same…
Andreas:
My understanding is that ultimately Bit.ly knows a unique identifier for each link.
However, if you want to have certain custom URLs, then you can have a number of
different ones. But, in the reports you see, they get all lumped together into one.
Here are some questions I think you might be interested in asking. I want to get
some insights from you. Does the day of week and time of day of the initial
posting influence the number of clicks and decay time? What is the effect of the
initial channel? You know from HTTP referrer where it initially came from. Do
things that are posted in Facebook live longer than things that are posted in
Twitter or posted in a blog? Does traditional media content such as New York
Times, do blog posts such as TechCrunch propagate differently from more
personal content? I don’t think anybody knows the answer but you will a week
from today. What is the influence of the topic area? Bit.ly tries to give us a
number that characterizes what the link is about. Is it about technology, finance,
food, etc? Do we understand that maybe food items are longer lived than finance
items?
Here is another one where you should pull some data from Google, does the
popularity of the content site – you can get that via Google page rank – predict its
propagation? If you have something from the New York Times, will that propagate
further than something from Weigend.com? Finally, are the properties of the initial
post important? Does it matter what their people rank is?
Those are questions I think you can look at, look at the data. The assignment is to give
me some insights from the small sample we call “Sample A”. For those of you who
actually love to write algorithms that scale, then you can do the second one as a group,
run through the huge dataset we’re getting, in order to understand what is statistically
significant.
Student:
I’m wondering about the list. If you found ... what is the correlation of pollination? For
example, people who send out links in the middle of the day might be really interested in
whether it’s propagating or maybe if I’m just sending a link to a friend or something, I
might not care how much traffic the link has been getting.
0:13:03
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Andreas:
You can do experiments. You can have different target URLs which essentially are the
same, and run experiments all day long. Marketing people typically say 50% of all
marketing dollars are wasted, but nobody knows which 50% are. We can create
transparency now by understanding where people click, what happens to them
afterwards. We don’t have to own the site. That’s the powerful thing. We can do it with
redirects.
Student:
How does Bit.ly take into account the fact that the URL is shortened so you can’t actually
see what site they go to?
Andreas:
I think most interfaces now show the site which is behind it. For instance, at Twitter it
shows that it’s the link to the New York Times.
Student:
What format is this data in?
Andreas:
CSV, so whatever you want to do to look at stuff. The small one you can do in Excel if
you want. The big one you need to have reasonable skills.
Student:
What do you think about Twitter rolling out their own URL shortener? How do you think
that will affect Bit.ly’s spread to the world?
Andreas:
I think that’s a good question for Todd on Tuesday.
Before we have Lars come up, I wanted to talk about discovery. On Tuesday we will get
the initial proposal for your project from you. On Thursday we will get the Bit.ly answers
from you. On the following Tuesday, it’s the project day and we will get the next… there.
The Twitter homework, if you want to get a peek preview, look at last year’s Twitter
assignment. It’s about people discovery on the public data Twitter has.
As we’re listening to Lars today, who is going to talk to us about how Facebook tries to
have you discover people you might know, I want you to think about how you would
do that on Twitter. MrTweet is the app that I use and it’s quite interesting to
compare and contrast the two worlds.
For today, we have the best of Lars, and I asked him to establish his credibility by talking
about something on privacy and the non-existing privacy preserving data mining
attempts; how auxiliary data helps you to nail down things to an enormous precision.
Then at 4:45 we will start with “People You Might Know”. At 5:00 we will do a group
exercise where we will talk together and figure out metrics, and how to know if what this
team is building is any good. If his team is building two different versions, which one is
better? We will then turn it back to you and you will show us some results.
Do you know why we’re doing what we’re doing? Do you know how it fits in? Today is
people discovery. Last class was content discovery. The class before that was
product discovery.
0:17:22
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Lars:
Thank you for having me. I’m going to talk a bit about some work I did a couple of years
ago while I was a grad student at Cornell. Then I’ll move to what I’m doing on Facebook.
The demo question we asked in this work was can these social data sets be released to
the research public? Could you release something like the Facebook social graph to
researchers and not worry you were going to end up in the New York Times the next
morning?
For the most part, the research community, people who are interested in doing social
graph research are not really interested in who the actual people are. They’re more
interested in the social network structure, how the network is evolving, and how it’s being
used, and things like that.
The social network could be anything. It could be an instant messenger or email graph,
or the AT&T phone graph. In all these cases, researchers would really like to get their
hands on it and try to run their models, and explain what’s going on. That would be great
except the users of all these systems have these strong privacy expectations. You would
be pretty upset if the whole world found out everybody you’d ever sent an email to or
made a phone call to.
A first attempt at making both of these groups of people happy is to do some sort of
anonymization where we release a anonymized version of the network, where we just
strip away all of the identifying information on the nodes. We replace them with random
IDs and we release the link structure.
This gives you a bit of pause and will make you think of all these cases where people
have released data and gotten into a lot of trouble. There has been a long series of work
on de-anonymizing data, things like figuring out who wrote something based on textual
analysis. With the NetFlicks data people were able to align that with public IMDB ratings,
and de-anonymize that. Then there was this whole AOL debacle where people got fired
and there are lawsuits. We’d really like to avoid that.
We can start by looking at a very small network. This is the famous karate network from
a sociology study 50 years ago. During the study, the network split into two groups.
Node 1 and node 34 eventually didn’t like each other and they went off and formed
separate karate clubs. This is the social network of all the people that were in the karate
club.
0:21:10
Just looking at this very small toy example, it turns out that some information has been
leaked. When Zachary released this data, he published this graph but took away all the
peoples’ names. If one of the people in the study knew they had 6 friends in these 2
karate networks, and they were also friends with both of the leaders of node 1 and node
34, it turns out that person could uniquely identify himself as being in this node. Similarly,
if this person also remembered he had 6 friends but he was only friends with this karate
club leader and not that one, he could also uniquely identify himself. Just using this little
amount of information, how many friends did I have and which karate leader was I friends
with, all these people could uniquely identify themselves in the graph.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Our case is different from this. The graph is much larger. Some believe that’s going to
save us. We have this huge graph with hundreds of millions of nodes in it. You could
think, “What could go wrong in that situation?” Imagine you’re a data curator for
Facebook or AT&T and you’re going to release this huge social graph with hundreds of
millions of nodes. Whereas these karate club individuals were able to uniquely identify
themselves because there weren’t very many of them, and somebody with 6 nodes was
almost uniquely identified to begin with, in these graphs there will be hundreds of
thousands of people with degree 6.
We did this work which actually won “Best Paper” in 2007, where we imagine this simple
scenario where the data curator – Facebook is going to release their social network just
one time. They’re going to strip as much identifying information as possible. They’re
only going to include the nodes on social connections, friendships, and nothing else.
An attacker has the ability to add a few nodes and edges before this all happens. You
can think that in the case of Facebook that means going in and creating some accounts,
and then making some friendships between those accounts. Then maybe making
friendships to some other individuals in the graph. The goal of this attacker is to try to
discover if 2 individuals are connected. You can imagine the attacker is targeting 2
specific people. He wants to know if his wife and best buddy are friends on Facebook or
something like that.
What can that person do? Here is the structure of the attack we developed. This is how
you should do it if you want to do it. Imagine you have this huge graph with hundreds of
millions of nodes. You can target more than 2 people. Imagine you were interested in
these 3 people here and we wanted to determine of these three edges, which ones are
present in the social network.
We’ll create this structure out here which will be like our secret key. We create these 10
nodes and they’re connected in some way that we know about. Then we’re going to put
that in the graph and the data is going to be released with the original hundred million
nodes, and then also these 10 we created. If we could only find this thing we could
created, we could follow these links to find the people that we’re interested and we could
read off which of those pairs are connected.
The question is how do we create this widget we can find uniquely because some of the
instances of this problem are very hard. This is like an instance of sub-graph
isomorphism, if you’re familiar with that sort of computer science. It’s an NPR problem. It
seems like this won’t work.
It turns out that if you do things the way we tell you, you’ll be able to find it very efficiently.
The right way to do this is pretty simple, actually. You just pick someone [0:24:59 U,K]
which has to be sufficiently large but not that big, in the order of 10 or 20. You’re
targeting K people. You create K nodes. Then you add these links to target them.
0:25:12
In order to make my thing uniquely identifiable, and in order to find it later, I just add a
bunch of random edges between them. This is just a random graph, probability ½ for all
pairs of edges, with this one caveat that I’ve deterministically created this path. I’m not
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
going to go into the details too much about why this all works, but I’ll give the 30-second
version. If you’ve created a sub graph of size K, in this entire graph the total number of
sub graphs of size K is something like Nk and there is this hundred million term. That is
the number of things you have to search over.
The probability that any one of them matches looks like this so if you work that out in your
head it’s something like 2 to the –K2 and it turns out this thing is getting bigger and bigger
as K grows, but this one is getting smaller and smaller much more rapidly. Eventually
when K gets to be about 2 times the log rhythm of the number of nodes, then with high
probability your thing is unique and you can find it. We have an algorithm that does this
all efficiently too.
That’s all I’m going to say. Hopefully I’ve established my credibility. I’ll pause and take
questions.
Andreas:
There are a couple of examples you might want to give, like the NetFlicks. If you had N
movies rated, then with a probability of – I forgot the numbers, but they’re pretty amazing.
Lars:
I don’t remember the details either, but if you include some contextual information then it
gets much easier. With NetFlicks, they had the time stamps on all of the ratings. You
can think a lot of people rated the movie at every minute, but not that many people had
the exact same signature as you. That was a key part for their de-anonymization in that
work.
Student:
… Facebook social graph, a lot of the Facebook users… avatar. Can you map a graph
out that way?
Lars:
This doesn’t have anything to do with – it the data is already public and if they get it – in
the scenario we’re thinking about is where it’s not public. How could you, if someone
were to release the data, how could you learn things that wouldn’t otherwise be
available?
Student:
Wouldn’t the public parts of the … get a lot harder to … Facebook graph … a lot of
contextual…?
Lars:
The question was wouldn’t the public parts make all of this easier, and yes, they probably
would. It’s already pretty easy though. The point is that if you’re worried about your
users’ privacy, you shouldn’t go and do this simple anonymization.
I’m going to talk about the “People You May Know” at Facebook now. This is what
I’ve been working on for the last few months. The first question, the overarching
problem is who should I suggest. I’m going to make some suggestions for you of
who I think you might know, and how should I figure out who those people are.
0:28:39
Our hypothesis which is backed up by the literature is the social graph is going to tell us.
I’ll get into the literature of that and talk about what we’ve done on top of that in a bit.
Given the social network, we’ll use some machine learning techniques to make the
suggestions, and we’ll talk about how we measure our performance and how much better
we’re doing now than we were a month ago.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
To put it in context, you have everybody doing recommendation analysis. You have
Amazon recommending all these products. They think I need a new hard drive.
NetFlicks is recommending movies based on what I’ve watched before and what ratings
I’ve given. Facebook’s product is sort of people, so we’re trying to recommend
people we think you might know. Sometimes our problem is a bit different from
the problems that Amazon and NetFlicks have. They’re doing a collaborative
filtering where they try to find people who are similar to you in the whole world and
they make suggestions based on that. They find someone who has similar tastes
and that person has rated some movies, and they figure if that person liked the
movie then you’ll like it too.
For us the social context is much more important. It’s going to be silly for us to
find people who were like you and recommend their friends. We’re going to use
the social context and look at the social graph to find people there nearby in that
space.
To show what the product looks like, If you go to your home page on Facebook you may
or may not see some friend suggestions here in the upper right-hand corner. It shows
you a few things. It tells you how many mutual friends you have with that person and you
have the ability to add that person as a friend or you can remove that person and it will
show another one. There is this other page which will show you a whole laundry list of
people. The home page is by far the most important source of friend additions for
us.
Our hypothesis is we’re going to be able to figure all this out, or at least a lot of it,
just by looking at the social graph. There is some research that backs this up. There
is work I did with Uri Luskovitch when we were both interns at Yahoo. We looked at a lot
of social networks. We looked at how many friendships are created as a function of
how far in the social graph you are from that person.
To take one example, we’ll look at LinkedIn and you see here that the highest point
is 2 and 2 hops away means your friends of friends. This is on a log scale, so
dropping 2-3, you’re dropping by not quite an order of magnitude but a factor of 5.
By far the most significant source of new friends is people that are friends of
friends. It makes a lot of sense intuitively that people you’re meeting, you’re
probably getting introduced to them through friends, or people you already know,
you probably have a common friend with them.
This is the raw number of edges that have been created. If you were to think about
the probability of creating a new friend to a new person, that’s more extreme. It
looks like this because when you go from that raw number to the probability, the
denominator gets huge. You have maybe 10,000 people that are friends of friends,
but you have a million people that are friends of friends of friends, 3 hops away.
0:32:08
If we look at LinkedIn, there are about 2 orders of magnitude, so you’re a hundred
times more likely to link to a random friend of friend, as compared to a random
person, 3 hops away. This is true on Facebook also.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
I ran this yesterday. I was looking at one small country because it was quick and
easy. I was looking over the entire evolution of Facebook, where were the new
links coming from, and 92% of them were linking to friends of friends. This is really
good from a practical point of view also because if it weren’t the case we’d be in trouble.
Already friends of friends, for Facebook, the average friend count is something like
130. On average, a person is going to have, if we multiply that together, 17,000
friends of friends. If you had to go any further than that you’re getting up in the many
millions. Doing something like that for all users would be hopeless. That’s the average,
so if you think about the power users who have 5,000 friends, there are people on
Facebook who actually have over 1 million friends of friends, just 2 hops away.
Student:
What does the zero …
Lars:
This is people that there was no path. This line is showing how many friendships
were created as a function of how many hops away they were. I threw in this one
here. This is friendships to people to whom there was no path. That’s mostly new
users or people, just two people that were new users who connected up.
Whenever you’re a new user there is no path to wherever it was you made your
first connection to.
Student:
… average number of … friends a lot higher….
Lars:
It is. On a subsequent slide, it’s closer to 40,000. That’s because there is some sort
of correlation. If you’re a person who has many friends, chances are your friends
also have many friends. A simple example is kids in high school and also college
students tend to have more friends, and they also tend to be friends with people in
the same demographic who also have more friends. When we do this, it turns out the
average is about 40,000.
Student:
Do you have an idea of what percentage of your users are people whose networks don’t
grow? It’s analogous to the guest user on a website, they never want to register for an
account but want to be there to use the site totally anonymously?
Lars:
I don’t have a really good sense of what that percentage would look like. Facebook is all
about social content so if a user like that, unless you’re a big Farmville fan or something
and even Farmville is about social content – there is much less for you on Facebook as
an anonymous guest than there would be on a lot of sites where you would do that, I
would think, but I don’t have any concrete numbers to back up that intuition.
This gives us some idea of what we should do. It allows us to narrow the scope of
our problem. Now we’re going to go from who should we suggest, to which
friends of friends should we suggest. You can imagine me, here are all my friends,
and I want to find out; of all these people, all these friends of friends, who is the
best person to suggest.
0:35:48
We wanted to figure out what features are going to help us pick from these 40,000.
We wound up looking at different network features and also some of the
demographic features. A first thing that’s obvious is the number of friends you
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
have in common with that person, how many mutual friends. If I have 40,000
friends of friends, most of those people I only have 1 friend in common. The ones
where I have more friends in common, those are likely to be better suggestions.
When we look at the data, this is metered out. Here on the X axis is the number of
friends we have in common, and the Y axis is the relative probability on a log
scale. This isn’t the exact probability, so it doesn’t go to 1. It’s the relative
probability so there is some like constant you would have to multiply by. The
upshot is if we look at a couple of data points, if we look at the difference between
1 mutual friend and 10 mutual friends, there is about a factor of 12 difference. If
you were somehow to get up to 100 mutual friends without actually friending the
person, your intuition would be you have 100 mutual friends, you should have
found that person before, but when you look at the data you’re much more likely to
become friend with those people.
That’s a good start, just looking at mutual friends. If you had to pack something
together and ship it next week, that would be a good place to start; suggest based
on the number of mutual friends. It turns out we can do better by incorporating
some more complicated features. I’ll put up one of the things we used. This is a bit
scary and complicated so I’ll go through it.
What are the other network features that might be useful? One of the important things
we found is the edge creation time is a very strong signal for us. You can think
that if you have a close friend and that close friend makes a new friend, that’s
probably a good suggestion. That might be a better suggestion than the person
who works at the same company as you down the hall that you have 10 mutual
friends with but you’ve always had 10 mutual friends and you always will, but you
don’t know them socially.
The summation is over all of the mutual friends that you have with the person. Say
I have 5 mutual friends with some specific person; we’ll sum over those 5 people.
We have these 2 time terms. These are the times on those 2 edges. If I have a
mutual friend with some person, there are 2 edges between me and that friend of
friend. That’s what these 2 delta terms are. We multiply them together and we take
them to this negative power, which says new edges are more important. We divide
by the square root of friends because that helps to diffuse the influence. If I had a
friend with 5,000 friends, all those 5,000 people should probably get less influence
than if I have a friend with 20 friends.
I wouldn’t want you to come away with the message that this is how we do it. This is an
example of one of the many kinds of features we’re able to use and glean from the
social network and from the metadata we have associated with it. The time stamps
on the edges, the number of friends that everybody has, and things like that.
0:39:09
Student:
I don’t understand the graph. Relative probability….
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Lars:
That’s why it’s relative and not the actual probability. How do I do this? Think
about one exposition – 10. I counted for all the friendships that every occurred, I
have a weight of 1 that I’m going to put into the denominator of everything but I’m
going to distribute that amongst all of the friends of friends. You may have 10,000
people with 1 mutual friend and you have 1,000 people with 2 mutual friends. The
amount of weight you put into the denominator of this fraction is that. You put 1
into the numerator. I’d have to go through the math but we can talk about that
online.
The normalization is a bit complicated, but it says that people in this bucket are 12
times as likely and you can think that the probability can’t be greater than 1, so
there is some constant that you should multiply everything by for it to make sense.
It would have been simpler if I had left off the labels on the Y axis.
Student:
Where do you get the negative….?
Lars:
We tried a few things and some of them worked better than others. There is no real
principled reason for why it should be that. We tried 0.1, 0.2, 0.3, and 0.3 was one that
worked well. There are other features that are of a similar flavor to this that go into the
model at the end of the day.
Student:
I think the time of creation of those edges might … friends and he might not join
Facebook at the very beginning….
Lars:
Absolutely, we have to make the most of the data we have available to us. At the end of
the day you do what you can. The time stamps we have are not necessarily
representative of reality but they’re still a good signal for us.
We first talked about the features that we use, so how do we use them. The
answer is we do some fairly straightforward machine learning. For every user, like
me, we consider all the friends of friends, these “W’s”, all pairs, so me and one of
my friends of friends and we generate a bunch of features, things like the mutual
number of friends, this super complicated time decaying number of mutual friends,
and a few other things in that same genre. We also add some other features that
are just for you or just for this W, like age, gender, country, or whatever
demographic information we happen to have about the person. Sometimes those
things are missing and we don’t use everything we have, but you get the idea.
We take that and generate a bunch of training data and we use [bagged] decision
trees, which is you’re familiar with what that is, we just do the simplest thing. We
take our training data, train a bunch of different decision trees using different
subsets of the data and then we average their results together. It looks like this;
we have the features that come in, like there are 5 mutual friends, the source node
is a guy, he and the target is 23, etc. Then it goes into all these different trees that
sort out numbers and we average them together to get the final output.
0:43:02
One thing you have to think about whenever you are taking this approach is where
does your training data come from. Whenever you’re doing this classification in
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
machine learning, you need to have some labeled training data. Some of the cases
are negative and some of them are positive.
In our case, the positive cases are friendships that were created. For the negative
cases there are a few different possible choices you could make. Naively all the
negative examples are friendships in all the world that were not created, but that’s a bit
too much to deal with. We actually use the data from PYMK to sort of train subsequent
models.
We show you a bunch of suggestions and you add some of them and don’t add some
other ones. The ones you add, those are the positive examples, and the ones you don’t
add are the negative examples.
What we’re grappling with in trying to understand all the influences is what that
sort of feedback loop does in the long run. We’re training on the previous iteration
of the models and how does that impact the long-term solutions?
Of all the features, the mutual friends are the most important, so the things related
to the social network are by far the most significant in making all these
predictions. Some of the other network related features, like how many friends
people have are also important, new users for instance are potentially good
suggestions because they’re people you might know outside of Facebook but
they’ve just joined the site. There is demographic stuff like age and gender which
helps a bit but it’s secondary to the network things.
Student:
What do you mean by [time scale] of friends?
Lars:
That’s the complicated formula that I put up, things like that.
We have this model and it gives us this number at the end of the day. Doing all that is all
very expensive and if you start multiplying, you get some pretty astronomical numbers.
We said the average person has 40,000 friends of friends. Facebook now has 400
million active users, so we multiply those together, and we get up to 16 trillion. We have
to generate all these features and do some sort of evaluation for 16 trillion pairs.
To do this we have these dedicated racks of machines and they’re all running full tilt all
the time, going through all the users and generating new suggestions. Each of the
machines hold a fraction of the social graph because it’s way too big to fit memory on any
one machine. Even with all these machines, we’re only able to generate these new
suggestions every few days.
For more established users, we generate new suggestions every 3.5 days. For new
users their social network is changing more often so we generate new suggestions for
them more often.
0:46:14
Student:
Is there anything related to the population, like a barcode basis, like as things change you
update things marginally?
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Lars:
No, it’s way too much data to store in a practical way. 16 trillion, even if it was only 1 byte
per pair, that’s 16 TB and in reality it would probably take at least a few hundred bytes
per pair so you’re getting up into Petabytes.
Student:
Are these supercomputers?
Lars:
No, they’re regular computers. Everything is regular computers. Supercomputers are
too expensive, I think. They just gave them to me. They’re run of the mill machines.
They have 70GB of memory and 8 core CPUs in them, but other than that they’re not
special.
The important thing to take away from this is we’re only able to do this every few
days. That means for each user, if they come to the site all the time, they’ll go
through them. One of the things we build on top of this is we want to do a bit of
work on every impression to try to show the best suggestion we have available
every single time. To do this, we keep track of all the suggestions we’ve already
made to you.
This curve is showing you how the click through rate decays as the number of
impressions increases. You can see the first time we make a suggestion is when
we get the highest click through rate. The second time it’s dumbed down some
amount and it continues to decay. By the time we’ve shown it 10 times, the click
through rate is very low. This curve is approximately 1 over the number of
impressions.
Student:
When you say impression, do you mean …
Lars:
Number of times you’ve seen that suggestion. By impression I mean a pair, a user, and
a suggestion.
We can’t afford to – they keep the response time in the website up so we can’t do a lot of
hardcore computation for every impression, but we can do a bit. We actually combine
the output of this expensive offline model which gave us these original suggestions and
original scores with a bit of information. The number of impressions is the big thing
but also we have a few other features. Then we do some simple logistic regression
to re-rank every time.
Just to drive that home, you can imagine that we have these 4 suggestions for a
person and they all have some CTR prediction associated with them. At first,
you’ve never seen any of them. You come to the website and you see these 2 and
consequently, their CTR predictions drop. Now the next time you come and you
see those 2 people. You come back again and things change. You go back to
seeing Alison and Bob who you saw the first time.
0:49:23
The third time you come to the site, Bob decayed a lot. We think Alice is a good
suggestion and we think you should be friends with her. She’s really nice. We’re
going to show her to you a couple more times. That is what our model tells us to
do. The point is every time you come to the site, we’re going through the list of
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
suggestions this offline model made, and re-ranking them and showing the ones
we think are the best.
Let’s put it all together and summarize how the whole system works. At first, we
want to generate some suggestions for some user so here is me. I come in and we
have this big offline system that goes and finds all the friends of friends and it
does all this feature creation: how many mutual friends do you have with all these
people, what are all these other features associated with that pair of users.
That goes through these decision trees and they spit out some scores. They’ll say
the score for Lars and Greg is .045 or something like that. Now, I come to the
website and we go and look to see how many times have I seen Greg before.
Maybe I’ve seen him 3 times before and I’ve see Shelly a couple of times. We
combine that with the scores from these bagged decision trees, and that gives us
some sort of final CTR prediction. We take the top 2 and show them on your page
if they exceed some threshold.
We try not to spam you too much so that there is some threshold so if this number
doesn’t exceed that we won’t show you any. That’s why you might not see them.
Every so often the results feed back into both of these models to continue training
and improving them. This model here is very dependent upon this one. If we
retrain these trees so the types of scores they output change, then it’s very
important for us to retrain this also because they need to work in concert.
Student:
How do you … the modeling you use for people you know… for example in the news
feed, rather than … populated according to time, if you did some sort of sorting on who
your mutual friends are and have you tried it?
Lars:
Our system that we built is only about a month old so some of the things we’re doing are
definitely not happening anywhere on the site. For new feed, I know they have some
concept of how important the person is to you that they try to learn. I don’t know the
details of how that system works. It’s definitely – this sort of thing we’ve talked to other
teams about and we’re exploring how this sort of thing can be applied in other places. As
far as I know, the people you may know is the most sophisticated in terms of finding that
sort of ranking.
Student:
When you do the CTR, if I go to the homepage and the “People You Might Know”
suggestion is mutual but I don’t click X, I just go on, does that count as a click through?
Lars:
It counts as a negative example. If you don’t do anything, it just – when we’re predicting
the click through, we’re predicting the probability that you’ll add that person on that page
impression.
Student:
So there is a difference in that and when I click yes.
0:53:06
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Lars:
Yeah, that’s a bit complicated and I don’t want to get into the details of how we deal with
those two different signals.
Student:
I’m curious if you actually create a model of the person’s popularity – the model you were
talking about is tree based so it’s much more… are features generated that create
general compatibility or popularity metrics?
Lars:
We have some of these demographic features, like your friend count and how long
you’ve been on the site. I guess in principle it’s possible that the trees are learning
something like what you’re suggesting, but we don’t have anything explicit like that in
there.
Andreas:
Let’s do some pair exercise. [Break]
… with all kinds of ideas. How do we know whether something is working? What I want
you to do is to spend a couple of minutes talking to your neighbor and come up with
metrics that could be metrics that measure how well the system actually performs. I just
will throw up a few examples to get you started.
What is the CTR on a friend? What is the conversion using ebusiness language,
that the person gets added as a friend? What is the second order effect, that the
other person accepts the add? Returns in business are very expensive and in our
case that means unfriending someone. You tried them and he spams, so forget it.
Stockpiling people you would have added anyway. What is the value of the
friendship? Is there some notion of social capital? These are things that are metrics
and I want you to talk to your neighbor for a couple of minutes. Then we’ll collect a few
ideas from you about what might be metrics that help him evaluate ideas and
experiments. Talk to your neighbor. If you don’t have one, move up. How would you
measure the quality of the system? That’s the question right now.
Let’s see what interesting ideas we have for metrics. How do we know the system is
doing better than random, or should we from a Facebook perspective just show ads?
That is why we want to talk about metrics. Let’s hear some examples.
Student:
The point about the perspective of Facebook – it could be interested to see how much
more page use someone generates. Whether that leads to more activity. Another thing
we thought about is if that person has no friends whatsoever, probably it’s about the
confidence that this person would befriend, but also if you have a slot of only 2 people, if
we give him someone we feel would stimulate his activities, someone who is super hot
and posts a lot and this guy – it might kick him into action as well.
Andreas:
What you’re suggesting are features to build the algorithm with. If you have few friends,
you might make different recommendations than if you have many friends. What I want
to focus on here is the metrics. How do we know we’re doing well?
0:56:48
Student:
We were thinking about metrics of the expected comment number someone has and then
conversely the expected number of people who then hide them from their feeds or their
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
comments. Rank users based on they get 3 responses every time which is better than
someone who gets none. Someone who gets hidden could also …
Andreas:
Facebook knows whether people tend to get x’d out of the feed, so you might not want to
recommend those people. It’s a feature more than a metric of knowing how we’re doing.
Student:
My neighbor has a friend who friended her and she’s a big pain in the ass now. It was a
high school friend she doesn’t want to talk to anymore. We are suggesting if there is a
really big imbalance between the messages, like someone is sending out versus
receiving then maybe the friendship isn’t actually … user experience on the site. Maybe
it’s a negative in their life. Maybe you could track whether the friendship was a good
addition in their life by high feeds. You could also do the balance of messages flowing
between and whether one person was using a lot more words in the message than the
other.
Andreas:
Good one, so I heard two things. One is people unsubscribing and the other is symmetry
is a good thing.
Student:
We had an idea where after you friend this person, how much value do you get; how
many other friends do you get from this friendship, recursively looking at the suggestions.
Do you get a lot of good suggestions after friending this person? That would be hard to
do but it would be interesting to see what the results are.
Andreas:
Also, how many of the articles in the news feed do you click on that come from this friend.
Student:
… value of the friendships and number of page views the user had for the new friend
because if you add certain people and never revisits them, views the page, then what is
the value?
Student:
We were thinking more about some guys get into Facebook just to have as many friends
as possible so they can advertise a product or something. Why would you unfriend
someone? If you already accepted them, why did you unfriend them. If there could be a
metric on if you unfriend the person, how is it related to how many messages or wall
posts or things like that which they sent because that could be someone spamming…
that person was my friend and my other friend gets … third friend of mine, he might be
suggested to add that friend… 150 friends in common but I would….
Student:
Isn’t there a feature on Facebook where you can suggest to other friends? Compare
those two.
Lars:
Compare in what sense?
Student:
Conversion rates, like if I suggest to Noah, that he should also know Joe, but you also do
it then it gets a better result.
1:00:22
Lars:
I hope that you would get better results, but that doesn’t really help us evaluate how
we’re doing. It just tells us that we’re losing because we’re never going to do as well.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Student:
But you could be improving.
Lars:
I would say one common thing here is although you all have good ideas, the things we
look at are much simpler. I would say that photo tagging or the number of
comments and things is you need to be careful. If this is what your goal is going
to be, it’s going to heavily influence your product. If my goal is to maximize the
number of co-tagged photos that these people are interested in, then I’m going to
do well by suggesting prolific photographers, which is not necessarily the right
thing.
With that caveat, I’ll talk about what we look at. At a high level this gives you an idea of
how well we’re doing in terms of how well are we finding new users their friends? This
pie graph is just looking at new users, so people that have only been on the site for a little
while. Of all of the friends they make how many of them come through PYMK? It’s a
pretty big slice. It’s much smaller for more established users who are more familiar with
the product, but if you’re a new user on the site and you don’t really know what’s going
on, having that person’s face appear in the add as a friend link is very powerful. It’s the
number one single source of new friends in the product.
Things we look at – one thing that’s sort of obvious is the CTR. Of all the
suggestions we show, how many of them get clicked on, what is our conversion?
Another thing is what is the total number of friendships we’ve created so we have
this on the homepage for everybody; how many total number of friendships are we
creating? Then you could look at the add/remove ratio. We talked about this
briefly but there are three things you can do with a suggestion. You could either
click “add” or completely ignore it and do something else. Or you could click on
the X to remove it. Maybe the X’s are a stronger signal than the ignores.
All these metrics have some sort of pitfall. If you were just to try to optimize the
CTR or the add remove ratio, a good way to optimize that is to show less
impressions. Only show the crème de la crème of the impressions, the ones you
think are really good and have really high CTR estimates. You’ll end up with a
really high CTR but at the end of the day you’re showing few impressions so you’ll
end up creating very little total value, very few clicks overall.
The other side of that is if your metric was only total clicks, then you would go
crazy with impressions and you’d always be suggesting things, even when you
didn’t really have a good suggestion that you hadn’t shown to the user over, and
over again. That creates some sort of negative user experience, where people are
getting spammed with suggestions.
Student:
What is the actual reason for helping people friend more people? Why would you do that
over showing ads or something else that would …
1:04:00
Lars:
That’s a good question. I guess I skipped over that on this slide. In order to create a
good experience there is a fairly significant downstream impact of having new friends.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Facebook is all about connecting with friends and family and being able to share things.
Especially for new users, but also for established users, finding that person’s friends, the
people they’re interested in on the website makes the product that much more valuable.
It’s not the case that the more friends you have the more valuable the site is. You see
these people who have thousands of friends and you think that’s silly, but for people who
are less technically savvy or having a hard time finding their friends or for all these
people, if we can help these people identify the people they care about and help connect
them, that’s creating a lot of value for the users which leads to all sorts of good things
downstream.
Student:
Why do you take into account the number of friends people have?
Student:
When you suggest a friend, does the same suggestion show up in the other person’s
homepage?
Lars:
No, it may but there is no reason. It’s not necessarily a symmetric relationship. If a
person only has 1 friend, then there is a very limited pool of people we can sample from.
We’re only suggesting friends of friends so if you have 1 friend and that person has 20
friends, there are only 20 candidates. We have to pick one of those 20, but for the
person on the other end that has hundreds of friends, that person who only has 1 friend is
maybe not a good suggestion. There is probably someone better.
Student:
Based on the networks and information you have on your profile, wouldn’t it be a good
way to suggest – instead of suggesting a specific friend, suggest “do you want to check
more people in this network?” Like more people who went to your high school, people
that worked at this place, so you could be browsing your homepage and say I want to
know who else lives around me?
Lars:
I think there are some products on Facebook like that. I think there is a “Find People At
My School” product, and there is also this thing related called “contextual PYMK” which is
you go and add a friend and we try to make recommendations based on a system sort of
like this except also taking into account the context. I add Jim as a friend and now it
suggests I’m sort of – he’s my new friend so it’s going to suggest his friends to me. It’s
sort of taking the system and adding the additional constraint that they also are my new
friends' friends. We do some things like that.
Student:
Do you also recommend friends of friends of friends?
Lars:
No I talked about why we don’t do that. We can go back and I can answer that offline.
Let me wrap up. At a high level you should think what we really care about is
somehow creating the most value in the whole ecosystem. That’s hard to quantify.
At some level, every time we manage to create a friendship that adds some value
to that person, presumably. That person is going to the trouble of adding a friend
so they’re getting something out of it. We’ve done a good job.
1:07:36
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
If we make a suggestion and you ignore it or delete then we’ve done a bad job and
there are some costs associated with that. We’re potentially annoying our
customers by suggesting people they don’t know but then there is also the
opportunity costs. We could have shown something else in that space, like an ad
or there are other modules we can fit up there.
Of course there are other issues to consider. You might ask if you suggest my girlfriend
to me, I would have found her anyway. You’re really just cannibalizing other channels,
not adding anything. This is a question we look at; how many of these friendships that
we’re suggesting would have been found through some other channels, not what is the
absolute value, but what is the incremental value we’re creating?
Finally results – this new system has replaced what we had a couple months ago. I’ll
show you the improvement we’ve managed to make. This is showing 30 days. AT the
top you have the total number of add clicks. When we first started we did something
good and screwed up for a while, but we’ve fixed it. You can see that on this graph too.
Compared to 30 days ago, we’re doing pretty well. Our CTR is up 30% and also the total
clicks are up 60%. Total value is up a bunch, we think, but we don’t really have a good
idea of exactly how to measure that so it’s hard to put a number on.
So far, we’ve made a lot of big improvements. I would say that compared to before, the
biggest improvement is coming from the 2-tier system where we do the expensive part in
this batch mode and then do the fine-grained adjustments in real time. Both the systems
we have are still pretty simple so there is a lot of room for improvement. The offline
system does a lot of potentially relevant features that we just don’t use because we
haven’t gotten around to them. Those are things like how many photos are you tagged in
with a person, or how many messages have you sent, what is the relationship status,
dozens of things we think might help but we haven’t tried yet.
Machine learning – we’ve only tried this one method, these bagged decision trees.
We should probably try some other things and see what works best. At a higher
level, we’re still not exactly sure what the best way to collect training data is. Right
now we have two ways we’re experimenting with. One is positive examples are
add clicks. Negative examples are remove clicks. The other way is the positive
examples are the same, the negative examples are all the impressions that you
didn’t click, either remove or ignore.
This real time system we have on top is super simple. It’s just a regression with 5
features. We’re working on adding a few more features that we think will improve that
further.
1:10:35
Three takeaways from all of the things we’ve learned over the last couple of
months, the first is figuring out exactly what you’re goals are is very important. If
you don’t do that you’re trying to hit a moving target and it makes everything much
more complicated. We’re still struggling with that. You want to make sure you’re
optimizing the right thing or at least if you can’t figure out exactly what to optimize,
don’t totally blow it.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Feeding into this is making sure you’re looking at more than just the positive
ratios. You want to look at misses as well as hits. We create a lot of friendships
but maybe we’re still showing too many impressions, still annoying our users too
much. People are always complaining to me.
Finally, you have in any of these systems, the real world pops up and there is
some limited amount of computational power. One of the biggest improvements
that we’ve made in terms of our performance is making the most of that and doing
the expensive parts in batch mode but then doing some sort of cheap, simple
improvements that give us a big performance boost on every impression.
I can answer some more questions if we have time.
Student:
Cannibalizing friendships, it was surprising to me that you said before every friendship is
valuable, every friendship helps. Why not make this friendship easier, faster and as soon
as possible?
Lars:
That’s true and our belief, but there might be other high value things you could show
there. There are all kinds of different things you can put in that space and if you can
create a friendship that is high value then that’s great; our CTR is not that high so we
don’t always know if it’s high value. If it’s a question of do I show 100 impressions of
people PYMK to get a couple of friends that the person would have gotten by typing it
into the search box anyway, that might be less valuable than showing 100 ads. I don’t
know. It’s something to think about.
Andreas:
One is the goals or objectives matter. Someone said should we have more ads or should
we have more friends. The ecosystem matters, the health of the ecosystem. Less
interrupts and irrelevant stuff. The metrics matter. Finally, computation matters.
On that scale, you really do need to be smart about what you compute in batch
mode and what you do in real time.
Next Tuesday your project ideas are due. Bit.ly is coming to class. After class we have
in the Business School, the Real Time Web event, the VLab event. Thursday the Bit.ly
homework is due. Once we get the data from them we will tweet from @socialdata and
you know where to find the data which are to be analyzed. Finally, for the speaker on
Tuesday, the Facebook.com/socialdatarevolution page is a good way to ask some
questions ahead of time and to wrap our brains around what we want to learn. That’s it
for today.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0
4.15.doc