Download Transcript of Andreas Weigend Data Mining and E

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Andreas Weigend (www.weigend.com)
Data Mining and Electronic Business: The Social Data Revolution
STATS 252
May 4, 2009
Class 5 Facebook: (Part 2 of 2)
This transcript:
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Corresponding audio file:
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.mp3
Previous Transcript: (Part 1 of 2):
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-1_2009.05.04.mp3
To see the whole series: Containing folder:
http://weigend.com/files/teaching/stanford/2009/recordings/audio/
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 1
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Andreas:
Are we ready for the second half? I already introduced Itamar. After our interesting
discussion in the first half, now we will hear about data science at Facebook. You should
know that I told them that people only call their thing science when it is not a science.
[Laughs] Physics is not called Physics Science.
Itamar:
The name is quite awkward. You can thank my former boss Jeff Hammerbacher for that.
Student:
Are you going to charge…
Itamar:
No, it’s open. You can use it, a company can use it.
Student:
…
Itamar:
Not to my knowledge, no; that’s a really interesting question, though. I’m not sure. So,
I’m Itamar and this is Eric. He’ll continue my presentation after I’m finished with most of
it. Today I want to talk with you guys about the team that I work at, at Facebook, called
The Data Team, or sometimes we call ourselves the Data Science Team. First, let’s go
over some of what’s involved …
Andreas:
Would you be willing to share the slides with the students afterwards?
Itamar:
Sure, absolutely
Andreas:
Okay, concentrate on the class and don’t worry about notes. I’ll put the slides up. Did
that change from what you sent me yesterday, or could I just put up what you sent me
yesterday?
Itamar:
This has changed. I want to start by giving you guys a little bit of a taste for what’s
involved, particularly with respect to the scale of doing data analysis and data mining at
Facebook.
We have a social graph, as everyone knows. There are two million active users on
our site. More than half of them, over a hundred million users come to the site each day.
Several hundred thousand new users join each day. Every user can be described by
hundreds of dimensions of different types, numerical, categorical, textual
dimensions. The average user has 120 friends. Friendships on Facebook span
many different types of relationship, coworkers, people you may have met several
years ago, close friends, family members, and so on.
There is also a lot of rich behavioral data that we collect. Action data, users
interact with hundreds of thousands of applications on and off the site, and users
interact with one another directly via hundreds of different, unique types of
interactions that we support on this site.
0:02:35.8
Finally, there is rich data we collect about social content, the photos, the status
updates, the platform application content, the events, the posts, the videos, the
notes, the groups, everything that happens that users create and share with one
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 2
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
another that we have hundreds of millions of sharing events along these lines,
every single week.
What do you do with all this data? You can’t just stick in a MySQL database or in an
Oracle rack server even. At this kind of scale, the solution that we’ve gone with, which
is really we have Jeff Hammerbacher, the former manager of the team, to thank; it’s a
distributed approach with a piece of technology called Hadoop as well as a piece
of technology that we developed on top of Hadoop called Hive.
Hadoop and HDFS is a distributed approach to data warehousing and computation.
It is inspired by the Map Reduce programming paradigm, and the distributed file system
developed at Google, but Yahoo developed it in Java in a system called Hadoop. It’s a
system called HDFS, which is a distributed file system.
We took these technologies and on top of that, we built something called a metastore
which enables you to do metadata management in terms of representing data,
rather than on flat files, as actual database-like tables.
Finally, we built a [0:03:58.0 unclear] language. It’s very similar to SQL, called
HiveQL, that allows pretty much anyone with a SQL-like background to run
computations on this data warehouse.
We have this Hadoop and Hive system. It’s deployed on a couple of different clusters,
but our main cluster has over a terabyte of raw capacity, over 2 terabytes of
uncompressed data are collected into this cluster every single day, and dozens upon
dozens of terabyte of data are read and written each day via Hadoop and Hive.
Andreas:
2 terabytes, including videos and everything?
Itamar:
We don’t actually put the videos in the data warehouse.
Andreas:
That would be much more.
Itamar:
That’s right. The photos and the videos are not in the data warehouse. The fact that a
photo or a video was created or viewed is recorded in an action log which is then
imported into this Hadoop data warehouse.
You can think of all this data as dimensional data about users and raw log files
about the pages they view, the interactions they have, the applications they install,
the content that they produce, and so forth.
0:05:07.4
Now that you understand the core technology that’s involved in terms of the
infrastructure, I’d like to explain what the data science team itself does. We can really
think about what the data science team does as two different things, behavioral
analysis, and data driven systems.
Behavioral analysis – some examples – we are involved in some discussions of
formulating the key product health metrics by which we gauge our success, our
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 3
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
site success. With a mission like helping people share more information and to
connect to the people they care about, it’s not entirely obvious what to pose as a
product health metric. You can’t just simply talk about number of visits or number
of users or number of purchases if we sold something.
Another thing that we do is we support product launches by doing evaluations of
the effect of those launches. For example, you guys probably noticed, those of you
who use Facebook, that there was a site redesign six or eight weeks ago. We came in
before the launch and helped the product team answer some questions about which
specific decisions to make, based on past behavior. Once the product was actually
launched, we judged its effect on the overall ecosystem of the user base.
Another example is growth modeling, understanding which markets we’re growing
in, what the saturation patterns are like, what the adoption curves are like, where
there might be markets that we don’t have any success in, even if we’re sending
invitations and there are some key users there; it’s a little bit resistant. These are
the questions involved in growth modeling.
Another interesting problem is user churn modeling. That’s the problem of
understanding for a given user, who has a certain set of characteristics and a
certain usage profile, how can we model the likelihood that that user will return to
the site, K times in the next T periods. Doing so would enable us to label high risk
users and maybe target them with additional proactive measures to keep them on
the site and we can also leverage this kind of modeling for explanatory purposes
to understand what sort of features get users to stay on versus what sort of
features push users away.
There are a couple of different things here, as well, production incentives. I’ll be talking
about that later. That’s a series of studies trying to understand what gets users to
produce content. There is also content diffusion, which Eric will be talking about. That’s
sort of the first family of things that we do. It’s very quantitative social science analysis
driven.
The second set of things we do we call data-driven systems. This really is a lot of
machine learning and recommendation systems and algorithmic optimization. As an
example, one of the projects that we do here is add CTR prediction for optimization
of the ads that we serve. Given a user, their profile characteristics, various things they
might talk about with their friends, ads they’ve seen recently; what’s the likelihood that
they’ll click on a particular ad. If we’re able to determine that likelihood, we can maximize
our revenue by serving the ads that have the highest likelihood times the highest
expected value.
0:08:19.0
Another example is PYMK. PYMK is a feature on the site that suggests to you
people you might know, people you might want to connect with. That’s actually a
very interesting and non-trivial algorithmic problem. You have a set of friends and we
can compute maybe the second order set of friends of friends that are connected with
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 4
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
those friends, and pick out of those people the people that you are most likely to connect
with, and we actually show them to you.
The problem is, if the average user has 100 friends, the size of that set is 10,000 and we
can’t show you 10,000 people. We can show you one person. How do you actually
determine which of these people in this candidate set are most likely to be people
you know and want to connect with and you want to interact with?
Andreas:
Let me interject here. It’s not only about you, but it’s really about the site as a
whole. The metric is not just you likelihood of clicking on them. Then, you just show the
cutest people and that solves that problem. You want those people who actually will
be good for the network as a whole. It might not be who is most likely to, but who
will benefit the most. That’s the hard part about modeling. That’s why statistics
comes in.
Itamar:
Exactly, and as an example, consider the case of an existing user who has 500
friends on the site, who might not derive that much more value from making
another connection. We’re faced with a problem of two candidates to show him; one is
another user who has been on the site for a very long time, is a very close friend of his,
who for some reason he may have missed. Another person is someone he knows a
little less well, who is a new user, who just came on the site and only has a few
friends. In some cases, it might actually be more advantageous in terms of a
network effect, to recommend a new user because making that connection will
severely impact the likelihood that the new user stays on the site and derives
value. That’s an example of the point that you made.
Student:
…
Itamar:
Absolutely, that’s right
Student:
…
Itamar:
We’re evolving the algorithm to consider these sorts of situational factors. I think
Andreas alluded to them earlier. Right now, it’s not very good; you’re right. You keep
seeing the same recommendations over and over, but we’re definitely thinking about how
to evolve that to a more optimal state in the future.
A few more example – search ranking when you search for someone; how do we
decide again, based on your intent and also based on what’s best for the network,
who to show as the first entry in the search result.
0:11:10.3
Finally, a problem that we’re just starting to work on is the highlight section. I don’t
know if you guys actually know what that is. As a result of the redesign, now there are
two streams on the home page that serve you information about what your friends
are doing. There is the main stream that shows you everything your friends are
doing at that time. There is the highlight stream on the bottom right, that shows
you the best content.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 5
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
That latter problem is a very robust and interesting quality recommendation
system problem. We’ve just started tackling how to actually do that in an empirical way.
The point is, we have these two sets of tasks. Unfortunately, I can’t really talk about the
specifics of the data-driven systems part because it’s pretty algorithmic. It’s a lot of the
secret sauce of what we’re working on.
This behavioral analysis part we can really explore a bunch of interesting problems
that reveal the nature of user behavior, the nature of social norms, and the nature
of how products affect user experience on this site.
I think a final point I want to make here is there are also some interesting problems
at the intersection of these two focuses. Most of those problems are tool-based in
nature or have to do with infrastructure and tables. Making sure you have
everything instrumented properly, making sure you have tables that express the
metrics you really want to care about, making sure you have tools that allow you to
run statistical experiments that allow you to run machine learning algorithms like
random forests, or logistic regressions or linear regressions on this huge scale of
data that you have.
Now, after that longwinded slide, I will quickly show you who is on the team. We have
ten people, and a manager. The backgrounds are pretty varied. Some people are fresh
undergrads. Some people have PhDs in artificial intelligence. Other people on the team
have PhDs in computational sociology. One person has a PhD in biomedical informatics.
We have a self-proclaimed algorithmic designer on the team, who specializes in
visualizing data.
The first thing I want to talk about is a pretty simple exercise in descriptive
statistics, that I think yielded some very interesting results. The question we were
tackling with this particular study is we wanted to know whether Facebook is
increasing the size of peoples’ personal networks, the networks of other people
that people are interacting with and keeping up with on the site.
The question is very difficult to answer. The first thing we wanted to do was a
descriptive task to explore the different types of relationships that people maintain
on this site, and the relative sizes of these different groups. You can think of at
least four different types of relationships that might exist in the world that we
might care about on Facebook.
0:14:14.3
One set is the people you just know throughout your life, and in Facebook, it’s fair
to say those are really your Facebook friends. Your Facebook friends for many users
are people they’re actually friends with, but it’s actually a super set of that. It’s a set of
people you’ve met at some point in your life that have some sort of relevance to
you.
Researchers have estimated this number to be somewhere between 300 and 3,000.
Malcolm Gladwell did an interesting experiment where he gave people the phonebook
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 6
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
and went through common names and asked people to tell him how many people of a
given name they knew. By using that information, compared to the distribution of names
in the actual population, he was able to sort of extrapolate a guess as to how many
people you know throughout your life.
The second set of people who might be interesting for us is your communication
network. These are people with whom you directly communicate on a regular
basis. It probably indicates that the very least, your core support network, the
people you really rely on for emotionally significant events and support, and social
researcher have estimated the size of this group to be as low as 3 people.
Duncan Watts and a few of his students actually dug into this point a bit and looked at
email communication patterns across various universities and found that in a month
long period, people generally communicate with between 10-20 people, over email,
if email is something that they’re using.
The final set of relationships that are interesting for us are this notion of
maintained relationships. Social technologies like the Facebook Newsfeed or RSS
readers allow you to keep up with people in your life and just know what they’re
doing without actually having to directly communicate with them.
You see your friends do something and it pops up on your newsfeed, or you have an
RSS reader that gives you information of your friends’ blogs that you are subscribing to.
You can actually consume the content that their producing in a passive way.
This really is a form of relationship management, we’re claiming, because you
might see your friend do something in a newsfeed like post a photo of the fact that
they’re engaged, and this leads you to reach out to that friend and congratulate
them on their engagement, even if you hadn’t talked to them in many months.
It’s also important to know that even though technologies are making these types of
relationships much easier to access, it’s something that you could have done in the past,
as well. You might have been in a party and rather than directly talking to someone you
didn’t have a direct communicative relationship with, you asked your good friends, “What
is Joe doing these days?”
We’d like to measure the size of these different types of relationships on Facebook.
What we did was we examined the relationships of a random user sample, of a
couple hundred thousand users, over thirty days, on this site. We defined the
networks of these users in four ways:
0:17:16.9
The first definition is all their friends. This is the largest representation of a
person’s network on this site. It’s the best proxy we have for how many people they
know.
The second group of people we wanted to measure was reciprocal
communications. This is the number of people that you, as a user, reach out to via
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 7
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
direct communication on this site, that also reach back to you, via messages, wall
posts, and comments. We think this provides a measure of your core network, the set
of people that you really are actively engaging with in the most meaningful way on this
site.
The third relationship we wanted to define was one-way communication, so this is
a little bit more of a broad notion, just the people that you’re reaching out to. Of
your friends, whose wall are you writing on, who are you sending messages to,
and who are you communicating with.
Finally, this special notion that we’re focusing on here, called maintained
relationships, that’s the number of friends that you’re tracking on this site. The
number of friends whose newsfeed stories you click on, and the number of friends you
view at least twice; at least twice to control for random people who pop up in your
newsfeed that you might just check out once out of curiosity, or people whose friendships
you accept and then you check out once and never care about again.
Here are the findings that I wanted to discuss. We graph, as a function of the number
of friends a user has, the median size of these various relationships: reciprocal
communication is in red, one way communication is in green, and maintained
relationships are in blue.
What we see here is as a function of the number of friends that a user has, the
median user is passively engaging via this maintained relationship sort of way,
with 2 to 2.5 more people than the number of people directly communicated with.
Let’s see what this actually does for the structure of your network.
Here we take a user, a random user from our sample, and we graph their social graph
according to these different definitions. At the top left, you have the graph that
defines all of their friends. Next to it, you see the graph of their maintained
relationships. These are the people that they’re keeping up with. On the bottom
left, you see the one-way communication, and on the bottom right you see mutual
communication.
Just observe the stark contrast on the right, between the top right graph and the bottom
right graph. This is really visually an evidence of what technologies like newsfeed
are enabling people to do. With cell phone, or email, or SMS, you have to have this
reciprocal sort of mutual communication relationship with people to really learn
about what they’re doing. If you can see, the network is much sparser than if you look
at these maintained relationships up top.
0:20:10.2
Student:
…enabled a lot of … between the actual…
Itamar:
Is your question whether the people that you have maintained relationships with is a
disjoined set from people that you have mutual communication with? It’s usually a super
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 8
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
set, but what’s interesting is if you then define this notion of maintained
relationship in a much more subtle way as the people you actually have a
reciprocal maintained relationship with, the people you are consuming their
activity and they’re consuming their activity, that set of people does not largely
overlap with the set of people you have mutual communication with. That’s fairly
interesting.
There is this group of people who are having reciprocal interactions in a direct way
and a group of people that are having reciprocal interactions in this passive way.
There is not complete overlap between the two.
Before I go to this next study, this point of that is just to show that with simple descriptive
statistics, we get very interesting initial insights about our product’s effect on users’
experiences and users’ social lives.
The next example is much more experimental in nature. This project aimed to
answer questions about content production among new users. We are really
trying to get a sense for the incentives that guide new users to produce content on
the site. The reason we’re interested in this is directly out of our mission; our mission is
to give people the power to share and make the world more open and connected. One of
our core objectives is to get people to actually share content with their friends on this site.
As a bit of background, content production among new users looks sort of like
this, a little less than half of new users upload a photo in their first two weeks. A
little bit more use a third-party app, although that’s not really content production.
Less than a third send a private message. About a quarter compose a status
update, and about a fifth write on a friend’s wall.
These numbers are kind of artificially inflated because there is a large population
of new users who sign up for the site and then never come back because they’re
just checking it out. They may have heard about it in the press and they’re just not
receptive users so after their first week, they don’t actually return to the site.
Controlling for this users, these numbers tend to go up a little bit, but the same sort of
observations hold. Uploading a photo seems to be the most prevalent content
production activity among new users.
Within this study, for that reason, we chose to focus on new users producing photos.
The goal of this study is really to model what leads new users to produce more photos
during the first three months on Facebook.
0:23:16.2
Let’s talk about some hypotheses around what the core production incentives are.
These are hypotheses that we came up with based on some social science literature,
HCI literature, and so on.
Student:
….
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 9
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Itamar:
Yes, that’s a great point. When I’m citing this statistic here of 45% who upload a
photo, that doesn’t include profile pictures. If you control for profile pictures, that
number jumps down to I think about a third.
Student:
…
Itamar:
No, it’s still the most common behavior, except for using a third-party app, which we’re
not here considering as content production. This does have some of that bias there. The
one thing to note with that issue is that a lot of users tend to use their profile pictures
as a place for uploading photos and not just representing themselves on the
profile.
Let’s talk about hypotheses. The first hypothesis is one focused on feedback. The
hypothesis is that newcomers who receive more feedback on their initial content will go
on to contribute more content later on.
The second hypothesis is about distribution. Newcomers whose initial content
receives greater distribution in terms of the number of people who actually see it, will go
on to produce more content.
The third hypothesis is social learning. Newcomers whose friends share more
content and who see their friends share more content will go on to produce more content
themselves.
Finally, we have a hypothesis around singling out. New users who are singled out in
content that their friends produce, will go on to produce more of that content themselves.
Do the hypotheses make sense?
Student:
I don’t understand the…
Itamar:
Really, what we’re talking about here, with respect to photos, is if you as a new
user are tagged in a photo, you somehow become aware of this content production
mechanism on Facebook. You go on to produce more content.
Another example is someone might refer to you by name in a status update, which
because you were referred to by name, you would attend a bit more carefully to it
and you would think about what status updates are, and then you go and produce
status updates.
Student:
What about the …
0:26:01.8
Itamar:
That’s right; our hunch was that this hypothesis was the weakest hypothesis because
there is no real data that the user receives about how many people are seeing their
content. Nevertheless, I think it’s a pretty classic consideration to assume that users
somehow want to maximize the amount of distribution they have and maybe they have
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 10
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
indirect ways of judging that based on the buzz they hear or messages they receive from
their friends.
I think the other valuable aspect of this is if you put a sort of weak hypothesis,
dummy hypothesis in the model, you really want to make sure that your
expectations are met and that the model shows you no effect of that hypothesis.
That’s one of the reasons why it’s there.
Here is the method that we followed. We took a quantitative study that I’ll be talking
about here, and we also did a qualitative study, which for the sake of time I won’t
mention. The quantitative study involved taking two cohorts of new users, one
November 5, 2007, and one March 3, 2008. We observed their activity in the first two
weeks and we predict, based on features constructed from their activity in their
first two weeks – how many photos they uploaded between their third week and
their fifteenth week on Facebook.
Here are the futures we considered. The independent variables, first let’s talk about the
variables that actually capture the hypotheses we posed.
For the first hypothesis – feedback, the variables that we looked at were whether
users received comments on the content that they produced, and how many
comments they received.
The second hypothesis – distribution, the variables we looked at were number of
times that a content the new user produced was viewed by their friends in
newsfeed, and the number of distinct friends who viewed the content in newsfeed.
The third hypothesis – social learning, this is captured by the variable, the number
of friends’ photos that the new user saw in their first two weeks
The final hypothesis – singling out, was the number of times the new user was
tagged in photos, as well as a binary indicator of whether the new user was tagged
at all.
There were some controls. As control variables, we included the user’s age, gender
in the form of whether or not they specified it, in addition to the actual gender; the
number of friends that the user had, the total pages that the user viewed in the first
two weeks on the site, engagement with photos – the number of photos they
uploaded in their first two weeks, the number of photos they viewed, the number of
photo tags they authored, and the number of photo comments that they wrote on
other peoples’ photos.
0:28:46.1
Student:
…
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 11
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Itamar:
That’s why we run to cohorts. One is from one period of the year and the other is from
another period of the year. We took these features that I discussed in the previous slide;
again, we’re predicting the number of photos that a user produces between the third and
fifteenth week, and we put them in a linear OLS model, linear regression, and here are
the results that we found.
We really had two models here, the first model was only able to test two of the
hypotheses, singling out and social learning. The reason the first model was only able to
test those two hypotheses is because there is a certain set of people who just don’t
produce photos on the site. If they don’t produce photos in the first two weeks, they’re
not going to get any distribution, no feedback, and those hypotheses aren’t testable.
I misspoke; the first model just focuses on early uploaders, people who actually
produced photos during their first two weeks, for whom all the hypotheses apply,
and the second model tested on everyone, wherein some cases, only the first two
hypotheses apply.
If you see here, I’ve listed the coefficients of the model and the percent change from the
Y intercept. At the bottom there, you have the independent variables that are actually
testing for the hypotheses.
What I haven’t included here was an observation that we made initially, which is the
number of comments received on photos had absolutely no statistical significant effect on
the number of photos produced later on, which was a very surprising result. It basically
says that controlling for all these other factors, the number of comments you receive on
photos as a new user is not in any way leading you to produce photos later on.
When you change that to a binary variable that indicates whether or not you
received any photo comments on your photos as a new user, we do see a positive
effect. There is a positive coefficient there that leads to a 6.2% increase in the
volume of photos that you produce. That’s the feedback hypothesis.
Student:
…
Itamar:
It’s an indicator variable. We transformed that variable just to an indicator variable that
expresses whether the user received any comments on the photos that they produced.
The second hypothesis, distribution, is captured by that independent variable
there, photo views received. What we see here is there is a significant effect; the
three stars mean that it’s within .01 significance, the coefficient, but then it’s pretty
modest. Photo views received leads to only a 2.6% increase in the volume of
photos that you produce later on.
0:32:09.7
To test the hypothesis of social learning, we have an independent variable there on
the left of photo stories seen, which is the number of times that you see your
friends actually produce photos. We see a pretty sizable effect here, an increase of
6.1% in the volume of photos that you produce later on.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 12
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
The final hypothesis around singling out is captured by this binary variable of
photo tags received, and we found no significant effect there.
Turning to the second model, which is testing on everyone, and is only able to test
the two hypotheses around social learning and singling out, we see pretty similar
effects. Again, let’s look at the independent variable columns here. Photo stories
seen is the social learning, as an interaction with early uploader, we see a very
significant effect of an increase in 10.7%. Photo stories sent, interacted with nonearly uploaders, we see a much milder effect. Photo tags received as an
interaction with early uploaders, we see no significant effect. Photo tags received
as a function of non-early uploader, we see a positive effect.
I think the story that especially the right column is telling is that users are going to
produce more content later on, as users who are engaged with the photo app, if
they see in their first two weeks, their friends doing this activity at all, if they’re
able to discover the feature. If you see here, receiving a photo tag, if you’re not an
early uploader of photos, has a pretty large effect that signals to me an event of this nonearly uploader in discovering the photo feature.
So, to summarize the results, the feedback hypothesis – we found support among
early uploaders and the hypothesis is not applicable among non-early uploaders.
The distribution hypothesis – we found modest support, a pretty small coefficient
among early uploaders. Again, it’s not applicable among non-early uploaders. We
found significant support across both groups of users for the social learning
hypothesis and mixed results in terms of singling out; no support among early
uploaders, probably because they don’t need to discover the feature – they’ve
already discovered it, and large support for the non-early uploaders, probably
because he’s singling them out with a photo tag as leading them to discover the
feature.
The conclusions here really seem to suggest that we learn about social utilities
from our friends. If our friends are doing something on the site and we’re able to
witness them doing that, we’re probably going to start mimicking them. Social learning,
as a result of this study, it appears to be the main lever for content production on
this site. We have a very rich access for pulling this lever. Back in the old newsfeed
system, we can play with the weights and perhaps show you a story of a particular
content type that you haven’t produced before, so if you’ve never produced videos, we
show you a video story and see if you actually produce more.
0:35:18.8
Now, with highlights, we still have that ability so the area of the site that shows you the
top stories that are going on, if you are a new user, and you have never become a fan of
a page or you’ve never created an event, we might want to expose you to that sort of
feature so you can learn about it and actually mimic the behavior yourself.
For the final study that we’re discussing, we’ll be talking about modeling contagion
through newsfeed, where contagion is a theory about how content diffuses among
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 13
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
a social network of people who learn about trends or adopt technologies. For that,
we have Eric Sun.
Eric:
This is a research project that I started last summer. The main goal is to figure out
how ideas spread through Facebook. Obviously, on Facebook, the biggest thing that
causes the spread of ideas is newsfeed. This is important for many reasons. First, it’s
important just as an academic study, but it’s also important for making money on this site
because for example, Facebook pages is the main product where Facebook can
interface with advertisers because most advertisers will put up a page on Facebook and
you can become a fan of the page.
We wanted to figure out how these ideas can spread on Facebook. We compared
these results with existing models of diffusion and ideally we would want to show that
advertising on Facebook is way better than advertising anywhere else, so why would you
bother spending money anywhere else.
I started this last summer, so it’s based on the old Facebook, in particular, the old
newsfeed where not everything is shown but certain types of stories are picked and
shown to each user.
Just to give a little bit of background, there are two competing theories about how
ideas spread through the population. The old theory is that it’s all about the
influentials. This is what Malcolm Gladwell talks about in The Tipping Point, and the
idea is that if you reach a tiny group of very influential people then they’ll talk to
their friends and their friends will talk and eventually you’ll reach everyone for free,
after you get these very important people at the top.
Therefore, a billion dollars a year spent on word-of-mouth campaigns, trying to find out
which of these people are actually influential, and this amount is just growing every year.
This 36% figure was from 2006. I’m not aware of how it’s changed since then. That’s the
influential theory.
The competing theory is recently developed by Duncan Watts, who is a sociologist at
Columbia, and I think he’s still at Yahoo. His idea is that anyone can be an influencer.
For ideas to become very popular, you don’t need to get the influential people;
rather, all you need is a very susceptible population. So, I think he coined a term
called “sheeple.” They’re not people and they’re not sheep, they’re “sheeple.” If you
have a lot of people who are easily convinced, that’s even better than just getting
the people who are very influential at the top.
0:38:56.1
We like to test these things on Facebook. Probably most of you are aware, if you’ve
forgotten the old Facebook already, the old Facebook had a newsfeed where stories
were selected based on how receptive you’re likely to be to it. The way you interact with
the pages product is, for example, any retailer might have a page and Alice might
decide to fan a page. Once she does that, with some probability, they will see an
item on their newsfeed saying Alice fanned a page. Once you see that, you can
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 14
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
either ignore it or you can also click a link saying, “I want to fan this page, as well.”
If you do this, we call this a chain of length one.
If you continue this process, throughout the whole population, the result is that
you get a huge cloud of large connected trees. This is a really interesting example
that we found. This is a diffusion chain for Stripy, which is a European cartoon. We
noticed that when we visualize this graph, we had three clusters immediately pop out.
One theory was that maybe these users were differentiated in some way. It turns out that
it was indeed true, although [0:40:27.5 unclear] are from Bosnia. All the Slovenia nodes
are yellow, and the Croatian nodes are green. The Bosnian and Slovenian nodes are
connected by a few users, but the Croatian cloud has not been connected, yet.
Student:
I don’t understand what the links represent. They represent actual links, in terms of
friends’ relationships? Links being that I saw one of my friends fan this page and directly
fan … within 24 hours….
Eric:
This is just an extension of this action, right here. In order to have a link to another node,
you need to first be a friend of that person and then you have to see that person has
fanned a certain page, and you also have to fan it, as well, within 24 hours.
Student:
In other words it’s the influence, not just the ….
Eric:
It turns out that if you just draw these links between users to users, often, the vast
majority of fans can be connected into one single cluster. Sometimes, for a new
page that becomes very popular, if you draw these links, sometimes you can get
over 90% of the fans that are connected in some way. That just speaks to how
connected Facebook is and how addicted Facebook users are to Facebook.
For example, on August 21, 2008, which was right in the middle of the Beijing Olympics,
71,000 of 96,000 fans of the [0:42:16.5 unclear] page views of the American gymnast
were in one connected cluster. Pages created after July 1, 2008, we measured the data
at August 19, 2008, and the median page has almost 70% of its fans in one connected
cluster.
Student:
…
0:42:42.1
Eric:
Right now, we’re just looking at the clusters. Right now, the yellow and blue nodes would
be connected because there is at least one link between those two clouds, but this cloud
on the bottom right would not be connected. The 70% figure we get is just by taking all
the nodes in the top two and dividing by all of them, for example. In this case, there are
also a lot of other nodes that are just random, that were not as interesting to show.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 15
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Student:
A cluster means that every person in that cluster fanned a page as a result of seeing
someone else’s cluster?
Student:
…
Eric:
That’s something we like to test, so I’ll get to that in a few minutes. First, we would like
to figure out how these large clusters come about in the first place. Are these
large clusters started by one guy, as the influentials theory predicts, or are they
formed when long chains of diffusion is merged together?
It turns out that of all pages of any meaningful size, we define that rather arbitrarily
at 1,000. 14.8% of the fans in the biggest cluster were start points. By that, I mean
15% of the fans searched for the page and fanned it, whereas 85% of the fans
found the page by seeing someone on their newsfeed, fanned the page, and then
clicking the link to “also fan the page” within 24 hours.
This 15% figure is actually very stable, especially as the number of fans in that
page increases. This happens because the average node and the biggest cluster
are connected to almost 3 others. It’s in both directions so if I’m somewhere in this big
cloud of users, on average, I either have 2 parents and 1 child, or 2 children and 1
parent. I think the median behavior was 2 parents and 1 child. I don’t recall.
Just to compare diffusion chains on Facebook versus real life, obviously it’s a little bit
different, but the connected nature of Facebook makes long diffusion chains very easily
possible.
Just to contrast this with a word of mouth study in real life, there was a paper in
1987 where they tried to track the propagation of piano teachers in a
neighborhood. If they drew a similar link graph, they found that 38% of the paths
involved at least 4 individuals. On Facebook, where things are very easily found and
you log onto Facebook and you see a newsfeed and a lot of items that say your
friends have fanned a page, and it’s really easy to also click that link, it turns out
that 86.4% of paths of page diffusions involve at least 4 individuals. It’s already a
huge difference.
Another thing we wanted to do is to figure out how these long diffusion chains are
created because this obviously has a lot of implications for advertisers and also it
is just interesting in a sociological experiment. We wanted to test whether the
influential theory or the contagion theory is more applicable to Facebook. To do
this, we tried to predict the length of the diffusion chain that someone will create
when they fan a page.
0:46:50.9
Basically, the way it works is that if I fan a page, using my characteristics, maybe
we’ll be able to predict the length of the diffusion chain I create. If I fan a page,
maybe Itamar will see it and he’ll fan a page, and someone he’s friends with that I’m
not will also fan it. The process continues. If it stops there, then my chain will be
of length 3, but maybe there are some characteristics that will be more amenable
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 16
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
to having someone create a very long chain. If we can figure out what those
characteristics are, then it would be very great.
To do this, we used a sample of 10 pages that we found pretty randomly. For these
pages, we got the graph of every single link between the actor and the follower. To make
sure these are good pages, we made sure they’re sufficiently old and also sufficiently
large.
Our prediction model was a model where we have the response variable is maximum
chain length. For each user who is a chain starter – someone who becomes a fan of
a page without ever seeing any of their friends fan the page in the last 24 hours –
the maximum chain length is a count variable for how long their chain was.
Predictors were gender, their age – which is how long they’ve been a member of
Facebook, their feed exposure which controls for the number of friends who saw their
newsfeed story. In the old Facebook, this was not just everyone. We also controlled for
their friend count, for how active this person was on Facebook, and there is also a
nebulous concept called popularity. This controls for newsfeed exposure through an
algorithm that puts stories on your friends’ newsfeeds. With some probability, if I’m a
friend of yours, I’m going to see your story when you fanned a page, but that algorithm is
pretty complex.
Student:
I was wondering about….
Eric:
In graph terminology, sometimes they call it “diameter,” so if you take this person
at the top and you plot their entire tree, and it’s the maximum the length of the tree,
the number of levels of the tree, just the number of levels.
Students:
…
Eric:
Yes, but that makes it very difficult because a lot of these changes will emerge so it turns
out that you won’t get a very interesting regression because most of these people – it
90% of these users are actually in one chain, then the width will just be 90%. Everyone
will have the same number.
Student:
To me, what you’re telling me is …
Eric:
There are definitely a lot of things you can do with this data. I’m just presenting one little
analysis.
0:51:01.8
I’m going to skip over the technical details, but we’ll post the paper on the wiki later, if you
want to take a look. We ran a negative binomial regression and we found that the
only consistent significant coefficient is on this feed exposure data, which controls
for the number of friends who saw your news feed. This coefficient hovers around
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 17
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
one, which implies that if we tweak it such that newsfeed publishes a user’s action
to 1% more people, we can actually expect a 1% longer max chain.
This implies that the friend count variable is not realistically meaningful because
after control – friend count is just the number of Facebook friends you have. This
is actually not that meaningful which is a surprising finding. It says that after
controlling for distribution and popularity, your demographics' characteristics, such as
age, gender, and your number of Facebook friends do not play an important role in
the prediction of your maximum diffusion chain length. This is one small evidence
for the contagion theory that Duncan Watts puts out.
Student:
…
Eric:
If you have no friends, then you will not diffuse. You are right, there will be an excess
number of zeros, so we did a correction. It’s call a zero inflation correction, in the
negative binomial procedure. We can take a look at the paper if you’re interested.
In conclusion, Facebook newsfeed enables long-lasting chains of diffusion that
may reach a lot more people than real life chains. This is made possible by
Facebook, which is very connected and ideas that have good receptiveness will
attract long, wide chains of clusters. The main influence of these long chains is
basically distribution, not anything related to the users. That’s why it may not be so
important to find the actually influential people. I think that’s it.
Student:
… clusters with the …
Eric:
We assemble clusters just by drawing links between all these users. Our data set is so
big that if you looked hard enough, you can find clusters that look like anything. Am I
misunderstanding your question?
Student:
… nodes… it’s connected by…are those small clusters generate… essentially you can
draw at the edges… to find how big your cluster….
Eric:
Yeah, but that would require a certain structure of the graph that may not be present,
because there is also a time aspect where it’s really not a graph; it’s really a tree that
goes throughout time. I think a tree might be a better way to think about it.
Student:
… assigned any weight to the connections…
Eric:
That’s not something we’ve done yet, so far. Right now, for this, all we care about is that
they’ve fanned a page, seen someone fan a page, and they also fanned the page.
0:55:07.0
Student:
I thought that… do you think there are power chains that are more… than the others?
Eric:
What do you mean by power chains?
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 18
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Student
The diffusion of information … nodes…
Eric:
What do you mean by consistently valuable?
Student:
If the information travels through one chain, consistently creating a larger effect than…
Eric:
So you’re saying if we take a lot of different kinds of content and it always propagates
through this one chain, is that meaningful?
Student:
Yes
Eric:
Maybe, we haven’t looked into that. I’d expect that something like that could happen very
easily.
Student:
… people versus
Eric:
Yes, but I didn’t cover that in the presentation but it’s in the paper if you want to take a
look.
Student:
… “sheeple,” and you used the cartoon to propagate. Does this have anything to say
about the complexity of the information and its ability to propagate through a long chain,
as something that’s easily transmittable and … pass it on?
Eric:
We think that what’s very important is that people are receptive, so for example,
with the American gymnast in the Olympics example, it happened during the
Olympics and a lot of people were very receptive of the idea. All they needed was
some stimulus, for someone to point out that this page exists. Once someone
points it out, then people are very receptive and they’ll go ahead and click the link.
Andreas:
I’ll see you next week, and let’s thank both speakers very much. Thank you. [Applause]
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 19
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc