Download weigend_stanford2009_6linkedin-1_2009.05.11

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Andreas Weigend (www.weigend.com)
Data Mining and Electronic Business: The Social Data Revolution
STATS 252
May 11, 2009
Class 6 LinkedIn: (Part 1 of 2)
This transcript:
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11..doc
Corresponding audio file:
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.mp3
Next Transcript: (Part 2 of 2):
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-2_2009.05.11.doc
To see the whole series: Containing folder:
http://weigend.com/files/teaching/stanford/2009/recordings/audio/
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 1
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Andreas:
Welcome to our class today, where the main attraction is Reid Hoffman. Reid Hoffman
just reminded me that we met at the parking lot at [0:00:12.5 unclear] a couple of
decades ago, in the 1980’s. I remember him running, at Symbolic Systems, the weekly
speaker series where my PhD advisor, Dave [0:00:24.3 unclear] said, “Andreas, you
should come with me and meet that guy.” Dave was actually talking about what
distinguishes humans from animals. He said it was that we have invented an external
representation of knowledge. For an AI person, having a representation of something
where we talk about the axes of the spaces versus the points in the space; that’s always
what I’m referring to.
Reid then went on to co-found PayPal. After PayPal, he started LinkedIn. I remember I
first saw you in your office in 2003 or 2004, when we talked about what one could do with
all those data in a business setting, before Facebook was around. I know most of you
can’t imagine a time before Facebook, but there was a time before Facebook.
Reid will be joined by D.J. Patil when he gets here. In the meantime, let’s start. Do you
want to have a quick conversation with the class about where they are and what they
know about LinkedIn? Who here is on LinkedIn?
Reid:
At least there is a registration. D.J., who will join us mid-flight, is head of our analytics.
He used to be a professor or concrete math at Maryland. He did work with the DOD, and
built the latest fraud and risk model for eBay. He will be running through some of those
specific things. That’s actually not what I do with my time, since I’m a CEO.
For purposes of talking about things like knowledge representation and what not, that
Andreas is mentioning, I actually went to Oxford to study philosophy from here. I was the
thirteenth person to declare Symbolic Systems as a major – I started the Symbolic
Systems forum. I think it’s still going, from what I gather, but it’s twenty years later.
I will open with a few comments about how LinkedIn looks at data and looks at the kinds
of products we can construct with data in order to give you some conceptual overview to
this. D.J., when he gets here, may rehash some of this because he’s going to go into
more depth than I will, but starting with the concepts is interesting.
One of the key things that you are probably already familiar with, given that I’ve looked at
a few of your earlier classes, is looking at networks as essentially information and people
representation systems. Part of what you’re getting out of a network, especially with
data added in the network, depending on what the semantics of the network is, it’s
a reputation system in order to decide which people are good at something and
which people you can trust, and which sources of information you can trust.
0:03:30.3
The professional applications of that are relatively straightforward, from a conceptual
basis, which is not just hiring, although this is important, but also when you’re looking at
making judgments about expertise, which judgments of this person is expert in this can
facilitate making a transactional decision. What you may not be familiar with; I don’t
know how many of you have crossed over to having industry experience but the kinds of
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 2
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
things that LinkedIn is used for is not just recruiting or job seeking, but also hedge
funds use it a lot in order to find experts in order to do trades in the market. There
have been such things as people sourcing a deal and all the reference checking on the
deal and making $300 million in eighteen months off of a $50 million investment, by using
this kind of reputational system in order to make judgments.
When you’re doing dollar sums that large, you’re not just using the fact that there
is a prediction, based on this person is likely to be interesting to you as an expert,
but you’re also doing a lot of hand checking. You’re using your social network.
One of the ways, that reputation through a network is not just an objective
reputation, but also subjective reputation, which means reputation to me. I believe
this person is really good at this thing. For example, if Andreas and I are connected
on LinkedIn, if I find somebody that Andreas is connected to and I think I need to talk to
that person, I ask Andreas, “Is this person really good at this or not?” That’s where you
blend somewhat of what you might think of as pure networking data with subjective
information. We interact.
I think one of the really interesting patterns, in terms of what’s going on here, is these
human/machine combinations. You bring in human interactions along with the data
patterns as a way of building something interesting.
For example, the kinds of things – we take a very long term view of this in terms of how
we build our products at LinkedIn because we launched May 5 th, 2003. For a
professional network, we are by far and away the largest. I think Facebook was in here
last week, and they’ve had the third best growth trajectory. The first best is Twitter, the
second best is YouTube, and they’re the third. It’s a very large thing, but that’s universal.
That’s not just trying to do professionals.
We launched on May 5th, 2003 and our entire goal at that point was growth; getting
people to network. One of the challenges in building a network is person number one,
service valuable. Person number one and two, service not valuable. Person one, two,
and three, service not valuable. How do you get to enough people in a network so
the service starts being valuable?
0:06:43.9
Despite that, in August of 2003, we launched references, which are how I can take
what is equivalent of a book blurb, and suggest it. I can say, “Andreas, I’d like to give
you this reference.” If he likes it, he can post it onto his profile. Part of the reason we do
it with his approval is because there are all these legal liability issues around putting
information about other people out there, especially when it has an economic context. If
hedge funds are making decisions about who they might approach in order to hire or as
an expert, and that could be a very lucrative consulting contract if I was leading a
negative piece of data on Andreas, which may be something he would be very unhappy
with. It wouldn’t be true, but it’s always that kind of positive system.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 3
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Once you have that network with the positive system, then that begins to allow you
to add in data, overlaying the graph, in terms of who is recommending who and
what kinds of words are in those recommendations. One of the things we do that is
partially distinct from some of the other sites is that reputation in terms of
transaction is subjective. It’s by you, as a person, because the people I may judge
to be an expert at something are not the same people you may judge to be an
expert and it is not necessarily a completely objective thing.
We look at the graph as both an information and people reputation system. The
other part of it that is interesting is that a lot of when we look at data, it’s not just numbers
and graphs. It’s also textual analysis. A lot of the products that D.J.’s group builds at
LinkedIn are things like what are the career options possible to you; what are the
sequences to get there? If you wanted to be Chief Scientist at Amazon, what are the
sequences that get to that kind of job? If you plot out how to get there, what is the
likelihood and the transitions and so forth for doing that? Part of that ends up in doing a
lot of text analysis on the profiles and CVs that people construct.
D.J:
Sorry for being a bit late. It’s like Reid says, it’s running at top speed every single
moment of the day.
Student:
Is there something that we can do now?
D.J.:
How about we show you?
Reid:
That’s actually one of the things that D.J. and I have talked about showing.
D.J.:
That’s actually a good start. One of the things we’re hoping to do is to try to make this as
interactive as possible, otherwise, it will be rather long and drawn out for both of us. I first
want to talk about what analytics is.
These days, you’re hearing a lot about analytics, everything from what Facebook calls
their data scientists, to what we at LinkedIn call research scientists and our analytic
scientists, and that different notion. This is one of my favorite graphs, which is this guy
just took the word “cahn” in the classic Kirk sense, “caaaaaahn,” and how many a’s could
there actually be and how does it plot with Google search results. One of the amazing
things is, and this is a bit silly, but at the same time it’s extremely clever. You say, “Gee,
somebody out there has a bunch of pages with the word “cahn” with 96 a’s in it. It’s this
weird notion of this is interesting insight.
0:11:13.2
One of the distinctions that we actually make at LinkedIn is whereas in many places you
could just be a research scientist and come up with this interesting data and make a
graph; what do you do with it, how do you actually turn that into a product? That’s
something we spend a lot of time constructing constructing a team and a philosophy
about doing.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 4
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
This is an interesting graph of how analytics has changed over time. One of the great
things about LinkedIn profiles is you have you profile history as far back as you want. In
many cases, it’s a little fuzzy out there, but this is 1970. It’s just an arbitrary cut off there.
You have in the cold war, early time, post Sputnik era, really getting the space age time is
a lot of people had analytics and analytic scientists in their title because you were
churning data for space missions. You kind of got this lumpiness as you got into the
1980s. The more interesting thing is by the time you get to 2000, we’re getting this
hockey stick growth of people including the word analytics. It’s the same thing if you look
for searches. What are people searching for on LinkedIn when they’re trying to do
people searches? They’re looking for talent? A very high proportion of analytics yearover-year.
The notion of growth of what analytics is, is steadily growing and building on each
other. When we say analytics, it’s interesting to define what that means. It means,
broadly, everything from a person that’s involved in hedge funds, who churns
through data, to a person who might be doing data mining or cut text
classification, to a person who is a business analyst and putting that in their title,
where they’re much more on the softer side of that, comparatively to data mining.
Reid:
By the way, you can interrupt with questions. This is a format we’re familiar with. Have
we correlated this with companies and so forth, on the analytics?
D.J.:
You know, we have not, but that’s a good point. The number of analytics startups…
Reid:
Is it Internet, is it startups, is it banks?
D.J.:
Right, and what I think it is overall is the ability to drive a throughput of data, so here is
the period of time when you actually had mainframes and you were churning through a
lot of data. Then you came through the local PC time, where you didn’t have the ability to
crunch through massive amounts of data. Beowulf started coming into the picture here
and then you get MySQL and massive MPP systems coming back in. You get the ability
to actually do something with data. I should check that. That’s one of the easy things
about LinkedIn; I’ll go do that this afternoon.
Reid:
This is what I meant by textual data in terms of the fact that we have all these things in a
structured textual data format so we could actually make a rough guess at what the
semantics of the words are and apply them in to concepts, in terms of doing the analysis.
Is this trend – it obviously says analytics is on the rise in terms of the industry, in terms of
how people represent themselves in the industry – which industries is it doing it in? Is it
iBanking, which obviously is in less trouble than the last six months, or is it other things?
0:14:27.7
For example, do you have a Lehman graph in here?
D.J.:
I didn’t put it in, but you could talk about it.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 5
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Reid:
We saw a huge surge in signups from Lehman, before the bankruptcy hit.
[Laughter] You can see that kind of pattern.
D.J.:
It’s even more interesting. I’ll just take a few minutes to talk about this. The
announcement got out to Lehman Brothers that there was a problem on Saturday. The
employees went into the system on Sunday, physically went into the office to start
downloading the files because they didn’t know what would happen on Monday, The IT
department says all their warning systems and everything went off the charts. Why is
there so much download, all this bandwidth activity on a Sunday, when there is usually
no activity? IT actually started shutting down parts of the network, thinking that there was
some type of internal attack that was happening.
Everyone then called their other friends and said, “No, it’s really happening,” which then
fostered even more people coming in and trying to get on the ground and getting worried.
The amazing thing out of it was by Sunday late day, before the announcement
happened; we started to see this incredible shooting hockey stick growth of
Lehman Brothers’ people applying.
Reid:
Curves never go straight up, but it was pretty close.
D.J.:
At the same time, it was registering on all my fraud detection systems. What’s Lehman –
somebody doing an attack from Lehman Brothers? What’s going on? We found that
when we talked to people at Lehman Brothers afterwards, one of the reasons that people
had suddenly registered in their mind was that when you are in a place like Lehman
Brothers and you are on heavy corporate IT systems, the BlackBerry that you are
carrying is hooked up to the system where I’m friends with Reid outside of work,
and we talk all the time; this got shut off. I don’t know Reid’s phone number.
People were actually reverting to going back to the phone book. People didn’t
have access to Gmail. They didn’t have access to other external clients or address
book tools. The only thing they had was LinkedIn. Suddenly, it became their
default mechanism when the company was just about to tank.
Reid:
That was very interesting from our perspective.
Andreas:
The BlackBerry, looking up phone numbers on the BlackBerry is primarily within the
company. What percentage of LinkedIn’s messaging or contacting is done for people
who are in the same organization versus what percentage actually goes elsewhere?
What you are saying is that their BlackBerries didn’t work anymore and they couldn’t
message each other anymore? They had to revert to the phone and LinkedIn, which
point to more internal communications than external communications.
0:17:22.9
D.J.:
I wouldn’t say so much just internal communication, but being able to find people and
connect, get in contact with, or establish a communication channel outside of the
corporate IT approved bands. Another way to think about it is with what we’ve seen with
both the Bush administration and the Obama administration during the transition; I was
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 6
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
there during part of the Bush administration. I had been using LinkedIn even before I
knew Reid. I had been telling people this was fantastic because at the DOD, your typical
term of rotation is three years. Your Rolodex, at best, is good for three years and then
you don’t know how to contact people.
Using Facebook is not a good alternative if you’re in the government because I want to
take that email address and cut and paste it into my government communication channel,
whether it’s the secured one or the unsecured one. I can’t just be doing my business
communication channel through Facebook messaging because we don’t want that. At
the same time, with the incoming Obama administration, we have seen a lot of people
trying to connect and stay in touch with where people have gone. As they’re doing
the new IT systems, you’re finding these emails are incredibly long, complex, and
they change. What your email is now, in three or four months, may not be the
same. It’s a very fluid email type of system.
Let me amplify in one part of Andreas’ question because I think this will be interesting.
Today, I don’t know if we have exact analytics on how much is internal company
communication on a one-on-one basis actually goes through LinkedIn. I would suspect
it’s not very much. We haven’t really studied that data. However, there are two things
that are interesting. One is we have a company groups product, which actually
has full wiki-like moderation facilities so people can control, “This person just left
yesterday. Put them out of the group.” They’re only validated if they have the
domain, the position, and a bunch of other things. They can be controlled. There
is threaded communication within it and there is a fair amount of that.
Reid:
For example, there is a constant rotation of questions answered in the discussions on
LinkedIn. The most interesting piece is because individuals have a good incentive for
keeping a relatively full LinkedIn profile, we’ve actually had a number of companies
approach us to build directories for their company. One of the tasks the companies look
at is how do we find expertise internal to the company, not just external but internal. The
problem is that there have been a lot of directory products offered by internal software.
The problem with the enterprise directory is the premise that they go to employees
with is the following. “Please be diligent and update your expertise directly to the
internal company’s system. Your reward will be that people you don’t know within
the company will call you and ask you to do work that you will not be benefited or
compensated for, so please do update your internal company directory.”
0:20:45.9
Not surprisingly, it rarely happens. Part of what we do at LinkedIn is to try to give
people a number of incentives for having their profile relatively up-to-date. Not
only are there job opportunities, but there is also finding former colleagues,
finding other experts, hedge funds that may call you and offer you a lucrative
contract for advising, etc. You have an incentive for keeping your LinkedIn profile
up-to-date. It’s also easy. It’s a one published to multiple.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 7
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
That kind of directory product is something that companies have asked us about, which is
an interesting way – it’s how incentives match up to creating data that allow new kinds of
products. I thought Andreas would be interested in that, given your question.
Andreas:
I think from the first perspective of incentive design, which we talked about, and
persistent ID, which are really two key drivers for what we call the social data
revolution, it clearly matters that people create things not for the current company
and then when they leave, it’s dead. They create it for their own future, ultimately.
D.J:
It goes even one step further, which is if you take LinkedIn and what you see on
the profile, and everything we do for the profile for search engine optimization,
we’re helping to enable your brand. There is a big question there. What type of brand
are you trying to enable in your company? What are you incented to enable inside your
company versus what are you incented to enable more broadly? That’s really this
question of what are you going to get for it.
Student:
You are saying you try to SEO pages about [0:22:25.5 cough] name, and ideally, I’ll get
LinkedIn pages for people that work there.
D.J.:
Your name, but we do it also for company names, which is really important around the
small and medium business market. For a smaller company that is likely to have
three or four people and unlikely to have a footprint, we do have a company pages
product. We’ll take advantage and leverage that.
Student:
… internally update their profiles… managers actually do it…
Reid:
The question is what about other incentive systems and are companies doing that
in order to get employees to update, for example, review compensation, management
feedback, and that sort of thing. The most advanced one that I know about in the
world, is Google. Google was the first company to come and ask us about this, in terms
of saying, “Could you build a product like this for us?” They’re very attuned to this
internal “search the world of information,” search the knowledge base of the expertise of
the profiles in the company.
You can’t give compensation in Google without having an up-to-date internal profile.
That still doesn’t have all of the relevant information in it. The question is people have
tried it; relatively few people have done it successfully. Google is the most
successful I know of, but even that, the profile is short from a richness
perspective.
0:24:04.3
D.J.:
Something everyone in this class knows; business is definitely recognizing the
importance of analytics. This is a Business Week title, cover magazine. This was from
about two years ago. It really was calling out that there is a big shift; it was kind of the
first time you really saw in big common media access, some people saying, “Look, there
is all these mathematicians and what are the things they’re doing.” On the right, there
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 8
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
are some of the books I’m sure many of you talked about, this new area where people
are really getting into much more sophisticated analysis than what you used to see about
eight to ten years ago. Somebody would say, “We should do a regression on that,” or
just putting data only into Excel; using much more sophisticated statistics packages like R
or pick your favorite one.
How do we think about analytics? I’ll give my spiel first and I’ll let Reid give his so you
can hear two different viewpoints. I’m the one who is pitching how we should do this.
He’s the one who is funding how we should do it.
Inside the analytics team, we want to build products out of analytics. We want to
figure out how we take the data and turn it into something that is really interesting
and user facing, that is measured across two dimensions. One is engagement and
two is revenue. We are very conscious about how we are trying to figure out how
to cut that split. A great example is “Who viewed my profile?” and I’ll talk more
about that in a bit.
There is also this big part of we’re not only the front facing people. We’ll build
something like a recommendation engine. A lot of teams will be able to use it. It’s
a much more efficient model than everyone building their own one-off model, etc.
to drive their own system. We might use our recommendation engine for
everything from ads targeting to how we will rank your network status updates, the
groups you might like, and anything that is arbitrary content, or even things like
collaborative filtering which “People who viewed this profile also viewed this other
profile.”
Reid:
These are people or this is information that may be relevant to you.
Andreas:
We spent a lot of time in the beginning of the quarter in the distinction between
just signing up people and actually engaging people. The class came up with about
100 metrics of engagement. People are quite sensitive that there is a lot of different
ways you can measure engagement. Do you mind talking a bit on that? You said
revenue is rather straightforward [0:26:56.5 unclear]. Engagement is less
straightforward. What is the number of features used for you?
D.J:
Let me just give a [0:27:06.0 unclear] of an interesting way where engagement doesn’t
work well for LinkedIn. Let me switch to actually how we do it. Time on site – for
LinkedIn, if you have your iPhone app and you pull up LinkedIn to look at
someone’s profile because you’re in a meeting around a table and everybody is
going from person to person to person; there is a great opportunity to get a bunch
of information. We obviously don’t want to hold you on the site and hold you up.
We want to facilitate how you’re doing things.
0:27:40.0
On the other side, a good model for engagement is you come to the site looking for
question and answers. You want to be an active group participant. There, time on
site does matter. The way we actually think about it is we do it in a funnel. We try
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 9
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
to build an ad hoc funnel. Into the funnel is you come to the site. The second tier
is you have engaged with the product and we treat that as very coarse grained and
binary. The third is you’ve done something; you’ve shared the content is the next
tier. The final one we measure in this is you actually have contributed content.
Depending on the product there are a lot of nuances into that but we work very hard at
not making it so granular that we eliminate one side or the other.
It’s almost like a hierarchy of how much you’re investing if you get back to it; what is the
level of investment. If you’re going to go, “Yeah, recommend this news article within the
company group,” that’s the click. It’s a very valuable click to help create both
collaborative filtering and an ultimate Digg-like system within the company of what news
is hot today.
Reid:
If you start typing a blurb in to discuss the article, then all of a sudden you’re adding
additional competitive intelligence in there. You’re much more invested in what’s going
on. That automatically leads us to say this increases the level of importance that you
think this article is.
D.J.:
The other parts we do, and this is the part of the organization where we do very
rapid prototyping of very fast visualization. I’ll get into how we create data
visualization as well a bit later. The other thing that we do, the other part of the team
is the data insights team. Their function is doing everything from “State of the Union,”
which goes into the projects, all of our products, and tries to really understand the
segmentation, of who the users are, what industries are using, what countries are
penetrated, and everything including the A/B testing, as well as understanding interesting
demographic trends. That’s the team that came up with the analytics graph.
There is an interesting dichotomy here because while the first group is really
focused on engineering products, really getting the product out the door, and
shipping things; this team is coming up with the interesting funnel to support the
first group, as well as the broad functions of the team. In fact, this team rarely has a
week go by that doesn’t support every single organization in the company, with either
some type of dashboard, some type of analysis, or some type of consult to give a feeling
of some sense of where LinkedIn currently is, what the population is doing, what’s hot,
and what’s not.
0:30:51.3
Reid:
For example, someone who is interested in targeting very expensive enterprise systems
in an advertising buy might say, “We’re looking for how many IT purchase decision
makers log in every week across the U.S. Can we sell that as a demographic?”
That would end up being what exactly are they looking for if it’s not one of the
things we’ve already done, and that may go through the data insights team. That’s
an example of something that has nothing to do with the product and engineering
side, but something the analytics group would help with.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 10
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Andreas:
One of the distinctions we made was between real time data and interactive. My
general view is that real time is not as important as a fast, interaction with the data.
In many cases, it’s really the dialog with the data when new insights get developed.
The question you just posed for IT managers and trying to get a high revenue of
the ads for that group; it doesn’t seem like a very interactive thing. Do you have
examples of what came out of quick interactions with the data, how these
hypotheses actually got formulated?
D.J.:
Actually it is high interactivity. That team is very closely tethered to this team. A
hypothesis might come in saying, “I’m looking for this,” and this team will come back
and say, “Is that really the right question? What about this? We found this other
interesting area, or you could do this.” A concrete example of this is in what we
call lead generation.
We have an enterprise team and they’re out there trying to hustle, trying to make
sales, and get things going. We noticed what they were trying to do and we said,
“We could just do some quick analytics around this and identify all the people who
are heavy users of LinkedIn, and that would give you a good indication of where to
go look, which companies they’re looking for.”
That idea of this traditional funnel that the sales guys call it, where you are just kind of
cold calling, hoping somebody gets interested and you try to drive them to this long
iterative sales process before you have an enterprise sale; it’s not just a click “buy now”
type of sale. You want to try to short cut as much as possible. You want to be as
clever as you can with the data to get them there, that way. That’s how we think of
it as interactive.
Reid:
Another thing to amplify is D.J.’s actually driving a self-serve data interaction model
for various parts of the company. As opposed to calling a human, and getting it,
it’s like here is an interface by which you could ask certain, basic questions in a
more rapid basis in order to achieve the exact result you’re talking about.
D.J.:
How many of you guys saw the Google article with Twitter on design, in The New York
Times this last week? Only a couple of you?
Reid:
I take it you recommend it?
D.J.:
I recommend it. It’s an interesting one because this touches on a really interesting space
right now, which is one of the really good, creative designers at Google left for Twitter.
One of his arguments for leaving was saying, “Everything is so metrics driven that where
is the room for creativity.” Google’s answer is, “This is how we do it; we’re the big dog,
it’s working, what do you mean?” It’s a good argument.
0:34:19.0
We want to have a little bit of blend in that. Our goal is to enable and empower the
increased sophistication of every person that works at LinkedIn so they have the
ability to dial up or dial down where they need to be with respect to analytics.
Somebody that is optimizing the new user registration page, yeah, they’re trying to get
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 11
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
people in. They’re cranking on numbers. They A/B test. They’re talking you do a couple
of marginal points on that, percentage points, and you’re doing awesome.
Somebody else is trying to get something new going, to foster some new type of
dynamic. There is a lot said there for the design components. We want to, by enabling
everyone, foster their own decision making. We’re trying to make things as
decentralized as possible.
Student:
What kind of interface do you build for less technical end users in marketing to use?
D.J.:
There is nothing good out there. We built our own. We built a free form SQL tool that
just literally, we can build a report for you by taking the straight SQL and it builds it with a
Cron job and it goes and hits the servers and pulls back whenever you want.
Student:
Who writes the SQL?
D.J.:
We actually write the SQL. My teams write the SQL. The reason we do that is
somebody might say, “We’re looking at this.” We offer classes; we’ll be going to once
every week once we get everyone racked up. The idea is you can come in and train with
anybody to learn your SQL. You might get it to the part and then we’ll optimize it for the
particular type of database solution we’re using. We use a number of database
technologies, everything from Aster Data, MySQL, Hadoop with Pig and Hive on it, as
well as traditional Oracle.
This is a really interesting part for those of you who are really into the data; when you’re
constructing your organization to be able to actually get data, the tools typically there are
business intelligence tools like MicroStrategy, Hyperion, and those type of things. It’s
very tough to get everything structured. It counts on the fact that you have invested a lot
of money on your data warehouse. If you have new types of data and you want to be
really nimble, you have to have something else. This is a fast way of shortcutting
everything. We can just lob it in there and get people the data they want, quickly.
A great example of this is we have a tool where actually – I used to pull up this data,
which is LinkedIn’s representation of Stanford. We have this tool; people always come to
us and say, “I’m going to go speak at this university,” or “I’m going to this company; I
need some stats about the LinkedIn population on there.”
0:37:14.1
I just enter “Stanford” with the wildcard, enter the tool and it sent me a note saying I’m
working on your query and will let you know when I’m done. It took me about twenty or
thirty minutes because I ran it at the highest peak of the morning. There are 74,000
members of LinkedIn who have gone to Stanford or are current members of faculty or
work for Stanford.
Reid:
Most of them are older.
D.J.:
Yeah, and there are 29 new members per day that come onto LinkedIn. You can see the
representation that 20% of them are VP or higher, which indicates the elder part. You
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 12
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
can see the breakdown of where they are originally. We have a lot of other data that
supplies about a couple of hundred fields that we allow people to just pull from.
It’s an example of where we said, “Yes, we are a centralized organization here, but we’re
not going to stop you from getting the data. We put this organization in the product
organization, not the technology organization, specifically to strengthen that top one and
to make sure this does not become white castle and deviate from outside of the
organization, which has happened in other organizations, such as eBay and Yahoo.
Reid:
The idea is that the product drives the health of the overall network. Ultimately,
while this also helps with adoption rates and curves, and everything else, it’s also
how healthy is everything happening. This is the analytic fuel of what’s currently
there, from either a 50,000 foot level or a 100 foot level, on a specific webpage, button, or
product.
What’s most interesting, I think, is actually how you build products out of the data.
This is where I was mentioning the career map. How do you actually make
products that are unique products that people don’t have otherwise, that they
desperately need but have not otherwise seen? You can actually drive patterns out of
everyone having the right incentive to put certain kinds of information in their profile,
having a certain kind of network, and then using that to create aggregate products that
benefit everybody.
It’s a pretty different approach that we’ve taken. In fact, it’s one of the ones when Jeff
Hammerbacher was spinning up the data team at Facebook; we were comparing notes
very regularly on how we did it and how we were doing it at LinkedIn. He was coming
from a very tech/engineering perspective. We were coming at a very product oriented
perspective. They both have their interesting aspects.
D.J.:
One is we’ve very fast and very good at iterating on the product cycle. Then we
have to build our technology layer. They’ve got this awesome engine but they haven’t
figured out how to very rapidly innovate on the data cycle.
Andreas:
Where do you see the current bottleneck? Is it that people coming from a certain
education being pigeonholed into a certain area feeling it’s not their job to do something,
or is the bottleneck that the cycle time is too slow?
D.J.:
To turn it into a product?
Andreas:
What do you wish you had more of, besides people?
0:40:35.0
D.J.:
There are a lot of long polls that are there right now. One is how do you manipulate data.
There are solutions that are these MPP solutions, like Aster Data, Greenplum; you could
argue Hadoop to some extent, those technologies, their ability to be implemented, and
their cost. That’s a huge one.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 13
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Reid:
We break them regularly.
D.J.:
Oh, and we break them. We were on Greenplum until recently. We could tip it over
instantaneously. How much analytics are we actually pushing into the data layer? We’re
doing much more sophistication with Aster Data and drive a lot more of the roadmap
currently; they’re a bunch of Stanford CS alumni.
Reid:
Aster Data built its whole company using LinkedIn to find clients, and the whole
thing. It’s one of the interesting things that I only discovered after they were
already a customer. I called D.J. and said, “I just read this business case study.”
D.J.:
It’s a little quid pro quo. The part of analysis – the biggest thing we found – the shocking
thing we see is people say, “Go get a data miner, find a data miner,” and those actually
have not turned out to be the best skills. I’ll talk about this as one of our key problems as
standardization. It’s the classic, “I have all these titles; how do I map them against each
other?” That’s a well known problem inside computer science, the classification problem.
It really has a ridiculous amount of edge cases and it becomes very complex with the
nature of how do you manipulate those edge cases without screwing up in a way that
exposes it so blatantly to the user that they say, “What the hell is this?”
What we found works better has been the scientists who have the ability to manipulate
massive amounts of data that is necessary to solve another problem. My background is
in atmospheric science so you have to manipulate your data, you have to clean it, and
you have to get it in the right form before you even get to start on your problem.
One of our really strong guys, a physicist, did a heavy amount of measurement
analysis and had to manipulate the data in all these forms. The data mining
algorithms, often, we look at these graphs that come out of any of the ACM or the
KDD conferences and I know you know this; you say, “Look, I got this fantastic
result,” and you say, “I need the graph to go two more orders of magnitude; that’s
where we start.” How do you get those types of complex algorithms to actually
run inside your environment?
A lot of times it doesn’t come down to being really, richly algorithmic. It comes
down to being much more clever. That’s not to say you have to have the algorithm
strength, because as soon as you get that window of opportunity to apply the
algorithm and it’s going to work, you have to capitalize on it because that drives
the long drives, and the short tactical moves are with the cleverness.
0:43:57.2
Student:
I was wondering if you could comment about – you talk a lot about your rapid prototyping
and bringing in new information frequently. I was wondering if there was also a need in
the organization for a consistent view of the standard dashboard that people can
communicate with across the organization. Do you have something like that built in,
using the traditional [0:44:22.5 unclear] intelligence tool, plus how do you move things
from one that goes from a temporary place to a persistent place or back and forth?
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 14
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
D.J.:
The way LinkedIn started doing things was we just had massive scripting language that
we would pump into Excel and then fire it off. Then we built a portal, which has that more
detailed layer and does have the dashboarding because it updates via Cron. Now, we’ve
migrated to the next phase, which is a MicroStrategy dashboard. That is going to be the
consistent corporate level dashboard that everyone looks at.
How do you actually do that prioritization of what goes? It’s a rank ruling. The way
we ask it is, “If we move this,” or “if you are looking at this, how much does this have an
impact? Does this really turn the rudder if the glass changes a little bit? Is there going to
be big actions or small actions?” The way we ask those questions is we literally ask
somebody and they say, “We really need this. This is our dashboard, dashboard,
dashboard,” and we say, “If this graph dropped by 10%, what would you do in the next
two hours?” We literally have them write down what they would do in the next two hours.
What would they do in the next six hours, the next twenty-four hours? If it is this long list
that happens in two hours, of action, action, action; yes, you get a dashboard. It’s going.
That’s our kind of ad hoc model for prioritization.
Student:
Along those lines, to what extent to you then [0:46:11.8 cough] automate those
processes. “If this thing drops by 10%, does it send an email to someone or does
it automatically do those kinds of things that they need to happen?
D.J.:
Yes, right now, up until now we’ve been at the place where we have enough people who
are dedicated inside the wider organization who are always watching. We actually know
because we monitor the reports and we know how often people are looking at them at
those types of things. Very often, somebody will say, “My report broke,” and we say,
“You haven’t looked at it in two weeks, so it doesn’t count anymore.” [Laughs] or it needs
to be incorporated in something else.
The alert mechanisms – we are now getting to that sophistication. The challenge
with alerts is at which point? Is it 10%, 2%, where does that alarm bell go off? We
do have some basic warnings, like the Lehman Brothers. That’s where I got
alerted saying, “This number is two standard deviations outside.” Something
either really broke or something funky is going on.
Reid:
Something unique is happening or some unique attack is happening.
D.J.:
That the interesting thing; how do you keep increasing the sophistication of your
organization? If you put all that in right away, the organization just doesn’t have the
ability to absorb all of that sophistication.
0:47:32.2
Reid:
Let me add one thing because there was one other piece to your question, which is
automatic action. Generally speaking, when you get to these complex systems, you
almost want to have human operator intermediaries anyway because of judgment.
For example, part of what caused not this stock crash, but the one previous was all the
automated computer trading systems. We just want to make sure you don’t have some
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 15
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
artifact coming out of complexity or something odd. For example, the Lehman thing; shut
down all registrations of people trying to claim Lehman. That would have been wrong
and would have been a terrible user experience for them. You want to have some
humans go, “What’s going on? Does this make sense?”
There will be a limited amount to which it will automatically touch action.
That’s a good point. Let’s talk about some of our products. These are a couple of our
most popular products. On the left here is what you would see if you go to – this is my
profile because Reid’s on there. If you go to my profile, on the bottom right, you will see
something that says, “Viewers of this profile also viewed” and it gives you a nice list
and you can click on it to go to any of the other places.
D.J.:
It’s responsible for a huge number of page views, such a simple thing. What is it?
It’s a collaborative filter; one of the first places that it was done was at Amazon.
“People who looked at this book looked at this other book.” It was incredibly easy. You
just look at all your [0:49:05.7 pair wise] pages and you say, “This tallies more than a
certain amount, bingo.” Where is the challenge? It’s very sparse. We have 40 million
users. Of all those pairs, finding those pairs and calculating them – the way we’ve done it
in the past is we ripped through all that information with either Aster Data or Greenplum,
or previously Oracle.
Nowadays, this is one of those things where iteration time is about a week. Live to
site is about a week and a half. That’s all it takes. That’s our prototype cycle to be
live to site. When I say that, I mean the entire look, the entire thing. We have a
technology component system that is the ability to deploy on arbitrary pages this
type of content. It’s very powerful.
The same thing happened with “People you may know.” Those algorithms are
quite a bit more challenging. The algorithm took a lot longer but the overall
functionality – not very long at all.
We actually have another product that is similar, called “Groups you might like,”
and that took three and a half days to build the algorithm, which is just some basic
logistic regression off of your key words, and deployment on the site was another
two days before we loaded the data into the system and pushed it. That is
responsible for the huge genesis that we’ve seen, where we see about 1,000 groups
being added per day, and a lot of the activity of people finding which groups they want.
0:50:49.1
I put a graph here. It’s because one of the things that we actually do with these
analytics products is we don’t build it so they actually update continuously. This is
an interesting part of why do we do this. This comes as an offline processing.
We crunch, crunch across the thing; we think we have a product. We can test it very
quickly, we can iterate on this. This isn’t the final version that went. Several iterations
happened. We said, “What if we reordered it? How do we order this? What colors do
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 16
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
we put?” All of that we can change very quickly. We may want to change the data
actually.
Then, we’ll crunch through it, we’ll put the data in a nice neat package, push it into
production. We wait to see how it does. If it’s getting traction we will refresh the
data. The reason we do that is from a data philosophical sense, the investment to
make the real time system that’s keeping track of the data coming in/going out;
that’s real heavyweight infrastructure. If requires SLA, Service Latency
Agreements. It requires system up time. It has a whole lot of stuff built in there.
This is an older graph of “People you may know” where when we do the regular
pushes, this is engagement, the percent of LinkedIn users engaged with “People
you may know.” That’s about 20% line right here, so we have a very high
engagement just on this product. It’s also one of the prominent features on the
homepage. At the same time, it clearly shows incredible value.
The interesting thing is these blue lines are where we push the data. You see this spike
set happen. That’s one way we know, “We push the data,” and “bam!” you get this big
uplift. Let’s try it again. “Bam!” you get this uplift. You do it a couple more times and
you convince yourself, “We’re going to be doing this all the time; now we’re going
to invest in a product to actually make sure it’s bulletproof, iterative, updates
frequently, has a great seamless user experience that we’re not going to touch for
quite some time.”
Reid:
When you have the algorithm right and then you move the blue lines together by
increasing performance and dedicating engineering.
Student:
Do you feel if you introduce that in real time instead of [0:53:19.1 unclear], would you get
the same results?
D.J.:
No, I think it would be lower. The reason why – this is specific to this product. The
reason why that happens is you have this built up demand for it of freshness, and
then it’s like, “Wow, there is something fresh, engage, engage, engage.” Then it
kind of relaxes.
Andreas:
We have two things. One is the freshness you referred to. The other thing is
something we have seen at Amazon quite a number of times, called the “Hawthorne
Effect.” About nine years ago, Hawthorne was in the Detroit area. They had some
factory where people were putting stuff together. It turned out that productivity was kind
of on the low side. People had the brilliant idea that the workers can’t see all that well so
they needed to crank up the light. We saw precisely that effect here, that productivity
went up but then kind of dwindled down and went to the prior level.
0:54:33.9
People thought, “If we are back at that level, we might as well save that energy.” Now
comes the interesting thing about this. They turned the lights down again. Productivity
went up. It’s one of those stories that irrespective of what you do, as long as you
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 17
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
have the attention, and you as a worker feel attended to, you perform better. That
is sometimes interesting; a random graph like this one can explain in many ways.
Reid:
It probably wouldn’t work if you strobed the lights. [Laughter]
D.J.:
That’s right, there is a surf frequency. This gets into an interesting place. You have data
products that are on the site that are interactivity. You can put these into email channels.
You can do different things that have time sensitivity. Depending on how you expose it,
where is that “I’m showing you the love as a consumer that there is something new for
you.” That’s where the uptake happens.
Student:
What about the average over time? Do you think it’s going to be higher, peak out, or go
down with engagement?
D.J.:
That’s what he just asked, or are you asking something else?
Student:
We don’t have the [0:55:53.3 background noise] get engaged with someone all the time,
but maybe the average will have a higher … 25% all the time….
D.J.:
The engagement over time, I expect to be less in the peaks but not much below the
peaks.
Reid:
But the average or higher
Student:
In regards to this engagement question; I was wondering what your rationale behind
the status updates was. It seems like it would be more of an internal facing thing.
This is more external and I guess people who care about reputation a lot more … I was
wondering…
Reid:
We will get to doing status updates that work only within a group, within a
validated group and that sort of thing as a way of doing that. We have a tendency to
iterate the product and see what comes out of it because what frequently happens is you
will see patterns that you just didn’t imagine when you were first doing it. It’s better to
have an iterative strategy as a product strategy.
0:57:13.0
Part of how I use my status updates is I will actually say, “Hey, I’m thinking about
enterprise 2.0.” I will get some responses from within the company saying, “Did you see
this, or this article,” but I’ll also get responses from people that I know who are part of my
network more generally. There is a use of status updates, in terms of solving
professional problems. Essentially what you’re doing with a status update is
you’re advertising where your attention is and someone else to whom where your
attention is, is important to them whether it’s because they’re a collaborator with
you or because they’re trying to sell you something or to build a relationship with
you; that advertisement of where your attention can actually bring back useful
information.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 18
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
D.J.:
I want to talk a little about this one specifically. This is the edge case problem, which
is standardization. Just ignore all the text, but the key thing is; here are some examples
of what actually might have to get mapped. You have the misspellings, the
abbreviations, maybe some additions. You have to map that all to software engineer.
How bad does that get? What happens here, at the top you have software engineer,
software engineer, and different spellings there. Those all have to be mapped the same.
You also have the question where all companies are the same. Right now, in the system
there are over 8,000 variants of IBM that exist. People are always – this is a very nonstationary problem.
We have 46% of our population is international. You have companies that may have
similar names that are showing up in the system. If there is only one of those names,
how do you make sure your system isn’t just dropping it into this name? If it’s
“Continental” – Continental Bakery, Continental Airlines, Continental Tires – there is all of
this. At which point do you make it a new group? Is it three people, five people?
You might think, “I can just dynamically do this.” It turns out there are so many edge
cases that it’s incredibly challenging. That’s just an example of one of those
background process technologies where we spend an incredible amount of time
because it’s not only important for front-facing products – search, any of the
people finding-type mechanisms, powering part of the group recommendation
engines, but also critical for the underlying structure of how we dictate data and
how we take advantage of data in the future.
Student:
This part is done algorithmically versus using things like clever filters where people will
click through and therefore you correlate these two people might be…
Andreas:
Do you want to reflect the question to the class? I think there would be some good ideas.
How would you solve the problem of using people, creating incentives, and using
computers?
D.J.:
This is one of our first interview questions. How would you solve – you have 40
million profiles, they all have variations of pick your number of how many
companies you want to pick. We let people pick that.
Reid:
The reasoning process of getting that is interesting too.
D.J.:
How would you figure out where to put people in those buckets? Who has some ideas?
Student:
If I didn’t know, I might see who they were connected with and see if it’s the
company, their peers…
1:01:13.4
D.J.:
Similarity is a good one to start with. Any others? Do we have some algorithm
jockeys? How about the algorithm side? There is the algorithm side and the clever side.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 19
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Reid:
Input the list in the Mechanical Turk and pay people to do it. I’m trying to stimulate you
guys, throw out stuff. This is a non-techy guy saying this.
Student:
…
D.J.:
You have all these profiles and the guys from the search team say, “We need to have all
of these segmented by company. All of the people with IBM, any of these variants, they
all have to match to IBM. People with Motorola have to match with Motorola. Stanford
University – Stanford Medical? Stanford Business School? They all have to map to
Stanford.
Student:
Maybe send in correspondence to people who are in question, offering them some
sort of incentive to clarify their relationship. You could take the popular ones and just
have them click “this is what I mean” and associate it.
D.J.:
How do you think you would get – that is a very good track. How do you think you would
get them to actually take action? How am I going to convince you to take time?
Student:
Layer in the friends
Student:
Tell them the same thing; “Somebody is looking at you. This could mean a better
job. This could mean …”
D.J.:
Exactly, the value proposition is essential for getting people to do stuff. What else?
Student:
To make better profiles more professional, so that’s the campaign.
D.J.:
That’s exactly the word we use, campaign. We use it as a campaign to get you to
do something.
Student:
Put a job title with auto complete…
D.J.:
Type ahead
Reid:
We have that, I think.
Student:
Just for searching, start with regular expressions to cluster things by initial words,
for example, all the Stanford Business School etc… on top of that… matching
based on the distance…..
D.J.:
Keep going; no one has gotten out number one yet.
Student:
Do something similar to [1:03:53.2unclear] where Ireland appears in a lot of places but
IBM is more interesting so you extract that and away the extra stuff.
1:03:59.5
D.J.:
Yeah, that’s a good one.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 20
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Reid:
You wouldn’t necessarily throw it away, by the way. There is a value in the data…
D.J.:
Number one…
Student:
…
D.J.:
[1:04:13.6 Chronologization] is an important part for taxonomy and ontology. The
number one is the email address. What email address did you register with? What
email address do you have on file? If it’s eBay.com, you work for eBay.
Reid:
Even though it says Kajiji in your profile, for example.
D.J.:
You can map all the emails to the same DNS. Then you do a look up and you tell where
everyone is from. That’s the number one way.
Student:
…
D.J.:
Yeah, we’re a professional site.
Student:
…
Reid:
You can associate up to 8 email addresses with the site. Which one you have direct to,
but in fact, you get invitations to all the invitations.
D.J.:
Or, your alumni site, also.
Andreas:
I thought it was a very interesting discussion. Let’s just pull out what we actually heard.
There is so much going on. One is that the one data source, which is basically the
hardest one to fake, namely, peoples’ email addresses, is interesting that none of
us thought about. I’ve been wondering; from the earliest days I heard about
LinkedIn, why don’t you have an authority that if somebody claims he was the
Chief Scientist at Amazon.com, or if somebody claims he was at Amazon.com, that
you say, “Yes, on November 17, 2002, he confirmed from an Amazon.com email
address.” Do you want to be an authority space for the emails?
The other things we heard is that I believe it is less the pattern matching, but more
the incentives of people at that moment when they’re registering, when they have
an incentive for others to find them that you can get them to do work. They don’t
do the work for LinkedIn. They do the work for themselves, and as long as you
manage to align your incentives with the end user, it’s the same with wish lists at
Amazon. People don’t do the wish lists for Amazon, they do it for themselves. But,
Amazon can recommend some books every now and then from the wish list and people
feel so understood when you recommend a book from the wish list.
1:06:34.4
D.J.:
It is the trivial things that go a long way.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 21
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Andreas:
It’s interesting that it’s not the hard algorithms. The really hard ones probably don’t work
on the sizes you have.
D.J:
It’s the edge cases.
Andreas:
It is coming up with new ideas like the new email we didn’t have or with incentives
for people to do what they want to do at a given point of time, where they see they
get value out of it. Are there any other things we think are counter intuitive from what
we thought five years ago?
D.J.:
The other one, and Reid talked about this, it’s paying people. You have to pay them
in a smart way and this is one of the great things about Mechanical Turk. When you
have a system in your algorithm, you’re not sure. Your algorithm can’t get you there.
Your techniques and your rule-based logic doesn’t work. Ask somebody to do the work
for you. Give it to seven people, two of them say something ambiguous; give it to
another seven and let them vote on it. It’s a very fast way to do it.
Student:
I’m curious; you say that was your number one solution. It’s an automated solution.
You missed the chance to engage with the consumer in a very positive way; giving
them value. You may have crunched the numbers and created a better … more value,
but you haven’t given a positive experience to your user to say, “We thought you might
like to know that people are looking for…” Is there a way to measure that?
D.J.:
We don’t leave it behind. We want to use that opportunity but we have to be very
careful in how often, and how much you use that opportunity. The way we look at it
is we still have that channel to try to say, “Do this for your best interests,” but at the same
time, we want to be very cognizant of can I ask you to do something that’s even more
powerful because I got this piece done. We engage on other campaigns to actually take
care of that.
Andreas:
I think the point you make is that dialog, whenever the user is willing to enter
dialog, is one of the most powerful drivers, in being understood, and in most cases
being much more willing to give more than he initially thought he would give.
Reid:
There is one other point that we try to limit the number of times that we touch a
user per time period, whether it’s a week, a day, a month because if you inundate
them even with things that are value propositions, unless it’s free money that they
believe, it’s like, “Stop!” We may have said we could have hit you with this particular one,
but it would’ve been better to hit you with another one. We already figured this out
anyway so we hit you with a different one instead.
1:09:24.5
Student:
…
Reid:
You can use analytics as a way of figuring out click through rates and that sort of thing.
Frequently, the things people mistake about consumer Internet strategy is they
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 22
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
think because there is so much data, all of the decisions are data driven. Actually,
because the time frames are so compressed, it’s fragments of data together with
intuition, with a strategic belief. You are kind of going, “This piece, this piece,
okay fine; we’ll do this.” It’s not like a logical proof. It’s a combination of the two.
Andreas:
I wanted to pick up on one thing you said. By the way, for those of you who don’/t know
what Mechanical Turk refers to, Amazon, about five years ago, launched a service
decomposing problems into little well-defined questions, such as verifying the opening
hours of restaurants and things like that. You pay a few cents for such a task.
One of the open questions is when do you get people, and who do you get to do
things for you for money and what is the result of that, versus when do you have
people do things out of self interest? One of the things we found at MoodLogic was
that paying people to classify music was very expensive and not as good as letting
people classify their own music in order to actually have their iTunes library, for instance,
better organized, or in order to discover music that is similar to what they already have.
I think it’s really key in the task that you that you define and you might want to give some
of your experiences in order to understand whether those tasks are amenable to doing it
on MTurk, or whether they’re like music classification where you ask the MTurk people to
do things that might not be experts in.
D.J.:
This is a really important point; how do you fact check and make sure you’re not
getting somebody who is just putting crappy answers in for you when you’re trying
to do something with Mechanical Turk? The biggest thing you do is the structure
of the test and the experiment. It turns out it’s actually not that complicated for the
things we’re trying to do because we try to make them as binary as we can. “Is this
right? Is this wrong? Which one of these is the right one? Pick one of the four.” Make it
multiple choice. We also do it against rankings. For example, this is the idea where you
say, “Let’s show it to three people. If all say yes then you know your answer. If anyone
dissents then you send it to an additional seven, ten, or whatever you want to do,” until
you believe that you have an answer.
1:12:06.5
The ones that you say this is too wonky, those are the ones you say may not exist.
You get in there and see if there is another clever alternative that you could use to
solve that type of edge case. At the end of the day, most of the things you are solving
with this – we’re not trying to solve the mainstream, fundamental problem via Mechanical
Turk, like you try to do by saying, “Tell me about this music,” or “classify this into these
buckets.” We’re using this to solve our edge cases, our hard problems that we know we
can come up with algorithms to solve it, but it’s going to take us weeks or months to
implement the technological solution and at what cost versus $500 on MTurk. I’m being
literal when I say $500 to solve a problem that may take something like $15 thousand to
solve otherwise.
Reid:
The background of all this, from a business investor perspective is all this costs money.
It costs money to pay D.J.; it costs money to have computer systems, so you factor into
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 23
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
how do I get to a sufficiently effective answer in timeframes and costs that make sense,
as a way of providing a service.
Student:
At what point did analytics start making sense with respect to the quantity of users
on the site? When you started building the business, I guess the important thing was
having the users. At what point did you have enough data that analytics actually …
Reid:
About a million users, registered users, it was a little crude then. One of the things
about modern business strategy is there is a certain amount of stuff you virtualize
and put out in the cloud and there is a certain amount of stuff you put internally. It
was pretty clear that we would want to have analytic stuff internally and be
iterating on it because there was a whole platform and sequence of products we
wanted to build.
Once we started being able to say things like, “Here are users you should meet,” or “who
you should know,” – I think we may have started “The people you may know,” I think we
started and everyone else copied off of us. That sort of thing was a valuable thing to
have.
D.J.:
You can see this one very clearly because IBM came out with a social type network thing.
One of the things is we were testing it out and said, “Gee, who do we connect to? They
don’t have a people you may know feature.” How dependent we’ve become on this.
The other thing is many times people think with data, that is has to solve a
problem. It’s not true. For example, if you were trying to convince people and you
didn’t have enough usage around your search functionality, you bullshit it. Put in
similar searches that you think are going to be okay. If a person tries to put in a
search term, you just jerry rig a bunch of stuff around other searches you might
see on Google to get the people to bootstrap it. You don’t have to tell them it’s an
analytics project. You’re just trying to get the spin up going and you’re doing it in a clever
way to get the fire going.
Reid:
We’ve done that too.
Andreas:
I think we’ve all done that. Actually, I was expecting a different answer in the question
how many data points did you need. The answer is always the four letter word, more.
What I want to do in order not to lose momentum, I know you have a lot of other stuff to
do. We do want to take a break. Before the break, I want to take a couple of minutes to
collect the questions you might have for them. During break we can organize our
thoughts and make sure that we give you what you want to do. Again, that is all with
apologies to the SCPD people who are watching it offline afterwards and can’t get their
questions in.
1:15:53.7
In the next five minutes, I want to see what is on your mind, whether it has been talked
about already, whether we talked about it in the last class. For instance, we talked about
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 24
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
a global function for people you might know, which might be very different from the local
optimum, which I get by clicking on somebody. That’s a very interesting problem.
What questions would you have that could be arbitrary, difficult, no problem for these two
guys, for us to think about in this second half. Enrique will take notes. What would you
like to know?
Student:
How much of networking do you see between actual companies… companies have…
Andreas:
I will repeat so we have it for the others. The question here is how much networking
between companies is seen. The point is that LinkedIn, at its basic, is about the
[1:17:03.5 unclear] an individual. The companies get aggregated up, so the question is
how much do you see between companies?
Student:
I was wondering if there was any way to calculate misrepresentation of titles… how do
you tell if… actually… companies actually incentivized… helpful…
Andreas:
The question is about the truthfulness about the data here. What incentives can be
provided? For instance, what ways are there to comment or to show that somebody’s
title changed, like that guy I hired straight out of college to work with me at Shock Market.
When I saw a few years later that he was the CTO of Shock Market, that was quite
impressive. If you track, through time, that he slowly moved up from SD 1 to manager, to
a CTO, or the company was long dead; that would be an interesting incentive for people
to not lie. What are your incentives for people to be honest?
Student:
Do you have any plans of creating a social economy based on the … messages or
updating status or accumulation of all these … does that mean anything that is
transparent.
Reid:
For example, I will answer the question more in length afterwards, but do you mean like
the number of the invitations I send, how many of them are accepted? Is that what
you mean by reflecting data of what I’m doing?
Student:
Reflecting data… way that would contribute to …
Andreas:
What self-metrics or metrics of the user could we show, for instance, the
probability of responding within a day, within a week, so by people seeing this
about themselves, and potentially being compared to the rest of the pool or to their
peers, seeing “if I’m the lowest [1:19:32.3 unclear], maybe I’m not doing what I
should be doing.”
1:19:36.4
Reid:
As an amplification example, I’ve forwarded three requests from D.J. to other people but
he hasn’t forwarded any from me, in terms of social karma economy.
Student:
I have two questions. One, I was hoping you, D.J., could talk more about this idea
you launched for recommending a person’s career path based on the data that you
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 25
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
collect. The second question is for you, Reid, you mentioned your vision to also
create some kind of a platform, and about things I can definitely think of a bunch of
applications in terms of predictive markets and finance. Can you give us an update about
that; is it still going on, and what kind of data are you going to be exposing to developers?
Andreas:
Let’s take two more questions.
D.J.:
By the way, the one on the career paths, I will also ask will ask you the same thing as
how would you actually build a career path… you can discuss. I’ll show you how we do it
and then you guys will know how to do it yourself.
Student:
How do you balance user experience with monetization? What are the metrics you
are looking into?
Student:
With people here using LinkedIn’s answers to ask questions of experts on the
network; what metrics do you use to identify the best person who might be best
suited to answer any given question?
Andreas:
That certainly leaves us with some things to think about during the break. Let’s
reconvene in 20 minutes, which will be shortly after 4:00, according to the clock in the
back of the room. Coffee, as you know, the café is open until 4:00. Do you have to dash
right afterwards? You have access to our two guests during the break right now, if
they’re willing to do that, and also after class.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 26
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc