Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas Weigend (www.weigend.com) Data Mining and Electronic Business: The Social Data Revolution STATS 252 May 11, 2009 Class 6 LinkedIn: (Part 1 of 2) This transcript: http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11..doc Corresponding audio file: http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.mp3 Next Transcript: (Part 2 of 2): http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-2_2009.05.11.doc To see the whole series: Containing folder: http://weigend.com/files/teaching/stanford/2009/recordings/audio/ Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 1 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas: Welcome to our class today, where the main attraction is Reid Hoffman. Reid Hoffman just reminded me that we met at the parking lot at [0:00:12.5 unclear] a couple of decades ago, in the 1980’s. I remember him running, at Symbolic Systems, the weekly speaker series where my PhD advisor, Dave [0:00:24.3 unclear] said, “Andreas, you should come with me and meet that guy.” Dave was actually talking about what distinguishes humans from animals. He said it was that we have invented an external representation of knowledge. For an AI person, having a representation of something where we talk about the axes of the spaces versus the points in the space; that’s always what I’m referring to. Reid then went on to co-found PayPal. After PayPal, he started LinkedIn. I remember I first saw you in your office in 2003 or 2004, when we talked about what one could do with all those data in a business setting, before Facebook was around. I know most of you can’t imagine a time before Facebook, but there was a time before Facebook. Reid will be joined by D.J. Patil when he gets here. In the meantime, let’s start. Do you want to have a quick conversation with the class about where they are and what they know about LinkedIn? Who here is on LinkedIn? Reid: At least there is a registration. D.J., who will join us mid-flight, is head of our analytics. He used to be a professor or concrete math at Maryland. He did work with the DOD, and built the latest fraud and risk model for eBay. He will be running through some of those specific things. That’s actually not what I do with my time, since I’m a CEO. For purposes of talking about things like knowledge representation and what not, that Andreas is mentioning, I actually went to Oxford to study philosophy from here. I was the thirteenth person to declare Symbolic Systems as a major – I started the Symbolic Systems forum. I think it’s still going, from what I gather, but it’s twenty years later. I will open with a few comments about how LinkedIn looks at data and looks at the kinds of products we can construct with data in order to give you some conceptual overview to this. D.J., when he gets here, may rehash some of this because he’s going to go into more depth than I will, but starting with the concepts is interesting. One of the key things that you are probably already familiar with, given that I’ve looked at a few of your earlier classes, is looking at networks as essentially information and people representation systems. Part of what you’re getting out of a network, especially with data added in the network, depending on what the semantics of the network is, it’s a reputation system in order to decide which people are good at something and which people you can trust, and which sources of information you can trust. 0:03:30.3 The professional applications of that are relatively straightforward, from a conceptual basis, which is not just hiring, although this is important, but also when you’re looking at making judgments about expertise, which judgments of this person is expert in this can facilitate making a transactional decision. What you may not be familiar with; I don’t know how many of you have crossed over to having industry experience but the kinds of Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 2 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics things that LinkedIn is used for is not just recruiting or job seeking, but also hedge funds use it a lot in order to find experts in order to do trades in the market. There have been such things as people sourcing a deal and all the reference checking on the deal and making $300 million in eighteen months off of a $50 million investment, by using this kind of reputational system in order to make judgments. When you’re doing dollar sums that large, you’re not just using the fact that there is a prediction, based on this person is likely to be interesting to you as an expert, but you’re also doing a lot of hand checking. You’re using your social network. One of the ways, that reputation through a network is not just an objective reputation, but also subjective reputation, which means reputation to me. I believe this person is really good at this thing. For example, if Andreas and I are connected on LinkedIn, if I find somebody that Andreas is connected to and I think I need to talk to that person, I ask Andreas, “Is this person really good at this or not?” That’s where you blend somewhat of what you might think of as pure networking data with subjective information. We interact. I think one of the really interesting patterns, in terms of what’s going on here, is these human/machine combinations. You bring in human interactions along with the data patterns as a way of building something interesting. For example, the kinds of things – we take a very long term view of this in terms of how we build our products at LinkedIn because we launched May 5 th, 2003. For a professional network, we are by far and away the largest. I think Facebook was in here last week, and they’ve had the third best growth trajectory. The first best is Twitter, the second best is YouTube, and they’re the third. It’s a very large thing, but that’s universal. That’s not just trying to do professionals. We launched on May 5th, 2003 and our entire goal at that point was growth; getting people to network. One of the challenges in building a network is person number one, service valuable. Person number one and two, service not valuable. Person one, two, and three, service not valuable. How do you get to enough people in a network so the service starts being valuable? 0:06:43.9 Despite that, in August of 2003, we launched references, which are how I can take what is equivalent of a book blurb, and suggest it. I can say, “Andreas, I’d like to give you this reference.” If he likes it, he can post it onto his profile. Part of the reason we do it with his approval is because there are all these legal liability issues around putting information about other people out there, especially when it has an economic context. If hedge funds are making decisions about who they might approach in order to hire or as an expert, and that could be a very lucrative consulting contract if I was leading a negative piece of data on Andreas, which may be something he would be very unhappy with. It wouldn’t be true, but it’s always that kind of positive system. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 3 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Once you have that network with the positive system, then that begins to allow you to add in data, overlaying the graph, in terms of who is recommending who and what kinds of words are in those recommendations. One of the things we do that is partially distinct from some of the other sites is that reputation in terms of transaction is subjective. It’s by you, as a person, because the people I may judge to be an expert at something are not the same people you may judge to be an expert and it is not necessarily a completely objective thing. We look at the graph as both an information and people reputation system. The other part of it that is interesting is that a lot of when we look at data, it’s not just numbers and graphs. It’s also textual analysis. A lot of the products that D.J.’s group builds at LinkedIn are things like what are the career options possible to you; what are the sequences to get there? If you wanted to be Chief Scientist at Amazon, what are the sequences that get to that kind of job? If you plot out how to get there, what is the likelihood and the transitions and so forth for doing that? Part of that ends up in doing a lot of text analysis on the profiles and CVs that people construct. D.J: Sorry for being a bit late. It’s like Reid says, it’s running at top speed every single moment of the day. Student: Is there something that we can do now? D.J.: How about we show you? Reid: That’s actually one of the things that D.J. and I have talked about showing. D.J.: That’s actually a good start. One of the things we’re hoping to do is to try to make this as interactive as possible, otherwise, it will be rather long and drawn out for both of us. I first want to talk about what analytics is. These days, you’re hearing a lot about analytics, everything from what Facebook calls their data scientists, to what we at LinkedIn call research scientists and our analytic scientists, and that different notion. This is one of my favorite graphs, which is this guy just took the word “cahn” in the classic Kirk sense, “caaaaaahn,” and how many a’s could there actually be and how does it plot with Google search results. One of the amazing things is, and this is a bit silly, but at the same time it’s extremely clever. You say, “Gee, somebody out there has a bunch of pages with the word “cahn” with 96 a’s in it. It’s this weird notion of this is interesting insight. 0:11:13.2 One of the distinctions that we actually make at LinkedIn is whereas in many places you could just be a research scientist and come up with this interesting data and make a graph; what do you do with it, how do you actually turn that into a product? That’s something we spend a lot of time constructing constructing a team and a philosophy about doing. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 4 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics This is an interesting graph of how analytics has changed over time. One of the great things about LinkedIn profiles is you have you profile history as far back as you want. In many cases, it’s a little fuzzy out there, but this is 1970. It’s just an arbitrary cut off there. You have in the cold war, early time, post Sputnik era, really getting the space age time is a lot of people had analytics and analytic scientists in their title because you were churning data for space missions. You kind of got this lumpiness as you got into the 1980s. The more interesting thing is by the time you get to 2000, we’re getting this hockey stick growth of people including the word analytics. It’s the same thing if you look for searches. What are people searching for on LinkedIn when they’re trying to do people searches? They’re looking for talent? A very high proportion of analytics yearover-year. The notion of growth of what analytics is, is steadily growing and building on each other. When we say analytics, it’s interesting to define what that means. It means, broadly, everything from a person that’s involved in hedge funds, who churns through data, to a person who might be doing data mining or cut text classification, to a person who is a business analyst and putting that in their title, where they’re much more on the softer side of that, comparatively to data mining. Reid: By the way, you can interrupt with questions. This is a format we’re familiar with. Have we correlated this with companies and so forth, on the analytics? D.J.: You know, we have not, but that’s a good point. The number of analytics startups… Reid: Is it Internet, is it startups, is it banks? D.J.: Right, and what I think it is overall is the ability to drive a throughput of data, so here is the period of time when you actually had mainframes and you were churning through a lot of data. Then you came through the local PC time, where you didn’t have the ability to crunch through massive amounts of data. Beowulf started coming into the picture here and then you get MySQL and massive MPP systems coming back in. You get the ability to actually do something with data. I should check that. That’s one of the easy things about LinkedIn; I’ll go do that this afternoon. Reid: This is what I meant by textual data in terms of the fact that we have all these things in a structured textual data format so we could actually make a rough guess at what the semantics of the words are and apply them in to concepts, in terms of doing the analysis. Is this trend – it obviously says analytics is on the rise in terms of the industry, in terms of how people represent themselves in the industry – which industries is it doing it in? Is it iBanking, which obviously is in less trouble than the last six months, or is it other things? 0:14:27.7 For example, do you have a Lehman graph in here? D.J.: I didn’t put it in, but you could talk about it. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 5 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Reid: We saw a huge surge in signups from Lehman, before the bankruptcy hit. [Laughter] You can see that kind of pattern. D.J.: It’s even more interesting. I’ll just take a few minutes to talk about this. The announcement got out to Lehman Brothers that there was a problem on Saturday. The employees went into the system on Sunday, physically went into the office to start downloading the files because they didn’t know what would happen on Monday, The IT department says all their warning systems and everything went off the charts. Why is there so much download, all this bandwidth activity on a Sunday, when there is usually no activity? IT actually started shutting down parts of the network, thinking that there was some type of internal attack that was happening. Everyone then called their other friends and said, “No, it’s really happening,” which then fostered even more people coming in and trying to get on the ground and getting worried. The amazing thing out of it was by Sunday late day, before the announcement happened; we started to see this incredible shooting hockey stick growth of Lehman Brothers’ people applying. Reid: Curves never go straight up, but it was pretty close. D.J.: At the same time, it was registering on all my fraud detection systems. What’s Lehman – somebody doing an attack from Lehman Brothers? What’s going on? We found that when we talked to people at Lehman Brothers afterwards, one of the reasons that people had suddenly registered in their mind was that when you are in a place like Lehman Brothers and you are on heavy corporate IT systems, the BlackBerry that you are carrying is hooked up to the system where I’m friends with Reid outside of work, and we talk all the time; this got shut off. I don’t know Reid’s phone number. People were actually reverting to going back to the phone book. People didn’t have access to Gmail. They didn’t have access to other external clients or address book tools. The only thing they had was LinkedIn. Suddenly, it became their default mechanism when the company was just about to tank. Reid: That was very interesting from our perspective. Andreas: The BlackBerry, looking up phone numbers on the BlackBerry is primarily within the company. What percentage of LinkedIn’s messaging or contacting is done for people who are in the same organization versus what percentage actually goes elsewhere? What you are saying is that their BlackBerries didn’t work anymore and they couldn’t message each other anymore? They had to revert to the phone and LinkedIn, which point to more internal communications than external communications. 0:17:22.9 D.J.: I wouldn’t say so much just internal communication, but being able to find people and connect, get in contact with, or establish a communication channel outside of the corporate IT approved bands. Another way to think about it is with what we’ve seen with both the Bush administration and the Obama administration during the transition; I was Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 6 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics there during part of the Bush administration. I had been using LinkedIn even before I knew Reid. I had been telling people this was fantastic because at the DOD, your typical term of rotation is three years. Your Rolodex, at best, is good for three years and then you don’t know how to contact people. Using Facebook is not a good alternative if you’re in the government because I want to take that email address and cut and paste it into my government communication channel, whether it’s the secured one or the unsecured one. I can’t just be doing my business communication channel through Facebook messaging because we don’t want that. At the same time, with the incoming Obama administration, we have seen a lot of people trying to connect and stay in touch with where people have gone. As they’re doing the new IT systems, you’re finding these emails are incredibly long, complex, and they change. What your email is now, in three or four months, may not be the same. It’s a very fluid email type of system. Let me amplify in one part of Andreas’ question because I think this will be interesting. Today, I don’t know if we have exact analytics on how much is internal company communication on a one-on-one basis actually goes through LinkedIn. I would suspect it’s not very much. We haven’t really studied that data. However, there are two things that are interesting. One is we have a company groups product, which actually has full wiki-like moderation facilities so people can control, “This person just left yesterday. Put them out of the group.” They’re only validated if they have the domain, the position, and a bunch of other things. They can be controlled. There is threaded communication within it and there is a fair amount of that. Reid: For example, there is a constant rotation of questions answered in the discussions on LinkedIn. The most interesting piece is because individuals have a good incentive for keeping a relatively full LinkedIn profile, we’ve actually had a number of companies approach us to build directories for their company. One of the tasks the companies look at is how do we find expertise internal to the company, not just external but internal. The problem is that there have been a lot of directory products offered by internal software. The problem with the enterprise directory is the premise that they go to employees with is the following. “Please be diligent and update your expertise directly to the internal company’s system. Your reward will be that people you don’t know within the company will call you and ask you to do work that you will not be benefited or compensated for, so please do update your internal company directory.” 0:20:45.9 Not surprisingly, it rarely happens. Part of what we do at LinkedIn is to try to give people a number of incentives for having their profile relatively up-to-date. Not only are there job opportunities, but there is also finding former colleagues, finding other experts, hedge funds that may call you and offer you a lucrative contract for advising, etc. You have an incentive for keeping your LinkedIn profile up-to-date. It’s also easy. It’s a one published to multiple. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 7 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics That kind of directory product is something that companies have asked us about, which is an interesting way – it’s how incentives match up to creating data that allow new kinds of products. I thought Andreas would be interested in that, given your question. Andreas: I think from the first perspective of incentive design, which we talked about, and persistent ID, which are really two key drivers for what we call the social data revolution, it clearly matters that people create things not for the current company and then when they leave, it’s dead. They create it for their own future, ultimately. D.J: It goes even one step further, which is if you take LinkedIn and what you see on the profile, and everything we do for the profile for search engine optimization, we’re helping to enable your brand. There is a big question there. What type of brand are you trying to enable in your company? What are you incented to enable inside your company versus what are you incented to enable more broadly? That’s really this question of what are you going to get for it. Student: You are saying you try to SEO pages about [0:22:25.5 cough] name, and ideally, I’ll get LinkedIn pages for people that work there. D.J.: Your name, but we do it also for company names, which is really important around the small and medium business market. For a smaller company that is likely to have three or four people and unlikely to have a footprint, we do have a company pages product. We’ll take advantage and leverage that. Student: … internally update their profiles… managers actually do it… Reid: The question is what about other incentive systems and are companies doing that in order to get employees to update, for example, review compensation, management feedback, and that sort of thing. The most advanced one that I know about in the world, is Google. Google was the first company to come and ask us about this, in terms of saying, “Could you build a product like this for us?” They’re very attuned to this internal “search the world of information,” search the knowledge base of the expertise of the profiles in the company. You can’t give compensation in Google without having an up-to-date internal profile. That still doesn’t have all of the relevant information in it. The question is people have tried it; relatively few people have done it successfully. Google is the most successful I know of, but even that, the profile is short from a richness perspective. 0:24:04.3 D.J.: Something everyone in this class knows; business is definitely recognizing the importance of analytics. This is a Business Week title, cover magazine. This was from about two years ago. It really was calling out that there is a big shift; it was kind of the first time you really saw in big common media access, some people saying, “Look, there is all these mathematicians and what are the things they’re doing.” On the right, there Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 8 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics are some of the books I’m sure many of you talked about, this new area where people are really getting into much more sophisticated analysis than what you used to see about eight to ten years ago. Somebody would say, “We should do a regression on that,” or just putting data only into Excel; using much more sophisticated statistics packages like R or pick your favorite one. How do we think about analytics? I’ll give my spiel first and I’ll let Reid give his so you can hear two different viewpoints. I’m the one who is pitching how we should do this. He’s the one who is funding how we should do it. Inside the analytics team, we want to build products out of analytics. We want to figure out how we take the data and turn it into something that is really interesting and user facing, that is measured across two dimensions. One is engagement and two is revenue. We are very conscious about how we are trying to figure out how to cut that split. A great example is “Who viewed my profile?” and I’ll talk more about that in a bit. There is also this big part of we’re not only the front facing people. We’ll build something like a recommendation engine. A lot of teams will be able to use it. It’s a much more efficient model than everyone building their own one-off model, etc. to drive their own system. We might use our recommendation engine for everything from ads targeting to how we will rank your network status updates, the groups you might like, and anything that is arbitrary content, or even things like collaborative filtering which “People who viewed this profile also viewed this other profile.” Reid: These are people or this is information that may be relevant to you. Andreas: We spent a lot of time in the beginning of the quarter in the distinction between just signing up people and actually engaging people. The class came up with about 100 metrics of engagement. People are quite sensitive that there is a lot of different ways you can measure engagement. Do you mind talking a bit on that? You said revenue is rather straightforward [0:26:56.5 unclear]. Engagement is less straightforward. What is the number of features used for you? D.J: Let me just give a [0:27:06.0 unclear] of an interesting way where engagement doesn’t work well for LinkedIn. Let me switch to actually how we do it. Time on site – for LinkedIn, if you have your iPhone app and you pull up LinkedIn to look at someone’s profile because you’re in a meeting around a table and everybody is going from person to person to person; there is a great opportunity to get a bunch of information. We obviously don’t want to hold you on the site and hold you up. We want to facilitate how you’re doing things. 0:27:40.0 On the other side, a good model for engagement is you come to the site looking for question and answers. You want to be an active group participant. There, time on site does matter. The way we actually think about it is we do it in a funnel. We try Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 9 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics to build an ad hoc funnel. Into the funnel is you come to the site. The second tier is you have engaged with the product and we treat that as very coarse grained and binary. The third is you’ve done something; you’ve shared the content is the next tier. The final one we measure in this is you actually have contributed content. Depending on the product there are a lot of nuances into that but we work very hard at not making it so granular that we eliminate one side or the other. It’s almost like a hierarchy of how much you’re investing if you get back to it; what is the level of investment. If you’re going to go, “Yeah, recommend this news article within the company group,” that’s the click. It’s a very valuable click to help create both collaborative filtering and an ultimate Digg-like system within the company of what news is hot today. Reid: If you start typing a blurb in to discuss the article, then all of a sudden you’re adding additional competitive intelligence in there. You’re much more invested in what’s going on. That automatically leads us to say this increases the level of importance that you think this article is. D.J.: The other parts we do, and this is the part of the organization where we do very rapid prototyping of very fast visualization. I’ll get into how we create data visualization as well a bit later. The other thing that we do, the other part of the team is the data insights team. Their function is doing everything from “State of the Union,” which goes into the projects, all of our products, and tries to really understand the segmentation, of who the users are, what industries are using, what countries are penetrated, and everything including the A/B testing, as well as understanding interesting demographic trends. That’s the team that came up with the analytics graph. There is an interesting dichotomy here because while the first group is really focused on engineering products, really getting the product out the door, and shipping things; this team is coming up with the interesting funnel to support the first group, as well as the broad functions of the team. In fact, this team rarely has a week go by that doesn’t support every single organization in the company, with either some type of dashboard, some type of analysis, or some type of consult to give a feeling of some sense of where LinkedIn currently is, what the population is doing, what’s hot, and what’s not. 0:30:51.3 Reid: For example, someone who is interested in targeting very expensive enterprise systems in an advertising buy might say, “We’re looking for how many IT purchase decision makers log in every week across the U.S. Can we sell that as a demographic?” That would end up being what exactly are they looking for if it’s not one of the things we’ve already done, and that may go through the data insights team. That’s an example of something that has nothing to do with the product and engineering side, but something the analytics group would help with. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 10 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas: One of the distinctions we made was between real time data and interactive. My general view is that real time is not as important as a fast, interaction with the data. In many cases, it’s really the dialog with the data when new insights get developed. The question you just posed for IT managers and trying to get a high revenue of the ads for that group; it doesn’t seem like a very interactive thing. Do you have examples of what came out of quick interactions with the data, how these hypotheses actually got formulated? D.J.: Actually it is high interactivity. That team is very closely tethered to this team. A hypothesis might come in saying, “I’m looking for this,” and this team will come back and say, “Is that really the right question? What about this? We found this other interesting area, or you could do this.” A concrete example of this is in what we call lead generation. We have an enterprise team and they’re out there trying to hustle, trying to make sales, and get things going. We noticed what they were trying to do and we said, “We could just do some quick analytics around this and identify all the people who are heavy users of LinkedIn, and that would give you a good indication of where to go look, which companies they’re looking for.” That idea of this traditional funnel that the sales guys call it, where you are just kind of cold calling, hoping somebody gets interested and you try to drive them to this long iterative sales process before you have an enterprise sale; it’s not just a click “buy now” type of sale. You want to try to short cut as much as possible. You want to be as clever as you can with the data to get them there, that way. That’s how we think of it as interactive. Reid: Another thing to amplify is D.J.’s actually driving a self-serve data interaction model for various parts of the company. As opposed to calling a human, and getting it, it’s like here is an interface by which you could ask certain, basic questions in a more rapid basis in order to achieve the exact result you’re talking about. D.J.: How many of you guys saw the Google article with Twitter on design, in The New York Times this last week? Only a couple of you? Reid: I take it you recommend it? D.J.: I recommend it. It’s an interesting one because this touches on a really interesting space right now, which is one of the really good, creative designers at Google left for Twitter. One of his arguments for leaving was saying, “Everything is so metrics driven that where is the room for creativity.” Google’s answer is, “This is how we do it; we’re the big dog, it’s working, what do you mean?” It’s a good argument. 0:34:19.0 We want to have a little bit of blend in that. Our goal is to enable and empower the increased sophistication of every person that works at LinkedIn so they have the ability to dial up or dial down where they need to be with respect to analytics. Somebody that is optimizing the new user registration page, yeah, they’re trying to get Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 11 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics people in. They’re cranking on numbers. They A/B test. They’re talking you do a couple of marginal points on that, percentage points, and you’re doing awesome. Somebody else is trying to get something new going, to foster some new type of dynamic. There is a lot said there for the design components. We want to, by enabling everyone, foster their own decision making. We’re trying to make things as decentralized as possible. Student: What kind of interface do you build for less technical end users in marketing to use? D.J.: There is nothing good out there. We built our own. We built a free form SQL tool that just literally, we can build a report for you by taking the straight SQL and it builds it with a Cron job and it goes and hits the servers and pulls back whenever you want. Student: Who writes the SQL? D.J.: We actually write the SQL. My teams write the SQL. The reason we do that is somebody might say, “We’re looking at this.” We offer classes; we’ll be going to once every week once we get everyone racked up. The idea is you can come in and train with anybody to learn your SQL. You might get it to the part and then we’ll optimize it for the particular type of database solution we’re using. We use a number of database technologies, everything from Aster Data, MySQL, Hadoop with Pig and Hive on it, as well as traditional Oracle. This is a really interesting part for those of you who are really into the data; when you’re constructing your organization to be able to actually get data, the tools typically there are business intelligence tools like MicroStrategy, Hyperion, and those type of things. It’s very tough to get everything structured. It counts on the fact that you have invested a lot of money on your data warehouse. If you have new types of data and you want to be really nimble, you have to have something else. This is a fast way of shortcutting everything. We can just lob it in there and get people the data they want, quickly. A great example of this is we have a tool where actually – I used to pull up this data, which is LinkedIn’s representation of Stanford. We have this tool; people always come to us and say, “I’m going to go speak at this university,” or “I’m going to this company; I need some stats about the LinkedIn population on there.” 0:37:14.1 I just enter “Stanford” with the wildcard, enter the tool and it sent me a note saying I’m working on your query and will let you know when I’m done. It took me about twenty or thirty minutes because I ran it at the highest peak of the morning. There are 74,000 members of LinkedIn who have gone to Stanford or are current members of faculty or work for Stanford. Reid: Most of them are older. D.J.: Yeah, and there are 29 new members per day that come onto LinkedIn. You can see the representation that 20% of them are VP or higher, which indicates the elder part. You Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 12 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics can see the breakdown of where they are originally. We have a lot of other data that supplies about a couple of hundred fields that we allow people to just pull from. It’s an example of where we said, “Yes, we are a centralized organization here, but we’re not going to stop you from getting the data. We put this organization in the product organization, not the technology organization, specifically to strengthen that top one and to make sure this does not become white castle and deviate from outside of the organization, which has happened in other organizations, such as eBay and Yahoo. Reid: The idea is that the product drives the health of the overall network. Ultimately, while this also helps with adoption rates and curves, and everything else, it’s also how healthy is everything happening. This is the analytic fuel of what’s currently there, from either a 50,000 foot level or a 100 foot level, on a specific webpage, button, or product. What’s most interesting, I think, is actually how you build products out of the data. This is where I was mentioning the career map. How do you actually make products that are unique products that people don’t have otherwise, that they desperately need but have not otherwise seen? You can actually drive patterns out of everyone having the right incentive to put certain kinds of information in their profile, having a certain kind of network, and then using that to create aggregate products that benefit everybody. It’s a pretty different approach that we’ve taken. In fact, it’s one of the ones when Jeff Hammerbacher was spinning up the data team at Facebook; we were comparing notes very regularly on how we did it and how we were doing it at LinkedIn. He was coming from a very tech/engineering perspective. We were coming at a very product oriented perspective. They both have their interesting aspects. D.J.: One is we’ve very fast and very good at iterating on the product cycle. Then we have to build our technology layer. They’ve got this awesome engine but they haven’t figured out how to very rapidly innovate on the data cycle. Andreas: Where do you see the current bottleneck? Is it that people coming from a certain education being pigeonholed into a certain area feeling it’s not their job to do something, or is the bottleneck that the cycle time is too slow? D.J.: To turn it into a product? Andreas: What do you wish you had more of, besides people? 0:40:35.0 D.J.: There are a lot of long polls that are there right now. One is how do you manipulate data. There are solutions that are these MPP solutions, like Aster Data, Greenplum; you could argue Hadoop to some extent, those technologies, their ability to be implemented, and their cost. That’s a huge one. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 13 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Reid: We break them regularly. D.J.: Oh, and we break them. We were on Greenplum until recently. We could tip it over instantaneously. How much analytics are we actually pushing into the data layer? We’re doing much more sophistication with Aster Data and drive a lot more of the roadmap currently; they’re a bunch of Stanford CS alumni. Reid: Aster Data built its whole company using LinkedIn to find clients, and the whole thing. It’s one of the interesting things that I only discovered after they were already a customer. I called D.J. and said, “I just read this business case study.” D.J.: It’s a little quid pro quo. The part of analysis – the biggest thing we found – the shocking thing we see is people say, “Go get a data miner, find a data miner,” and those actually have not turned out to be the best skills. I’ll talk about this as one of our key problems as standardization. It’s the classic, “I have all these titles; how do I map them against each other?” That’s a well known problem inside computer science, the classification problem. It really has a ridiculous amount of edge cases and it becomes very complex with the nature of how do you manipulate those edge cases without screwing up in a way that exposes it so blatantly to the user that they say, “What the hell is this?” What we found works better has been the scientists who have the ability to manipulate massive amounts of data that is necessary to solve another problem. My background is in atmospheric science so you have to manipulate your data, you have to clean it, and you have to get it in the right form before you even get to start on your problem. One of our really strong guys, a physicist, did a heavy amount of measurement analysis and had to manipulate the data in all these forms. The data mining algorithms, often, we look at these graphs that come out of any of the ACM or the KDD conferences and I know you know this; you say, “Look, I got this fantastic result,” and you say, “I need the graph to go two more orders of magnitude; that’s where we start.” How do you get those types of complex algorithms to actually run inside your environment? A lot of times it doesn’t come down to being really, richly algorithmic. It comes down to being much more clever. That’s not to say you have to have the algorithm strength, because as soon as you get that window of opportunity to apply the algorithm and it’s going to work, you have to capitalize on it because that drives the long drives, and the short tactical moves are with the cleverness. 0:43:57.2 Student: I was wondering if you could comment about – you talk a lot about your rapid prototyping and bringing in new information frequently. I was wondering if there was also a need in the organization for a consistent view of the standard dashboard that people can communicate with across the organization. Do you have something like that built in, using the traditional [0:44:22.5 unclear] intelligence tool, plus how do you move things from one that goes from a temporary place to a persistent place or back and forth? Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 14 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics D.J.: The way LinkedIn started doing things was we just had massive scripting language that we would pump into Excel and then fire it off. Then we built a portal, which has that more detailed layer and does have the dashboarding because it updates via Cron. Now, we’ve migrated to the next phase, which is a MicroStrategy dashboard. That is going to be the consistent corporate level dashboard that everyone looks at. How do you actually do that prioritization of what goes? It’s a rank ruling. The way we ask it is, “If we move this,” or “if you are looking at this, how much does this have an impact? Does this really turn the rudder if the glass changes a little bit? Is there going to be big actions or small actions?” The way we ask those questions is we literally ask somebody and they say, “We really need this. This is our dashboard, dashboard, dashboard,” and we say, “If this graph dropped by 10%, what would you do in the next two hours?” We literally have them write down what they would do in the next two hours. What would they do in the next six hours, the next twenty-four hours? If it is this long list that happens in two hours, of action, action, action; yes, you get a dashboard. It’s going. That’s our kind of ad hoc model for prioritization. Student: Along those lines, to what extent to you then [0:46:11.8 cough] automate those processes. “If this thing drops by 10%, does it send an email to someone or does it automatically do those kinds of things that they need to happen? D.J.: Yes, right now, up until now we’ve been at the place where we have enough people who are dedicated inside the wider organization who are always watching. We actually know because we monitor the reports and we know how often people are looking at them at those types of things. Very often, somebody will say, “My report broke,” and we say, “You haven’t looked at it in two weeks, so it doesn’t count anymore.” [Laughs] or it needs to be incorporated in something else. The alert mechanisms – we are now getting to that sophistication. The challenge with alerts is at which point? Is it 10%, 2%, where does that alarm bell go off? We do have some basic warnings, like the Lehman Brothers. That’s where I got alerted saying, “This number is two standard deviations outside.” Something either really broke or something funky is going on. Reid: Something unique is happening or some unique attack is happening. D.J.: That the interesting thing; how do you keep increasing the sophistication of your organization? If you put all that in right away, the organization just doesn’t have the ability to absorb all of that sophistication. 0:47:32.2 Reid: Let me add one thing because there was one other piece to your question, which is automatic action. Generally speaking, when you get to these complex systems, you almost want to have human operator intermediaries anyway because of judgment. For example, part of what caused not this stock crash, but the one previous was all the automated computer trading systems. We just want to make sure you don’t have some Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 15 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics artifact coming out of complexity or something odd. For example, the Lehman thing; shut down all registrations of people trying to claim Lehman. That would have been wrong and would have been a terrible user experience for them. You want to have some humans go, “What’s going on? Does this make sense?” There will be a limited amount to which it will automatically touch action. That’s a good point. Let’s talk about some of our products. These are a couple of our most popular products. On the left here is what you would see if you go to – this is my profile because Reid’s on there. If you go to my profile, on the bottom right, you will see something that says, “Viewers of this profile also viewed” and it gives you a nice list and you can click on it to go to any of the other places. D.J.: It’s responsible for a huge number of page views, such a simple thing. What is it? It’s a collaborative filter; one of the first places that it was done was at Amazon. “People who looked at this book looked at this other book.” It was incredibly easy. You just look at all your [0:49:05.7 pair wise] pages and you say, “This tallies more than a certain amount, bingo.” Where is the challenge? It’s very sparse. We have 40 million users. Of all those pairs, finding those pairs and calculating them – the way we’ve done it in the past is we ripped through all that information with either Aster Data or Greenplum, or previously Oracle. Nowadays, this is one of those things where iteration time is about a week. Live to site is about a week and a half. That’s all it takes. That’s our prototype cycle to be live to site. When I say that, I mean the entire look, the entire thing. We have a technology component system that is the ability to deploy on arbitrary pages this type of content. It’s very powerful. The same thing happened with “People you may know.” Those algorithms are quite a bit more challenging. The algorithm took a lot longer but the overall functionality – not very long at all. We actually have another product that is similar, called “Groups you might like,” and that took three and a half days to build the algorithm, which is just some basic logistic regression off of your key words, and deployment on the site was another two days before we loaded the data into the system and pushed it. That is responsible for the huge genesis that we’ve seen, where we see about 1,000 groups being added per day, and a lot of the activity of people finding which groups they want. 0:50:49.1 I put a graph here. It’s because one of the things that we actually do with these analytics products is we don’t build it so they actually update continuously. This is an interesting part of why do we do this. This comes as an offline processing. We crunch, crunch across the thing; we think we have a product. We can test it very quickly, we can iterate on this. This isn’t the final version that went. Several iterations happened. We said, “What if we reordered it? How do we order this? What colors do Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 16 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics we put?” All of that we can change very quickly. We may want to change the data actually. Then, we’ll crunch through it, we’ll put the data in a nice neat package, push it into production. We wait to see how it does. If it’s getting traction we will refresh the data. The reason we do that is from a data philosophical sense, the investment to make the real time system that’s keeping track of the data coming in/going out; that’s real heavyweight infrastructure. If requires SLA, Service Latency Agreements. It requires system up time. It has a whole lot of stuff built in there. This is an older graph of “People you may know” where when we do the regular pushes, this is engagement, the percent of LinkedIn users engaged with “People you may know.” That’s about 20% line right here, so we have a very high engagement just on this product. It’s also one of the prominent features on the homepage. At the same time, it clearly shows incredible value. The interesting thing is these blue lines are where we push the data. You see this spike set happen. That’s one way we know, “We push the data,” and “bam!” you get this big uplift. Let’s try it again. “Bam!” you get this uplift. You do it a couple more times and you convince yourself, “We’re going to be doing this all the time; now we’re going to invest in a product to actually make sure it’s bulletproof, iterative, updates frequently, has a great seamless user experience that we’re not going to touch for quite some time.” Reid: When you have the algorithm right and then you move the blue lines together by increasing performance and dedicating engineering. Student: Do you feel if you introduce that in real time instead of [0:53:19.1 unclear], would you get the same results? D.J.: No, I think it would be lower. The reason why – this is specific to this product. The reason why that happens is you have this built up demand for it of freshness, and then it’s like, “Wow, there is something fresh, engage, engage, engage.” Then it kind of relaxes. Andreas: We have two things. One is the freshness you referred to. The other thing is something we have seen at Amazon quite a number of times, called the “Hawthorne Effect.” About nine years ago, Hawthorne was in the Detroit area. They had some factory where people were putting stuff together. It turned out that productivity was kind of on the low side. People had the brilliant idea that the workers can’t see all that well so they needed to crank up the light. We saw precisely that effect here, that productivity went up but then kind of dwindled down and went to the prior level. 0:54:33.9 People thought, “If we are back at that level, we might as well save that energy.” Now comes the interesting thing about this. They turned the lights down again. Productivity went up. It’s one of those stories that irrespective of what you do, as long as you Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 17 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics have the attention, and you as a worker feel attended to, you perform better. That is sometimes interesting; a random graph like this one can explain in many ways. Reid: It probably wouldn’t work if you strobed the lights. [Laughter] D.J.: That’s right, there is a surf frequency. This gets into an interesting place. You have data products that are on the site that are interactivity. You can put these into email channels. You can do different things that have time sensitivity. Depending on how you expose it, where is that “I’m showing you the love as a consumer that there is something new for you.” That’s where the uptake happens. Student: What about the average over time? Do you think it’s going to be higher, peak out, or go down with engagement? D.J.: That’s what he just asked, or are you asking something else? Student: We don’t have the [0:55:53.3 background noise] get engaged with someone all the time, but maybe the average will have a higher … 25% all the time…. D.J.: The engagement over time, I expect to be less in the peaks but not much below the peaks. Reid: But the average or higher Student: In regards to this engagement question; I was wondering what your rationale behind the status updates was. It seems like it would be more of an internal facing thing. This is more external and I guess people who care about reputation a lot more … I was wondering… Reid: We will get to doing status updates that work only within a group, within a validated group and that sort of thing as a way of doing that. We have a tendency to iterate the product and see what comes out of it because what frequently happens is you will see patterns that you just didn’t imagine when you were first doing it. It’s better to have an iterative strategy as a product strategy. 0:57:13.0 Part of how I use my status updates is I will actually say, “Hey, I’m thinking about enterprise 2.0.” I will get some responses from within the company saying, “Did you see this, or this article,” but I’ll also get responses from people that I know who are part of my network more generally. There is a use of status updates, in terms of solving professional problems. Essentially what you’re doing with a status update is you’re advertising where your attention is and someone else to whom where your attention is, is important to them whether it’s because they’re a collaborator with you or because they’re trying to sell you something or to build a relationship with you; that advertisement of where your attention can actually bring back useful information. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 18 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics D.J.: I want to talk a little about this one specifically. This is the edge case problem, which is standardization. Just ignore all the text, but the key thing is; here are some examples of what actually might have to get mapped. You have the misspellings, the abbreviations, maybe some additions. You have to map that all to software engineer. How bad does that get? What happens here, at the top you have software engineer, software engineer, and different spellings there. Those all have to be mapped the same. You also have the question where all companies are the same. Right now, in the system there are over 8,000 variants of IBM that exist. People are always – this is a very nonstationary problem. We have 46% of our population is international. You have companies that may have similar names that are showing up in the system. If there is only one of those names, how do you make sure your system isn’t just dropping it into this name? If it’s “Continental” – Continental Bakery, Continental Airlines, Continental Tires – there is all of this. At which point do you make it a new group? Is it three people, five people? You might think, “I can just dynamically do this.” It turns out there are so many edge cases that it’s incredibly challenging. That’s just an example of one of those background process technologies where we spend an incredible amount of time because it’s not only important for front-facing products – search, any of the people finding-type mechanisms, powering part of the group recommendation engines, but also critical for the underlying structure of how we dictate data and how we take advantage of data in the future. Student: This part is done algorithmically versus using things like clever filters where people will click through and therefore you correlate these two people might be… Andreas: Do you want to reflect the question to the class? I think there would be some good ideas. How would you solve the problem of using people, creating incentives, and using computers? D.J.: This is one of our first interview questions. How would you solve – you have 40 million profiles, they all have variations of pick your number of how many companies you want to pick. We let people pick that. Reid: The reasoning process of getting that is interesting too. D.J.: How would you figure out where to put people in those buckets? Who has some ideas? Student: If I didn’t know, I might see who they were connected with and see if it’s the company, their peers… 1:01:13.4 D.J.: Similarity is a good one to start with. Any others? Do we have some algorithm jockeys? How about the algorithm side? There is the algorithm side and the clever side. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 19 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Reid: Input the list in the Mechanical Turk and pay people to do it. I’m trying to stimulate you guys, throw out stuff. This is a non-techy guy saying this. Student: … D.J.: You have all these profiles and the guys from the search team say, “We need to have all of these segmented by company. All of the people with IBM, any of these variants, they all have to match to IBM. People with Motorola have to match with Motorola. Stanford University – Stanford Medical? Stanford Business School? They all have to map to Stanford. Student: Maybe send in correspondence to people who are in question, offering them some sort of incentive to clarify their relationship. You could take the popular ones and just have them click “this is what I mean” and associate it. D.J.: How do you think you would get – that is a very good track. How do you think you would get them to actually take action? How am I going to convince you to take time? Student: Layer in the friends Student: Tell them the same thing; “Somebody is looking at you. This could mean a better job. This could mean …” D.J.: Exactly, the value proposition is essential for getting people to do stuff. What else? Student: To make better profiles more professional, so that’s the campaign. D.J.: That’s exactly the word we use, campaign. We use it as a campaign to get you to do something. Student: Put a job title with auto complete… D.J.: Type ahead Reid: We have that, I think. Student: Just for searching, start with regular expressions to cluster things by initial words, for example, all the Stanford Business School etc… on top of that… matching based on the distance….. D.J.: Keep going; no one has gotten out number one yet. Student: Do something similar to [1:03:53.2unclear] where Ireland appears in a lot of places but IBM is more interesting so you extract that and away the extra stuff. 1:03:59.5 D.J.: Yeah, that’s a good one. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 20 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Reid: You wouldn’t necessarily throw it away, by the way. There is a value in the data… D.J.: Number one… Student: … D.J.: [1:04:13.6 Chronologization] is an important part for taxonomy and ontology. The number one is the email address. What email address did you register with? What email address do you have on file? If it’s eBay.com, you work for eBay. Reid: Even though it says Kajiji in your profile, for example. D.J.: You can map all the emails to the same DNS. Then you do a look up and you tell where everyone is from. That’s the number one way. Student: … D.J.: Yeah, we’re a professional site. Student: … Reid: You can associate up to 8 email addresses with the site. Which one you have direct to, but in fact, you get invitations to all the invitations. D.J.: Or, your alumni site, also. Andreas: I thought it was a very interesting discussion. Let’s just pull out what we actually heard. There is so much going on. One is that the one data source, which is basically the hardest one to fake, namely, peoples’ email addresses, is interesting that none of us thought about. I’ve been wondering; from the earliest days I heard about LinkedIn, why don’t you have an authority that if somebody claims he was the Chief Scientist at Amazon.com, or if somebody claims he was at Amazon.com, that you say, “Yes, on November 17, 2002, he confirmed from an Amazon.com email address.” Do you want to be an authority space for the emails? The other things we heard is that I believe it is less the pattern matching, but more the incentives of people at that moment when they’re registering, when they have an incentive for others to find them that you can get them to do work. They don’t do the work for LinkedIn. They do the work for themselves, and as long as you manage to align your incentives with the end user, it’s the same with wish lists at Amazon. People don’t do the wish lists for Amazon, they do it for themselves. But, Amazon can recommend some books every now and then from the wish list and people feel so understood when you recommend a book from the wish list. 1:06:34.4 D.J.: It is the trivial things that go a long way. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 21 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas: It’s interesting that it’s not the hard algorithms. The really hard ones probably don’t work on the sizes you have. D.J: It’s the edge cases. Andreas: It is coming up with new ideas like the new email we didn’t have or with incentives for people to do what they want to do at a given point of time, where they see they get value out of it. Are there any other things we think are counter intuitive from what we thought five years ago? D.J.: The other one, and Reid talked about this, it’s paying people. You have to pay them in a smart way and this is one of the great things about Mechanical Turk. When you have a system in your algorithm, you’re not sure. Your algorithm can’t get you there. Your techniques and your rule-based logic doesn’t work. Ask somebody to do the work for you. Give it to seven people, two of them say something ambiguous; give it to another seven and let them vote on it. It’s a very fast way to do it. Student: I’m curious; you say that was your number one solution. It’s an automated solution. You missed the chance to engage with the consumer in a very positive way; giving them value. You may have crunched the numbers and created a better … more value, but you haven’t given a positive experience to your user to say, “We thought you might like to know that people are looking for…” Is there a way to measure that? D.J.: We don’t leave it behind. We want to use that opportunity but we have to be very careful in how often, and how much you use that opportunity. The way we look at it is we still have that channel to try to say, “Do this for your best interests,” but at the same time, we want to be very cognizant of can I ask you to do something that’s even more powerful because I got this piece done. We engage on other campaigns to actually take care of that. Andreas: I think the point you make is that dialog, whenever the user is willing to enter dialog, is one of the most powerful drivers, in being understood, and in most cases being much more willing to give more than he initially thought he would give. Reid: There is one other point that we try to limit the number of times that we touch a user per time period, whether it’s a week, a day, a month because if you inundate them even with things that are value propositions, unless it’s free money that they believe, it’s like, “Stop!” We may have said we could have hit you with this particular one, but it would’ve been better to hit you with another one. We already figured this out anyway so we hit you with a different one instead. 1:09:24.5 Student: … Reid: You can use analytics as a way of figuring out click through rates and that sort of thing. Frequently, the things people mistake about consumer Internet strategy is they Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 22 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics think because there is so much data, all of the decisions are data driven. Actually, because the time frames are so compressed, it’s fragments of data together with intuition, with a strategic belief. You are kind of going, “This piece, this piece, okay fine; we’ll do this.” It’s not like a logical proof. It’s a combination of the two. Andreas: I wanted to pick up on one thing you said. By the way, for those of you who don’/t know what Mechanical Turk refers to, Amazon, about five years ago, launched a service decomposing problems into little well-defined questions, such as verifying the opening hours of restaurants and things like that. You pay a few cents for such a task. One of the open questions is when do you get people, and who do you get to do things for you for money and what is the result of that, versus when do you have people do things out of self interest? One of the things we found at MoodLogic was that paying people to classify music was very expensive and not as good as letting people classify their own music in order to actually have their iTunes library, for instance, better organized, or in order to discover music that is similar to what they already have. I think it’s really key in the task that you that you define and you might want to give some of your experiences in order to understand whether those tasks are amenable to doing it on MTurk, or whether they’re like music classification where you ask the MTurk people to do things that might not be experts in. D.J.: This is a really important point; how do you fact check and make sure you’re not getting somebody who is just putting crappy answers in for you when you’re trying to do something with Mechanical Turk? The biggest thing you do is the structure of the test and the experiment. It turns out it’s actually not that complicated for the things we’re trying to do because we try to make them as binary as we can. “Is this right? Is this wrong? Which one of these is the right one? Pick one of the four.” Make it multiple choice. We also do it against rankings. For example, this is the idea where you say, “Let’s show it to three people. If all say yes then you know your answer. If anyone dissents then you send it to an additional seven, ten, or whatever you want to do,” until you believe that you have an answer. 1:12:06.5 The ones that you say this is too wonky, those are the ones you say may not exist. You get in there and see if there is another clever alternative that you could use to solve that type of edge case. At the end of the day, most of the things you are solving with this – we’re not trying to solve the mainstream, fundamental problem via Mechanical Turk, like you try to do by saying, “Tell me about this music,” or “classify this into these buckets.” We’re using this to solve our edge cases, our hard problems that we know we can come up with algorithms to solve it, but it’s going to take us weeks or months to implement the technological solution and at what cost versus $500 on MTurk. I’m being literal when I say $500 to solve a problem that may take something like $15 thousand to solve otherwise. Reid: The background of all this, from a business investor perspective is all this costs money. It costs money to pay D.J.; it costs money to have computer systems, so you factor into Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 23 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics how do I get to a sufficiently effective answer in timeframes and costs that make sense, as a way of providing a service. Student: At what point did analytics start making sense with respect to the quantity of users on the site? When you started building the business, I guess the important thing was having the users. At what point did you have enough data that analytics actually … Reid: About a million users, registered users, it was a little crude then. One of the things about modern business strategy is there is a certain amount of stuff you virtualize and put out in the cloud and there is a certain amount of stuff you put internally. It was pretty clear that we would want to have analytic stuff internally and be iterating on it because there was a whole platform and sequence of products we wanted to build. Once we started being able to say things like, “Here are users you should meet,” or “who you should know,” – I think we may have started “The people you may know,” I think we started and everyone else copied off of us. That sort of thing was a valuable thing to have. D.J.: You can see this one very clearly because IBM came out with a social type network thing. One of the things is we were testing it out and said, “Gee, who do we connect to? They don’t have a people you may know feature.” How dependent we’ve become on this. The other thing is many times people think with data, that is has to solve a problem. It’s not true. For example, if you were trying to convince people and you didn’t have enough usage around your search functionality, you bullshit it. Put in similar searches that you think are going to be okay. If a person tries to put in a search term, you just jerry rig a bunch of stuff around other searches you might see on Google to get the people to bootstrap it. You don’t have to tell them it’s an analytics project. You’re just trying to get the spin up going and you’re doing it in a clever way to get the fire going. Reid: We’ve done that too. Andreas: I think we’ve all done that. Actually, I was expecting a different answer in the question how many data points did you need. The answer is always the four letter word, more. What I want to do in order not to lose momentum, I know you have a lot of other stuff to do. We do want to take a break. Before the break, I want to take a couple of minutes to collect the questions you might have for them. During break we can organize our thoughts and make sure that we give you what you want to do. Again, that is all with apologies to the SCPD people who are watching it offline afterwards and can’t get their questions in. 1:15:53.7 In the next five minutes, I want to see what is on your mind, whether it has been talked about already, whether we talked about it in the last class. For instance, we talked about Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 24 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics a global function for people you might know, which might be very different from the local optimum, which I get by clicking on somebody. That’s a very interesting problem. What questions would you have that could be arbitrary, difficult, no problem for these two guys, for us to think about in this second half. Enrique will take notes. What would you like to know? Student: How much of networking do you see between actual companies… companies have… Andreas: I will repeat so we have it for the others. The question here is how much networking between companies is seen. The point is that LinkedIn, at its basic, is about the [1:17:03.5 unclear] an individual. The companies get aggregated up, so the question is how much do you see between companies? Student: I was wondering if there was any way to calculate misrepresentation of titles… how do you tell if… actually… companies actually incentivized… helpful… Andreas: The question is about the truthfulness about the data here. What incentives can be provided? For instance, what ways are there to comment or to show that somebody’s title changed, like that guy I hired straight out of college to work with me at Shock Market. When I saw a few years later that he was the CTO of Shock Market, that was quite impressive. If you track, through time, that he slowly moved up from SD 1 to manager, to a CTO, or the company was long dead; that would be an interesting incentive for people to not lie. What are your incentives for people to be honest? Student: Do you have any plans of creating a social economy based on the … messages or updating status or accumulation of all these … does that mean anything that is transparent. Reid: For example, I will answer the question more in length afterwards, but do you mean like the number of the invitations I send, how many of them are accepted? Is that what you mean by reflecting data of what I’m doing? Student: Reflecting data… way that would contribute to … Andreas: What self-metrics or metrics of the user could we show, for instance, the probability of responding within a day, within a week, so by people seeing this about themselves, and potentially being compared to the rest of the pool or to their peers, seeing “if I’m the lowest [1:19:32.3 unclear], maybe I’m not doing what I should be doing.” 1:19:36.4 Reid: As an amplification example, I’ve forwarded three requests from D.J. to other people but he hasn’t forwarded any from me, in terms of social karma economy. Student: I have two questions. One, I was hoping you, D.J., could talk more about this idea you launched for recommending a person’s career path based on the data that you Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 25 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics collect. The second question is for you, Reid, you mentioned your vision to also create some kind of a platform, and about things I can definitely think of a bunch of applications in terms of predictive markets and finance. Can you give us an update about that; is it still going on, and what kind of data are you going to be exposing to developers? Andreas: Let’s take two more questions. D.J.: By the way, the one on the career paths, I will also ask will ask you the same thing as how would you actually build a career path… you can discuss. I’ll show you how we do it and then you guys will know how to do it yourself. Student: How do you balance user experience with monetization? What are the metrics you are looking into? Student: With people here using LinkedIn’s answers to ask questions of experts on the network; what metrics do you use to identify the best person who might be best suited to answer any given question? Andreas: That certainly leaves us with some things to think about during the break. Let’s reconvene in 20 minutes, which will be shortly after 4:00, according to the clock in the back of the room. Coffee, as you know, the café is open until 4:00. Do you have to dash right afterwards? You have access to our two guests during the break right now, if they’re willing to do that, and also after class. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 26 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_6linkedin-1_2009.05.11.doc