Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Andreas Weigend (www.weigend.com) The Social Data Revolution: Data Mining and Electronic Business: MS&E 237, Stanford University April 15, 2010 Class 6: People Discovery, Facebook PYMK This transcript: http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc Corresponding audio file: http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.mp3 To see the whole series: Containing folder: http://weigend.com/files/teaching/stanford/2010/recordings/audio/ Course Wiki: http://stanford2010.wikispaces.com Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Andreas: Welcome to class 6 of the Social Data Revolution. The agenda for today is that I want to give you some feedback on your 2-minute feedback about what was not clear and was clear about last class. I will use that as a way to reflect on what was important. I will briefly talk about the homework we are assigning today which is due a week from today, which is our Bit.ly homework. I will ask you to get an early start because the Chief Product Officer of Bit.ly will be in class next week on Tuesday. Any questions you have, you will have the man himself here in class, ready to give answers. The questions we didn’t do a good job explaining were what metadata is. A number of you asked if I could explain what the meaning of metadata is. Metadata is data about data, such as annotations. If you have a file and you put some tags on the file, that would be metadata. Our example of RTN would be an example of metadata, not just doing a retweet but also saying how important that retweet is for the audience. Music metadata would be what genre the song is, or if you submit a paper to a journal, you are asked to give topics which are examples of metadata. What has changed there is that the metadata creation can be easily crowd sourced by people as they go along, simply adding metadata. Wikipedia is an example of that which works quite well. The second question was asking if I would explain PHAME better. When talking to Lars, we decided his talk today would be a beautiful example of PHAME framed as PHAME. If you are still not sure, at the end of the day, what problem, hypothesis, action, metrics, and experiment means, then we’ll have that conversation afterwards. The most important things from last class were the production patterns have shifted. It’s very easy to create data now. The distribution patterns have shifted and also what we learned from Dan Olsen is the consumption patterns have shifted. What he called “information snacking” is an example of small bits, bitesized pieces, snacks of consumption as opposed to buying a book. These are a few dimensions. Production has changed. We are all carrying devices which make it easy to us to produce data. Distribution has changed; the cost of communication has dropped. As a consequence of that, the consumption has changed. The second point that was important about last class is to understand the tradeoffs. Any product design is full of tradeoffs, and one of the questions we had last class was should we surface the question about recency versus relevance to the user? Should the user have a nob where he can say “I really want very recent stuff, even if it’s not relevant,” or should we make the decision for him, or should we learn from their past behavior. 0:04:03 For instance, if they always click on the most recent stuff, presumably they’re more interested in recency than in relevance. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business For those of you with a statistics background, that is the old problem of bias variance tradeoff. You can have a model that has a strong bias, which means you have a small variance, but if it is the wrong model it’s the wrong model. Or you can have a model which has a very low bias, which means you can model anything but the parameters are not determined as well. Another way on Wall Street is the noise, non-stationarity tradeoff, so you might see something and you don’t know if it’s just a fluctuation or if there is actually a regime shift. If you go right after any little blurb, then you hunt for the noise, but you assume the world is non-stationary. If on the other hand you look a long way into the past, the problem is you miss small shifts. All these are tradeoffs of recency and of a larger dataset, bias, variance, noise, and non-stationarity. Are there any other tradeoffs which you would like to add? If you think about the class prior to that when we talked about conversion rates and metrics at a company like Amazon, for instance you can buy lots of hits to the site, by buying unqualified traffic. The person who is responsible for hits to the site, for bringing traffic to the site gets a big bonus, but the person responsible for conversion rates is in big trouble. Those people you push to the site from random keywords… of network, are not the ones who are going to sign up or buy. Tradeoffs are important. Metadata is important, and understanding the economics impact on production, dissemination, and consumption of data was another important part. A couple of minutes on homework three, I think you did a good job so far as I can tell, from what people told me, on the Delicious homework. Homework three is due one week from today. I just talked to Bit.ly this morning and the data is being prepared as we speak. Let me tell you what it is that you’ll be doing. First of all, don’t worry about computational complexity. We have one very small data set and we have one big data set. I don’t expect everybody to work on the big one. Play with the small one. See what insights you can get, even if they’re far from statistically significant. To give you an example of why I said this, here is some data from an interview I gave last year on MPR as a marketplace. I put a Bit.ly link in there. They told us there were 112 clicks, and they also give us the timeline of those clicks. In the first order, we can describe this by the area under the curve, the total number of clicks, and the decay time; how fast do things age? Do people still listen to it half a year later? Hardly. 0:08:01 I want you to think about whether you can build a model predicting the overall number of clicks, as well as the decay time, and for that I am giving you a few variables. The variables Jeremy is writing up is the long URL, the time stamp of when it was encoded or created. Maybe on Friday afternoon that is not a good time for press releases because it gets stuck in email and Monday morning people are busy. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business If you think about yourself as a marketer or working with a marketer, you want to help them understand when they should send links out, and you want to get their buy in, in the details you can measure. Bit.ly for me is a global measurement system of attention. We know what people are clicking on. The powerful thing is the websites cannot do anything about it. Like NPR has no say; I am making a shortcut. People click on that shortcut, I measure what is happening, and I don’t need to talk to NPR about that. That is very interesting and for me, for marketing, understanding what works and what doesn’t, is one of the most powerful probes we have. Student: To clarify, does Bit.ly make unique links or could there be 12 links that go to the same… Andreas: My understanding is that ultimately Bit.ly knows a unique identifier for each link. However, if you want to have certain custom URLs, then you can have a number of different ones. But, in the reports you see, they get all lumped together into one. Here are some questions I think you might be interested in asking. I want to get some insights from you. Does the day of week and time of day of the initial posting influence the number of clicks and decay time? What is the effect of the initial channel? You know from HTTP referrer where it initially came from. Do things that are posted in Facebook live longer than things that are posted in Twitter or posted in a blog? Does traditional media content such as New York Times, do blog posts such as TechCrunch propagate differently from more personal content? I don’t think anybody knows the answer but you will a week from today. What is the influence of the topic area? Bit.ly tries to give us a number that characterizes what the link is about. Is it about technology, finance, food, etc? Do we understand that maybe food items are longer lived than finance items? Here is another one where you should pull some data from Google, does the popularity of the content site – you can get that via Google page rank – predict its propagation? If you have something from the New York Times, will that propagate further than something from Weigend.com? Finally, are the properties of the initial post important? Does it matter what their people rank is? Those are questions I think you can look at, look at the data. The assignment is to give me some insights from the small sample we call “Sample A”. For those of you who actually love to write algorithms that scale, then you can do the second one as a group, run through the huge dataset we’re getting, in order to understand what is statistically significant. Student: I’m wondering about the list. If you found ... what is the correlation of pollination? For example, people who send out links in the middle of the day might be really interested in whether it’s propagating or maybe if I’m just sending a link to a friend or something, I might not care how much traffic the link has been getting. 0:13:03 Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Andreas: You can do experiments. You can have different target URLs which essentially are the same, and run experiments all day long. Marketing people typically say 50% of all marketing dollars are wasted, but nobody knows which 50% are. We can create transparency now by understanding where people click, what happens to them afterwards. We don’t have to own the site. That’s the powerful thing. We can do it with redirects. Student: How does Bit.ly take into account the fact that the URL is shortened so you can’t actually see what site they go to? Andreas: I think most interfaces now show the site which is behind it. For instance, at Twitter it shows that it’s the link to the New York Times. Student: What format is this data in? Andreas: CSV, so whatever you want to do to look at stuff. The small one you can do in Excel if you want. The big one you need to have reasonable skills. Student: What do you think about Twitter rolling out their own URL shortener? How do you think that will affect Bit.ly’s spread to the world? Andreas: I think that’s a good question for Todd on Tuesday. Before we have Lars come up, I wanted to talk about discovery. On Tuesday we will get the initial proposal for your project from you. On Thursday we will get the Bit.ly answers from you. On the following Tuesday, it’s the project day and we will get the next… there. The Twitter homework, if you want to get a peek preview, look at last year’s Twitter assignment. It’s about people discovery on the public data Twitter has. As we’re listening to Lars today, who is going to talk to us about how Facebook tries to have you discover people you might know, I want you to think about how you would do that on Twitter. MrTweet is the app that I use and it’s quite interesting to compare and contrast the two worlds. For today, we have the best of Lars, and I asked him to establish his credibility by talking about something on privacy and the non-existing privacy preserving data mining attempts; how auxiliary data helps you to nail down things to an enormous precision. Then at 4:45 we will start with “People You Might Know”. At 5:00 we will do a group exercise where we will talk together and figure out metrics, and how to know if what this team is building is any good. If his team is building two different versions, which one is better? We will then turn it back to you and you will show us some results. Do you know why we’re doing what we’re doing? Do you know how it fits in? Today is people discovery. Last class was content discovery. The class before that was product discovery. 0:17:22 Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Lars: Thank you for having me. I’m going to talk a bit about some work I did a couple of years ago while I was a grad student at Cornell. Then I’ll move to what I’m doing on Facebook. The demo question we asked in this work was can these social data sets be released to the research public? Could you release something like the Facebook social graph to researchers and not worry you were going to end up in the New York Times the next morning? For the most part, the research community, people who are interested in doing social graph research are not really interested in who the actual people are. They’re more interested in the social network structure, how the network is evolving, and how it’s being used, and things like that. The social network could be anything. It could be an instant messenger or email graph, or the AT&T phone graph. In all these cases, researchers would really like to get their hands on it and try to run their models, and explain what’s going on. That would be great except the users of all these systems have these strong privacy expectations. You would be pretty upset if the whole world found out everybody you’d ever sent an email to or made a phone call to. A first attempt at making both of these groups of people happy is to do some sort of anonymization where we release a anonymized version of the network, where we just strip away all of the identifying information on the nodes. We replace them with random IDs and we release the link structure. This gives you a bit of pause and will make you think of all these cases where people have released data and gotten into a lot of trouble. There has been a long series of work on de-anonymizing data, things like figuring out who wrote something based on textual analysis. With the NetFlicks data people were able to align that with public IMDB ratings, and de-anonymize that. Then there was this whole AOL debacle where people got fired and there are lawsuits. We’d really like to avoid that. We can start by looking at a very small network. This is the famous karate network from a sociology study 50 years ago. During the study, the network split into two groups. Node 1 and node 34 eventually didn’t like each other and they went off and formed separate karate clubs. This is the social network of all the people that were in the karate club. 0:21:10 Just looking at this very small toy example, it turns out that some information has been leaked. When Zachary released this data, he published this graph but took away all the peoples’ names. If one of the people in the study knew they had 6 friends in these 2 karate networks, and they were also friends with both of the leaders of node 1 and node 34, it turns out that person could uniquely identify himself as being in this node. Similarly, if this person also remembered he had 6 friends but he was only friends with this karate club leader and not that one, he could also uniquely identify himself. Just using this little amount of information, how many friends did I have and which karate leader was I friends with, all these people could uniquely identify themselves in the graph. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Our case is different from this. The graph is much larger. Some believe that’s going to save us. We have this huge graph with hundreds of millions of nodes in it. You could think, “What could go wrong in that situation?” Imagine you’re a data curator for Facebook or AT&T and you’re going to release this huge social graph with hundreds of millions of nodes. Whereas these karate club individuals were able to uniquely identify themselves because there weren’t very many of them, and somebody with 6 nodes was almost uniquely identified to begin with, in these graphs there will be hundreds of thousands of people with degree 6. We did this work which actually won “Best Paper” in 2007, where we imagine this simple scenario where the data curator – Facebook is going to release their social network just one time. They’re going to strip as much identifying information as possible. They’re only going to include the nodes on social connections, friendships, and nothing else. An attacker has the ability to add a few nodes and edges before this all happens. You can think that in the case of Facebook that means going in and creating some accounts, and then making some friendships between those accounts. Then maybe making friendships to some other individuals in the graph. The goal of this attacker is to try to discover if 2 individuals are connected. You can imagine the attacker is targeting 2 specific people. He wants to know if his wife and best buddy are friends on Facebook or something like that. What can that person do? Here is the structure of the attack we developed. This is how you should do it if you want to do it. Imagine you have this huge graph with hundreds of millions of nodes. You can target more than 2 people. Imagine you were interested in these 3 people here and we wanted to determine of these three edges, which ones are present in the social network. We’ll create this structure out here which will be like our secret key. We create these 10 nodes and they’re connected in some way that we know about. Then we’re going to put that in the graph and the data is going to be released with the original hundred million nodes, and then also these 10 we created. If we could only find this thing we could created, we could follow these links to find the people that we’re interested and we could read off which of those pairs are connected. The question is how do we create this widget we can find uniquely because some of the instances of this problem are very hard. This is like an instance of sub-graph isomorphism, if you’re familiar with that sort of computer science. It’s an NPR problem. It seems like this won’t work. It turns out that if you do things the way we tell you, you’ll be able to find it very efficiently. The right way to do this is pretty simple, actually. You just pick someone [0:24:59 U,K] which has to be sufficiently large but not that big, in the order of 10 or 20. You’re targeting K people. You create K nodes. Then you add these links to target them. 0:25:12 In order to make my thing uniquely identifiable, and in order to find it later, I just add a bunch of random edges between them. This is just a random graph, probability ½ for all pairs of edges, with this one caveat that I’ve deterministically created this path. I’m not Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business going to go into the details too much about why this all works, but I’ll give the 30-second version. If you’ve created a sub graph of size K, in this entire graph the total number of sub graphs of size K is something like Nk and there is this hundred million term. That is the number of things you have to search over. The probability that any one of them matches looks like this so if you work that out in your head it’s something like 2 to the –K2 and it turns out this thing is getting bigger and bigger as K grows, but this one is getting smaller and smaller much more rapidly. Eventually when K gets to be about 2 times the log rhythm of the number of nodes, then with high probability your thing is unique and you can find it. We have an algorithm that does this all efficiently too. That’s all I’m going to say. Hopefully I’ve established my credibility. I’ll pause and take questions. Andreas: There are a couple of examples you might want to give, like the NetFlicks. If you had N movies rated, then with a probability of – I forgot the numbers, but they’re pretty amazing. Lars: I don’t remember the details either, but if you include some contextual information then it gets much easier. With NetFlicks, they had the time stamps on all of the ratings. You can think a lot of people rated the movie at every minute, but not that many people had the exact same signature as you. That was a key part for their de-anonymization in that work. Student: … Facebook social graph, a lot of the Facebook users… avatar. Can you map a graph out that way? Lars: This doesn’t have anything to do with – it the data is already public and if they get it – in the scenario we’re thinking about is where it’s not public. How could you, if someone were to release the data, how could you learn things that wouldn’t otherwise be available? Student: Wouldn’t the public parts of the … get a lot harder to … Facebook graph … a lot of contextual…? Lars: The question was wouldn’t the public parts make all of this easier, and yes, they probably would. It’s already pretty easy though. The point is that if you’re worried about your users’ privacy, you shouldn’t go and do this simple anonymization. I’m going to talk about the “People You May Know” at Facebook now. This is what I’ve been working on for the last few months. The first question, the overarching problem is who should I suggest. I’m going to make some suggestions for you of who I think you might know, and how should I figure out who those people are. 0:28:39 Our hypothesis which is backed up by the literature is the social graph is going to tell us. I’ll get into the literature of that and talk about what we’ve done on top of that in a bit. Given the social network, we’ll use some machine learning techniques to make the suggestions, and we’ll talk about how we measure our performance and how much better we’re doing now than we were a month ago. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business To put it in context, you have everybody doing recommendation analysis. You have Amazon recommending all these products. They think I need a new hard drive. NetFlicks is recommending movies based on what I’ve watched before and what ratings I’ve given. Facebook’s product is sort of people, so we’re trying to recommend people we think you might know. Sometimes our problem is a bit different from the problems that Amazon and NetFlicks have. They’re doing a collaborative filtering where they try to find people who are similar to you in the whole world and they make suggestions based on that. They find someone who has similar tastes and that person has rated some movies, and they figure if that person liked the movie then you’ll like it too. For us the social context is much more important. It’s going to be silly for us to find people who were like you and recommend their friends. We’re going to use the social context and look at the social graph to find people there nearby in that space. To show what the product looks like, If you go to your home page on Facebook you may or may not see some friend suggestions here in the upper right-hand corner. It shows you a few things. It tells you how many mutual friends you have with that person and you have the ability to add that person as a friend or you can remove that person and it will show another one. There is this other page which will show you a whole laundry list of people. The home page is by far the most important source of friend additions for us. Our hypothesis is we’re going to be able to figure all this out, or at least a lot of it, just by looking at the social graph. There is some research that backs this up. There is work I did with Uri Luskovitch when we were both interns at Yahoo. We looked at a lot of social networks. We looked at how many friendships are created as a function of how far in the social graph you are from that person. To take one example, we’ll look at LinkedIn and you see here that the highest point is 2 and 2 hops away means your friends of friends. This is on a log scale, so dropping 2-3, you’re dropping by not quite an order of magnitude but a factor of 5. By far the most significant source of new friends is people that are friends of friends. It makes a lot of sense intuitively that people you’re meeting, you’re probably getting introduced to them through friends, or people you already know, you probably have a common friend with them. This is the raw number of edges that have been created. If you were to think about the probability of creating a new friend to a new person, that’s more extreme. It looks like this because when you go from that raw number to the probability, the denominator gets huge. You have maybe 10,000 people that are friends of friends, but you have a million people that are friends of friends of friends, 3 hops away. 0:32:08 If we look at LinkedIn, there are about 2 orders of magnitude, so you’re a hundred times more likely to link to a random friend of friend, as compared to a random person, 3 hops away. This is true on Facebook also. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business I ran this yesterday. I was looking at one small country because it was quick and easy. I was looking over the entire evolution of Facebook, where were the new links coming from, and 92% of them were linking to friends of friends. This is really good from a practical point of view also because if it weren’t the case we’d be in trouble. Already friends of friends, for Facebook, the average friend count is something like 130. On average, a person is going to have, if we multiply that together, 17,000 friends of friends. If you had to go any further than that you’re getting up in the many millions. Doing something like that for all users would be hopeless. That’s the average, so if you think about the power users who have 5,000 friends, there are people on Facebook who actually have over 1 million friends of friends, just 2 hops away. Student: What does the zero … Lars: This is people that there was no path. This line is showing how many friendships were created as a function of how many hops away they were. I threw in this one here. This is friendships to people to whom there was no path. That’s mostly new users or people, just two people that were new users who connected up. Whenever you’re a new user there is no path to wherever it was you made your first connection to. Student: … average number of … friends a lot higher…. Lars: It is. On a subsequent slide, it’s closer to 40,000. That’s because there is some sort of correlation. If you’re a person who has many friends, chances are your friends also have many friends. A simple example is kids in high school and also college students tend to have more friends, and they also tend to be friends with people in the same demographic who also have more friends. When we do this, it turns out the average is about 40,000. Student: Do you have an idea of what percentage of your users are people whose networks don’t grow? It’s analogous to the guest user on a website, they never want to register for an account but want to be there to use the site totally anonymously? Lars: I don’t have a really good sense of what that percentage would look like. Facebook is all about social content so if a user like that, unless you’re a big Farmville fan or something and even Farmville is about social content – there is much less for you on Facebook as an anonymous guest than there would be on a lot of sites where you would do that, I would think, but I don’t have any concrete numbers to back up that intuition. This gives us some idea of what we should do. It allows us to narrow the scope of our problem. Now we’re going to go from who should we suggest, to which friends of friends should we suggest. You can imagine me, here are all my friends, and I want to find out; of all these people, all these friends of friends, who is the best person to suggest. 0:35:48 We wanted to figure out what features are going to help us pick from these 40,000. We wound up looking at different network features and also some of the demographic features. A first thing that’s obvious is the number of friends you Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business have in common with that person, how many mutual friends. If I have 40,000 friends of friends, most of those people I only have 1 friend in common. The ones where I have more friends in common, those are likely to be better suggestions. When we look at the data, this is metered out. Here on the X axis is the number of friends we have in common, and the Y axis is the relative probability on a log scale. This isn’t the exact probability, so it doesn’t go to 1. It’s the relative probability so there is some like constant you would have to multiply by. The upshot is if we look at a couple of data points, if we look at the difference between 1 mutual friend and 10 mutual friends, there is about a factor of 12 difference. If you were somehow to get up to 100 mutual friends without actually friending the person, your intuition would be you have 100 mutual friends, you should have found that person before, but when you look at the data you’re much more likely to become friend with those people. That’s a good start, just looking at mutual friends. If you had to pack something together and ship it next week, that would be a good place to start; suggest based on the number of mutual friends. It turns out we can do better by incorporating some more complicated features. I’ll put up one of the things we used. This is a bit scary and complicated so I’ll go through it. What are the other network features that might be useful? One of the important things we found is the edge creation time is a very strong signal for us. You can think that if you have a close friend and that close friend makes a new friend, that’s probably a good suggestion. That might be a better suggestion than the person who works at the same company as you down the hall that you have 10 mutual friends with but you’ve always had 10 mutual friends and you always will, but you don’t know them socially. The summation is over all of the mutual friends that you have with the person. Say I have 5 mutual friends with some specific person; we’ll sum over those 5 people. We have these 2 time terms. These are the times on those 2 edges. If I have a mutual friend with some person, there are 2 edges between me and that friend of friend. That’s what these 2 delta terms are. We multiply them together and we take them to this negative power, which says new edges are more important. We divide by the square root of friends because that helps to diffuse the influence. If I had a friend with 5,000 friends, all those 5,000 people should probably get less influence than if I have a friend with 20 friends. I wouldn’t want you to come away with the message that this is how we do it. This is an example of one of the many kinds of features we’re able to use and glean from the social network and from the metadata we have associated with it. The time stamps on the edges, the number of friends that everybody has, and things like that. 0:39:09 Student: I don’t understand the graph. Relative probability…. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Lars: That’s why it’s relative and not the actual probability. How do I do this? Think about one exposition – 10. I counted for all the friendships that every occurred, I have a weight of 1 that I’m going to put into the denominator of everything but I’m going to distribute that amongst all of the friends of friends. You may have 10,000 people with 1 mutual friend and you have 1,000 people with 2 mutual friends. The amount of weight you put into the denominator of this fraction is that. You put 1 into the numerator. I’d have to go through the math but we can talk about that online. The normalization is a bit complicated, but it says that people in this bucket are 12 times as likely and you can think that the probability can’t be greater than 1, so there is some constant that you should multiply everything by for it to make sense. It would have been simpler if I had left off the labels on the Y axis. Student: Where do you get the negative….? Lars: We tried a few things and some of them worked better than others. There is no real principled reason for why it should be that. We tried 0.1, 0.2, 0.3, and 0.3 was one that worked well. There are other features that are of a similar flavor to this that go into the model at the end of the day. Student: I think the time of creation of those edges might … friends and he might not join Facebook at the very beginning…. Lars: Absolutely, we have to make the most of the data we have available to us. At the end of the day you do what you can. The time stamps we have are not necessarily representative of reality but they’re still a good signal for us. We first talked about the features that we use, so how do we use them. The answer is we do some fairly straightforward machine learning. For every user, like me, we consider all the friends of friends, these “W’s”, all pairs, so me and one of my friends of friends and we generate a bunch of features, things like the mutual number of friends, this super complicated time decaying number of mutual friends, and a few other things in that same genre. We also add some other features that are just for you or just for this W, like age, gender, country, or whatever demographic information we happen to have about the person. Sometimes those things are missing and we don’t use everything we have, but you get the idea. We take that and generate a bunch of training data and we use [bagged] decision trees, which is you’re familiar with what that is, we just do the simplest thing. We take our training data, train a bunch of different decision trees using different subsets of the data and then we average their results together. It looks like this; we have the features that come in, like there are 5 mutual friends, the source node is a guy, he and the target is 23, etc. Then it goes into all these different trees that sort out numbers and we average them together to get the final output. 0:43:02 One thing you have to think about whenever you are taking this approach is where does your training data come from. Whenever you’re doing this classification in Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business machine learning, you need to have some labeled training data. Some of the cases are negative and some of them are positive. In our case, the positive cases are friendships that were created. For the negative cases there are a few different possible choices you could make. Naively all the negative examples are friendships in all the world that were not created, but that’s a bit too much to deal with. We actually use the data from PYMK to sort of train subsequent models. We show you a bunch of suggestions and you add some of them and don’t add some other ones. The ones you add, those are the positive examples, and the ones you don’t add are the negative examples. What we’re grappling with in trying to understand all the influences is what that sort of feedback loop does in the long run. We’re training on the previous iteration of the models and how does that impact the long-term solutions? Of all the features, the mutual friends are the most important, so the things related to the social network are by far the most significant in making all these predictions. Some of the other network related features, like how many friends people have are also important, new users for instance are potentially good suggestions because they’re people you might know outside of Facebook but they’ve just joined the site. There is demographic stuff like age and gender which helps a bit but it’s secondary to the network things. Student: What do you mean by [time scale] of friends? Lars: That’s the complicated formula that I put up, things like that. We have this model and it gives us this number at the end of the day. Doing all that is all very expensive and if you start multiplying, you get some pretty astronomical numbers. We said the average person has 40,000 friends of friends. Facebook now has 400 million active users, so we multiply those together, and we get up to 16 trillion. We have to generate all these features and do some sort of evaluation for 16 trillion pairs. To do this we have these dedicated racks of machines and they’re all running full tilt all the time, going through all the users and generating new suggestions. Each of the machines hold a fraction of the social graph because it’s way too big to fit memory on any one machine. Even with all these machines, we’re only able to generate these new suggestions every few days. For more established users, we generate new suggestions every 3.5 days. For new users their social network is changing more often so we generate new suggestions for them more often. 0:46:14 Student: Is there anything related to the population, like a barcode basis, like as things change you update things marginally? Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Lars: No, it’s way too much data to store in a practical way. 16 trillion, even if it was only 1 byte per pair, that’s 16 TB and in reality it would probably take at least a few hundred bytes per pair so you’re getting up into Petabytes. Student: Are these supercomputers? Lars: No, they’re regular computers. Everything is regular computers. Supercomputers are too expensive, I think. They just gave them to me. They’re run of the mill machines. They have 70GB of memory and 8 core CPUs in them, but other than that they’re not special. The important thing to take away from this is we’re only able to do this every few days. That means for each user, if they come to the site all the time, they’ll go through them. One of the things we build on top of this is we want to do a bit of work on every impression to try to show the best suggestion we have available every single time. To do this, we keep track of all the suggestions we’ve already made to you. This curve is showing you how the click through rate decays as the number of impressions increases. You can see the first time we make a suggestion is when we get the highest click through rate. The second time it’s dumbed down some amount and it continues to decay. By the time we’ve shown it 10 times, the click through rate is very low. This curve is approximately 1 over the number of impressions. Student: When you say impression, do you mean … Lars: Number of times you’ve seen that suggestion. By impression I mean a pair, a user, and a suggestion. We can’t afford to – they keep the response time in the website up so we can’t do a lot of hardcore computation for every impression, but we can do a bit. We actually combine the output of this expensive offline model which gave us these original suggestions and original scores with a bit of information. The number of impressions is the big thing but also we have a few other features. Then we do some simple logistic regression to re-rank every time. Just to drive that home, you can imagine that we have these 4 suggestions for a person and they all have some CTR prediction associated with them. At first, you’ve never seen any of them. You come to the website and you see these 2 and consequently, their CTR predictions drop. Now the next time you come and you see those 2 people. You come back again and things change. You go back to seeing Alison and Bob who you saw the first time. 0:49:23 The third time you come to the site, Bob decayed a lot. We think Alice is a good suggestion and we think you should be friends with her. She’s really nice. We’re going to show her to you a couple more times. That is what our model tells us to do. The point is every time you come to the site, we’re going through the list of Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business suggestions this offline model made, and re-ranking them and showing the ones we think are the best. Let’s put it all together and summarize how the whole system works. At first, we want to generate some suggestions for some user so here is me. I come in and we have this big offline system that goes and finds all the friends of friends and it does all this feature creation: how many mutual friends do you have with all these people, what are all these other features associated with that pair of users. That goes through these decision trees and they spit out some scores. They’ll say the score for Lars and Greg is .045 or something like that. Now, I come to the website and we go and look to see how many times have I seen Greg before. Maybe I’ve seen him 3 times before and I’ve see Shelly a couple of times. We combine that with the scores from these bagged decision trees, and that gives us some sort of final CTR prediction. We take the top 2 and show them on your page if they exceed some threshold. We try not to spam you too much so that there is some threshold so if this number doesn’t exceed that we won’t show you any. That’s why you might not see them. Every so often the results feed back into both of these models to continue training and improving them. This model here is very dependent upon this one. If we retrain these trees so the types of scores they output change, then it’s very important for us to retrain this also because they need to work in concert. Student: How do you … the modeling you use for people you know… for example in the news feed, rather than … populated according to time, if you did some sort of sorting on who your mutual friends are and have you tried it? Lars: Our system that we built is only about a month old so some of the things we’re doing are definitely not happening anywhere on the site. For new feed, I know they have some concept of how important the person is to you that they try to learn. I don’t know the details of how that system works. It’s definitely – this sort of thing we’ve talked to other teams about and we’re exploring how this sort of thing can be applied in other places. As far as I know, the people you may know is the most sophisticated in terms of finding that sort of ranking. Student: When you do the CTR, if I go to the homepage and the “People You Might Know” suggestion is mutual but I don’t click X, I just go on, does that count as a click through? Lars: It counts as a negative example. If you don’t do anything, it just – when we’re predicting the click through, we’re predicting the probability that you’ll add that person on that page impression. Student: So there is a difference in that and when I click yes. 0:53:06 Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Lars: Yeah, that’s a bit complicated and I don’t want to get into the details of how we deal with those two different signals. Student: I’m curious if you actually create a model of the person’s popularity – the model you were talking about is tree based so it’s much more… are features generated that create general compatibility or popularity metrics? Lars: We have some of these demographic features, like your friend count and how long you’ve been on the site. I guess in principle it’s possible that the trees are learning something like what you’re suggesting, but we don’t have anything explicit like that in there. Andreas: Let’s do some pair exercise. [Break] … with all kinds of ideas. How do we know whether something is working? What I want you to do is to spend a couple of minutes talking to your neighbor and come up with metrics that could be metrics that measure how well the system actually performs. I just will throw up a few examples to get you started. What is the CTR on a friend? What is the conversion using ebusiness language, that the person gets added as a friend? What is the second order effect, that the other person accepts the add? Returns in business are very expensive and in our case that means unfriending someone. You tried them and he spams, so forget it. Stockpiling people you would have added anyway. What is the value of the friendship? Is there some notion of social capital? These are things that are metrics and I want you to talk to your neighbor for a couple of minutes. Then we’ll collect a few ideas from you about what might be metrics that help him evaluate ideas and experiments. Talk to your neighbor. If you don’t have one, move up. How would you measure the quality of the system? That’s the question right now. Let’s see what interesting ideas we have for metrics. How do we know the system is doing better than random, or should we from a Facebook perspective just show ads? That is why we want to talk about metrics. Let’s hear some examples. Student: The point about the perspective of Facebook – it could be interested to see how much more page use someone generates. Whether that leads to more activity. Another thing we thought about is if that person has no friends whatsoever, probably it’s about the confidence that this person would befriend, but also if you have a slot of only 2 people, if we give him someone we feel would stimulate his activities, someone who is super hot and posts a lot and this guy – it might kick him into action as well. Andreas: What you’re suggesting are features to build the algorithm with. If you have few friends, you might make different recommendations than if you have many friends. What I want to focus on here is the metrics. How do we know we’re doing well? 0:56:48 Student: We were thinking about metrics of the expected comment number someone has and then conversely the expected number of people who then hide them from their feeds or their Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business comments. Rank users based on they get 3 responses every time which is better than someone who gets none. Someone who gets hidden could also … Andreas: Facebook knows whether people tend to get x’d out of the feed, so you might not want to recommend those people. It’s a feature more than a metric of knowing how we’re doing. Student: My neighbor has a friend who friended her and she’s a big pain in the ass now. It was a high school friend she doesn’t want to talk to anymore. We are suggesting if there is a really big imbalance between the messages, like someone is sending out versus receiving then maybe the friendship isn’t actually … user experience on the site. Maybe it’s a negative in their life. Maybe you could track whether the friendship was a good addition in their life by high feeds. You could also do the balance of messages flowing between and whether one person was using a lot more words in the message than the other. Andreas: Good one, so I heard two things. One is people unsubscribing and the other is symmetry is a good thing. Student: We had an idea where after you friend this person, how much value do you get; how many other friends do you get from this friendship, recursively looking at the suggestions. Do you get a lot of good suggestions after friending this person? That would be hard to do but it would be interesting to see what the results are. Andreas: Also, how many of the articles in the news feed do you click on that come from this friend. Student: … value of the friendships and number of page views the user had for the new friend because if you add certain people and never revisits them, views the page, then what is the value? Student: We were thinking more about some guys get into Facebook just to have as many friends as possible so they can advertise a product or something. Why would you unfriend someone? If you already accepted them, why did you unfriend them. If there could be a metric on if you unfriend the person, how is it related to how many messages or wall posts or things like that which they sent because that could be someone spamming… that person was my friend and my other friend gets … third friend of mine, he might be suggested to add that friend… 150 friends in common but I would…. Student: Isn’t there a feature on Facebook where you can suggest to other friends? Compare those two. Lars: Compare in what sense? Student: Conversion rates, like if I suggest to Noah, that he should also know Joe, but you also do it then it gets a better result. 1:00:22 Lars: I hope that you would get better results, but that doesn’t really help us evaluate how we’re doing. It just tells us that we’re losing because we’re never going to do as well. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Student: But you could be improving. Lars: I would say one common thing here is although you all have good ideas, the things we look at are much simpler. I would say that photo tagging or the number of comments and things is you need to be careful. If this is what your goal is going to be, it’s going to heavily influence your product. If my goal is to maximize the number of co-tagged photos that these people are interested in, then I’m going to do well by suggesting prolific photographers, which is not necessarily the right thing. With that caveat, I’ll talk about what we look at. At a high level this gives you an idea of how well we’re doing in terms of how well are we finding new users their friends? This pie graph is just looking at new users, so people that have only been on the site for a little while. Of all of the friends they make how many of them come through PYMK? It’s a pretty big slice. It’s much smaller for more established users who are more familiar with the product, but if you’re a new user on the site and you don’t really know what’s going on, having that person’s face appear in the add as a friend link is very powerful. It’s the number one single source of new friends in the product. Things we look at – one thing that’s sort of obvious is the CTR. Of all the suggestions we show, how many of them get clicked on, what is our conversion? Another thing is what is the total number of friendships we’ve created so we have this on the homepage for everybody; how many total number of friendships are we creating? Then you could look at the add/remove ratio. We talked about this briefly but there are three things you can do with a suggestion. You could either click “add” or completely ignore it and do something else. Or you could click on the X to remove it. Maybe the X’s are a stronger signal than the ignores. All these metrics have some sort of pitfall. If you were just to try to optimize the CTR or the add remove ratio, a good way to optimize that is to show less impressions. Only show the crème de la crème of the impressions, the ones you think are really good and have really high CTR estimates. You’ll end up with a really high CTR but at the end of the day you’re showing few impressions so you’ll end up creating very little total value, very few clicks overall. The other side of that is if your metric was only total clicks, then you would go crazy with impressions and you’d always be suggesting things, even when you didn’t really have a good suggestion that you hadn’t shown to the user over, and over again. That creates some sort of negative user experience, where people are getting spammed with suggestions. Student: What is the actual reason for helping people friend more people? Why would you do that over showing ads or something else that would … 1:04:00 Lars: That’s a good question. I guess I skipped over that on this slide. In order to create a good experience there is a fairly significant downstream impact of having new friends. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Facebook is all about connecting with friends and family and being able to share things. Especially for new users, but also for established users, finding that person’s friends, the people they’re interested in on the website makes the product that much more valuable. It’s not the case that the more friends you have the more valuable the site is. You see these people who have thousands of friends and you think that’s silly, but for people who are less technically savvy or having a hard time finding their friends or for all these people, if we can help these people identify the people they care about and help connect them, that’s creating a lot of value for the users which leads to all sorts of good things downstream. Student: Why do you take into account the number of friends people have? Student: When you suggest a friend, does the same suggestion show up in the other person’s homepage? Lars: No, it may but there is no reason. It’s not necessarily a symmetric relationship. If a person only has 1 friend, then there is a very limited pool of people we can sample from. We’re only suggesting friends of friends so if you have 1 friend and that person has 20 friends, there are only 20 candidates. We have to pick one of those 20, but for the person on the other end that has hundreds of friends, that person who only has 1 friend is maybe not a good suggestion. There is probably someone better. Student: Based on the networks and information you have on your profile, wouldn’t it be a good way to suggest – instead of suggesting a specific friend, suggest “do you want to check more people in this network?” Like more people who went to your high school, people that worked at this place, so you could be browsing your homepage and say I want to know who else lives around me? Lars: I think there are some products on Facebook like that. I think there is a “Find People At My School” product, and there is also this thing related called “contextual PYMK” which is you go and add a friend and we try to make recommendations based on a system sort of like this except also taking into account the context. I add Jim as a friend and now it suggests I’m sort of – he’s my new friend so it’s going to suggest his friends to me. It’s sort of taking the system and adding the additional constraint that they also are my new friends' friends. We do some things like that. Student: Do you also recommend friends of friends of friends? Lars: No I talked about why we don’t do that. We can go back and I can answer that offline. Let me wrap up. At a high level you should think what we really care about is somehow creating the most value in the whole ecosystem. That’s hard to quantify. At some level, every time we manage to create a friendship that adds some value to that person, presumably. That person is going to the trouble of adding a friend so they’re getting something out of it. We’ve done a good job. 1:07:36 Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business If we make a suggestion and you ignore it or delete then we’ve done a bad job and there are some costs associated with that. We’re potentially annoying our customers by suggesting people they don’t know but then there is also the opportunity costs. We could have shown something else in that space, like an ad or there are other modules we can fit up there. Of course there are other issues to consider. You might ask if you suggest my girlfriend to me, I would have found her anyway. You’re really just cannibalizing other channels, not adding anything. This is a question we look at; how many of these friendships that we’re suggesting would have been found through some other channels, not what is the absolute value, but what is the incremental value we’re creating? Finally results – this new system has replaced what we had a couple months ago. I’ll show you the improvement we’ve managed to make. This is showing 30 days. AT the top you have the total number of add clicks. When we first started we did something good and screwed up for a while, but we’ve fixed it. You can see that on this graph too. Compared to 30 days ago, we’re doing pretty well. Our CTR is up 30% and also the total clicks are up 60%. Total value is up a bunch, we think, but we don’t really have a good idea of exactly how to measure that so it’s hard to put a number on. So far, we’ve made a lot of big improvements. I would say that compared to before, the biggest improvement is coming from the 2-tier system where we do the expensive part in this batch mode and then do the fine-grained adjustments in real time. Both the systems we have are still pretty simple so there is a lot of room for improvement. The offline system does a lot of potentially relevant features that we just don’t use because we haven’t gotten around to them. Those are things like how many photos are you tagged in with a person, or how many messages have you sent, what is the relationship status, dozens of things we think might help but we haven’t tried yet. Machine learning – we’ve only tried this one method, these bagged decision trees. We should probably try some other things and see what works best. At a higher level, we’re still not exactly sure what the best way to collect training data is. Right now we have two ways we’re experimenting with. One is positive examples are add clicks. Negative examples are remove clicks. The other way is the positive examples are the same, the negative examples are all the impressions that you didn’t click, either remove or ignore. This real time system we have on top is super simple. It’s just a regression with 5 features. We’re working on adding a few more features that we think will improve that further. 1:10:35 Three takeaways from all of the things we’ve learned over the last couple of months, the first is figuring out exactly what you’re goals are is very important. If you don’t do that you’re trying to hit a moving target and it makes everything much more complicated. We’re still struggling with that. You want to make sure you’re optimizing the right thing or at least if you can’t figure out exactly what to optimize, don’t totally blow it. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Feeding into this is making sure you’re looking at more than just the positive ratios. You want to look at misses as well as hits. We create a lot of friendships but maybe we’re still showing too many impressions, still annoying our users too much. People are always complaining to me. Finally, you have in any of these systems, the real world pops up and there is some limited amount of computational power. One of the biggest improvements that we’ve made in terms of our performance is making the most of that and doing the expensive parts in batch mode but then doing some sort of cheap, simple improvements that give us a big performance boost on every impression. I can answer some more questions if we have time. Student: Cannibalizing friendships, it was surprising to me that you said before every friendship is valuable, every friendship helps. Why not make this friendship easier, faster and as soon as possible? Lars: That’s true and our belief, but there might be other high value things you could show there. There are all kinds of different things you can put in that space and if you can create a friendship that is high value then that’s great; our CTR is not that high so we don’t always know if it’s high value. If it’s a question of do I show 100 impressions of people PYMK to get a couple of friends that the person would have gotten by typing it into the search box anyway, that might be less valuable than showing 100 ads. I don’t know. It’s something to think about. Andreas: One is the goals or objectives matter. Someone said should we have more ads or should we have more friends. The ecosystem matters, the health of the ecosystem. Less interrupts and irrelevant stuff. The metrics matter. Finally, computation matters. On that scale, you really do need to be smart about what you compute in batch mode and what you do in real time. Next Tuesday your project ideas are due. Bit.ly is coming to class. After class we have in the Business School, the Real Time Web event, the VLab event. Thursday the Bit.ly homework is due. Once we get the data from them we will tweet from @socialdata and you know where to find the data which are to be analyzed. Finally, for the speaker on Tuesday, the Facebook.com/socialdatarevolution page is a good way to ask some questions ahead of time and to wrap our brains around what we want to learn. That’s it for today. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_6peoplediscoveryFacebookPYMK_2010.0 4.15.doc