Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Andreas Weigend (www.weigend.com) The Social Data Revolution: Data Mining and Electronic Business: MS&E 237, Stanford University April 6, 2010 Class 3: Course Content, Wikimedia, SKOUT This transcript: http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaS kout_2010.04.06.doc Corresponding audio file: http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.mp3 To see the whole series: Containing folder: http://weigend.com/files/teaching/stanford/2010/recordings/audio/ Course Wiki: http://stanford2010.wikispaces.com Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Andreas: Welcome to class number three of the Social Data Revolution. I looked through all of the non-binding initial proposals for your projects, your ideas, and I indentified on very serious problem; that you think it’s very easy to get data. It’s actually not easy. For instance, Intuit was not able to get data to us because their legal department is worried. If you think about it, if someone makes up a story that during tax season - Intuit is actually sharing some data that could be personally identifiable with a class and it might have their stock drop by 1%. That is certainly not a risk they were willing to take. I want to start today’s class with one data source we can happily use, which is the data by the Wikimedia Foundation. You all know Wikipedia. Who of you has edited an entry in Wikipedia? That’s a small number. Why did the rest of you never change or edit anything? Was it too much work, too unclear, too hard? Student: The level or organization needed to reformat a page would be too much, so wouldn’t bother. Most of the time I wouldn’t find too many errors, I guess. Andreas: We have Eugene Kim and [0:01:47 Haway Fung]. [Haway] went to Stanford, and Eugene went to Harvard. They went to good schools. I hear Harvard is somewhere on the east coast. Eugene: The Stanford of the east coast. Andreas: They do work with Wikimedia. I thought I’d take the first 15 minutes today, giving them the platform, giving us some ideas of what we could be getting in there. In the spirit of social data, Wikimedia/Wikipedia is a beautiful example of people knowingly and willingly sharing data. At the very end of class today we have a startup I work with on the board, SKOUT, which is another example of a company where you can get some ideas about what you could do with data, which people are happy to share. I will turn it over to you. Eugene: It’s a pleasure to be here and to talk with you all about Wikimedia. I want to do a few more interactive things, just to get … raise your hands. I’m going to ask some questions and if you qualify, if your answer is yes, I want you to actually stand up. The question is how many of you have ever accessed Wikipedia? If you’ve ever accessed Wikipedia, stand up. Very good. Are you not standing because of the laptop? 100% here. If you accessed Wikipedia today, remain standing. Okay, and then we have the number of people who said they had edited. If you have ever edited Wikipedia, remain standing. If you have ever edited Wikipedia more than once, remain standing. 3 people, very nice. Have you been editing Wikipedia for more than a month? Remain standing. 2 people left. How long have you been editing Wikipedia? Student: I haven’t done it in the last month but I’ve been doing it for several years. 0:04:18 Eugene: What are some articles you’ve edited? Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Student: Mainly around different types of web development and geography specific stuff that has to do with what I do for my web - Eugene: Stuff that you’re personally interested in. How about you? Student: The same, maybe 3 or 4 years. Not to make a point out of it, but reading an article that I got to because to and am interested in and I see something missing or wrong; I’ll correct it or add it. Frequently it’s around politics, music, or web development, things I’m interested in. Eugene: Very good, thank you. Everyone here has accessed Wikipedia before. Everyone here knows you have accessed Wikipedia before. I presume everyone knows what Wikipedia is. This is your professor’s Wikipedia page. Wikipedia is basically an encyclopedia that anyone can edit. You can see that edit tab. This is the new Wikipedia interface. Probably most of you haven’t seen this yet. It will roll out in the next couple of weeks. This will be what you see. One of the things you’ll notice about Wikipedia is that it’s edited by anyone. If you look at the revision history of this particular page, you can see that it was last edited in July of 2009, by JamesAM. Do you know who that person is? Do you recognize any of these names? Andreas: Toby Eugene: Toby, who is not on it. He’s probably further down this list. Basically, if you have a Wikipedia page or if there are things about you, things you’re interested in, it’s quite possible that completely random people are contributing to pages about you. Somehow it works, and it’s amazing. On the English Wikipedia there are about 3 million articles. Though all of the Wikimedia sites combined are essentially making up the 5th most accessed website in the world today, approximately 400 million people access Wikipedia every month, which is pretty cool. What people don’t necessarily know is there is actually a vision underlying Wikimedia. This vision has actually been in place for a long time. The way it’s articulated at best; imagine a world in which every single human being can freely share in the sum of all knowledge. That’s our commitment. It’s a lofty goal. 0:06:43 What I’ve been working on with the Wikimedia Foundation, and the question we’re grappling with is how are we doing towards that goal. We know we have big projects now, and we know there are a lot of people who know about it, a lot of people access it; how close are we to getting to the point where everyone in the world has access to the sum of all knowledge. These are questions we’re asking ourselves. What I was tasked to do is to basically leave an open strategic planning process where we develop a 5-year strategic plan, meaning we basically decide what the priorities are, what we should focus on over Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business the next 5 years. If we start with this vision statement, there are a lot of things we can start pulling out of this. One of them is “every single human being” so that’s essentially a question of reach. Are we reaching every single human being right now? This is sort of a theory of change. What it’s doing is articulating all the different factors that are going to contribute to the goal and their relationships to each other. You don’t have to worry about the level of detail right here, just the main thing to point out is there is a virtuous circle between reach, quality, and participation. All of those three things are heavily related to each other. If we improve the reach of Wikipedia or the other Wikimedia projects, that are going to increase quality, increase participation, all of the things are going to self-reinforce each other. That’s the theory we’re operating on right now. In terms of reach, we started by asking a couple of questions. We know that 400 million people are accessing Wikipedia every month. Based on the world population, that’s about 15%. The question is what is the actual country break down, where are people accessing Wikipedia from? Based on that country break down, should we actually be prioritizing certain regions of the world? There are questions about what kind of content people actually want. If people are not accessing Wikipedia, or any of the other Wikimedia content projects, is it because the content they’re actually looking for is not there now? Then there is the question - I mentioned the virtuous cycle before about how reach is related to participation. How do we convert readers to participants? What we saw in the room just now was everyone is a reader, and maybe 2% or so are actually contributors. Is there a way to boost that number? That number is relatively consistent to what we know about Wikipedia right now. This is the worldwide state of Wikipedia right now, in terms of reach. You’ll notice the dark blue is where we have the best penetration. Are there any Canadians in the room here? You guys are the most active Wikipedia readers right now, 40%. If you can believe it, in the United States, only about 35% of people who are online access Wikipedia. That’s knowingly or unknowingly. When we think about reach, there is actually a significant population of people who are online in the United States who are not actually reading Wikipedia at all. Then if you look at the developing world, look at Africa. Africa is entirely under 30% in terms of access. I few look at China, India, Brazil, all of those places are clearly possible places where we can target. This next slide shows Internet growth. 0:10:15 Student: Do you have any idea who the … who don’t access… Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Eugene: No, we have some clues and this is one of the reasons we’re here today. There are a lot of questions we have. There are some answers but a lot more we would like to know so there is an opportunity to take our data, which is basically all publicly and freely available. That’s one of the ethics of the Wikimedia project, and to do really interesting analysis based on that. This is showing the growth numbers. You can see that before, we’re under represented right now in Africa, China, and India, and yet those are sort of the biggest growing countries in terms of Internet access right now. These are actually Internet numbers. If you add mobile as well, mobile boosts those numbers but it doesn’t shift where these regions are growing. For example, Internet growth is very high in China right now; mobile growth is very high in China right now. As you can see from this picture, in terms of prioritization, there are some clear opportunity spaces we should be looking at over the next 5 years, and one of the things we’re trying to figure out is how to target those areas. I want to talk about participation and some of the questions we have. While all of this content is really great and I think we have enough stuff to start targeting different areas, and there are a lot of opportunities there, participation is the lifeblood of Wikipedia. It is what makes everything work. At the end of the day, if we don’t have people like the people who stood up here as contributors, there is no content for everyone in the world to access. One of the priorities that have emerged over the last 5 years centers around this participation question. Some of the questions we have are how do we encourage more participation, what are the different types of participation? We talked a bit about why people are editing Wikipedia or how people are editing Wikipedia, and there are actually very different levels of activity. There are people who are just correcting typos every once in a while. There are also people who are literally spending several hours a day tenderly editing content, adding content, looking at other peoples’ pages, fixing grammar, adding new comment, double checking sources, really involved in the community and participating on the meta level as well; talking to other Wikipedians online, forging relationships, and all those other things. 0:12:36 How do we encourage new editors to become active editors? A big question for us right now is around community health so we believe very strongly that there is a strong correlation between community, the social health of the community and the quality of the content that emerges and quality of participation that emerges. The question is how do we measure that. As I pointed out, there is a relationship between participation and quality, so what is that specific relationship. Here are some numbers to give you a picture of what we understand about participation right now. The metric we’ll use to differentiate between just people who have edited Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Wikipedia and people who are considered active is if you make 5 or more edits, you’re considered an active Wikipedian, at 5 or more per month. That’s an arbitrary number. We needed to pick something so we picked 5 and we’ve been using that number ever since. This is about a 4 or 5 year old metric. This chart measures a couple of different things. Number one, the area of each of the circles represents the size of the different projects. One of the things I didn’t mention before is that Wikipedia actually consists of 250 different projects. Each language gets its own Wikipedia. Each of those language versions are essentially its own community. They have their own contributors, their own governance rules. Occasionally there is some translation between each of the Wikipedias but for the most part it’s all original content. English Wikipedia, which was the first one that started, that big circle to the left, that’s the biggest Wikipedia right now. There are over 3 million articles. After that, you can see Germany is close behind and then France, Japan, and the Spanish Wikipedia. You can see on the y axis, that’s measuring the number of active contributors. Not surprisingly, the biggest Wikipedias also have the highest number of active contributors. The x axis is measuring contributor growth. These larger Wikipedias are actually not growing at all, or growing slightly. Then we have Russia which is sort of a mid-sized Wikipedia and it’s growing quite rapidly. It’s an outlier in this stuff. Then we have the smaller Wikipedias here, and in some cases they’re growing and in some cases they’re kind of dying. They’re kind of [in stasis] right now. This is sort of a picture of where all the Wikipedias are right now. One of the things some of you might have read in the news recently is editors are leaving Wikipedia in droves. It’s a typical interesting media spin on things, but one of the things that’s absolutely true is that participation across the different Wikimedia projects have actually tailed off. This is a picture of active contributors, number of active contributors per month, starting in January 2001, which is when Wikipedia started, all the way to January 2009. You can see we basically peaked in January 2007, and all of a sudden we’ve got this weird behavior going on. Something happened in 2007 and we don’t know what it was to cause this kind of behavior. There is speculation about maybe there was something policy wise or some technological change that happened, or maybe the community has just gotten too big and is starting to get unfriendly, so you’ve hit a natural limit. 0:16:12 We’ve found, and this is all research that Ed Gee’s group did at Xerox Park; when you look at all of the different projects, around January 2007 you have the same plateau effect, even though all of the projects are self sustained, they’re all different sizes, they all have their unique community. Something happened in January 2007 that was probably a worldwide phenomenon that basically affected Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business all of Wikipedia growth. We don’t know what that is. We have some guesses and it will be interesting to explore that. That’s also a potential project. This is showing the same kinds of effects for some of the smaller Wikipedia projects. This is a chart that Ed Chee also did, which his research group did, and it measured reversion rates. One of the things you can do in Wikipedia is if you see someone has edited a Wikipedia page and you think it’s a bad edit, you can revert it so it goes back to what it was previously. You find these different colored lines measure the activity of the user. These purple and blue lines below are people who are very experienced editors, people who have edited between 100 and 1,000 times or more. In this red line up top, you see people who have made one edit and a line below it are people who have made less than 10 edits. One the left, we’re measuring the rate at which peoples’ edits are reverted. I come in; edit a Wikipedia page for the first time. If that edit sticks around then that’s great. If the edit gets reverted, that counts in terms of the reversion rate. What we’re finding is there is actually a big class difference happening now on Wikipedia between new editors and experienced editors. New editors tend to get punished, or not, because of the reversion rate. Experienced editors seem to sort of get away with that. There is a question about whether or not reversion rate is an indication of community health. Perhaps it’s actually a sign of increasing quality on Wikipedia, the fact that you’re getting a lot of random people who are coming on board, and in fact there is a lot more noise. You have to revert more frequent. Or, maybe there are other things that it’s indicating. These are questions we’re trying to figure out. That leads to the last thing I wanted to mention, which was around quality. I don’t have any charts to show you on that right now, but I can cite different studies around this. The main questions we have right now are how do we measure quality of content, how do we measure the quality of experience? Quality is not just about content, but peoples’ experience on the site itself. How do we indicate the quality of content? It’s all well and good for us to do a research study and to say that some percentage of Wikipedia content in this category of content is very accurate. In fact, we want to take that content, that information and leverage it so that people who are looking up information can actually see what’s high quality or not. A lot of interesting things could come out of that. 0:19:25 Number one, it gives people more faith in terms of the accuracy of the content. Number two, if you’re a contributor, maybe one of the things you want to look at is where are there a lot of articles where they’re just kind of mediocre quality because that is where I might want to go and contribute to Wikipedia. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business I just want to throw out a couple of things about where we’re at right now. One of the interesting things about Wikipedia is that we collect a lot of data. As I said before, we’re making that data available to anyone and everyone and there is actually a nice research community that’s emerged and doing a lot of interesting studies around Wikipedia. But, there is huge room for more research. We have way more questions than we have answers for and I’m quite sure there are a lot of questions that we haven’t even asked yet. There is a tremendous opportunity to actually use this as a research tool, to explore different things and I think build business opportunities … really participate in this ecosystem of information. I just wanted to open the floor and see if anyone had any questions about some of the data here, or any of the broader questions, and let’s have a little discussion. Andreas: Let me show you the agenda for today. We had this as the first social data source. I want to spend the next 20 minutes explaining to you, class by class, what we’re going to cover. There has been a lot of confusion and I think the best way is to talk you through the quarter. Then we have another data source, SKOUT, as another example for you to get your creative juices flowing about what you might be doing. The last 15 minutes I’m going to address what you need to do, what are your deliverables. The web page is up and pretty clean. I will walk you through the wiki primarily with the TAs. We’ll all be clear about what’s expected for the rest of the quarter. In the next class, we’ll talk about recommender systems. Recommender systems are traditionally early on in ebusiness, where rather than having experts figure out what you should be buying or merchandizing, an algorithm figures out what you should be buying; not only on your past behavior, but on the past behavior of all the company’s customers and in some cases even beyond an individual company. A lot has happened in recommender systems since the early days of Amazon. I will talk about that on Thursday. Since I slotted in 2 data sources, the two Economist articles I assigned last class, we will talk about briefly on Thursday. Today we don’t have time for it. Some people had problems posting their insights on Facebook.com/socialdatarevolution. Have those been resolved? Remember, I want a few crisp lines about one insight or something that triggered reading those two very good articles, post it on Facebook.com/socialdatarevolution. 0:23:00 Next Tuesday, we’re going to talk about content discovery. In some way you can view recommender systems as product discovery. Content discovery, and I’ll have Dan Olsen, who is the CEO of YourVersion come to class and talk with us briefly about what his vision is and having YourVersion of the web, your discovery engine. Think about StumbleUpon v2. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business If you have questions now, this is a good moment if you don’t know what I’m talking about, to ask. The backgrounds are so diverse which I realized when I looked at what you sent in, that if people don’t know what I’m talking about, ask. We did content discovery. Then we’re going to do people discovery. The example there is MrTweet. MrTweet looks on Twitter to find who you might be interested in following. If you haven’t used it yet, I suggest you do use it. It is really pretty powerful regarding who you didn’t know was on Twitter and now you get an interesting source of news from them. Then we’re moving on to real time web. I sent out the email this morning and I listed 5 different things about what happens in one minute. These are pretty big numbers. You remember 20 hours of YouTube content gets uploaded in 1 minute. We’re going to look more at the real time web in class 7. We’ll have Todd Levy come from New York, who is the Chief Product guy at Bit.ly. Bit.ly is very powerful because it measures the attention people pay to the web, independent of a specific website. Do you know, Bit.ly is this URL shortener so it’s convenient for using but the power is that they now know where the world’s attention sits. It’s very interesting. Class 8 will be about relevance and metrics. Relevance in search is well established. Search relevance, there are a lot of papers. What about discovery relevance? How can we measure that? What are metrics of relevance here? Metrics is something I used to spend more time on, but we just have so many new and interesting topics that I just condensed it to half a class. For people who think about organizations, it’s thinking about why are organizations so unwilling to actually talk about metrics. Is it the fear of the middle manager that if there is some clear metrics out, he might be obsolete? Is it that people don’t understand basic statistics? It is something I’ve been grappling with, working with big companies. I’ll share that with you. Class 9 is about mobile, the mobile as a data collection device. It’s very powerful. Nothing is closer to us than our mobile. There I expect some product/project ideas to really be in the mobile space. I’m not just talking about geolocation here. I’m really talking about the mobile as a data collection device for entering something explicitly or implicitly. 0:26:40 Class 10 is marketing. Last year I gave the keynote at the World Marketing Forum. You saw some of the slides in the first and second class, which I put together for that. On a high level, I do want to share that with you as well. For me, marketing has dramatically changed from creating messages and finding ways to push that to an audience, which usually doesn’t want to hear them, to actually co-creating the message, even in a Wikimedia for instance, co-creating the message with the audience. Social data means people create and share data. What makes people create and share data? How can we design incentives such that people actually do Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business something, primarily for themselves but also for the larger community? That’s why you talked first, because it’s a beautiful example of how we can understand that on Wikipedia and Wikimedia. Game dynamics is part of the lecture on what badges do you need to give to people or how do you need to give them reinforcement. I think it’s a very applicable lecture because people who build something have to grapple with the question of how to create incentives. For instance, self metrics are more important than monetary incentives. What game dynamics, what do we know from games that make them addictive? How can we apply that? Class 12 looks at collaboration. Incentive and design, so far, is pretty much for individuals to contribute, but how do we get people to share? How do we get to something where the whole is more than the sum of its parts? Right now I’m trying a piece of software where the transcripts for each class can be annotated by everybody worldwide, so it’s not ready for launch but I think that’s what we’ll be doing for the transcripts. Just to see what’s good, bad, unclear, put a post-it note on the transcripts of each class. Classes 13, 14, and 15 are classes I developed for Berkeley, for the School of Information, which is about the individual. It’s about relationships, and it’s about community, the collective, collective intelligence, and things like that. The individual we think we know what we’re talking about but do you just want pseudonyms or do you want to have real name identities? What have we learned there? When do we need one versus the other? When do we actually need to authenticate ourselves? Part of the lecture on the individual is privacy. I think any conference I go to, the privacy discussion always surfaces and that’s something that is really constantly evolving. That’s a lecture I’m looking forward to. Then relationships is the 14th class. Don’t just think about going out with somebody. The future of relationships means between people and people, like edges or arcs of the graph. It’s also relationships between people and companies, people and products, people and brands. I will have Auren Hoffman, who runs Rapleaf, come and share some stories about how they’re figuring out a risk score, which is not based on what you do as an individual but what can be inferred by your relationships to other people and understanding whether they’re good citizens or not. 0:30:40 The 15th is about the collective community and then the 16th is group dynamics. I wrote this outline before we talked in the car today. You call it site or community health. Group dynamics, I think interesting projects could be lying there by trying to analyze how we can influence the community to actually work well as opposed to drifting into negativity. The 17th class I learned from you because a number of people mentioned they are interested in cities. We’ll do one class, the last real class about smart cities. Colin Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Harrison runs the project at IBM Research. It is very interesting how much can be learned, and how much behavior can be influenced by measuring things in cities? I was quite surprised a couple people quoted Richard Florida. You know the saying “the world is flat”? Basically it has the same meaning everywhere. Florida counters that by saying the world is spikey, that the world here in Silicon Valley is probably similar to the world at Harvard, but probably very different from rural China. Or, in China, rural China is super different from the coastal region. The spikes are always similar. Why do people meet? Because they like the interpersonal interactions. So smart cities, a lot of examples trying to steer traffic for instance. In Singapore ERP, having flexible rates; what’s acceptable about people and what can we measure. The 18th class will be student presentations. Scott Johnson from Alloy Ventures is going to sponsor that and will be there. We’ll have somebody from Benchmark, from Excel, and from Founders Fund. These are friends, good people. Not all groups get to present. The TAs and I will hand pick who will be allowed to present. If we think some project is not that interesting for them to hear, we’ll read it but we won’t give you the slot. The very last class I call Festival of Data. I got the URL a couple of days ago. That will be with Peter Hirschberg, who also has Sun foundation in San Francisco. They’re running a whole festival of data in the summer in San Francisco this year. That’s about it. That brings us to the end of the quarter. We did think about it. It’s always a choice about what I put in. I’m very comfortable about the flow. I would like to know if you have any questions. Do you feel something is so sadly left out that you really wanted to hear? Do you think some things are a total waste of time? Quick discussion would be good. Student: 0:33:49 Andreas: We could put it, when I say game dynamics; we could put someone from Zynga here for 20 minutes. Who are actually gamers here? Stand up. What about the rest? Student: Social gamers means you play the games on Facebook or MySpace. A gamer you play on Xbox. It’s less of a social aspect. Andreas: MMORPG, multiple message multiplayer online role-playing games is probably what people had in mind, versus social games you do by yourself? Does anybody do social games by themselves, stand up? So no social gamers, so it’s MMORPGs. Student: 0:34:59 Andreas: It’s not clear. I think we’ll do some of it when we talk about game dynamics. That’s the logical spot. Student: Do we get to check out the business models for 0:35:19 business in this space? Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Andreas: Business models are pretty much underlying everything we talk about. For instance for games right now, the business model is that the game is free, coming from China or Korea, but you play for virtual items. Online, it’s very interesting what works in mobile marketing and what doesn’t work. I think there is nothing radically new in business models. Don’t expect that in the end of the class you’ll realize the world is very different from the way you thought it was at the beginning of the class. What is new is we have more data-driven business models and we can validate things more. Business models are part of the conversation but there is no class on business models. Student: When you were talking about privacy, are you going to also talk about some of the newer applications that got inspired through all this data? I know there is owner questions about privacy, who should we … data to other data, but there is other more practical - we go out there, the general guy goes out and tries to start scraping all this data. Andreas: I believe people need to understand the tradeoffs they’re making. There is the privacy/convenience tradeoff. You give up your privacy for convenience. I think we’re pretty good at this. I will quote one paper which is the use of auxiliary information which has some striking results. I think it’s more of a discussion class. That’s something most of us have some stake in because we’ve thought about it. Try the attempt Facebook has, in explaining the privacy settings. It shows you how difficult it is for a company to do a good job there. Student: Were you thinking of doing anything on healthcare? Andreas: We would need to throw out something. We had half of a class last year at the end on healthcare. When I talk about mobile, the perspective I have is of quantified self. Kevin Kelly’s group. But I’m not going to talk about health records and stuff like that. Two years ago we had one class on DNA sequencing, 23andMe being a good example, but I think this is really different now. It’s about people knowingly and willingly sharing data, which are not like DNA data. If people feel we need to throw out something and make space for healthcare, I have some brilliant ideas, another startup, happy for a discussion. Student: 0:38:04 Andreas: Can you come closer to the front? Student: There are 0:38:32. Andreas: This is not a computer science algorithms class. I will talk about the underlying data. For instance when we talked about recommender systems, we do talk about algorithms, but it’s not an algorithms class. If you want to learn about algorithms, this is not the place. Student: 0:38:55 Andreas: Toby Segaran’s book is a very good book. If you need a couple of chapters for the homework assignment in a week and a half, I can organize those for you, but it’s a good book. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Student: 0:39:13 Andreas: Toby Segaran’s book, Programming Collective Intelligence. It’s an O’Reilly Media book. It should be on the homework we’re giving out as the Python homework. Any other comments, questions? We will have more guest speakers and I’d like to keep it short, like this which is a very good presentation, as opposed to donating an entire class to a guest speaker. I hope that’s in your interest as well. Christian where are you? Come set up. Rest for today, we learned about another social data source and we have 10 minutes, not more than 15. At the end I will talk to you. Jeremy should be here by then, about what we need to get from you and if you haven’t seen it yet the table is on the wiki that tells you exactly which date each homework is due. In full disclosure, I’m on the board of Christian’s company. It’s SKOUT. I’m not pushing for any companies here. I’m showing this to you as a potential source of a company to get data from to do your project. The project is important. Christian: Hello and my name is Christian Wiklund. I’m the Founder and CEO of a company called SKOUT. We are in the love business so we connect guys like you as easily as possible on cell phones. Andreas is on the board and I am sure you are going to be in for a treat here in his class. I’m sure you’ll have a lot of fun. What we do is quite simple. We take the GPS coordinates from where you are based and we connect you with interesting people in the area so you can reach out to them in real time, and set up a chat, maybe a virtual gift if this person is cute. Maybe you’ll take them out for a date. It’s about the time and space continuum. That’s where we work, so it’s about when and where you are. Forget about online dating like Match.com and these more boring sites. This is for a younger, hipper demographic. Right now, we’re the largest mobile dating company in the country so it’s going quite well for us. 0:42:53 SKOUT is one of our brands. Everyone is welcome. Then we have a niche brand called [Boy Oh Boy], which is for the gay community. It’s like a rocket in San Francisco. It’s going really well. What are the user benefits with these new types of services? When you think about online dating versus real world dating and how it works, in online dating you have a huge pool of inventory. I can search and match make between potentially millions of girls in my case. I can narrow it down to Asian girls that I like. Maybe I like slightly older girls, a cougar hunter, so I can filter it there and quickly figure out who will enjoy meeting a Swedish guy. That will help me slightly decrease the rejection. That is the nice thing about online dating [0:43:5] experience there. It’s been around, the Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business founder of Match.com and he’s an adviser to us. He started it in 1996. It’s almost as old as the Internet. The drawback with online dating is it’s pretty slow. I don’t know if you’ve tried it but it’s asynchronist communication back and forth with emails, and it’s not very engaging. Of course, you can’t bring it with you. If you look at the bar and how it works, you have a few drinks, go up to the cute girl/guy and say hi, and it’s instant. You get instant feedback, maybe instant rejection, maybe instant positive feelings. But, the drawback there is you have a limited inventory. You’re sort of restricted to the physical boundaries of the room you’re currently in. If you can marry these two worlds, and take the best from them, that’s SKOUT. We have a lot of interesting data. We’re a startup, small team, a lot of interesting problems and opportunities to work on. I’m more than happy to bring on a few groups here to do some cool projects to mine our data and maybe even solve some of the issues we have. One issue is how do you identify porn posters and prostitutes. That’s one serious issue, how do you create a reputation system around the whole social engagement model we have, how do you filter out - there are a few interesting categories of people we can find. You have guys pretending to be girls, looking for girls because they want to have some interaction there virtually. You have web scammers. In the Philippines you have these Internet cafes full of affiliates to webcams.com for instance, so they’re trying to up sell webcam shows to our users. How do we identify the trash from the real content and create long-term healthy base for growing really big? That’s one interesting problem that needs to be solved. Unfair distribution of attention is another one. I would say SKOUT works as any bar on steroids. The sexiest people get enormous amounts of attention so some guys will have 500 inbound flirts in one day and some less fortunate guys may have 0-1 flirts per day. Obviously, if you get 500 people hitting on you in one day you’ll probably give up. It’s too much. The guys that don’t get any flirts are not happy either so how do we distribute attention in the network to make it more fair and balanced? 0:46:47 One thing we can do since we’re location based is we live with historical data and with real time data, figure out where you should hang out tonight. Think about heat maps. That is something we want to launch for iPad, that you had a very sexy big map that shows I’m Christian, 29 years old, looking for Asian girls in San Francisco tonight, where is the highest likelihood of me encountering someone who would like me? That’s one interesting problem set that could be looked at. How to create finding Mr. Right right now. It’s about instant connection with people but there might be more, deep structure data that is also interesting to look at. We don’t match on a lot of dimensions. It does not take you half a day to fill out a big form of exactly who you are. It’s more about the conversation, there might be interesting stuff to add into our network of how do we create the now experience versus finding Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business something that could be more long term. I would say typically our users date multiple persons. They’re not looking to get married so it’s more casual. Optimization of sales funnel - this is something we’re doing continuously but we’re not the smartest brains on Earth so I’m pretty sure some guys here would be able to bring some real value to the table to optimize how we buy the cheapest traffic, what traffic monetizes the best, and how do we monetize them and so forth. Basically, we optimize the whole sales funnel including retention rates, how to get people to come back to the service and so forth. One interesting study I think will be behavioral patterns versus demographics. We had [BoyOBoy] versus SKOUT. We have different ethnicities, different age groups. It peaks around 24 or 25, but we have a range. How does Tulsa, Oklahoma compare to New York City, as far as engagement goes? Do people meet up more there or less there, how many chat messages do they send to each other, and so forth? There is a lot of interesting stuff that can be studied in that. Local marketing campaigns - we have not focused much on going to one city and pushing. I think there is huge opportunity for that, so that could be an interesting marketing problem to solve, how do you get as much local penetration as possible in the boundary, geographically? There is probably tons of stuff you can come up with that would add value to this class and also help SKOUT to continue its success. If you want to email me, that’s my email address. That’s about it. If you have any questions, I’m happy to take them. Student: Do you have a revenue model in mind? Christian: We’re revenue generating and it’s going quite well. How we’re monetizing is a virtual economy plus premium features. You can send virtual gifts and if you could figure out who in this room thinks you’re hot, you would pay for that, right? So stuff like that, we charge for. 0:50:25 Student: Have you tried anything for making the distribution more even? For one thing, can’t you tell by the incoming flirts someone has how hot they are? Creating people on a hot list that only allows hot people to flirt with other hot people? Christian: We have played with all of these things in our minds, but we haven’t executed on anything. I have hundreds of problems and things we want to do but we don’t have the resources now to attack all of them. I do think it could be very important for retaining users. We need to get hooked up as quickly as possible. Student: I had a question about the scammer thing. Have you actually implemented ideas on how to catch posers? I feel like that’s a big issue with instant messaging in general. I’m curious to find out how you handle that issue. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Christian: There are some things we can do. We haven’t done much. We have been growing quickly and it’s taking us by storm so all of these issues are just laying on the table and we need to quickly fix them. You won’t get the web scammers to go for your community if you’re not big. Now we’re growing large and they wouldn’t stay on if they couldn’t monetize. You could potentially block off complete countries in the IP range, but that might not be the best way of doing it. Maybe you have the community flag content, self policing, but we’re open for suggestions. Student: Wouldn’t that be kind of bad though? I’m a user - a girl that I think is hot and it turns out to be a guy? If I flag them in my phone it’s already too late. I’ve never met this - it’s not really going to help my case. Christian: It happens to me in San Francisco all the time. I have girls who are not really girls come and hit on me. I still go out. I’m not scared. Andreas: One more question and then we need to move on. Student: I was wondering on the issue of how to block users who aren’t real users or scammers, why not require everyone who signs in to use their Facebook name? Christian: That’s a great idea and Facebook Connect does know scammers and bullshit so these web scammers have accounts and hundreds of friends already. We tried that and I see a pretty great inbound spam stream on Facebook now. I get tons of girls trying to connect with me that I’ve never met. Maybe I did meet them… Andreas: The last 10 minutes we’ll talk about what we need you to do. The two TAs and I will split that presentation. First of all, the wiki, the Stanford2020.wikispaces.com is going to start with the next class so each class, two people will sign up to actually bring by midnight the day after the class - the Thursday class will be by Friday night - to bring up the basics of that class into the wiki. It’s very difficult for people to add stuff if there is nothing there but it’s very easy for them to add stuff if there is a skeleton there. I always commit to giving you my notes, which are what I have here basically, and the end result is very powerful. 0:54:24 If you look at the wikis from the last years, I think people do a good job because it’s very easy for them to do it. It is class editable. You should have all gotten invitations to Stanford2010.wikispaces.com. If anybody did not get an invitation, then see Jeremy. That starts this Thursday so if someone wants to volunteer for Thursday now, we can take names down. From then on it will be signing up on the wiki. Jeremy: We’re just going to add the pages in the wiki and then put your name on whichever one you want to do. Andreas: Why don’t you introduce yourself and what you do in MS&E? He works with Ron Howard who is one of my absolute heroes. Muhammad: My name is Muhammad Aldawood. I’m a PhD student in 0:55:31 department in the statistical analysis group. I will be one of the TAs for this class. I wanted to introduce the Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business homework. We have a total of 4 homework assignments in this class. All are in the next 4 weeks. The first one is today and is due on Thursday. It’s very simple. You have to build your own website and use Google Analytics for that. It shouldn’t take a long time. There are some instructions on the wiki. Andreas: I wanted to break down peoples’ fear with this homework, like GSB peoples’ fear, that it might be difficult to bring up a web page and to [0:56:15]. I don’t expect you to have lots of traffic to that page, but you read it, you want to be able to see that happening somewhere. Your time is less than half an hour. Muhammad: It shouldn’t take that long. Andreas: It’s not hard. Don’t worry about it. Do it and send a screenshot of Google Analytics with at least one hit or something. The second homework is much harder. It’s a recommender system for Delicious. The third homework is analysis of Bit.ly data. I’m interested as an engineer in what can we do to predict how long it takes for a URL to decay. Some things are like straw fire, high and gone; others sort of meander around. Can you find a model? We’ll make it easy enough so you don’t have to work too hard on that one. The fourth homework is Twitter. That’s a great one. Come up with a harness of how you evaluate whether it’s interesting to meet someone on Twitter, to discover somebody on Twitter. Then implement something simple. We get white listed for all the students at Twitter. There are no barriers there. Ask some friends how they feel about the recommendations. It’s a pretty good real-world project, coming up with something, implementing it, and then evaluating it. Muhammad will be the one who is mainly the contact person for homework and I’m very grateful for that. Any question about homework? Student: … Andreas: Homework is individual homework. These are not groups. 0:58:18 Student: Can we work with other people? Andreas: Of course, you can help each other but you need to submit the page you made with a screenshot you created, as opposed to someone else’s page. I believe in people learning from each other but for these homework ultimately each person needs to go through them on their own. Any other questions? Jeremy and I worked very hard until the sun rose on Monday morning. Thank you, his girlfriend didn’t dump him, I hope, on dog food. It’s his idea and it’s a great idea. Jeremy: One last thing on the homework, I’d recommend you take a look at homework 2 before Thursday. The dog food, the idea is for you to get firsthand knowledge of the Social Data Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc MS&E237 Spring 2010 Stanford University Andreas S. Weigend, Ph.D. The Social Data Revolution: Data Mining and Electronic Business Revolution. It’s the difference between reading an analyst’s report about Apple’s iPod and owning an iPod and playing with it. These are all supposed to be an hour or two, they’re going to be fun to do and get involved. They’re not supposed to be arduous. The first one will be for next Tuesday, YourVersion. The idea is there are a lot of different things going on with news discovery. Digg, Reddit, YourVersion, there are a lot of these different things. We chose one, YourVersion, and we’ll have you do a small dog food on that. We’ll post that on Thursday. The other ones are after the homework and while you’re doing the project. That’s collaboration focus with Google Wave. [1:00:22] is the latest QA platform that’s coming out. There are things like Hunch and things like that so there will be an interesting project there. Lastly, Groupon is one of the most successful companies going on right now so we thought it would be interesting to take a look at what they’re doing. We’ll post the dog food on YourVersion on Thursday, but these are not meant to be arduous. It’s to get you involved in everything that’s going on. You hear about Twitter and Facebook. We really need to create accounts, use the service and get data. Andreas: I look forward to doing that myself. One small remark is that in most cases we will have the key person behind the project come to class the day the project is due. They also want to learn what do smart people here think is great, are there barriers, and we’ll share the answers with the company. If you have specific things to talk to them about, great. If you want to use it for your project, and do a project with those companies - in Dan Olsen’s case I’m sure he will love that. I haven’t talked to the others yet. Jeremy: Any questions? Andreas: Do you understand our reasoning behind it, the experiential component of doing it instead of just reading about it is something we care about. That’s the dog food. Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04. 06.doc