Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas Weigend (www.weigend.com) Data Mining and Electronic Business: The Social Data Revolution STATS 252 April 6, 2009 Class 1 Overview: (Part 1 of 2) This transcript: http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Corresponding audio file: http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.mp3 Next Transcript: (Part 2 of 2): http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc To see the whole series: Containing folder: http://weigend.com/files/teaching/stanford/2009/recordings/audio/ Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 1 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas: I know it’s probably more exciting for you to talk to your neighbors than to hear me speak, but let’s get started. I want to welcome you to the first class of STATS 252, spring 2009. The official title is Data Mining and Electronic Business. Titles at universities have a long lifetime. The subtitle is actually The Social Data Revolution. What has changed is that in the 1990’s the problem was “Here is a set of data; what insights can you give to me?” The problem now is “Here is a problem; what data can you go out and get for me?” The emphasis of the class has shifted over the years, from algorithms for data mining to coming up with new ways of getting stuff from people where we can build apps for them that they find useful. I also teach at The Business School at Berkeley. There, people talk about customer value. Most of the MBA students think it’s the value the customer has for the company. I tell them no, it’s really the value you have for the customer that ultimately matters. When we think about data mining, now, I really have in mind what can we do to create interesting stuff that people like us want to use? It goes without saying that the main value is not data that is hidden somewhere, but data is shared. That’s why we call them social data or shared data. This course is about finding out what kinds of new data sources we have seen in the last few years and maybe also what kinds of new data sources we might be seeing in the next few years. If you think back in the 1960’s or 1970’s, maybe in the first decade of my life, I created less data than I create in a single day now. Actually, in 2009, more data will be created by consumers than have been created in the entire world, so far. That’s pretty amazing. Of course, a lot of it is video, music, and photos, but that’s a pretty amazing world. Not only do we live in the time when the world got connected, for the first and the only time, but we actually live in a world where more data is created every year than we have, so far. 0:02:40.6 Let’s think about this data. As some of you might know, I used to be the Chief Scientist at Amazon.com. Before Amazon, most of the data people had were transactional data, like you go to Safeway, buy your chips at 3:00 in the morning, and they know they sold chips to some dude at 3:00 in the morning. Maybe you have your Safeway loyalty card, which if you don’t have, you pay a premium. They know that at least the guy who had the card the last time bought beer with it. That’s how data mining comes up and they find chips and beer. How did that happen? The story goes that wives call their husbands on the way home from work and say, “Can you bring some diapers? We’re out of diapers for the baby.” The husband says, “Sure,” and picks up some beer on the way. I don’t know whether that’s true. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 2 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics I want to give you another data mining story that I do know is true. There are baby diapers and there are adult diapers. Adult diapers are for old people. Which area in the United States do you think has the highest ratio of adult diapers sold per capita? We hear Florida. We hear Scottsdale. I will give you one hint. Those data are about two or three years old. Something has kind of changed in the world in the last year or so that this data might not be accurate anymore. Wall Street. That is transactional data – not that exciting. The next data after transactional data was interaction data. We instrument the world in order to find out not only what people are doing passively, but how they are interacting with the world. The key thing about that is the geo-location as an example. Last year, we had three hours on geo-location and I’m sure I’ll do the same, maybe even six hours this year. Last year, a hundred million mobile phones were sold by Nokia alone, which have GPS devices in them, a hundred million. The year before last, Nokia bought a company called Nafteq for $8.6 billion, a lot of money. Why did they buy Nafteq? Nafteq had about a thousand people driving through the world with GPS devices, measuring out where the streets are, a thousand two years ago, a hundred million, ten to the five more sold last year alone. A hundred thousand more devices going through the streets. Think about what that means for the quality of data, square root of a hundred thousand is whatever, .3%. Data is much more accurate, but not only more accurate in the static sense, but I live in San Francisco; I’m coming down here and it’s always a little iffy with the traffic on 101. If we actually measure all the cars or at least a certain fraction of cars, maybe just a hundred cars on 101, we know fairly well how much the traffic is flowing. Twenty years ago we used to build complex models at Xerox Park, where I was as a post doc, between [0:06:18.8 unclear] and turbulent flow, figuring out how the traffic would be on 101. Now, instead of building complicated models, we just measure what’s happening. We instrument the world. 0:06:28.4 Other ideas would be parking spots in the city. In San Francisco, it’s pretty hard to find parking. At Berkeley, they have these projects on smart dust. The whole city is instrumented with little detectors, low power consumption that broadcasts whether there is an empty parking spot. The good thing about this is you know precisely, up to date, where you could park. The bad thing is the police know precisely if you park for more than one hour, if you are allowed to park there. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 3 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Instrumenting the world is one of the topics we have here. I’m personally most interested in instrumenting people. Obama has this campaign on medical information or medical records. It think this is a very important element. I want to make two remarks about this. One is unless it is truly consumer focused, who of us would play that game? If it is just insurance companies or hospitals or farmer industry trying to squeeze us, I’m not going to play that. To be more specific, there was a case of a girl on the east coast who was anorexic last year. She had to go to the hospital. Her mom got the bill of $30 thousand. Her mom sent the bill to the insurance company and the insurance company says, “There are two types of anorexia. Type one we don’t have to pay for and type two we do have to pay for. We are sure your daughter has type one.” The mother says, “I have no money. What do I do?” They said, “You can sue us.” What does the poor mom do? She goes to the insurance company and says she really wants the money. They tell her to sue and so she sues them. The first thing the insurance company does is to go into Facebook, MySpace and all the social networks we all know and love, and try to get all the data that girl has written, to make the cause against her. One question we will have in our class is not who owns the data. It doesn’t matter who owns digital codes. They can be reproduced arbitrarily. What can we do in this minus one sigma event, in this minus two sigma event, when the data are getting used against us? Who of you has any picture on Facebook that you would not be comfortable if I projected it, right now? All right, you know what I’m talking about. Another example is an insurance company, a car insurance company, that actually stopped the business model I’m going to explain, last year. Here is what it was. They said, “We’ll make a deal with you. You put that little device in the car, GPS device, and it will record wherever you go. You have to tell it where you want to go.” 0:09:37.8 There you are on Friday night, south of Market, at 3:00 in the morning. The bars are closed, and you went back home. You punch in, “Going back to Stanford,” and they say, “That will be $40.” They know, statistically, that people might have had too much fun, or it might be the other way around; you might be going at 3:00 in the afternoon, and they know that traffic is heavy and that’s when more of the incidents happen. I don’t know how it is. The fact was they tried to get people to share with the insurance company where they were going. That was going to determine the rate the insurance company charged. What do you think about that? Is it fair? Is it unfair? Does it help us be more reasonable drivers? What do you think? Student: I value my privacy more than I value whatever discount they were going to give me. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 4 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas: I think the other problem of why then went under was, not that any of us would ever go faster than 65 mph, but let’s assume that would happen. They of course know that; the precision of the GPS device is amazing. You have an accident and they say, “Young man, if you are going 78, you don’t expect us to pay for your car.” It’s a little tricky. The third example – 23andMe came to class last year. I actually staged that very well. I got that FedEx box that day and I told the class to remind me five minutes before the end and I will open it. It’s worth a thousand bucks. It’s a data collection device. I had them guess. Nobody had a device; it was just a little test tube I had to spit into. They thought it was a new hard drive, maybe 1 TB or 2 TB. 23andMe looked at my DNA, they sequenced snippets of it, and that was the only time I was not quite prepared to just put it up in public. I wanted to see it first. Now, we have all these people who share our DNA with. There is 23andMe. Much more exciting than learning about myself is we can now carry out experiments where, for instance, we are asked how many glasses of wine it takes for you to turn red; how do you experience bitter, and so on. 23andMe now has a pool of tens of thousands of subjects whose DNA they know reasonably well, and they can poke; they can ask certain questions. That is a very interesting way of doing collaborative science in a way that didn’t exist five years ago. We help the world create knowledge by answering simple questions and them having the background of our DNA snippets. Do you of any other examples in this space here of collecting data, maybe data about people, and then doing stuff with them; finding out things about the individual, coming up with insurance, coming up with insights in DNA, personalized medicine? Different people react differently to drugs. 0:12:44.5 For instance, there is a drug that sedates babies. If you have ever been on an airplane flight from here to Europe or from here to Asia, and sat right in front of that crying baby for the ten hours, you know what I’m talking about; you wish you had one of those things to put in the drink or something. Not in your drink, in the baby’s drink. The problem is that for 90 something percent of the babies it really sedates them. For the remaining couple of percent, it makes them very agitated. Do you want to take that risk? If you know a person’s DNA, because ultimately, the drugs interact with whatever is in our body in [0:13:22.2 unclear], then you know whether it would be the right thing for that baby or not. Another example is a product I’ve been waiting for and it’s still not out. Every month, they move it forward by one month. It says the end of this month. It is called Fitbit. You put it in your belt. It costs $99, and then it tells you whether you are walking enough, Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 5 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics sleeping enough – none of us sleep enough, we don’t need Fitbit for that, and if you are getting enough exercise. These are all ways I would push further personalized medicine. It not only tells you the couple of times a year or every other year we go to the doctor and they measure our pulse and say, “It seems good,” but the way we are walking, way we are talking, way we are sleeping. Wait, we have a device with us, which actually records. It has a microphone. Most of us have a device with a camera. It even talks to the world. Do you know what I’m talking about; the cell phone. Think about a cell phone. It even probably has our calendar on it. It knows our email. It knows who we like to talk to, or who we just push to voicemail. If you use SkyDeck, it analyzes your calling patterns. It knows you called that girl six times and she never called you back. [Laughter] If you think about our mobile phone as a data collection device, with all the richness and all the things it could do, yet it is so dumb. It either rings or it doesn’t ring. One of the topics in class will be the topic of relevance. Relevance means figuring out what situation we are in and figuring out how important a certain message/poke it is at that moment for us. The fact that United Airlines flight 886 from Beijing is an hour late today probably has no value for any of us, unless we are picking up that special friend and we’re worried about whether she is coming out or not. If we know where a person is, if we measure the Earth with a microphone, through their movement, what situation they are in - are you in a meeting or at home; we can do such a better job compared to now, in deciding whether to interrupt or not to interrupt. 0:15:52.8 One bit missing here is that unfortunately, the communication devices we have don’t allow us to attach metadata to the message. What do I mean by that? Yes, we do have that thing about importance being high, medium, or low. I think we all get those messages by virtue of being at standfor.edu email, like two nights ago; the woman’s bathroom closure [0:16:25.7 unclear], on Friday night. This was of relatively limited importance for most of us, but it was sent with high importance. I’m sure I could picture that lady sitting in the bathroom with her handbag on the floor… It shouldn’t be how important it is for the sender, but it should be how important it is for the recipient. Basically, relevance means we have a box, an algorithm, or whatever you want to call it, which has input sensors where we know which state we are in, and gives the sender of the message a way to express how important that message is for the recipient, not for the sender. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 6 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Now, if the woman’s bathroom organization at Stanford University decides something is really important and we say nah, maybe not, the next time they will have a much harder time of getting through to us. Getting people to give metadata, to give explicit data about messages is another theme of this class. At Facebook, we have seen our ability to reveal stuff about ourselves, or to make it up, or whatever, and also importantly, to reveal stuff about our friends. Facebook has a symmetrical relationship or pretends to have symmetrical relationships. I’m not that girl that you called six times and she didn’t call back. That is a very serious constraint. Twitter, on the other hand, doesn’t have that notion of symmetrical relationships. It says, “I do my tweets and whoever wants to follow me can follow me.” Unfortunately, and it really hurts, neither of them has any notion of relevance. Come on, how pathetic is using time as the only ordering parameter. I understand if Facebook wants to go public, it makes sense for them to say we are as good as Twitter and that’s why we show things in a Twitter stream or Facebook stream. In a bigger picture, it means declaring bankruptcy. It means declaring we don’t know how to use the data you all produce. How long does it take you to respond to that person, that person, or the other person? We don’t know how to use this data so all we do is show it to you in a chronological order. I asked you and didn’t give you the time to respond. What other data sources do you know about? What comes to mind when you hear these stories? I’m in the bright light and I actually don’t see much of you because the lights are so bright. I see some hand up there. 0:19:25.8 Student: I see two ends of the spectrum that a lot of easy to collect data that is not… shared… possibly share with your friends… Andreas: You also had a hand up? Somebody in front? Student: Facebook has a lot of private information… Andreas: So, Mint, as well as Wasabi at Berkeley, are companies where you just give them your credit card information and whatever other things, your phone bills, and so on. They help you out and tell you how much you spend on your credit card. Of course, that is very interesting information that they then can give other people information about, like who might be interested in your buying habits. They can help you out in saying, “No, that credit card would be better matched to your spending habits,” and maybe collect some revenues from the company as a bounty. That’s the business model there. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 7 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics What other examples do you know? If you say your name and your department it will help us to get to know each other a bit. Student: … it’s a lot of the emotions… through webcam… adding this information to your email… Andreas: That’s interesting; you try to get to the emotion implicitly. For instance, the quality of voice is a very strong indicator. I think we all know sometimes our voice sounds different from other times. I did my PhD in physics here. One of the classes I TA’d in 1988 was called Physics of Music. Then, as Assistant Professor, I was teaching Music Cognition. It’s very interesting how different our voice sounds in different situations. That would be the implicit way. One thing I learned at Amazon was you might even ask people explicitly how they are feeling today and they might even tell you. When I mentioned a situation before, one of the topics in class will be that we will talk about recommender systems. I know most of you are thinking about the person who bought X also bought Y, and that’s very boring. That’s what you did in kindergarten. That’s not what we’re going to talk about. What I’m talking about is recommender systems precisely based on the situation you are in, based on the state you are in. You need to get something bought so your girlfriend is happy in Holland, or do you actually really care whether you’re buying that Canon 5D Mark 2 or whether it’s not worth the money you’re going to spend on it. How do we figure out what the information is? How do we get this explicitly? Here is a question for you. Let’s take Mr. Tweet. Mr. Tweet, as you probably know, you follow Mr. Tweet on Twitter and it suggests people to you. 0:23:13.5 Do you think if you told Mr. Tweet, “I want to get five recommendations,” Mr. Tweet should give you a different set of people to follow compared to when you tell Mr. Tweet, “I want fifty recommendations,” and you look at the top five? Do you think those top five, in the two cases, should be identical, or different? Just to be clear, I’ll give you another example with exactly the same problem. mSpoke is a company with NewsGator is an RSS discovery tool that shows you stuff you might be interested in reading, not people, but content. If I tell mSpoke, “I have ten minutes; show me stuff,” would that be the same ten minutes worth of reading compared to when I say, “I have two hours; show me stuff” – the top ten minutes of that? What do you think? Student: It’s different. Andreas: Why? Student: You want something different… Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 8 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas: I did an interview with SaaS, many years ago, which had the title “We Don’t Know, but We Can Measure it”. I will ask you many questions in class where I have no idea what the answer is, but I want you to have the mindset that we can come up with metrics that allow us to actually figure out whether that method or that method is better. One thing I know for the news reading, however, is that if you give me the chance to actually tell you when I last read news, whether it was yesterday or last year, you certainly can do a better job by knowing which universe of RSS feeds or news items you should be looking through. After break, the M here stands for metrics. Do you know what the rest stands for? PHAME? P stands for problem, H for hypothesis, A for action, M for metrics, and E for experiment. That’s what we’ll do after the break. Are there any other questions, data sets, suggestions, anything on your mind? Student: … 0:26:09.6 Andreas: That’s what they had in the 1980’s, right? The web has come along since. In class, about five years ago, we had Seth Goldstein, who then had a company called Majestic Research, which used data by comScore. There were some wonderful case studies he had. One was by an online used car company. I’ve forgotten the name. By seeing which cars they sold, i.e. which cars dropped out of the inventory and monitoring those cars that dropped out of inventory and week back a week, which means they were top cars the manager tried out themselves, he could predict precisely what revenues they were making in this quarter. They released a press release at 7:34 in the morning saying they will have big losses. Three minutes later, the company said, “Oh gee, we forgot yesterday to let you know that we have big losses this quarter,” because the day before they were announcing that things would be looking great. What I’m after here is there is a traditional way of firms, with big walls, not letting anything out besides what they really want to control. There is this new world here. A friend of mine runs Swiss International Airlines. They have a very positive attitude towards people blogging. Ultimately, it helps to create transparency and it helps to create trust when somebody knows why that plane didn’t arrive as opposed to saying, “We only have our PR department communicating with the public.” Another example is a problem set we did two years ago in class. There will be problems for you to do, so don’t think it’s just me talking. That was with a company called Hitwise. We developed a consumer confidence index, not based on the traditional, sending out five thousand postcards, hoping for some to come back, asking people, “How do you feel about the economy,” and tallying it up. Instead, why don’t you look at ten million people, and all the queries they do and all the websites? Are they Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 9 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics looking for mortgages or whatever the terms are, and then predict consumers’ confidence based on what people actually do, as opposed to reflecting on how they feel about the economy? Do you see the difference? In one case you just have some surveys, a small sample size, huge selection bias. In the other case, you look at what ten million are looking for. That would be another example of how much richer data than what people thought about ten years ago can be used to make predictions, in this case, about the economy. I didn’t want to put you down. The fact that I took that, I just wanted to make a couple of points. One of them was really the openness of organizations. A company called LiveOps and the brother of last year’s TA works there and brought me there to talk to people; it’s launching a product tomorrow called LiveWork. That’s very interesting, in terms of the future of work. That notion of why companies are around, for the main reason is the switching or the transaction costs within the company are much cheaper than the transaction costs between the company and the outside world have disappeared. 0:29:42.7 For instance, for all my classes, I produce a transcript for you. I will talk about logistics after break. That transcript, of course, is done somewhere in the world. In this case, it’s Israel. It’s good because when we go to sleep, they wake up, do the transcript, and the next day you have the transcript. That is a very different notion of the future of work compared to going to an office. Why should I sit in an office to make a transcript if instead I can sit on the beach or something? That’s one of the byproducts that we have a lot of interaction data we collect on the way. For instance, if I was running LiveOps, with the ten thousand people they have who answer the phone when you want to have your auto fixed or something like this, say whenever you get connected to some potential customer, if it’s a girl you score. If it’s a guy, he doesn’t buy. What would you do? You can quickly identify from the voice pattern that it’s a girl. She should much more likely be talking to you than a guy. Understanding those patterns of interaction and really making the world a better place by making the matches happen, is one of the tenets I have here throughout class. It is one of the things we will see again, and again. Are there any other examples here, from you? Student: … Andreas: Let’s look at some examples that Google collects. Google data sources would be in search – can you all see the screen well enough? Is it big enough? Then I don’t have to read it out for you. Just look at this for a moment, and we’ll discuss what’s on the screen. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 10 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics This is one example, not yet of Google Analytics. That will come in a moment. This is what Google is known best for and that’s Google search. It’s pretty amazing how they leverage the data that you need to give to them. One of the things that is very important is we think about data strategies. When I say data strategy, I mean how do we deal with our personal data? I know whose hands went up about the pictures; not that kind of stuff, but what are we willing to reveal, what do we expect to get out of it as an expected value, and as the 99.9%, the 0.1%, the really happy thing that could happen or the really bad thing that could happen. In a data strategy, you don’t do contractual relationship; that is the hard one. It’s not, “If you tell me your phone number, I’ll give you ten bucks.” That’s not how it works. It works like, “If you me your phone number, I can call you.” That’s much more powerful than just getting ten bucks. Google, dammit, we have to tell them what we want to know, otherwise they can’t give us the answer. This is also one of the secrets at Amazon.com. Whereas, at that physical retailer, when you go in, you don’t have to really tell them what you’re looking for; you can say, “Just looking around.” At Amazon, you can’t say just looking around. You have to enter certain terms or click on certain stuff. 0:33:10.8 The query sequence can be used for spelling corrections. You don’t know quite how to spell Brittany Spears? People learned how to spell it and of course, Google has a product, “Do you mean “blah?” Choice set is very interesting in that if you have ten links, which one is it you are clicking at? That has nothing to do with what’s behind the link. It only has to do with the two lines of abstract that you see. Query refinements, how, over time, do you actually make a query more precise? One of my favorite features at Amazon.com is that not if people bought X also bought Y, but people who looked at X eventually bought Y. That’s powerful. Why? Because you have helped people in their decision-making process. Data is worth as much as it helps you make decisions. If a data doesn’t influence your decision at all, you should, pay zero cents for it. Has any of you taken Ron Howard’s class on Decision Analytics, or something like this? That’s great. Does he still ask you, as he asked me twenty years ago, how much you’re selling your shirt for? Okay, good; some things in the world don’t change. I say, “I’m not selling my shirt,” and he says, “Here’s a thousand bucks,” and of course – here’s the shirt. There are variations of that, a prostitute says how much would you take… do you want to know. Of course I’m not. How about a million? Okay, now since we know that you’re doing it, what’s the price. We have queries and their refinements. Now we have trends. By knowing that in certain areas, certain coughs or symptoms get queried quite often, Google really is Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 11 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics in the position to predict the outbreak of diseases much faster than any individual doctor or hospital can. First, you go to Google before you go to see a doctor. Second, the sample size that Google sees compared to what the doctor sees is much larger. That’s another example for Google Trends. Then, there is the traditional bread and butter for them, associating key words with pages. This can all be learned from search. The second point is Gmail. I still don’t understand how some of the messages, friends I’ve been talking to for five years by now, in Gmail, how they end up in my spam folder. Either Google knows something about them that I don’t know, or it’s a broken algorithm. In principle, we could really do a good job if we have it on the provider level, of understanding what messages should end up in spam and which messages shouldn’t end up in spam. 0:35:48.6 But, the next step has to be not to make this binary decision between spam or not spam, like friend or not friend, just as pathetic, but rank my messages. I told you the example of your five minutes and what should you see. It’s the same for email. What are ways to actually come up with “This is what you really should see now”? That notion of relevance function that we’ve talked a lot of times through right now, has clearly found its way into search and needs to find its way into many other forms of communication. Google Toolbar is seven years old. For seven years, Google has known what people do at Google as well as what they do wherever they go. That’s not bad. Google knows; that’s very powerful. I worked with a dating site in Singapore. People have private notes there, only private for them. They can talk about the people they were chatting with, like, “majored in chemistry in the US,” or “has beautiful eyes,” etc. We looked at these notes as being at the company. It really tells you a lot about how innocent some people are. Google Analytics – we had the guy come to class a few years ago who actually did Google Analytics. Again, this is very powerful. The Google browser bar, the toolbar, allows Google to know where a given person is sitting. It sits like a parrot on my shoulder and goes around wherever I go. Google Analytics sits on my site, www.weigend.com, and this way Google knows, even if people don’t have a browser bar, and never would think about logging into Gmail, it still knows what people are doing on my page. That’s pretty good; to have all these different perspectives to basically sniff the digital exhaust. Google Reader, Google Docs, Google Groups, again social relationships, Google Calendar, Google Talk, Google Maps, Google… what history. There is a lot of stuff there. Design Tools is an example for Google. It’s very powerful. If you design those Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 12 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics new blinds for your house, using the design tool that Google bought, what are the chances that that ad afterwards, about the blinds shop around the corner, will get good attention by you? That’s a smart move. Google Checkout is similar to what I said about Google Mail. [Ben Ling’s] child from two years ago, [0:38:42.9 unclear]; basically it now goes from the cradle to the grave, or whatever it means when you spend your money. It knows the entire process of decision making, how from your first search about cameras, you end up finally buying something. The feature I love at Amazon, people who looked at X eventually bought Y, of course, it’s only restricted to Amazon.com. If you want to see the entire universe, then Google knows that - YouTube, Blogger.com, a lot of that stuff. 0:39:13.3 To summarize here, we are talking about this data revolution. We had the first one, which moved from simple transaction data, the Safeway story, to interaction data. We then had persistent history, which allowed us to move from a transaction economics to a relationship economics. What’s the difference? In transaction economics if I can sell you something right now, I will push it down your throat, I will get my money, and I don’t care. Airlines do this. Do you want a ticket? Get a ticket and good luck. In contrast, relationship economics, as in class here, if I insult too many of you today, I’ll be here my myself next week. Relationship economics is something that trades off potentially not that much profit in the short term against long term. Your typical model is [0:40:11.6 exponential to K]. That was sort of the first thing having persistent ID. The second thing is what I call the social data revolution, which is people sharing data, sharing data about themselves. There, only the sky is the limit. And, they share data about their relationships to others. What we want to do in class is figure out what the implications are, what can we do and what exciting opportunities do we have. Here are some pieces of paper. Because there are fish and there is water. What these pieces of paper try to do is I think of you as the fish in the water. Most people don’t know water and don’t know fish. I have some questions here for you. I will ask you to spend fifteen minutes, now before the break, to answer those questions. These questions try to help me understand what the water is that you are in. I know it’s not easy to think about the water when you are a fish. I’m not a fish, so I don’t know, but I can imagine it’s quite hard. I ask you to concentrate for fifteen minutes, fill out those questions, and take a break. It is now 3:10. I want to be back here at 3:50, which Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 13 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics gives us another 1 hour and fifteen minutes for the PHAME lecture. Are there any questions? Student: How technical do you want … Andreas: Let’s do Q&A at the end. Otherwise, if we start now… you can ask me in the class. I set the end of the class for Q&A. Can I ask you to help me hand this out? Can you go around … can you hand this? Okay, the purpose is to be quite for fifteen minutes, think about the questions, fill them out, give them back to me, take a break, and we’ll be back at 3:50. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 14 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc