Download weigend_stanford2009_1overview-1_2009

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Andreas Weigend (www.weigend.com)
Data Mining and Electronic Business: The Social Data Revolution
STATS 252
April 6, 2009
Class 1 Overview: (Part 1 of 2)
This transcript:
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Corresponding audio file:
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.mp3
Next Transcript: (Part 2 of 2):
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-2_2009.04.06.doc
To see the whole series: Containing folder:
http://weigend.com/files/teaching/stanford/2009/recordings/audio/
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 1
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Andreas:
I know it’s probably more exciting for you to talk to your neighbors than to hear me speak,
but let’s get started. I want to welcome you to the first class of STATS 252, spring 2009.
The official title is Data Mining and Electronic Business. Titles at universities have a long
lifetime. The subtitle is actually The Social Data Revolution.
What has changed is that in the 1990’s the problem was “Here is a set of data; what
insights can you give to me?” The problem now is “Here is a problem; what data
can you go out and get for me?”
The emphasis of the class has shifted over the years, from algorithms for data
mining to coming up with new ways of getting stuff from people where we can
build apps for them that they find useful.
I also teach at The Business School at Berkeley. There, people talk about customer
value. Most of the MBA students think it’s the value the customer has for the company. I
tell them no, it’s really the value you have for the customer that ultimately matters.
When we think about data mining, now, I really have in mind what can we do to
create interesting stuff that people like us want to use? It goes without saying that
the main value is not data that is hidden somewhere, but data is shared. That’s
why we call them social data or shared data.
This course is about finding out what kinds of new data sources we have seen in the last
few years and maybe also what kinds of new data sources we might be seeing in the
next few years.
If you think back in the 1960’s or 1970’s, maybe in the first decade of my life, I created
less data than I create in a single day now. Actually, in 2009, more data will be created
by consumers than have been created in the entire world, so far. That’s pretty amazing.
Of course, a lot of it is video, music, and photos, but that’s a pretty amazing world.
Not only do we live in the time when the world got connected, for the first and the only
time, but we actually live in a world where more data is created every year than we have,
so far.
0:02:40.6
Let’s think about this data. As some of you might know, I used to be the Chief Scientist
at Amazon.com. Before Amazon, most of the data people had were transactional data,
like you go to Safeway, buy your chips at 3:00 in the morning, and they know they sold
chips to some dude at 3:00 in the morning. Maybe you have your Safeway loyalty card,
which if you don’t have, you pay a premium. They know that at least the guy who had the
card the last time bought beer with it. That’s how data mining comes up and they find
chips and beer. How did that happen?
The story goes that wives call their husbands on the way home from work and say, “Can
you bring some diapers? We’re out of diapers for the baby.” The husband says, “Sure,”
and picks up some beer on the way. I don’t know whether that’s true.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 2
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
I want to give you another data mining story that I do know is true. There are baby
diapers and there are adult diapers. Adult diapers are for old people. Which area in the
United States do you think has the highest ratio of adult diapers sold per capita? We
hear Florida. We hear Scottsdale. I will give you one hint. Those data are about two or
three years old. Something has kind of changed in the world in the last year or so that
this data might not be accurate anymore. Wall Street.
That is transactional data – not that exciting. The next data after transactional data
was interaction data. We instrument the world in order to find out not only what
people are doing passively, but how they are interacting with the world. The key
thing about that is the geo-location as an example.
Last year, we had three hours on geo-location and I’m sure I’ll do the same, maybe even
six hours this year. Last year, a hundred million mobile phones were sold by Nokia
alone, which have GPS devices in them, a hundred million. The year before last, Nokia
bought a company called Nafteq for $8.6 billion, a lot of money. Why did they buy
Nafteq?
Nafteq had about a thousand people driving through the world with GPS devices,
measuring out where the streets are, a thousand two years ago, a hundred million, ten to
the five more sold last year alone. A hundred thousand more devices going through the
streets.
Think about what that means for the quality of data, square root of a hundred thousand is
whatever, .3%. Data is much more accurate, but not only more accurate in the
static sense, but I live in San Francisco; I’m coming down here and it’s always a little iffy
with the traffic on 101. If we actually measure all the cars or at least a certain
fraction of cars, maybe just a hundred cars on 101, we know fairly well how much
the traffic is flowing.
Twenty years ago we used to build complex models at Xerox Park, where I was as a post
doc, between [0:06:18.8 unclear] and turbulent flow, figuring out how the traffic would be
on 101. Now, instead of building complicated models, we just measure what’s
happening. We instrument the world.
0:06:28.4
Other ideas would be parking spots in the city. In San Francisco, it’s pretty hard to find
parking. At Berkeley, they have these projects on smart dust. The whole city is
instrumented with little detectors, low power consumption that broadcasts whether there
is an empty parking spot.
The good thing about this is you know precisely, up to date, where you could park. The
bad thing is the police know precisely if you park for more than one hour, if you are
allowed to park there.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 3
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Instrumenting the world is one of the topics we have here. I’m personally most
interested in instrumenting people. Obama has this campaign on medical information
or medical records. It think this is a very important element.
I want to make two remarks about this. One is unless it is truly consumer focused, who
of us would play that game? If it is just insurance companies or hospitals or farmer
industry trying to squeeze us, I’m not going to play that.
To be more specific, there was a case of a girl on the east coast who was anorexic last
year. She had to go to the hospital. Her mom got the bill of $30 thousand. Her mom
sent the bill to the insurance company and the insurance company says, “There are two
types of anorexia. Type one we don’t have to pay for and type two we do have to pay for.
We are sure your daughter has type one.”
The mother says, “I have no money. What do I do?” They said, “You can sue us.” What
does the poor mom do? She goes to the insurance company and says she really wants
the money. They tell her to sue and so she sues them. The first thing the insurance
company does is to go into Facebook, MySpace and all the social networks we all know
and love, and try to get all the data that girl has written, to make the cause against her.
One question we will have in our class is not who owns the data. It doesn’t matter
who owns digital codes. They can be reproduced arbitrarily. What can we do in
this minus one sigma event, in this minus two sigma event, when the data are
getting used against us? Who of you has any picture on Facebook that you would not
be comfortable if I projected it, right now? All right, you know what I’m talking about.
Another example is an insurance company, a car insurance company, that actually
stopped the business model I’m going to explain, last year. Here is what it was. They
said, “We’ll make a deal with you. You put that little device in the car, GPS device, and it
will record wherever you go. You have to tell it where you want to go.”
0:09:37.8
There you are on Friday night, south of Market, at 3:00 in the morning. The bars are
closed, and you went back home. You punch in, “Going back to Stanford,” and they say,
“That will be $40.” They know, statistically, that people might have had too much fun, or
it might be the other way around; you might be going at 3:00 in the afternoon, and they
know that traffic is heavy and that’s when more of the incidents happen. I don’t know
how it is.
The fact was they tried to get people to share with the insurance company where
they were going. That was going to determine the rate the insurance company
charged. What do you think about that? Is it fair? Is it unfair? Does it help us be more
reasonable drivers? What do you think?
Student:
I value my privacy more than I value whatever discount they were going to give me.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 4
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Andreas:
I think the other problem of why then went under was, not that any of us would
ever go faster than 65 mph, but let’s assume that would happen. They of course
know that; the precision of the GPS device is amazing. You have an accident and
they say, “Young man, if you are going 78, you don’t expect us to pay for your
car.” It’s a little tricky.
The third example – 23andMe came to class last year. I actually staged that very well.
I got that FedEx box that day and I told the class to remind me five minutes before the
end and I will open it. It’s worth a thousand bucks. It’s a data collection device. I had
them guess. Nobody had a device; it was just a little test tube I had to spit into. They
thought it was a new hard drive, maybe 1 TB or 2 TB.
23andMe looked at my DNA, they sequenced snippets of it, and that was the only
time I was not quite prepared to just put it up in public. I wanted to see it first. Now, we
have all these people who share our DNA with. There is 23andMe. Much more exciting
than learning about myself is we can now carry out experiments where, for instance,
we are asked how many glasses of wine it takes for you to turn red; how do you
experience bitter, and so on.
23andMe now has a pool of tens of thousands of subjects whose DNA they know
reasonably well, and they can poke; they can ask certain questions. That is a very
interesting way of doing collaborative science in a way that didn’t exist five years ago.
We help the world create knowledge by answering simple questions and them having the
background of our DNA snippets.
Do you of any other examples in this space here of collecting data, maybe data about
people, and then doing stuff with them; finding out things about the individual, coming up
with insurance, coming up with insights in DNA, personalized medicine? Different people
react differently to drugs.
0:12:44.5
For instance, there is a drug that sedates babies. If you have ever been on an
airplane flight from here to Europe or from here to Asia, and sat right in front of that crying
baby for the ten hours, you know what I’m talking about; you wish you had one of those
things to put in the drink or something. Not in your drink, in the baby’s drink.
The problem is that for 90 something percent of the babies it really sedates them.
For the remaining couple of percent, it makes them very agitated. Do you want to
take that risk? If you know a person’s DNA, because ultimately, the drugs interact
with whatever is in our body in [0:13:22.2 unclear], then you know whether it would
be the right thing for that baby or not.
Another example is a product I’ve been waiting for and it’s still not out. Every month,
they move it forward by one month. It says the end of this month. It is called Fitbit.
You put it in your belt. It costs $99, and then it tells you whether you are walking enough,
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 5
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
sleeping enough – none of us sleep enough, we don’t need Fitbit for that, and if you are
getting enough exercise.
These are all ways I would push further personalized medicine. It not only tells you
the couple of times a year or every other year we go to the doctor and they measure our
pulse and say, “It seems good,” but the way we are walking, way we are talking, way we
are sleeping. Wait, we have a device with us, which actually records. It has a
microphone. Most of us have a device with a camera. It even talks to the world. Do you
know what I’m talking about; the cell phone.
Think about a cell phone. It even probably has our calendar on it. It knows our
email. It knows who we like to talk to, or who we just push to voicemail. If you use
SkyDeck, it analyzes your calling patterns. It knows you called that girl six times and
she never called you back. [Laughter]
If you think about our mobile phone as a data collection device, with all the
richness and all the things it could do, yet it is so dumb. It either rings or it doesn’t
ring. One of the topics in class will be the topic of relevance.
Relevance means figuring out what situation we are in and figuring out how
important a certain message/poke it is at that moment for us. The fact that United
Airlines flight 886 from Beijing is an hour late today probably has no value for any of us,
unless we are picking up that special friend and we’re worried about whether she is
coming out or not.
If we know where a person is, if we measure the Earth with a microphone, through
their movement, what situation they are in - are you in a meeting or at home; we
can do such a better job compared to now, in deciding whether to interrupt or not
to interrupt.
0:15:52.8
One bit missing here is that unfortunately, the communication devices we have
don’t allow us to attach metadata to the message. What do I mean by that? Yes, we
do have that thing about importance being high, medium, or low. I think we all get those
messages by virtue of being at standfor.edu email, like two nights ago; the woman’s
bathroom closure [0:16:25.7 unclear], on Friday night. This was of relatively limited
importance for most of us, but it was sent with high importance. I’m sure I could picture
that lady sitting in the bathroom with her handbag on the floor…
It shouldn’t be how important it is for the sender, but it should be how important it
is for the recipient. Basically, relevance means we have a box, an algorithm, or
whatever you want to call it, which has input sensors where we know which state we are
in, and gives the sender of the message a way to express how important that message is
for the recipient, not for the sender.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 6
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Now, if the woman’s bathroom organization at Stanford University decides something is
really important and we say nah, maybe not, the next time they will have a much harder
time of getting through to us.
Getting people to give metadata, to give explicit data about messages is another
theme of this class. At Facebook, we have seen our ability to reveal stuff about
ourselves, or to make it up, or whatever, and also importantly, to reveal stuff about
our friends. Facebook has a symmetrical relationship or pretends to have
symmetrical relationships. I’m not that girl that you called six times and she didn’t call
back. That is a very serious constraint.
Twitter, on the other hand, doesn’t have that notion of symmetrical relationships.
It says, “I do my tweets and whoever wants to follow me can follow me.” Unfortunately,
and it really hurts, neither of them has any notion of relevance. Come on, how
pathetic is using time as the only ordering parameter.
I understand if Facebook wants to go public, it makes sense for them to say we are as
good as Twitter and that’s why we show things in a Twitter stream or Facebook stream.
In a bigger picture, it means declaring bankruptcy. It means declaring we don’t know
how to use the data you all produce. How long does it take you to respond to that
person, that person, or the other person? We don’t know how to use this data so
all we do is show it to you in a chronological order.
I asked you and didn’t give you the time to respond. What other data sources do you
know about? What comes to mind when you hear these stories? I’m in the bright light
and I actually don’t see much of you because the lights are so bright. I see some hand
up there.
0:19:25.8
Student:
I see two ends of the spectrum that a lot of easy to collect data that is not… shared…
possibly share with your friends…
Andreas:
You also had a hand up? Somebody in front?
Student:
Facebook has a lot of private information…
Andreas:
So, Mint, as well as Wasabi at Berkeley, are companies where you just give them your
credit card information and whatever other things, your phone bills, and so on. They
help you out and tell you how much you spend on your credit card. Of course, that
is very interesting information that they then can give other people information
about, like who might be interested in your buying habits. They can help you out in
saying, “No, that credit card would be better matched to your spending habits,” and
maybe collect some revenues from the company as a bounty. That’s the business model
there.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 7
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
What other examples do you know? If you say your name and your department it will
help us to get to know each other a bit.
Student:
… it’s a lot of the emotions… through webcam… adding this information to your email…
Andreas:
That’s interesting; you try to get to the emotion implicitly. For instance, the quality of
voice is a very strong indicator. I think we all know sometimes our voice sounds different
from other times. I did my PhD in physics here. One of the classes I TA’d in 1988 was
called Physics of Music. Then, as Assistant Professor, I was teaching Music Cognition.
It’s very interesting how different our voice sounds in different situations. That
would be the implicit way.
One thing I learned at Amazon was you might even ask people explicitly how they
are feeling today and they might even tell you. When I mentioned a situation before,
one of the topics in class will be that we will talk about recommender systems. I know
most of you are thinking about the person who bought X also bought Y, and that’s very
boring. That’s what you did in kindergarten. That’s not what we’re going to talk about.
What I’m talking about is recommender systems precisely based on the situation
you are in, based on the state you are in. You need to get something bought so your
girlfriend is happy in Holland, or do you actually really care whether you’re buying that
Canon 5D Mark 2 or whether it’s not worth the money you’re going to spend on it.
How do we figure out what the information is? How do we get this explicitly? Here
is a question for you. Let’s take Mr. Tweet. Mr. Tweet, as you probably know, you follow
Mr. Tweet on Twitter and it suggests people to you.
0:23:13.5
Do you think if you told Mr. Tweet, “I want to get five recommendations,” Mr. Tweet
should give you a different set of people to follow compared to when you tell Mr.
Tweet, “I want fifty recommendations,” and you look at the top five? Do you think
those top five, in the two cases, should be identical, or different? Just to be clear,
I’ll give you another example with exactly the same problem.
mSpoke is a company with NewsGator is an RSS discovery tool that shows you
stuff you might be interested in reading, not people, but content. If I tell mSpoke,
“I have ten minutes; show me stuff,” would that be the same ten minutes worth of
reading compared to when I say, “I have two hours; show me stuff” – the top ten
minutes of that? What do you think?
Student:
It’s different.
Andreas:
Why?
Student:
You want something different…
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 8
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
Andreas:
I did an interview with SaaS, many years ago, which had the title “We Don’t Know, but
We Can Measure it”. I will ask you many questions in class where I have no idea
what the answer is, but I want you to have the mindset that we can come up with
metrics that allow us to actually figure out whether that method or that method is
better.
One thing I know for the news reading, however, is that if you give me the chance to
actually tell you when I last read news, whether it was yesterday or last year, you
certainly can do a better job by knowing which universe of RSS feeds or news items you
should be looking through.
After break, the M here stands for metrics. Do you know what the rest stands for?
PHAME? P stands for problem, H for hypothesis, A for action, M for metrics, and E
for experiment. That’s what we’ll do after the break. Are there any other questions,
data sets, suggestions, anything on your mind?
Student:
…
0:26:09.6
Andreas:
That’s what they had in the 1980’s, right? The web has come along since. In class,
about five years ago, we had Seth Goldstein, who then had a company called Majestic
Research, which used data by comScore. There were some wonderful case studies he
had. One was by an online used car company. I’ve forgotten the name. By seeing
which cars they sold, i.e. which cars dropped out of the inventory and monitoring those
cars that dropped out of inventory and week back a week, which means they were top
cars the manager tried out themselves, he could predict precisely what revenues they
were making in this quarter. They released a press release at 7:34 in the morning saying
they will have big losses. Three minutes later, the company said, “Oh gee, we forgot
yesterday to let you know that we have big losses this quarter,” because the day before
they were announcing that things would be looking great.
What I’m after here is there is a traditional way of firms, with big walls, not letting
anything out besides what they really want to control. There is this new world
here. A friend of mine runs Swiss International Airlines. They have a very positive
attitude towards people blogging. Ultimately, it helps to create transparency and it
helps to create trust when somebody knows why that plane didn’t arrive as opposed to
saying, “We only have our PR department communicating with the public.”
Another example is a problem set we did two years ago in class. There will be problems
for you to do, so don’t think it’s just me talking. That was with a company called
Hitwise. We developed a consumer confidence index, not based on the traditional,
sending out five thousand postcards, hoping for some to come back, asking people,
“How do you feel about the economy,” and tallying it up. Instead, why don’t you look at
ten million people, and all the queries they do and all the websites? Are they
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 9
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
looking for mortgages or whatever the terms are, and then predict consumers’ confidence
based on what people actually do, as opposed to reflecting on how they feel about the
economy? Do you see the difference?
In one case you just have some surveys, a small sample size, huge selection bias.
In the other case, you look at what ten million are looking for. That would be
another example of how much richer data than what people thought about ten
years ago can be used to make predictions, in this case, about the economy.
I didn’t want to put you down. The fact that I took that, I just wanted to make a couple
of points. One of them was really the openness of organizations. A company called
LiveOps and the brother of last year’s TA works there and brought me there to talk
to people; it’s launching a product tomorrow called LiveWork.
That’s very interesting, in terms of the future of work. That notion of why
companies are around, for the main reason is the switching or the transaction
costs within the company are much cheaper than the transaction costs between
the company and the outside world have disappeared.
0:29:42.7
For instance, for all my classes, I produce a transcript for you. I will talk about logistics
after break. That transcript, of course, is done somewhere in the world. In this case, it’s
Israel. It’s good because when we go to sleep, they wake up, do the transcript, and the
next day you have the transcript. That is a very different notion of the future of work
compared to going to an office. Why should I sit in an office to make a transcript if
instead I can sit on the beach or something? That’s one of the byproducts that we have a
lot of interaction data we collect on the way.
For instance, if I was running LiveOps, with the ten thousand people they have who
answer the phone when you want to have your auto fixed or something like this, say
whenever you get connected to some potential customer, if it’s a girl you score. If it’s a
guy, he doesn’t buy.
What would you do? You can quickly identify from the voice pattern that it’s a girl. She
should much more likely be talking to you than a guy. Understanding those patterns of
interaction and really making the world a better place by making the matches
happen, is one of the tenets I have here throughout class. It is one of the things
we will see again, and again.
Are there any other examples here, from you?
Student:
…
Andreas:
Let’s look at some examples that Google collects. Google data sources would be in
search – can you all see the screen well enough? Is it big enough? Then I don’t have to
read it out for you. Just look at this for a moment, and we’ll discuss what’s on the screen.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 10
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
This is one example, not yet of Google Analytics. That will come in a moment. This is
what Google is known best for and that’s Google search. It’s pretty amazing how they
leverage the data that you need to give to them. One of the things that is very
important is we think about data strategies.
When I say data strategy, I mean how do we deal with our personal data? I know
whose hands went up about the pictures; not that kind of stuff, but what are we willing to
reveal, what do we expect to get out of it as an expected value, and as the 99.9%, the
0.1%, the really happy thing that could happen or the really bad thing that could happen.
In a data strategy, you don’t do contractual relationship; that is the hard one. It’s not,
“If you tell me your phone number, I’ll give you ten bucks.” That’s not how it works. It
works like, “If you me your phone number, I can call you.” That’s much more powerful
than just getting ten bucks.
Google, dammit, we have to tell them what we want to know, otherwise they can’t
give us the answer. This is also one of the secrets at Amazon.com. Whereas, at that
physical retailer, when you go in, you don’t have to really tell them what you’re looking
for; you can say, “Just looking around.” At Amazon, you can’t say just looking
around. You have to enter certain terms or click on certain stuff.
0:33:10.8
The query sequence can be used for spelling corrections. You don’t know quite how
to spell Brittany Spears? People learned how to spell it and of course, Google has a
product, “Do you mean “blah?” Choice set is very interesting in that if you have ten
links, which one is it you are clicking at? That has nothing to do with what’s behind
the link. It only has to do with the two lines of abstract that you see.
Query refinements, how, over time, do you actually make a query more precise?
One of my favorite features at Amazon.com is that not if people bought X also
bought Y, but people who looked at X eventually bought Y. That’s powerful. Why?
Because you have helped people in their decision-making process.
Data is worth as much as it helps you make decisions. If a data doesn’t influence
your decision at all, you should, pay zero cents for it. Has any of you taken Ron
Howard’s class on Decision Analytics, or something like this? That’s great. Does he still
ask you, as he asked me twenty years ago, how much you’re selling your shirt for?
Okay, good; some things in the world don’t change. I say, “I’m not selling my shirt,” and
he says, “Here’s a thousand bucks,” and of course – here’s the shirt.
There are variations of that, a prostitute says how much would you take… do you want to
know. Of course I’m not. How about a million? Okay, now since we know that you’re
doing it, what’s the price.
We have queries and their refinements. Now we have trends. By knowing that in
certain areas, certain coughs or symptoms get queried quite often, Google really is
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 11
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
in the position to predict the outbreak of diseases much faster than any individual
doctor or hospital can. First, you go to Google before you go to see a doctor.
Second, the sample size that Google sees compared to what the doctor sees is
much larger.
That’s another example for Google Trends. Then, there is the traditional bread and
butter for them, associating key words with pages. This can all be learned from search.
The second point is Gmail. I still don’t understand how some of the messages, friends
I’ve been talking to for five years by now, in Gmail, how they end up in my spam folder.
Either Google knows something about them that I don’t know, or it’s a broken algorithm.
In principle, we could really do a good job if we have it on the provider level, of
understanding what messages should end up in spam and which messages
shouldn’t end up in spam.
0:35:48.6
But, the next step has to be not to make this binary decision between spam or not
spam, like friend or not friend, just as pathetic, but rank my messages. I told you
the example of your five minutes and what should you see. It’s the same for email.
What are ways to actually come up with “This is what you really should see now”? That
notion of relevance function that we’ve talked a lot of times through right now, has
clearly found its way into search and needs to find its way into many other forms
of communication.
Google Toolbar is seven years old. For seven years, Google has known what
people do at Google as well as what they do wherever they go. That’s not bad.
Google knows; that’s very powerful. I worked with a dating site in Singapore. People
have private notes there, only private for them. They can talk about the people they were
chatting with, like, “majored in chemistry in the US,” or “has beautiful eyes,” etc. We
looked at these notes as being at the company. It really tells you a lot about how
innocent some people are.
Google Analytics – we had the guy come to class a few years ago who actually did
Google Analytics. Again, this is very powerful. The Google browser bar, the toolbar,
allows Google to know where a given person is sitting. It sits like a parrot on my
shoulder and goes around wherever I go.
Google Analytics sits on my site, www.weigend.com, and this way Google knows,
even if people don’t have a browser bar, and never would think about logging into
Gmail, it still knows what people are doing on my page. That’s pretty good; to
have all these different perspectives to basically sniff the digital exhaust.
Google Reader, Google Docs, Google Groups, again social relationships, Google
Calendar, Google Talk, Google Maps, Google… what history. There is a lot of stuff
there. Design Tools is an example for Google. It’s very powerful. If you design those
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 12
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
new blinds for your house, using the design tool that Google bought, what are the
chances that that ad afterwards, about the blinds shop around the corner, will get good
attention by you? That’s a smart move.
Google Checkout is similar to what I said about Google Mail. [Ben Ling’s] child from
two years ago, [0:38:42.9 unclear]; basically it now goes from the cradle to the grave,
or whatever it means when you spend your money. It knows the entire process of
decision making, how from your first search about cameras, you end up finally
buying something.
The feature I love at Amazon, people who looked at X eventually bought Y, of course, it’s
only restricted to Amazon.com. If you want to see the entire universe, then Google
knows that - YouTube, Blogger.com, a lot of that stuff.
0:39:13.3
To summarize here, we are talking about this data revolution. We had the first one,
which moved from simple transaction data, the Safeway story, to interaction data.
We then had persistent history, which allowed us to move from a transaction
economics to a relationship economics. What’s the difference?
In transaction economics if I can sell you something right now, I will push it down
your throat, I will get my money, and I don’t care. Airlines do this. Do you want a
ticket? Get a ticket and good luck.
In contrast, relationship economics, as in class here, if I insult too many of you
today, I’ll be here my myself next week. Relationship economics is something that
trades off potentially not that much profit in the short term against long term. Your
typical model is [0:40:11.6 exponential to K]. That was sort of the first thing having
persistent ID.
The second thing is what I call the social data revolution, which is people sharing
data, sharing data about themselves. There, only the sky is the limit. And, they
share data about their relationships to others. What we want to do in class is
figure out what the implications are, what can we do and what exciting
opportunities do we have.
Here are some pieces of paper. Because there are fish and there is water. What these
pieces of paper try to do is I think of you as the fish in the water. Most people don’t know
water and don’t know fish. I have some questions here for you. I will ask you to spend
fifteen minutes, now before the break, to answer those questions. These questions try to
help me understand what the water is that you are in.
I know it’s not easy to think about the water when you are a fish. I’m not a fish, so I don’t
know, but I can imagine it’s quite hard. I ask you to concentrate for fifteen minutes, fill out
those questions, and take a break. It is now 3:10. I want to be back here at 3:50, which
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 13
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc
Transcript of Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Stanford University, Dept. of Statistics
gives us another 1 hour and fifteen minutes for the PHAME lecture. Are there any
questions?
Student:
How technical do you want …
Andreas:
Let’s do Q&A at the end. Otherwise, if we start now… you can ask me in the class. I set
the end of the class for Q&A. Can I ask you to help me hand this out? Can you go
around … can you hand this?
Okay, the purpose is to be quite for fifteen minutes, think about the questions, fill them
out, give them back to me, take a break, and we’ll be back at 3:50.
Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/
Page 14
http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_1overview-1_2009.04.06.doc