Download weigend_stanford2010_3coursecontentWikimediaSkout_2010

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Andreas Weigend (www.weigend.com)
The Social Data Revolution: Data Mining and Electronic Business:
MS&E 237, Stanford University
April 6, 2010
Class 3: Course Content, Wikimedia, SKOUT
This transcript:
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaS
kout_2010.04.06.doc
Corresponding audio file:
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.mp3
To see the whole series: Containing folder:
http://weigend.com/files/teaching/stanford/2010/recordings/audio/
Course Wiki:
http://stanford2010.wikispaces.com
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Andreas:
Welcome to class number three of the Social Data Revolution. I looked through all of the
non-binding initial proposals for your projects, your ideas, and I indentified on very
serious problem; that you think it’s very easy to get data. It’s actually not easy.
For instance, Intuit was not able to get data to us because their legal department is
worried. If you think about it, if someone makes up a story that during tax season - Intuit
is actually sharing some data that could be personally identifiable with a class and it
might have their stock drop by 1%. That is certainly not a risk they were willing to take.
I want to start today’s class with one data source we can happily use, which is the data
by the Wikimedia Foundation. You all know Wikipedia. Who of you has edited an entry
in Wikipedia? That’s a small number. Why did the rest of you never change or edit
anything? Was it too much work, too unclear, too hard?
Student:
The level or organization needed to reformat a page would be too much, so wouldn’t
bother. Most of the time I wouldn’t find too many errors, I guess.
Andreas:
We have Eugene Kim and [0:01:47 Haway Fung]. [Haway] went to Stanford, and
Eugene went to Harvard. They went to good schools. I hear Harvard is somewhere on
the east coast.
Eugene:
The Stanford of the east coast.
Andreas:
They do work with Wikimedia. I thought I’d take the first 15 minutes today, giving them
the platform, giving us some ideas of what we could be getting in there. In the spirit of
social data, Wikimedia/Wikipedia is a beautiful example of people knowingly and willingly
sharing data. At the very end of class today we have a startup I work with on the board,
SKOUT, which is another example of a company where you can get some ideas about
what you could do with data, which people are happy to share. I will turn it over to you.
Eugene:
It’s a pleasure to be here and to talk with you all about Wikimedia. I want to do a few
more interactive things, just to get … raise your hands. I’m going to ask some questions
and if you qualify, if your answer is yes, I want you to actually stand up. The question is
how many of you have ever accessed Wikipedia? If you’ve ever accessed Wikipedia,
stand up. Very good. Are you not standing because of the laptop?
100% here. If you accessed Wikipedia today, remain standing. Okay, and then we have
the number of people who said they had edited. If you have ever edited Wikipedia,
remain standing. If you have ever edited Wikipedia more than once, remain standing. 3
people, very nice. Have you been editing Wikipedia for more than a month? Remain
standing. 2 people left. How long have you been editing Wikipedia?
Student:
I haven’t done it in the last month but I’ve been doing it for several years.
0:04:18
Eugene:
What are some articles you’ve edited?
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Student:
Mainly around different types of web development and geography specific stuff that has
to do with what I do for my web -
Eugene:
Stuff that you’re personally interested in. How about you?
Student:
The same, maybe 3 or 4 years. Not to make a point out of it, but reading an article that I
got to because to and am interested in and I see something missing or wrong; I’ll correct
it or add it. Frequently it’s around politics, music, or web development, things I’m
interested in.
Eugene:
Very good, thank you. Everyone here has accessed Wikipedia before. Everyone here
knows you have accessed Wikipedia before. I presume everyone knows what Wikipedia
is. This is your professor’s Wikipedia page. Wikipedia is basically an encyclopedia
that anyone can edit. You can see that edit tab. This is the new Wikipedia interface.
Probably most of you haven’t seen this yet. It will roll out in the next couple of weeks.
This will be what you see.
One of the things you’ll notice about Wikipedia is that it’s edited by anyone. If you look at
the revision history of this particular page, you can see that it was last edited in July of
2009, by JamesAM. Do you know who that person is? Do you recognize any of these
names?
Andreas:
Toby
Eugene:
Toby, who is not on it. He’s probably further down this list. Basically, if you have a
Wikipedia page or if there are things about you, things you’re interested in, it’s quite
possible that completely random people are contributing to pages about you. Somehow
it works, and it’s amazing. On the English Wikipedia there are about 3 million
articles. Though all of the Wikimedia sites combined are essentially making up the
5th most accessed website in the world today, approximately 400 million people
access Wikipedia every month, which is pretty cool.
What people don’t necessarily know is there is actually a vision underlying
Wikimedia. This vision has actually been in place for a long time. The way it’s
articulated at best; imagine a world in which every single human being can freely
share in the sum of all knowledge. That’s our commitment. It’s a lofty goal.
0:06:43
What I’ve been working on with the Wikimedia Foundation, and the question we’re
grappling with is how are we doing towards that goal. We know we have big projects
now, and we know there are a lot of people who know about it, a lot of people access it;
how close are we to getting to the point where everyone in the world has access to the
sum of all knowledge.
These are questions we’re asking ourselves. What I was tasked to do is to basically
leave an open strategic planning process where we develop a 5-year strategic plan,
meaning we basically decide what the priorities are, what we should focus on over
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
the next 5 years. If we start with this vision statement, there are a lot of things we can
start pulling out of this. One of them is “every single human being” so that’s essentially a
question of reach. Are we reaching every single human being right now?
This is sort of a theory of change. What it’s doing is articulating all the different
factors that are going to contribute to the goal and their relationships to each
other. You don’t have to worry about the level of detail right here, just the main thing to
point out is there is a virtuous circle between reach, quality, and participation. All
of those three things are heavily related to each other. If we improve the reach of
Wikipedia or the other Wikimedia projects, that are going to increase quality,
increase participation, all of the things are going to self-reinforce each other.
That’s the theory we’re operating on right now.
In terms of reach, we started by asking a couple of questions. We know that 400
million people are accessing Wikipedia every month. Based on the world population,
that’s about 15%. The question is what is the actual country break down, where are
people accessing Wikipedia from? Based on that country break down, should we
actually be prioritizing certain regions of the world?
There are questions about what kind of content people actually want. If people are
not accessing Wikipedia, or any of the other Wikimedia content projects, is it
because the content they’re actually looking for is not there now? Then there is
the question - I mentioned the virtuous cycle before about how reach is related to
participation. How do we convert readers to participants?
What we saw in the room just now was everyone is a reader, and maybe 2% or so are
actually contributors. Is there a way to boost that number? That number is relatively
consistent to what we know about Wikipedia right now.
This is the worldwide state of Wikipedia right now, in terms of reach. You’ll notice the
dark blue is where we have the best penetration. Are there any Canadians in the room
here? You guys are the most active Wikipedia readers right now, 40%. If you can
believe it, in the United States, only about 35% of people who are online access
Wikipedia. That’s knowingly or unknowingly.
When we think about reach, there is actually a significant population of people who are
online in the United States who are not actually reading Wikipedia at all. Then if you look
at the developing world, look at Africa. Africa is entirely under 30% in terms of access. I
few look at China, India, Brazil, all of those places are clearly possible places where we
can target. This next slide shows Internet growth.
0:10:15
Student:
Do you have any idea who the … who don’t access…
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Eugene:
No, we have some clues and this is one of the reasons we’re here today. There are a lot
of questions we have. There are some answers but a lot more we would like to know so
there is an opportunity to take our data, which is basically all publicly and freely available.
That’s one of the ethics of the Wikimedia project, and to do really interesting analysis
based on that.
This is showing the growth numbers. You can see that before, we’re under represented
right now in Africa, China, and India, and yet those are sort of the biggest growing
countries in terms of Internet access right now. These are actually Internet numbers. If
you add mobile as well, mobile boosts those numbers but it doesn’t shift where these
regions are growing. For example, Internet growth is very high in China right now; mobile
growth is very high in China right now.
As you can see from this picture, in terms of prioritization, there are some clear
opportunity spaces we should be looking at over the next 5 years, and one of the things
we’re trying to figure out is how to target those areas.
I want to talk about participation and some of the questions we have. While all of
this content is really great and I think we have enough stuff to start targeting
different areas, and there are a lot of opportunities there, participation is the
lifeblood of Wikipedia. It is what makes everything work. At the end of the day, if
we don’t have people like the people who stood up here as contributors, there is
no content for everyone in the world to access.
One of the priorities that have emerged over the last 5 years centers around this
participation question. Some of the questions we have are how do we encourage
more participation, what are the different types of participation? We talked a bit
about why people are editing Wikipedia or how people are editing Wikipedia, and there
are actually very different levels of activity.
There are people who are just correcting typos every once in a while. There are
also people who are literally spending several hours a day tenderly editing content,
adding content, looking at other peoples’ pages, fixing grammar, adding new
comment, double checking sources, really involved in the community and
participating on the meta level as well; talking to other Wikipedians online, forging
relationships, and all those other things.
0:12:36
How do we encourage new editors to become active editors? A big question for us
right now is around community health so we believe very strongly that there is a
strong correlation between community, the social health of the community and the
quality of the content that emerges and quality of participation that emerges. The
question is how do we measure that. As I pointed out, there is a relationship
between participation and quality, so what is that specific relationship.
Here are some numbers to give you a picture of what we understand about participation
right now. The metric we’ll use to differentiate between just people who have edited
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Wikipedia and people who are considered active is if you make 5 or more edits, you’re
considered an active Wikipedian, at 5 or more per month. That’s an arbitrary number.
We needed to pick something so we picked 5 and we’ve been using that number ever
since. This is about a 4 or 5 year old metric.
This chart measures a couple of different things. Number one, the area of each of the
circles represents the size of the different projects. One of the things I didn’t mention
before is that Wikipedia actually consists of 250 different projects. Each language
gets its own Wikipedia. Each of those language versions are essentially its own
community. They have their own contributors, their own governance rules.
Occasionally there is some translation between each of the Wikipedias but for the
most part it’s all original content.
English Wikipedia, which was the first one that started, that big circle to the left, that’s the
biggest Wikipedia right now. There are over 3 million articles. After that, you can see
Germany is close behind and then France, Japan, and the Spanish Wikipedia. You can
see on the y axis, that’s measuring the number of active contributors. Not
surprisingly, the biggest Wikipedias also have the highest number of active
contributors. The x axis is measuring contributor growth.
These larger Wikipedias are actually not growing at all, or growing slightly. Then
we have Russia which is sort of a mid-sized Wikipedia and it’s growing quite rapidly. It’s
an outlier in this stuff. Then we have the smaller Wikipedias here, and in some cases
they’re growing and in some cases they’re kind of dying. They’re kind of [in stasis] right
now. This is sort of a picture of where all the Wikipedias are right now.
One of the things some of you might have read in the news recently is editors are leaving
Wikipedia in droves. It’s a typical interesting media spin on things, but one of the things
that’s absolutely true is that participation across the different Wikimedia projects
have actually tailed off. This is a picture of active contributors, number of active
contributors per month, starting in January 2001, which is when Wikipedia started, all the
way to January 2009.
You can see we basically peaked in January 2007, and all of a sudden we’ve got this
weird behavior going on. Something happened in 2007 and we don’t know what it
was to cause this kind of behavior. There is speculation about maybe there was
something policy wise or some technological change that happened, or maybe the
community has just gotten too big and is starting to get unfriendly, so you’ve hit a
natural limit.
0:16:12
We’ve found, and this is all research that Ed Gee’s group did at Xerox Park; when
you look at all of the different projects, around January 2007 you have the same
plateau effect, even though all of the projects are self sustained, they’re all
different sizes, they all have their unique community. Something happened in
January 2007 that was probably a worldwide phenomenon that basically affected
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
all of Wikipedia growth. We don’t know what that is. We have some guesses and it
will be interesting to explore that. That’s also a potential project.
This is showing the same kinds of effects for some of the smaller Wikipedia projects.
This is a chart that Ed Chee also did, which his research group did, and it measured
reversion rates. One of the things you can do in Wikipedia is if you see someone has
edited a Wikipedia page and you think it’s a bad edit, you can revert it so it goes back to
what it was previously. You find these different colored lines measure the activity of the
user. These purple and blue lines below are people who are very experienced editors,
people who have edited between 100 and 1,000 times or more. In this red line up top,
you see people who have made one edit and a line below it are people who have made
less than 10 edits.
One the left, we’re measuring the rate at which peoples’ edits are reverted. I come in;
edit a Wikipedia page for the first time. If that edit sticks around then that’s great. If the
edit gets reverted, that counts in terms of the reversion rate. What we’re finding is
there is actually a big class difference happening now on Wikipedia between new
editors and experienced editors. New editors tend to get punished, or not,
because of the reversion rate. Experienced editors seem to sort of get away with
that.
There is a question about whether or not reversion rate is an indication of
community health. Perhaps it’s actually a sign of increasing quality on Wikipedia,
the fact that you’re getting a lot of random people who are coming on board, and in
fact there is a lot more noise. You have to revert more frequent. Or, maybe there
are other things that it’s indicating. These are questions we’re trying to figure out.
That leads to the last thing I wanted to mention, which was around quality. I don’t
have any charts to show you on that right now, but I can cite different studies around this.
The main questions we have right now are how do we measure quality of content,
how do we measure the quality of experience? Quality is not just about content,
but peoples’ experience on the site itself. How do we indicate the quality of
content?
It’s all well and good for us to do a research study and to say that some percentage of
Wikipedia content in this category of content is very accurate. In fact, we want to take
that content, that information and leverage it so that people who are looking up
information can actually see what’s high quality or not. A lot of interesting things
could come out of that.
0:19:25
Number one, it gives people more faith in terms of the accuracy of the content.
Number two, if you’re a contributor, maybe one of the things you want to look at is
where are there a lot of articles where they’re just kind of mediocre quality because
that is where I might want to go and contribute to Wikipedia.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
I just want to throw out a couple of things about where we’re at right now. One of the
interesting things about Wikipedia is that we collect a lot of data. As I said before,
we’re making that data available to anyone and everyone and there is actually a
nice research community that’s emerged and doing a lot of interesting studies
around Wikipedia. But, there is huge room for more research. We have way more
questions than we have answers for and I’m quite sure there are a lot of questions that
we haven’t even asked yet. There is a tremendous opportunity to actually use this as a
research tool, to explore different things and I think build business opportunities … really
participate in this ecosystem of information. I just wanted to open the floor and see if
anyone had any questions about some of the data here, or any of the broader questions,
and let’s have a little discussion.
Andreas:
Let me show you the agenda for today. We had this as the first social data source. I
want to spend the next 20 minutes explaining to you, class by class, what we’re going to
cover. There has been a lot of confusion and I think the best way is to talk you through
the quarter.
Then we have another data source, SKOUT, as another example for you to get your
creative juices flowing about what you might be doing. The last 15 minutes I’m going to
address what you need to do, what are your deliverables. The web page is up and pretty
clean. I will walk you through the wiki primarily with the TAs. We’ll all be clear about
what’s expected for the rest of the quarter.
In the next class, we’ll talk about recommender systems. Recommender systems are
traditionally early on in ebusiness, where rather than having experts figure out
what you should be buying or merchandizing, an algorithm figures out what you
should be buying; not only on your past behavior, but on the past behavior of all
the company’s customers and in some cases even beyond an individual company.
A lot has happened in recommender systems since the early days of Amazon. I will talk
about that on Thursday.
Since I slotted in 2 data sources, the two Economist articles I assigned last class, we will
talk about briefly on Thursday. Today we don’t have time for it. Some people had
problems posting their insights on Facebook.com/socialdatarevolution. Have those been
resolved? Remember, I want a few crisp lines about one insight or something that
triggered reading those two very good articles, post it on
Facebook.com/socialdatarevolution.
0:23:00
Next Tuesday, we’re going to talk about content discovery. In some way you can
view recommender systems as product discovery. Content discovery, and I’ll have
Dan Olsen, who is the CEO of YourVersion come to class and talk with us briefly
about what his vision is and having YourVersion of the web, your discovery
engine. Think about StumbleUpon v2.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
If you have questions now, this is a good moment if you don’t know what I’m talking
about, to ask. The backgrounds are so diverse which I realized when I looked at what
you sent in, that if people don’t know what I’m talking about, ask.
We did content discovery. Then we’re going to do people discovery. The example
there is MrTweet. MrTweet looks on Twitter to find who you might be interested in
following. If you haven’t used it yet, I suggest you do use it. It is really pretty
powerful regarding who you didn’t know was on Twitter and now you get an
interesting source of news from them.
Then we’re moving on to real time web. I sent out the email this morning and I listed 5
different things about what happens in one minute. These are pretty big numbers. You
remember 20 hours of YouTube content gets uploaded in 1 minute. We’re going to
look more at the real time web in class 7. We’ll have Todd Levy come from New
York, who is the Chief Product guy at Bit.ly.
Bit.ly is very powerful because it measures the attention people pay to the web,
independent of a specific website. Do you know, Bit.ly is this URL shortener so it’s
convenient for using but the power is that they now know where the world’s
attention sits. It’s very interesting.
Class 8 will be about relevance and metrics. Relevance in search is well established.
Search relevance, there are a lot of papers. What about discovery relevance? How
can we measure that? What are metrics of relevance here? Metrics is something I
used to spend more time on, but we just have so many new and interesting topics that I
just condensed it to half a class. For people who think about organizations, it’s
thinking about why are organizations so unwilling to actually talk about metrics. Is
it the fear of the middle manager that if there is some clear metrics out, he might
be obsolete? Is it that people don’t understand basic statistics? It is something I’ve
been grappling with, working with big companies. I’ll share that with you.
Class 9 is about mobile, the mobile as a data collection device. It’s very powerful.
Nothing is closer to us than our mobile. There I expect some product/project ideas to
really be in the mobile space. I’m not just talking about geolocation here. I’m really
talking about the mobile as a data collection device for entering something
explicitly or implicitly.
0:26:40
Class 10 is marketing. Last year I gave the keynote at the World Marketing Forum.
You saw some of the slides in the first and second class, which I put together for
that. On a high level, I do want to share that with you as well. For me, marketing
has dramatically changed from creating messages and finding ways to push that to an
audience, which usually doesn’t want to hear them, to actually co-creating the message,
even in a Wikimedia for instance, co-creating the message with the audience.
Social data means people create and share data. What makes people create and
share data? How can we design incentives such that people actually do
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
something, primarily for themselves but also for the larger community? That’s why
you talked first, because it’s a beautiful example of how we can understand that on
Wikipedia and Wikimedia.
Game dynamics is part of the lecture on what badges do you need to give to
people or how do you need to give them reinforcement. I think it’s a very applicable
lecture because people who build something have to grapple with the question of
how to create incentives. For instance, self metrics are more important than monetary
incentives. What game dynamics, what do we know from games that make them
addictive? How can we apply that?
Class 12 looks at collaboration. Incentive and design, so far, is pretty much for
individuals to contribute, but how do we get people to share? How do we get to
something where the whole is more than the sum of its parts? Right now I’m trying
a piece of software where the transcripts for each class can be annotated by everybody
worldwide, so it’s not ready for launch but I think that’s what we’ll be doing for the
transcripts. Just to see what’s good, bad, unclear, put a post-it note on the transcripts of
each class.
Classes 13, 14, and 15 are classes I developed for Berkeley, for the School of
Information, which is about the individual. It’s about relationships, and it’s about
community, the collective, collective intelligence, and things like that. The
individual we think we know what we’re talking about but do you just want pseudonyms
or do you want to have real name identities? What have we learned there? When do
we need one versus the other? When do we actually need to authenticate
ourselves? Part of the lecture on the individual is privacy. I think any conference I
go to, the privacy discussion always surfaces and that’s something that is really
constantly evolving. That’s a lecture I’m looking forward to.
Then relationships is the 14th class. Don’t just think about going out with
somebody. The future of relationships means between people and people, like
edges or arcs of the graph. It’s also relationships between people and companies,
people and products, people and brands. I will have Auren Hoffman, who runs
Rapleaf, come and share some stories about how they’re figuring out a risk score, which
is not based on what you do as an individual but what can be inferred by your
relationships to other people and understanding whether they’re good citizens or not.
0:30:40
The 15th is about the collective community and then the 16th is group dynamics. I
wrote this outline before we talked in the car today. You call it site or community health.
Group dynamics, I think interesting projects could be lying there by trying to analyze how
we can influence the community to actually work well as opposed to drifting into
negativity.
The 17th class I learned from you because a number of people mentioned they are
interested in cities. We’ll do one class, the last real class about smart cities. Colin
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Harrison runs the project at IBM Research. It is very interesting how much can be
learned, and how much behavior can be influenced by measuring things in cities?
I was quite surprised a couple people quoted Richard Florida. You know the saying “the
world is flat”? Basically it has the same meaning everywhere. Florida counters that by
saying the world is spikey, that the world here in Silicon Valley is probably similar to the
world at Harvard, but probably very different from rural China. Or, in China, rural China is
super different from the coastal region. The spikes are always similar. Why do
people meet? Because they like the interpersonal interactions. So smart cities, a lot
of examples trying to steer traffic for instance. In Singapore ERP, having flexible rates;
what’s acceptable about people and what can we measure.
The 18th class will be student presentations. Scott Johnson from Alloy Ventures is
going to sponsor that and will be there. We’ll have somebody from Benchmark,
from Excel, and from Founders Fund. These are friends, good people. Not all groups
get to present. The TAs and I will hand pick who will be allowed to present. If we think
some project is not that interesting for them to hear, we’ll read it but we won’t give you
the slot.
The very last class I call Festival of Data. I got the URL a couple of days ago. That
will be with Peter Hirschberg, who also has Sun foundation in San Francisco. They’re
running a whole festival of data in the summer in San Francisco this year. That’s about it.
That brings us to the end of the quarter.
We did think about it. It’s always a choice about what I put in. I’m very comfortable about
the flow. I would like to know if you have any questions. Do you feel something is so
sadly left out that you really wanted to hear? Do you think some things are a total waste
of time? Quick discussion would be good.
Student:
0:33:49
Andreas:
We could put it, when I say game dynamics; we could put someone from Zynga here for
20 minutes. Who are actually gamers here? Stand up. What about the rest?
Student:
Social gamers means you play the games on Facebook or MySpace. A gamer you play
on Xbox. It’s less of a social aspect.
Andreas:
MMORPG, multiple message multiplayer online role-playing games is probably what
people had in mind, versus social games you do by yourself? Does anybody do social
games by themselves, stand up? So no social gamers, so it’s MMORPGs.
Student:
0:34:59
Andreas:
It’s not clear. I think we’ll do some of it when we talk about game dynamics. That’s the
logical spot.
Student:
Do we get to check out the business models for 0:35:19 business in this space?
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Andreas:
Business models are pretty much underlying everything we talk about. For instance for
games right now, the business model is that the game is free, coming from China or
Korea, but you play for virtual items. Online, it’s very interesting what works in mobile
marketing and what doesn’t work. I think there is nothing radically new in business
models. Don’t expect that in the end of the class you’ll realize the world is very different
from the way you thought it was at the beginning of the class. What is new is we have
more data-driven business models and we can validate things more. Business models
are part of the conversation but there is no class on business models.
Student:
When you were talking about privacy, are you going to also talk about some of the newer
applications that got inspired through all this data? I know there is owner questions about
privacy, who should we … data to other data, but there is other more practical - we go out
there, the general guy goes out and tries to start scraping all this data.
Andreas:
I believe people need to understand the tradeoffs they’re making. There is the
privacy/convenience tradeoff. You give up your privacy for convenience. I think
we’re pretty good at this. I will quote one paper which is the use of auxiliary information
which has some striking results. I think it’s more of a discussion class. That’s something
most of us have some stake in because we’ve thought about it. Try the attempt
Facebook has, in explaining the privacy settings. It shows you how difficult it is for a
company to do a good job there.
Student:
Were you thinking of doing anything on healthcare?
Andreas:
We would need to throw out something. We had half of a class last year at the end on
healthcare. When I talk about mobile, the perspective I have is of quantified self. Kevin
Kelly’s group. But I’m not going to talk about health records and stuff like that. Two
years ago we had one class on DNA sequencing, 23andMe being a good example, but I
think this is really different now. It’s about people knowingly and willingly sharing data,
which are not like DNA data. If people feel we need to throw out something and make
space for healthcare, I have some brilliant ideas, another startup, happy for a discussion.
Student:
0:38:04
Andreas:
Can you come closer to the front?
Student:
There are 0:38:32.
Andreas:
This is not a computer science algorithms class. I will talk about the underlying data. For
instance when we talked about recommender systems, we do talk about algorithms, but
it’s not an algorithms class. If you want to learn about algorithms, this is not the place.
Student:
0:38:55
Andreas:
Toby Segaran’s book is a very good book. If you need a couple of chapters for the
homework assignment in a week and a half, I can organize those for you, but it’s a good
book.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Student:
0:39:13
Andreas:
Toby Segaran’s book, Programming Collective Intelligence. It’s an O’Reilly Media
book. It should be on the homework we’re giving out as the Python homework.
Any other comments, questions?
We will have more guest speakers and I’d like to keep it short, like this which is a very
good presentation, as opposed to donating an entire class to a guest speaker. I hope
that’s in your interest as well.
Christian where are you? Come set up. Rest for today, we learned about another social
data source and we have 10 minutes, not more than 15. At the end I will talk to you.
Jeremy should be here by then, about what we need to get from you and if you haven’t
seen it yet the table is on the wiki that tells you exactly which date each homework is due.
In full disclosure, I’m on the board of Christian’s company. It’s SKOUT. I’m not pushing
for any companies here. I’m showing this to you as a potential source of a company to
get data from to do your project. The project is important.
Christian:
Hello and my name is Christian Wiklund. I’m the Founder and CEO of a company called
SKOUT. We are in the love business so we connect guys like you as easily as possible
on cell phones. Andreas is on the board and I am sure you are going to be in for a treat
here in his class. I’m sure you’ll have a lot of fun.
What we do is quite simple. We take the GPS coordinates from where you are based
and we connect you with interesting people in the area so you can reach out to
them in real time, and set up a chat, maybe a virtual gift if this person is cute.
Maybe you’ll take them out for a date. It’s about the time and space continuum.
That’s where we work, so it’s about when and where you are.
Forget about online dating like Match.com and these more boring sites. This is for a
younger, hipper demographic. Right now, we’re the largest mobile dating company in the
country so it’s going quite well for us.
0:42:53
SKOUT is one of our brands. Everyone is welcome. Then we have a niche brand called
[Boy Oh Boy], which is for the gay community. It’s like a rocket in San Francisco. It’s
going really well.
What are the user benefits with these new types of services? When you think
about online dating versus real world dating and how it works, in online dating you
have a huge pool of inventory. I can search and match make between potentially
millions of girls in my case. I can narrow it down to Asian girls that I like. Maybe I like
slightly older girls, a cougar hunter, so I can filter it there and quickly figure out who will
enjoy meeting a Swedish guy. That will help me slightly decrease the rejection. That is
the nice thing about online dating [0:43:5] experience there. It’s been around, the
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
founder of Match.com and he’s an adviser to us. He started it in 1996. It’s almost as old
as the Internet. The drawback with online dating is it’s pretty slow. I don’t know if
you’ve tried it but it’s asynchronist communication back and forth with emails, and
it’s not very engaging. Of course, you can’t bring it with you.
If you look at the bar and how it works, you have a few drinks, go up to the cute
girl/guy and say hi, and it’s instant. You get instant feedback, maybe instant
rejection, maybe instant positive feelings. But, the drawback there is you have a
limited inventory. You’re sort of restricted to the physical boundaries of the room
you’re currently in.
If you can marry these two worlds, and take the best from them, that’s SKOUT. We
have a lot of interesting data. We’re a startup, small team, a lot of interesting problems
and opportunities to work on. I’m more than happy to bring on a few groups here to do
some cool projects to mine our data and maybe even solve some of the issues we have.
One issue is how do you identify porn posters and prostitutes. That’s one serious
issue, how do you create a reputation system around the whole social engagement
model we have, how do you filter out - there are a few interesting categories of people we
can find. You have guys pretending to be girls, looking for girls because they want to
have some interaction there virtually. You have web scammers. In the Philippines you
have these Internet cafes full of affiliates to webcams.com for instance, so they’re trying
to up sell webcam shows to our users. How do we identify the trash from the real content
and create long-term healthy base for growing really big? That’s one interesting problem
that needs to be solved.
Unfair distribution of attention is another one. I would say SKOUT works as any bar
on steroids. The sexiest people get enormous amounts of attention so some guys will
have 500 inbound flirts in one day and some less fortunate guys may have 0-1 flirts per
day. Obviously, if you get 500 people hitting on you in one day you’ll probably give up.
It’s too much. The guys that don’t get any flirts are not happy either so how do we
distribute attention in the network to make it more fair and balanced?
0:46:47
One thing we can do since we’re location based is we live with historical data and
with real time data, figure out where you should hang out tonight. Think about
heat maps. That is something we want to launch for iPad, that you had a very sexy big
map that shows I’m Christian, 29 years old, looking for Asian girls in San Francisco
tonight, where is the highest likelihood of me encountering someone who would like me?
That’s one interesting problem set that could be looked at.
How to create finding Mr. Right right now. It’s about instant connection with
people but there might be more, deep structure data that is also interesting to look
at. We don’t match on a lot of dimensions. It does not take you half a day to fill out a big
form of exactly who you are. It’s more about the conversation, there might be interesting
stuff to add into our network of how do we create the now experience versus finding
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
something that could be more long term. I would say typically our users date multiple
persons. They’re not looking to get married so it’s more casual.
Optimization of sales funnel - this is something we’re doing continuously but we’re not
the smartest brains on Earth so I’m pretty sure some guys here would be able to bring
some real value to the table to optimize how we buy the cheapest traffic, what traffic
monetizes the best, and how do we monetize them and so forth. Basically, we optimize
the whole sales funnel including retention rates, how to get people to come back to the
service and so forth.
One interesting study I think will be behavioral patterns versus demographics. We
had [BoyOBoy] versus SKOUT. We have different ethnicities, different age groups. It
peaks around 24 or 25, but we have a range. How does Tulsa, Oklahoma compare to
New York City, as far as engagement goes? Do people meet up more there or less
there, how many chat messages do they send to each other, and so forth? There is a lot
of interesting stuff that can be studied in that.
Local marketing campaigns - we have not focused much on going to one city and
pushing. I think there is huge opportunity for that, so that could be an interesting
marketing problem to solve, how do you get as much local penetration as possible
in the boundary, geographically?
There is probably tons of stuff you can come up with that would add value to this class
and also help SKOUT to continue its success. If you want to email me, that’s my email
address. That’s about it. If you have any questions, I’m happy to take them.
Student:
Do you have a revenue model in mind?
Christian:
We’re revenue generating and it’s going quite well. How we’re monetizing is a virtual
economy plus premium features. You can send virtual gifts and if you could figure out
who in this room thinks you’re hot, you would pay for that, right? So stuff like that, we
charge for.
0:50:25
Student:
Have you tried anything for making the distribution more even? For one thing, can’t you
tell by the incoming flirts someone has how hot they are? Creating people on a hot list
that only allows hot people to flirt with other hot people?
Christian:
We have played with all of these things in our minds, but we haven’t executed on
anything. I have hundreds of problems and things we want to do but we don’t have the
resources now to attack all of them. I do think it could be very important for retaining
users. We need to get hooked up as quickly as possible.
Student:
I had a question about the scammer thing. Have you actually implemented ideas on how
to catch posers? I feel like that’s a big issue with instant messaging in general. I’m
curious to find out how you handle that issue.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Christian:
There are some things we can do. We haven’t done much. We have been growing
quickly and it’s taking us by storm so all of these issues are just laying on the table and
we need to quickly fix them. You won’t get the web scammers to go for your community
if you’re not big. Now we’re growing large and they wouldn’t stay on if they couldn’t
monetize. You could potentially block off complete countries in the IP range, but that
might not be the best way of doing it. Maybe you have the community flag content, self
policing, but we’re open for suggestions.
Student:
Wouldn’t that be kind of bad though? I’m a user - a girl that I think is hot and it turns out
to be a guy? If I flag them in my phone it’s already too late. I’ve never met this - it’s not
really going to help my case.
Christian:
It happens to me in San Francisco all the time. I have girls who are not really girls come
and hit on me. I still go out. I’m not scared.
Andreas:
One more question and then we need to move on.
Student:
I was wondering on the issue of how to block users who aren’t real users or scammers,
why not require everyone who signs in to use their Facebook name?
Christian:
That’s a great idea and Facebook Connect does know scammers and bullshit so these
web scammers have accounts and hundreds of friends already. We tried that and I see a
pretty great inbound spam stream on Facebook now. I get tons of girls trying to connect
with me that I’ve never met. Maybe I did meet them…
Andreas:
The last 10 minutes we’ll talk about what we need you to do. The two TAs and I will split
that presentation. First of all, the wiki, the Stanford2020.wikispaces.com is going to start
with the next class so each class, two people will sign up to actually bring by midnight the
day after the class - the Thursday class will be by Friday night - to bring up the basics of
that class into the wiki. It’s very difficult for people to add stuff if there is nothing there but
it’s very easy for them to add stuff if there is a skeleton there. I always commit to giving
you my notes, which are what I have here basically, and the end result is very powerful.
0:54:24
If you look at the wikis from the last years, I think people do a good job because it’s very
easy for them to do it. It is class editable. You should have all gotten invitations to
Stanford2010.wikispaces.com. If anybody did not get an invitation, then see Jeremy.
That starts this Thursday so if someone wants to volunteer for Thursday now, we can
take names down. From then on it will be signing up on the wiki.
Jeremy:
We’re just going to add the pages in the wiki and then put your name on whichever one
you want to do.
Andreas:
Why don’t you introduce yourself and what you do in MS&E? He works with Ron Howard
who is one of my absolute heroes.
Muhammad:
My name is Muhammad Aldawood. I’m a PhD student in 0:55:31 department in the
statistical analysis group. I will be one of the TAs for this class. I wanted to introduce the
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
homework. We have a total of 4 homework assignments in this class. All are in the next
4 weeks. The first one is today and is due on Thursday. It’s very simple. You have to
build your own website and use Google Analytics for that. It shouldn’t take a long time.
There are some instructions on the wiki.
Andreas:
I wanted to break down peoples’ fear with this homework, like GSB peoples’ fear, that it
might be difficult to bring up a web page and to [0:56:15]. I don’t expect you to have lots
of traffic to that page, but you read it, you want to be able to see that happening
somewhere. Your time is less than half an hour.
Muhammad:
It shouldn’t take that long.
Andreas:
It’s not hard. Don’t worry about it. Do it and send a screenshot of Google Analytics with
at least one hit or something. The second homework is much harder. It’s a
recommender system for Delicious. The third homework is analysis of Bit.ly data. I’m
interested as an engineer in what can we do to predict how long it takes for a URL to
decay. Some things are like straw fire, high and gone; others sort of meander around.
Can you find a model? We’ll make it easy enough so you don’t have to work too hard on
that one. The fourth homework is Twitter. That’s a great one. Come up with a harness
of how you evaluate whether it’s interesting to meet someone on Twitter, to discover
somebody on Twitter. Then implement something simple. We get white listed for all the
students at Twitter. There are no barriers there. Ask some friends how they feel about
the recommendations. It’s a pretty good real-world project, coming up with something,
implementing it, and then evaluating it.
Muhammad will be the one who is mainly the contact person for homework and I’m very
grateful for that. Any question about homework?
Student:
…
Andreas:
Homework is individual homework. These are not groups.
0:58:18
Student:
Can we work with other people?
Andreas:
Of course, you can help each other but you need to submit the page you made with a
screenshot you created, as opposed to someone else’s page. I believe in people
learning from each other but for these homework ultimately each person needs to go
through them on their own.
Any other questions? Jeremy and I worked very hard until the sun rose on Monday
morning. Thank you, his girlfriend didn’t dump him, I hope, on dog food. It’s his idea and
it’s a great idea.
Jeremy:
One last thing on the homework, I’d recommend you take a look at homework 2 before
Thursday. The dog food, the idea is for you to get firsthand knowledge of the Social Data
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc
MS&E237
Spring 2010
Stanford University
Andreas S. Weigend, Ph.D.
The Social Data Revolution:
Data Mining and Electronic Business
Revolution. It’s the difference between reading an analyst’s report about Apple’s iPod
and owning an iPod and playing with it. These are all supposed to be an hour or two,
they’re going to be fun to do and get involved. They’re not supposed to be arduous. The
first one will be for next Tuesday, YourVersion. The idea is there are a lot of different
things going on with news discovery. Digg, Reddit, YourVersion, there are a lot of these
different things. We chose one, YourVersion, and we’ll have you do a small dog food on
that. We’ll post that on Thursday.
The other ones are after the homework and while you’re doing the project. That’s
collaboration focus with Google Wave. [1:00:22] is the latest QA platform that’s coming
out. There are things like Hunch and things like that so there will be an interesting project
there. Lastly, Groupon is one of the most successful companies going on right now so
we thought it would be interesting to take a look at what they’re doing.
We’ll post the dog food on YourVersion on Thursday, but these are not meant to be
arduous. It’s to get you involved in everything that’s going on. You hear about Twitter
and Facebook. We really need to create accounts, use the service and get data.
Andreas:
I look forward to doing that myself. One small remark is that in most cases we will have
the key person behind the project come to class the day the project is due. They also
want to learn what do smart people here think is great, are there barriers, and we’ll share
the answers with the company. If you have specific things to talk to them about, great. If
you want to use it for your project, and do a project with those companies - in Dan
Olsen’s case I’m sure he will love that. I haven’t talked to the others yet.
Jeremy:
Any questions?
Andreas:
Do you understand our reasoning behind it, the experiential component of doing it instead
of just reading about it is something we care about. That’s the dog food.
Transcript - Tamara Bentzur - Testimonials – www.tbentzur.wordpress.com, www.outsourcestranscriptionservices.com
http://weigend.com/files/teaching/stanford/2010/recordings/audio/weigend_stanford2010_3coursecontentWikimediaSkout_2010.04.
06.doc