Download http://cs246.stanford.edu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CS246: Mining Massive Datasets
Jure Leskovec, Stanford University
http://cs246.stanford.edu
High dim.
data
Graph
data
Infinite
data
Machine
learning
Apps
Locality
sensitive
hashing
PageRank,
SimRank
Filtering
data
streams
SVM
Recommen
der systems
Clustering
Community
Detection
Web
advertising
Decision
Trees
Association
Rules
Dimensional
ity
reduction
Spam
Detection
Queries on
streams
Perceptron,
kNN
Duplicate
document
detection
1/29/2013
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
2

Customer X
 Buys Metallica CD
 Buys Megadeth CD
1/29/2013

Customer Y
 Does search on Metallica
 Recommender system
suggests Megadeth from
data collected about
customer X
Jure Leskovec, Stanford C246: Mining Massive Datasets
3
Examples:
Search
Recommendations
Items
1/29/2013
Products, web sites,
blogs, news items, …
Jure Leskovec, Stanford C246: Mining Massive Datasets
4

Shelf space is a scarce commodity for
traditional retailers
 Also: TV networks, movie theaters,…

Web enables near-zero-cost dissemination
of information about products
 From scarcity to abundance

More choice necessitates better filters
 Recommendation engines
 How Into Thin Air made Touching the Void a
bestseller: http://www.wired.com/wired/archive/12.10/tail.html
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
5
Source: Chris Anderson (2004)
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
6
Read http://www.wired.com/wired/archive/12.10/tail.html to learn more!
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
7

Editorial and hand curated
 List of favorites
 Lists of “essential” items

Simple aggregates
 Top 10, Most Popular, Recent Uploads

Tailored to individual users
 Amazon, Netflix, …
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
8
X = set of Customers
 S = set of Items


Utility function u: X × S  R
 R = set of ratings
 R is a totally ordered set
 e.g., 0-5 stars, real number in [0,1]
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
9
Avatar
Alice
1
Bob
Carol
LOTR
Matrix
0.2
0.5
0.2
0.3
1
0.4
David
1/29/2013
Pirates
Jure Leskovec, Stanford C246: Mining Massive Datasets
10

(1) Gathering “known” ratings for matrix
 How to collect the data in the utility matrix

(2) Extrapolate unknown ratings from the
known ones
 Mainly interested in high unknown ratings
 We are not interested in knowing what you don’t like
but what you like

(3) Evaluating extrapolation methods
 How to measure success/performance of
recommendation methods
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
11

Explicit
 Ask people to rate items
 Doesn’t work well in practice – people
can’t be bothered

Implicit
 Learn ratings from user actions
 E.g., purchase implies high rating
 What about low ratings?
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
12

Key problem: matrix U is sparse
 Most people have not rated most items
 Cold start:
 New items have no ratings
 New users have no history

Three approaches to recommender systems:
 1) Content-based
Today!
 2) Collaborative
 3) Latent factor based
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
13
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
14

Main idea: Recommend items to customer x
similar to previous items rated highly by x
Example:
 Movie recommendations
 Recommend movies with same actor(s),
director, genre, …

Websites, blogs, news
 Recommend other sites with “similar” content
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
15
Item profiles
likes
build
recommend
match
Red
Circles
Triangles
User profile
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
16

For each item, create an item profile

Profile is a set (vector) of features
 Movies: author, title, actor, director,…
 Text: Set of “important” words in document

How to pick important features?
 Usual heuristic from text mining is TF-IDF
(Term frequency * Inverse Doc Frequency)
 Term … Feature
 Document … Item
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
17
fij = frequency of term (feature) i in doc (item) j
Note: we normalize TF
to discount for “longer”
documents
ni = number of docs that mention term i
N = total number of docs
TF-IDF score: wij = TFij × IDFi
Doc profile = set of words with highest TF-IDF
scores, together with their scores
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
18

User profile possibilities:
 Weighted average of rated item profiles
 Variation: weight by difference from average
rating for item
…

Prediction heuristic:
 Given user profile x and item profile i, estimate
𝒙·𝒊
𝑢(𝒙, 𝒊) = cos(𝒙, 𝒊) =
| 𝒙 |⋅| 𝒊 |
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
19

+: No need for data on other users
 No cold-start or sparsity problems


+: Able to recommend to users with
unique tastes
+: Able to recommend new & unpopular items
 No first-rater problem

+: Able to provide explanations
 Can provide explanations of recommended items by
listing content-features that caused an item to be
recommended
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
20

–: Finding the appropriate features is hard
 E.g., images, movies, music

–: Overspecialization
 Never recommends items outside user’s
content profile
 People might have multiple interests
 Unable to exploit quality judgments of other users

–: Recommendations for new users
 How to build a user profile?
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
21

Consider user x

Find set N of other
users whose ratings
are “similar” to
x’s ratings

x
N
Estimate x’s ratings
based on ratings
of users in N
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
23
rx = [*, _, _, *, ***]
ry = [*, _, **, **, _]


Let rx be the vector of user x’s ratings
Jaccard similarity measure
 Problem: Ignores the value of the rating

rx, ry as sets:
rx = {1, 4, 5}
ry = {1, 3, 4}
Cosine similarity measure
 sim(x, y) = cos(rx, ry) =
𝑟𝑥 ⋅𝑟𝑦
rx, ry as points:
rx = {1, 0, 0, 1, 3}
ry = {1, 0, 2, 2, 0}
||𝑟𝑥 ||⋅||𝑟𝑦 ||
 Problem: Treats missing ratings as “negative”

Pearson correlation coefficient
 Sxy = items rated by both users x and y
𝒔𝒊𝒎 𝒙, 𝒚 =
𝒔∈𝑺𝒙𝒚
𝒔∈𝑺𝒙𝒚
1/29/2013
rx, ry … avg.
rating of x, y
𝒓𝒙𝒔 − 𝒓𝒙 𝒓𝒚𝒔 − 𝒓𝒚
𝒓𝒙𝒔 − 𝒓𝒙
𝟐
Jure Leskovec, Stanford C246: Mining Massive Datasets
𝒔∈𝑺𝒙𝒚
𝒓𝒚𝒔 − 𝒓𝒚
𝟐
24
Cosine sim:
𝒔𝒊𝒎(𝒙, 𝒚) =
𝒊 𝒓𝒙𝒊
𝟐
𝒊 𝒓𝒙𝒊



⋅ 𝒓𝒚𝒊
⋅
𝟐
𝒊 𝒓𝒚𝒊
Intuitively we want: sim(A, B) > sim(A, C)
Jaccard similarity: 1/5 < 2/4
Cosine similarity: 0.386 > 0.322
 Considers missing ratings as “negative”
 Solution: subtract the (row) mean sim A,B vs. A,C:
0.092 > -0.559
Notice cosine sim. is
correlation when
data is centered at 0
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
25



Let rx be the vector of user x’s ratings
Let N be the set of k users most similar to x
who have rated item i
Prediction for item s of user x:
 𝑟𝑥𝑖 =
 𝑟𝑥𝑖 =
1
𝑘
𝑦∈𝑁 𝑟𝑦𝑖
𝑦∈𝑁 𝑠𝑥𝑦 ⋅𝑟𝑦𝑖
𝑦∈𝑁 𝑠𝑥𝑦
Shorthand:
𝒔𝒙𝒚 = 𝒔𝒊𝒎 𝒙, 𝒚
 Other options?

Many other tricks possible…
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
26


So far: User-user collaborative filtering
Another view: Item-item
 For item i, find other similar items
 Estimate rating for item i based
on ratings for similar items
 Can use same similarity metrics and
prediction functions as in user-user model
rxi
1/29/2013



s  rxj
jN ( i ; x ) ij
s
jN ( i ; x ) ij
sij… similarity of items i and j
rxj…rating of user u on item j
N(i;x)… set items rated by x similar to i
Jure Leskovec, Stanford C246: Mining Massive Datasets
27
users
1
1
2
1
movies
5
2
4
6
4
2
5
5
4
4
3
7
8
2
3
10 11 12
5
4
4
5
3
9
4
4
4
2
2
1
3
5
2
3
2
2
3
- unknown rating
1/29/2013
6
5
1
4
1
4
3
2
3
3
5
4
- rating between 1 to 5
Jure Leskovec, Stanford C246: Mining Massive Datasets
28
users
1
1
2
1
movies
5
2
4
6
4
2
5
4
3
5
6
?
5
4
1
4
1
4
3
2
3
3
7
9
10 11 12
5
4
4
2
3
5
3
8
4
4
2
1
3
5
4
2
3
2
2
2
3
5
4
- estimate rating of movie 1 by user 5
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
29
users
1
1
2
1
movies
5
2
4
6
4
2
5
4
3
5
6
?
5
4
1
4
1
4
3
2
3
3
7
2
3
10 11 12
5
4
4
4
4
2
3
Neighbor selection:
Identify movies similar to
movie 1, rated by user 5
1/29/2013
9
4
5
3
8
Jure Leskovec, Stanford C246: Mining Massive Datasets
1.00
2
1
3
5
0.41
2
-0.10
2
2
sim(1,m)
4
3
5
-0.18
-0.31
0.59
Here we use Pearson correlation as similarity:
1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)/5 = 3.6
row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0]
2) Compute cosine similarities between rows
30
users
1
1
2
1
movies
5
2
4
6
4
2
5
5
6
?
5
4
1
4
4
1
4
3
2
3
3
3
7
9
10 11 12
5
4
4
2
3
5
3
8
4
4
4
2
3
1.00
2
1
3
5
0.41
2
-0.10
2
2
sim(1,m)
4
3
5
-0.18
-0.31
0.59
Compute similarity weights:
s13=0.41, s16=0.59
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
31
users
1
1
2
1
movies
5
2
4
6
4
2
5
6
4
4
3
5
7
8
2.6 5
1
4
1
4
3
2
3
3
10 11 12
5
4
4
2
3
5
3
9
4
4
4
2
3
2
r15 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6
Jure Leskovec, Stanford C246: Mining Massive Datasets
1
3
5
3
2
2
Predict by taking weighted average:
1/29/2013
2
5
4
𝒓𝒊𝒙 =
𝒋∈𝑵(𝒊;𝒙) 𝒔𝒊𝒋
⋅ 𝒓𝒋𝒙
𝒔𝒊𝒋
32
Before:
rxi





sr
jN ( i ; x ) ij xj
s
jN ( i ; x ) ij
Define similarity sij of items i and j
Select k nearest neighbors N(i; x)
 Items most similar to i, that were rated by x

Estimate rating rxi as the weighted average:
rxi  bxi


s

(
r

b
)
ij
xj
xj
jN ( i ; x )

s
jN ( i ; x ) ij
baseline estimate for rxi

𝒃𝒙𝒊 = 𝝁 + 𝒃𝒙 + 𝒃𝒊


1/29/2013
μ = overall mean movie rating
bx = rating deviation of user x
= (avg. rating of user x) – μ
bi = rating deviation of movie i
Jure Leskovec, Stanford C246: Mining Massive Datasets
33
Avatar
Alice
LOTR
1
David


Pirates
0.8
0.5
Bob
Carol
Matrix
0.9
1
1
0.3
0.8
0.4
In practice, it has been observed that item-item
often works better than user-user
Why? Items are simpler, users have multiple tastes
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
34





+ Works for any kind of item
 No feature selection needed
- Cold Start:
 Need enough users in the system to find a match
- Sparsity:
 The user/ratings matrix is sparse
 Hard to find users that have rated the same items
- First rater:
 Cannot recommend an item that has not been
previously rated
 New items, Esoteric items
- Popularity bias:
 Cannot recommend items to someone with
unique taste
 Tends to recommend popular items
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
35

Implement two or more different
recommenders and combine predictions
 Perhaps using a linear model

Add content-based methods to
collaborative filtering
 Item profiles for new item problem
 Demographics to deal with new user problem
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
36
- Evaluation
- Error metrics
- Complexity / Speed
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
37
movies
1
3
4
3
5
4
5
5
5
2
2
3
users
3
2
5
2
3
1
1
3
1
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
38
movies
1
3
4
3
5
4
5
5
5
?
?
3
users
3
2
?
2
3
1
Test Data Set
?
?
1
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
39

Compare predictions with known ratings
 Root-mean-square error (RMSE)

𝑥𝑖
𝑟𝑥𝑖 −
∗ 2
𝑟𝑥𝑖
where 𝒓𝒙𝒊 is predicted, 𝒓∗𝒙𝒊 is the true rating of x on i
 Precision at top 10:
 % of those in top 10
 Rank Correlation:
 Spearman’s correlation between system’s and user’s complete rankings

Another approach: 0/1 model
 Coverage:
 Number of items/users for which system can make predictions
 Precision:
 Accuracy of predictions
 Receiver operating characteristic (ROC)
 Tradeoff curve between false positives and false negatives
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
40

Narrow focus on accuracy sometimes
misses the point
 Prediction Diversity
 Prediction Context
 Order of predictions

In practice, we care only to predict high
ratings:
 RMSE might penalize a method that does well
for high ratings and badly for others
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
42

Training data
 100 million ratings, 480,000 users, 17,770 movies
 6 years of data: 2000-2005

Test data
 Last few ratings of each user (2.8 million)
 Evaluation criterion: root mean squared error
(RMSE)
 Netflix Cinematch RMSE: 0.9514

Competition
 2700+ teams
 $1 million prize for 10% improvement on Cinematch
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
43


Expensive step is finding k most similar
customers: O(|X|)
Too expensive to do at runtime
 Could pre-compute

Naïve pre-computation takes time O(N ·|C|)

We already know how to do this!
 Near-neighbor search in high dimensions (LSH)
 Clustering
 Dimensionality reduction
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
44

Leverage all the data
 Don’t try to reduce data size in an
effort to make fancy algorithms work
 Simple methods on large data do best

Add more data
 e.g., add IMDB data on genres

More data beats better algorithms
http://anand.typepad.com/datawocky/2008/03/more-data-usual.html
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
45
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
46

Next topic: Recommendations via
Latent Factor models
Exoticness / Price
Overview of Coffee Varieties
B2
B1
I2
I1C6
L5
Exotic
S5 C1
S2S1 S7
S6
C7
R4
S3 R6 R3
R2
C2
C4 a1
L4C3
FR
S4 TE
F9
R5
R8
Popular Roasts
and Blends
Flavored
F8
F3 F2
F1
F0 F6
F5
F4
Com plexity of Flavor
The bubbles above represent products sized by sales volume.
Products close to each other are recommended to each other.
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
47
[Bellkor Team]
serious
The Color
Purple
Geared
towards
females
Braveheart
Amadeus
Sense and
Sensibility
Ocean’s 11
Lethal
Weapon
Geared
towards
males
Dave
The Lion
King
The Princess
Diaries
Independence
Day
Dumb and
Dumber
Gus
escapist
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
48
Koren, Bell, Volinksy, IEEE Computer, 2009
1/29/2013
Jure Leskovec, Stanford C246: Mining Massive Datasets
49