Download 3 - Genoveva Vargas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
1
Case study 2: Recommendation systems
Genoveva Vargas-Solar
http://www.vargas-solar.com/big-data-analytics
French Council of Scientific Research, LIG & LAFMIA Labs
Montevideo, 22nd November – 4th December, 2015
I N F O R M A T I Q U E
+ High dimensional data
2
Highdim.
data
Graph
data
Infinite
data
Machine
learning
Apps
Localitysensitive
hashing
PageRank,
SimRank
Filteringdata
streams
SVM
Recommender
systems
Clustering
Community
Detection
Web
advertising
Decision
Trees
Association
Rules
Dimensionality
reduction
Spam
Detection
Querieson
streams
Perceptron,
kNN
Duplicate
document
detection
Recommendations
3
Examples:
Search
Recommendations
Items
Products, web sites,
blogs, news items, …
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
+ From Scarcity to Abundance
n
Shelf space is a scarce commodity for traditional retailers
n
n
Web enables near-zero-cost dissemination of information about
products
n
n
Also: TV networks, movie theaters,…
From scarcity to abundance
More choice necessitates better filters
n
n
Recommendation engines
How Into Thin Air made Touching the Void
a bestseller: http://www.wired.com/wired/archive/12.10/tail.html
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
4
+ Sidenote: The Long Tail
Source: Chris Anderson
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
Datasets, http://www.mmds.org
(2004)
5
+ Physical vs. Online
Read http://www.wired.com/wired/archive/12.10/tail.html to learn more!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
6
+ Types of Recommendations
n
Editorial and hand curated
n
n
n
Simple aggregates
n
n
List of favorites
Lists of “essential” items
Top 10, Most Popular, Recent Uploads
Tailored to individual users
n
Amazon, Netflix, …
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
7
+ Formal Model
nX
= set of Customers
nS
= set of Items
n Utility function
u: X × S à R
R = set of ratings
n R is a totally ordered set
n e.g., 0-5 stars, real number in [0,1]
n
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
8
Utility Matrix
Avatar
Alice
1
David
Matrix
0.2
Pirates
0.2
0.5
Bob
Carol
LOTR
0.3
1
0.4
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
9
+ Key Problems
n
(1) Gathering “known” ratings for matrix
n
n
(2) Extrapolate unknown ratings from the known ones
n
n
How to collect the data in the utility matrix
Mainly interested in high unknown ratings
n We are not interested in knowing what you don’t like but what you like
(3) Evaluating extrapolation methods
n
How to measure success/performance of recommendation methods
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
10
+ (1) Gathering Ratings
n
Explicit
n
n
n
Ask people to rate items
Doesn’t work well in practice – people
can’t be bothered
Implicit
n
n
Learn ratings from user actions
n E.g., purchase implies high rating
What about low ratings?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
11
+ (2) Extrapolating Utilities
n
Key problem: Utility matrix U is sparse
n
n
n
Most people have not rated most items
Cold start:
n New items have no ratings
n New users have no history
Three approaches to recommender systems:
n
n
n
1) Content-based
2) Collaborative
3) Latent factor based
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
12
13
Content based recommendation systems
+ Content-based Recommendations
n
Main idea: Recommend items to customer x similar to previous
items rated highly by x
Example:
n
Movie recommendations
n
n
Recommend movies with same actor(s),
director, genre, …
Websites, blogs, news
n
Recommend other sites with “similar” content
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
14
+ Plan of Action
15
Item profiles
likes
build
recommend
match
Red
Circles
Triangles
User profile
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
+ Item Profiles
n
For each item, create an item profile
n
Profile is a set (vector) of features
n
n
n
Movies: author, title, actor, director,…
Text: Set of “important” words in document
How to pick important features?
n
Usual heuristic from text mining is TF-IDF
(Term frequency * Inverse Doc Frequency)
n Term … Feature
n Document … Item
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
16
Sidenote: TF-IDF
fij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term i
N = total number of docs
TF-IDF score: wij = TF ij × IDF i
Doc profile = set of words with highest TF-IDF scores,
together with their scores
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
17
Note: we normalize TF
to discount for “longer”
documents
+ User Profiles and Prediction
n
n
User profile possibilities:
n
Weighted average of rated item profiles
n
Variation: weight by difference from average
rating for item
n
…
Prediction heuristic:
n
Given user profile x and item profile i, estimate 𝑢(𝒙, 𝒊) = cos(𝒙, 𝒊) =
𝒙·𝒊
| 𝒙 |⋅| 𝒊 |
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
18
+ Pros: Content-based Approach
n
+: No need for data on other users
n
No cold-start or sparsity problems
n
+: Able to recommend to users with
unique tastes
n
+: Able to recommend new & unpopular items
n
n
No first-rater problem
+: Able to provide explanations
n
Can provide explanations of recommended items by listing contentfeatures that caused an item to be recommended
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
19
+ Cons: Content-based Approach
n
–: Finding the appropriate features is hard
n
n
–: Recommendations for new users
n
n
E.g., images, movies, music
How to build a user profile?
–: Overspecialization
n
n
n
Never recommends items outside user’s
content profile
People might have multiple interests
Unable to exploit quality judgments of other users
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
20
21
Collaborative Filtering
Harnessing quality judgments of other users
Collaborative Filtering
n
Consider user x
n
Find set N of other users
whose ratings are “similar” to
x’s ratings
n
22
x
Estimate x’s ratings based on
ratings of users in N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
N
Finding “Similar” Users
n
Let rx be the vector of user x’s ratings
n
Jaccard similarity measure
n
n
n
rx = [*, _, _, *, ***]
ry = [*, _, **, **, _]
rx, ry as sets:
rx = {1, 4, 5}
ry = {1, 3, 4}
Problem: Ignores the value of the rating
Cosine similarity measure
rx, ry as points:
rx = {1, 0, 0, 1, 3}
ry = {1, 0, 2, 2, 0}
/0 ⋅/1
n
sim(x, y) = cos(rx, ry) =
n
Problem: Treats missing ratings as “negative”
||/0||⋅||/1||
Pearson correlation coefficient
n
Sxy = items rated by both users x and y
𝒔𝒊𝒎 𝒙, 𝒚 =
∑𝒔∈𝑺𝒙𝒚 𝒓𝒙𝒔 − 𝒓𝒙 𝒓𝒚𝒔 − 𝒓𝒚
∑𝒔∈𝑺𝒙𝒚 𝒓𝒙𝒔 − 𝒓𝒙
𝟐
∑𝒔∈𝑺𝒙𝒚 𝒓𝒚𝒔 − 𝒓𝒚
𝟐
rx , ry … avg.
rating of x, y
23
+ Similarity Metric
n
Intuitively we want: sim(A, B) > sim(A, C)
n
Jaccard similarity: 1/5 < 2/4
n
Cosine similarity: 0.386 > 0.322
n
n
Considers missing ratings as “negative”
Solution: subtract the (row) mean
Cosine sim:
𝒔𝒊𝒎(𝒙, 𝒚) = ∑𝒊 𝒓𝒙𝒊 ⋅ 𝒓𝒚𝒊
∑𝒊 𝒓𝟐𝒙𝒊 ⋅ ∑𝒊 𝒓𝟐𝒚𝒊
sim A,B vs. A,C:
0.092 > -0.559
Notice cosine sim.
is correlation when
data is centered at
0
24
+ Rating Predictions
25
From similarity metric to recommendations:
n
Let rx be the vector of user x’s ratings
n
Let N be the set of k users most similar to x who have rated item i
n
Prediction for item s of user x:
n
>
n
𝑟<= = ∑@∈A 𝑟@=
n
𝑟<= =
n
Other options?
?
∑1∈D B01 ⋅/1C
∑1∈D B01
Shorthand:
𝒔𝒙𝒚 = 𝒔𝒊𝒎 𝒙, 𝒚
Many other tricks possible…
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
+ Item-Item Collaborative Filtering
n
So far: User-user collaborative filtering
n
Another view: Item-item
n
n
n
For item i, find other similar items
Estimate rating for item i based
on ratings for similar items
Can use same similarity metrics and
prediction functions as in user-user model
rxi
∑
=
∑
s ⋅ rxj
j∈N ( i ; x ) ij
s
j∈N ( i ; x ) ij
sij… similarity of items i and j
rxj…rating of user u on item j
N(i;x)… set items rated by x similar to i
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
26
Item-Item CF (|N|=2)
users
1
1
2
1
movies
5
2
4
6
4
2
5
5
4
4
3
6
7
8
5
1
4
1
4
3
2
3
3
10 11 12
5
4
4
2
3
5
3
9
4
3
- unknown rating
4
4
2
2
1
3
5
2
2
2
3
5
4
- rating between 1 to 5
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
27
Item-Item CF (|N|=2)
users
1
1
2
1
movies
5
2
4
6
4
2
5
4
3
5
6
?
5
4
1
4
1
4
3
2
3
3
8
9
10 11 12
5
4
4
2
3
5
3
7
4
3
4
2
1
3
5
4
2
2
2
2
- estimate rating of movie 1 by user 5
3
5
4
28
+ Item-Item CF (|N|=2)
29
users
1
1
2
1
movies
5
2
4
6
4
2
5
4
3
5
6
?
5
4
1
4
1
4
3
2
3
3
8
9
10 11 12
5
4
4
2
3
5
3
7
4
4
4
2
3
Neighbor selection:
Identify movies similar to
movie 1, rated by user 5
1.00
2
1
3
5
0.41
2
-0.10
2
2
sim(1,m)
4
3
5
-0.18
-0.31
0.59
Here we use Pearson correlation as similarity:
1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)/5 = 3.6
row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0]
2) Compute cosine similarities between rows
+ Item-Item CF (|N|=2)
30
users
1
1
2
1
movies
5
2
4
6
4
2
5
4
3
5
6
?
5
4
1
4
1
4
3
2
3
3
8
9
10 11 12
5
4
4
2
3
5
3
7
4
3
Compute similarity
weights:
s1,3=0.41, s1,6=0.59
4
4
2
1.00
2
1
3
5
0.41
2
-0.10
2
2
sim(1,m)
4
3
5
-0.18
-0.31
0.59
+ Item-Item CF (|N|=2)
31
users
1
1
2
1
movies
5
2
4
6
4
2
5
6
4
4
3
5
7
8
2.6 5
1
4
1
4
3
2
3
3
10 11 12
5
4
4
2
3
5
3
9
4
3
4
2
1
3
5
4
2
2
2
2
Predict by taking weighted average:
r1.5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6
3
5
4
𝒓𝒊𝒙 =
∑𝒋∈𝑵(𝒊;𝒙) 𝒔𝒊𝒋 ⋅ 𝒓𝒋𝒙
∑𝒔𝒊𝒋
Before:
rxi
CF: Common
Practice
n Define similarity s of items i and j
∑
=
∑
sr
j∈N ( i ; x ) ij xj
s
j∈N ( i ; x ) ij
ij
n
Select k nearest neighbors N(i; x)
n
n
Items most similar to i, that were rated by x
Estimate rating rxi as the weighted average:
rxi = bxi
∑
+
s
⋅
(
r
−
b
)
ij
xj
xj
j∈N ( i ; x )
baseline estimate for
rxi
∑
s
j∈N ( i ; x ) ij
μ =overallmeanmovierating
bx =ratingdeviationofuserx
𝒃𝒙𝒊 = 𝝁 + 𝒃𝒙 + 𝒃𝒊
=(avg.ratingofuserx) – μ
bi =ratingdeviationofmoviei
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, ¡http://www.mmds.org
¡
¡
32
+ Item-Item vs. User-User
33
Inpractice, ithas beenobserved that item-item oftenworksbetterthanuser-user
¡ Why?Items aresimpler, users havemultipletastes
¡
Pros/Cons of Collaborative Filtering
n
+ Works for any kind of item
n No feature selection needed
n
- Cold Start:
n Need enough users in the
system to find a match
n
- Sparsity:
n The user/ratings matrix is
sparse
n Hard to find users that have
rated the same items
n
- First rater:
n Cannot recommend an item
that has not been
previously rated
n New items, Esoteric items
n
- Popularity bias:
n Cannot recommend items to
someone with
unique taste
n Tends to recommend popular
items
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
34
+ Hybrid Methods
n
Implement two or more different recommenders and
combine predictions
n
n
Perhaps using a linear model
Add content-based methods to
collaborative filtering
n
Item profiles for new item problem
n
Demographics to deal with new user problem
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
35
+ Tip: Add Data
n
Leverage all the data
n
n
n
Add more data
n
n
Don’t try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
e.g., add IMDB data on genres
More data beats better algorithms
http://anand.typepad.com/datawocky/2008/03/more-data-usual.html
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
36
37