Download Content-based recommendation systems

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Content-based recommendation systems
(based on chapter 9 of Mining of Massive
Datasets, a book by Rajaraman, Leskovec, and
Ullman’s book)
Fernando Lobo
Data mining
1 / 16
Content-based Recommendation Systems
I
Focus on properties of items.
I
Similarity of items is determined by measuring the similarity in
their properties.
2 / 16
Item profiles
I
Need to construct a profile for each item.
I
A profile is a collection of important characteristics about the
item.
I
Example for item = movie. Profile can be:
I
I
I
I
set of actors
director
year the movie was made
genre
3 / 16
Discovering features
I
Features can be obvious and immediately available (as in the
movie example).
I
But many times they are not. Examples:
I
I
document collections
images
4 / 16
Discovering features of documents
I
Documents can be news articles, blog posts, webpages,
research papers, etc.
I
Identify a set of words that characterize the topic of a
document.
I
Need a way to find the importance of a word in a document.
I
We can pick the n most important words of that document as
the set of words that characterize the document.
5 / 16
Finding the importance of a word in a document
Common approach:
I
Remove stop words — the most common words of a language
that tend to say nothing about the topic of a document
(examples from english: the, and, of, but, . . .)
I
For the remaining words compute their TF.IDF score
I
TF.IDF stands for Term Frequency times Inverse Document
Frequency
6 / 16
TF.IDF score
First compute the Term Frequency (TF):
I
Given a collection of N documents.
I
Let fij = number of times word i appears in document j.
I
Then the term (word) frequency TFij =
I
Term frequency is fij normalized by dividing it by the
maximum number of occurrences of any term in the same
document (excluding stop words)
fij
maxk fkj
7 / 16
TF.IDF score
Then compute the Inverse Document Frequency (IDF):
I
IDF for a term (word) is defined as follows. Suppose word i
appears in ni of the N documents.
I
The IDFi = lg(N/ni )
I
TF.IDF for term i in document j = TFij × IDFi
8 / 16
TF.IDF score example
I
Suppose we have 220 = 1048576 documents. Suppose word w
appears in 210 = 1024 of them.
I
The IDFw = lg(220 /210 ) = 10
I
Suppose that in a document k, word w appears one time and
the maximum number of occurrences of any word in this
document is 20. Then,
I
I
TFwk = 1/20.
TF.IDF for word w in document k is 1/20 × 10 = 1/2.
9 / 16
Finding similar items
I
Find a similar item by using a distance measure.
I
For documents, two popular distance measures are:
I
I
Jaccard distance between sets of words
cosine distance between sets, treated as vectors
10 / 16
Jaccard Similarity and Jaccard Distance of Sets
I
The Jaccard similarity (SIM) of sets S and T is
|S ∩ T | / |S ∪ T |
I
Example: SIM(S, T ) = 3/8
I
Jaccard distance of S and T is 1 − SIM(S, T )
11 / 16
Cosine Distance of sets
I
Compute the dot product of the sets (treated as vectors) and
divide by their Euclidean distance from the origin.
I
Example: x = [1, 2, −1], y = [2, 1, 1]
Dot product x.y = 1 · 2 + 2 · 1 + (−1) · 1 = 3
Euclidean
distance of x to
p
√ the origin
= 12 + 22 + (−1)2 = 6
(same thing for y )
Cosine distance between x and y =
√ 3√
6 6
= 1/2
12 / 16
Sets of words as bit vectors
I
Think of a set of words as a bit vector, one bit position for
each possible word
I
Position has 1 if the word is in the set, and has 0 if not.
I
Only need to take care of words that exist in both documents.
(0’s don’t affect the calculations)
13 / 16
User profiles
I
Weighted average of rated item profiles
I
Example: items = movies represented by boolean profiles.
Utility matrix has a 1 if the user has seen a movie and is blank
otherwise
If 20% of the movies that user U likes have Julia Roberts as
one of the actors, then user profile for U will have 0.2 in the
component for Julia Roberts.
14 / 16
User profiles
I
If utility matrix is not boolean, e.g., ratings 1–5, then weight
the vectors by the utility value and normalize by subtracting
the average value for a user.
I
This way we get negative weights for items with below
average ratings, and positive weights for items with above
average ratings
15 / 16
Recommending items to users based on content
I
Compute cosine distance between user’s and item’s vectors
I
Movie example:
I
highest recommendations (lowest cosine distance) belong to
movies with lots of actors that appear in many of the movies
the user likes.
16 / 16