Download Movie Rating Prediction System

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nearest-neighbor chain algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Movie Rating Prediction System
Group members: Shu Zhang and Yue Xu
Abstract
In this paper, we build a movie rating prediction system on selected training sets provided
by MovieLens. We applied two different dimensionality reduction algorithms: K-means
and Stochastic Gradient Descent. It can be used to predict the rating of a user based on an
unrated movie. RMSE (Root-Mean-Squared-Error) is applied as the evaluating criteria to
analyze the performance of these two algorithms. Also, different parameters of the
Stochastic Gradient Descent method are applied to analyze the effects of each parameter
on the rating results. Finally, the performance of these two different algorithms are
compared and analyzed.
Introduction
Movie is becoming a more and more important entertainment in people’s life. There are
many choice of movies, and what would a person may like is an interesting topic. In this
project, we are building a movie rating predicting system. That is, when a certain user has
reviewed some movies, this system would predict ratings of other movies that he/she has
not reviewed. The result may be used for recommendation on some video website.
We use a 100k MovieLens dataset collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 1997 to April 1998.
The data includes 100,000 ratings among 943 users and 1682 movies. 90% of the data is
considered as the training data to build this movie rating prediction model using the
dimensionality reduction algorithm. The other 10% dataset is used as test data set to
evaluate the model using RMSE criteria. This dataset records different users’ ratings for
different movies (rating score from 1 to 5). This data set consists of:
•
100,000 ratings (1 - 5) from 943 users on 1682 movies (90% would be 90000 ratings)
•
The genres of the 1682 movies
1
The structure of these matrices are shown below:
Denoted as R
Denoted as M
The problem of this paper here comes, as how to predict a particular user’s rating on a
movie that he/she may has not seen yet. This paper aims to find out these missing ratings
so that it can be used to recommend personalized movies to different users.
Method
Data Pre-processing and Problem Formalized
As mentioned before, the whole data has 100,000 ratings, which is split into a 90%
training set and 10% test set. The training and testing data set are pre-processed as
follows: Users are assigned as user ID from 1 to 943, while movies are assigned to have a
item ID be 1 to 1682. So the data can be transformed, as a 1682 * 943 matrix with each
corresponding rating is the value in the matrix, which is a very sparse matrix with lots of
missing ratings.
𝑅!,! =
𝑟!,! , 𝑒𝑥𝑖𝑠𝑡𝑖𝑛𝑔 𝑟𝑎𝑡𝑖𝑛𝑔𝑠
0,
𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑟𝑎𝑡𝑖𝑛𝑔𝑠
Thus our work is to fill all the zero-entries in the matrix based on the training set data.
After
all
the
zero-entries
are
filled
out,
the
widely
used
RMSE
(Root-Mean-Squared-Error) criteria are used on the test data set to evaluate the
performance of our training algorithms. To be clear, the RMSE is calculated using the
test data set by comparing the real rating and our predicted rating using the following
formula:
2
!
RMSE =
!
!!!(𝑌!
!
- 𝑌! )!
K-means clustering Algorithm
K-means clustering algorithm can be applied to predict the missing ratings in the matrix
A in the following steps.
1. We use k-means clustering to group the movies based on their genre affiliation. All
the movies and their corresponding genre affiliation are denoted in M as mentioned
before. Based on this matrix, K-means is applied to cluster the movies. As a decision
criterion, the optimal cluster number is determined by minimizing the SSE (Sum
Squared Error) as the following formula:
!
||𝑋! − 𝜇! ||!
argmin
!
!!! !"∈!
After clustering, a matrix, denoted as D, is obtained to show the corresponding
cluster where each movie belongs.
2. This step and next step aim to calculate average rating of each user for each cluster,
which can be achieved as follows.
First, based on matrix R and D, we can obtain a matrix shown below:
Ø 1st column: user ID
Ø 2nd column: movie ID
Ø 3rd column: rating
Ø 4th column: cluster number
3. After that, we group the rows of this matrix based on user ID, then we can calculate
the user’s average rating for each cluster.
The rows are user ID and the columns are
cluster numbers.
Entries are corresponding average ratings.
3
4. Once the average rating of each user for each cluster is generated, the rating matrix R
can be filled out as follows. For a particular user with user ID u and a movie with
movie ID m, user u has not watched movie m before, the predicted rating Rum is the
average rating of u on the cluster where movie m belongs.
5. The final step is to apply the RMSE formula mentioned before to calculate the RMSE
to evaluate the result.
Stochastic Gradient Descent Algorithm
Stochastic Gradient Descent is an extension of the basic SVD algorithm. SVD is known
to give the minimum SSE (Sum of Squared Errors). SSE and RMSE are monotonically
related:
1
𝑆𝑆𝐸
𝑐
So SVD is also minimizing RMSE. But back to our problem, SVD cannot be defined as
𝑅𝑀𝑆𝐸 =
matrix R has lots of missing ratings. The aim is to minimize SSE for the unknown test
ratings. So the goal can be described as to find P and Q such that
Where R = Q * PT
Ø
Where rxi is the ith entry of the xth row in the matrix rating
Ø
Ø
Ø
qi the ith row of matrix Q
px the xth column of matrix P
k is the number of columns in matrix Q
The first idea is to apply SVD to get P and Q based on the training matrix R. This will
minimize the SSE of the training data set. But the model trained based on minimizing the
training error will be over-fitted and not to generalize well to the test data.
R = U∑VT = Q * PT , so that Q = U; PT = ∑VT
To solve the over-fitting problem, intuitively, we should not use a large number of k (k is
the number of columns of matrix Q) as the training matrix R has lots of missing ratings.
So we need to introduce regularization to the problem to consider the number of k. The
formula is shown below, it can be seen that the number of k should shrink sharply when
4
lots of data is missing in the training data matrix. 𝜆 is the regularization parameter and
μ is the learning rate.
𝑚𝑖𝑛
(𝑟!" − 𝑞! 𝑝! )2 + μ𝜆( ||𝑞! ||! +
||𝑝! ||! )
In order to achieve minimum objective, Stochastic Gradient Descent method is applied:
1.
Initialize P and Q by apply the SVD of rating matrix (with missing values)
R = U∑VT = Q * PT
Q=U
PT = ∑VT
2.
Then iterate over the training rating matrix R and update all P and Q
𝜀!" = 2 * (rxi – qi*px)
qi = qi + µ (𝜀!" *px –𝜆*qi)
pi = px + µ (𝜀!" * qi –𝜆* px)
The learning rate needs to be chosen carefully in order to get an optimal solution for this
problem. In order to choose a good learning rate, we tried the values from 0.1 to 0.01, and
calculate the SSE based on the training data for each learning rate.
After this, we find the optimal value for learning rate and then the SGD model is applied
based on the training data and get the P, Q to reconstruct the matrix R and calculate the
RMSE using the method described above.
Finally, another interesting thing is to apply different 𝜆 𝑎𝑛𝑑 𝑘 values to calculate both
the training and test data set SSE. By examining the errors, it can be shown clearly that
introducing regularization into the problem decrease over-fitting issue, which is also in
consistent with our intuition mentioned before.
Results
K-means clustering Algorithm
3 clusters are the best.
RMSE is 1.15.
Stochastic Gradient Descent Algorithm
To find the optimal learning rate μ, we tried values from μ= 0.1 to μ= 0.01 and
calculate the training set SSE for each learning rate μ. It is found that when μ = 0.03,
5
the training error has a minimum. After we get the optimal learning rate, we can apply the
stochastic gradient descent methods mentioned above to get the matrix P and Q and
reconstruct the matrix R. After that, we calculate RMSE is 0.95 after 40 iterations.
Finally the effect of regularization on over-fitting is also analyzed by applying different
values of 𝜆 𝑎𝑛𝑑 𝑘.
From the above graph, when 𝜆 = 0, which means no regularization is introduced, it is
clear that although training set error decrease a lot with number of k increases, the test set
error will increase dramatically when we have larger number of k. But if we use
𝜆 = 0.2, the training set error and test set error will both decrease if we have larger
number of k. The comparison proves that if the regularization solves the over-fitting
problem.
6
Comparison
Clearly, the stochastic gradient descent method gave a better value of RMSE, which can
be analyzed in the following two aspects. First of all, k-means clustering only considered
the movie genre to cluster all the movies. Each user has the same rating on all the movies
belonging to the same cluster. However, the Stochastic Gradient Descent method took the
relationship of both user and movie into consideration. Additionally, comparing to
number of clusters in k-means clustering method, in the Stochastic Gradient Descent
method, the parameter k, which is the number of columns of matrix Q, denoted the
number of factors considered for each movie. Therefore, in comparison with the k-means
cluster with only 3 clusters, the SGD method considered 20 factors for each movie.
Conclusions
To conclude, this project has successfully computed the missing values for a very sparse
matrix by dimensional reduction algorithms. K-means clustering and stochastic gradient
descent algorithm were applied, and the second one got the better result. Results of this
work can be used to recommend movies to users based on predicting user’s rating on a
particular movie. In future, the SGD (Stochastic Gradient Descent) can be explored more
by applying different parameters and optimize the results.
7