Download 3 - Genoveva Vargas

1 Case study 2: Recommendation systems Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs Montevideo, 22nd November – 4th December, 2015 I N F O R M A T I Q U E + High dimensional data 2 Highdim. data Graph data Infinite data Machine learning Apps Localitysensitive hashing PageRank, SimRank Filteringdata streams SVM Recommender systems Clustering Community Detection Web advertising Decision Trees Association Rules Dimensionality reduction Spam Detection Querieson streams Perceptron, kNN Duplicate document detection Recommendations 3 Examples: Search Recommendations Items Products, web sites, blogs, news items, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org + From Scarcity to Abundance n Shelf space is a scarce commodity for traditional retailers n n Web enables near-zero-cost dissemination of information about products n n Also: TV networks, movie theaters,… From scarcity to abundance More choice necessitates better filters n n Recommendation engines How Into Thin Air made Touching the Void a bestseller: http://www.wired.com/wired/archive/12.10/tail.html J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 + Sidenote: The Long Tail Source: Chris Anderson J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org (2004) 5 + Physical vs. Online Read http://www.wired.com/wired/archive/12.10/tail.html to learn more! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6 + Types of Recommendations n Editorial and hand curated n n n Simple aggregates n n List of favorites Lists of “essential” items Top 10, Most Popular, Recent Uploads Tailored to individual users n Amazon, Netflix, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7 + Formal Model nX = set of Customers nS = set of Items n Utility function u: X × S à R R = set of ratings n R is a totally ordered set n e.g., 0-5 stars, real number in [0,1] n J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8 Utility Matrix Avatar Alice 1 David Matrix 0.2 Pirates 0.2 0.5 Bob Carol LOTR 0.3 1 0.4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9 + Key Problems n (1) Gathering “known” ratings for matrix n n (2) Extrapolate unknown ratings from the known ones n n How to collect the data in the utility matrix Mainly interested in high unknown ratings n We are not interested in knowing what you don’t like but what you like (3) Evaluating extrapolation methods n How to measure success/performance of recommendation methods J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10 + (1) Gathering Ratings n Explicit n n n Ask people to rate items Doesn’t work well in practice – people can’t be bothered Implicit n n Learn ratings from user actions n E.g., purchase implies high rating What about low ratings? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11 + (2) Extrapolating Utilities n Key problem: Utility matrix U is sparse n n n Most people have not rated most items Cold start: n New items have no ratings n New users have no history Three approaches to recommender systems: n n n 1) Content-based 2) Collaborative 3) Latent factor based J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12 13 Content based recommendation systems + Content-based Recommendations n Main idea: Recommend items to customer x similar to previous items rated highly by x Example: n Movie recommendations n n Recommend movies with same actor(s), director, genre, … Websites, blogs, news n Recommend other sites with “similar” content J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14 + Plan of Action 15 Item profiles likes build recommend match Red Circles Triangles User profile J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org + Item Profiles n For each item, create an item profile n Profile is a set (vector) of features n n n Movies: author, title, actor, director,… Text: Set of “important” words in document How to pick important features? n Usual heuristic from text mining is TF-IDF (Term frequency * Inverse Doc Frequency) n Term … Feature n Document … Item J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16 Sidenote: TF-IDF fij = frequency of term (feature) i in doc (item) j ni = number of docs that mention term i N = total number of docs TF-IDF score: wij = TF ij × IDF i Doc profile = set of words with highest TF-IDF scores, together with their scores J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 Note: we normalize TF to discount for “longer” documents + User Profiles and Prediction n n User profile possibilities: n Weighted average of rated item profiles n Variation: weight by difference from average rating for item n … Prediction heuristic: n Given user profile x and item profile i, estimate 𝑢(𝒙, 𝒊) = cos(𝒙, 𝒊) = 𝒙·𝒊 | 𝒙 |⋅| 𝒊 | J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18 + Pros: Content-based Approach n +: No need for data on other users n No cold-start or sparsity problems n +: Able to recommend to users with unique tastes n +: Able to recommend new & unpopular items n n No first-rater problem +: Able to provide explanations n Can provide explanations of recommended items by listing contentfeatures that caused an item to be recommended J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19 + Cons: Content-based Approach n –: Finding the appropriate features is hard n n –: Recommendations for new users n n E.g., images, movies, music How to build a user profile? –: Overspecialization n n n Never recommends items outside user’s content profile People might have multiple interests Unable to exploit quality judgments of other users J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20 21 Collaborative Filtering Harnessing quality judgments of other users Collaborative Filtering n Consider user x n Find set N of other users whose ratings are “similar” to x’s ratings n 22 x Estimate x’s ratings based on ratings of users in N J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org N Finding “Similar” Users n Let rx be the vector of user x’s ratings n Jaccard similarity measure n n n rx = [*, _, _, *, ***] ry = [*, _, **, **, _] rx, ry as sets: rx = {1, 4, 5} ry = {1, 3, 4} Problem: Ignores the value of the rating Cosine similarity measure rx, ry as points: rx = {1, 0, 0, 1, 3} ry = {1, 0, 2, 2, 0} /0 ⋅/1 n sim(x, y) = cos(rx, ry) = n Problem: Treats missing ratings as “negative” ||/0||⋅||/1|| Pearson correlation coefficient n Sxy = items rated by both users x and y 𝒔𝒊𝒎 𝒙, 𝒚 = ∑𝒔∈𝑺𝒙𝒚 𝒓𝒙𝒔 − 𝒓𝒙 𝒓𝒚𝒔 − 𝒓𝒚 ∑𝒔∈𝑺𝒙𝒚 𝒓𝒙𝒔 − 𝒓𝒙 𝟐 ∑𝒔∈𝑺𝒙𝒚 𝒓𝒚𝒔 − 𝒓𝒚 𝟐 rx , ry … avg. rating of x, y 23 + Similarity Metric n Intuitively we want: sim(A, B) > sim(A, C) n Jaccard similarity: 1/5 < 2/4 n Cosine similarity: 0.386 > 0.322 n n Considers missing ratings as “negative” Solution: subtract the (row) mean Cosine sim: 𝒔𝒊𝒎(𝒙, 𝒚) = ∑𝒊 𝒓𝒙𝒊 ⋅ 𝒓𝒚𝒊 ∑𝒊 𝒓𝟐𝒙𝒊 ⋅ ∑𝒊 𝒓𝟐𝒚𝒊 sim A,B vs. A,C: 0.092 > -0.559 Notice cosine sim. is correlation when data is centered at 0 24 + Rating Predictions 25 From similarity metric to recommendations: n Let rx be the vector of user x’s ratings n Let N be the set of k users most similar to x who have rated item i n Prediction for item s of user x: n > n 𝑟<= = ∑@∈A 𝑟@= n 𝑟<= = n Other options? ? ∑1∈D B01 ⋅/1C ∑1∈D B01 Shorthand: 𝒔𝒙𝒚 = 𝒔𝒊𝒎 𝒙, 𝒚 Many other tricks possible… J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org + Item-Item Collaborative Filtering n So far: User-user collaborative filtering n Another view: Item-item n n n For item i, find other similar items Estimate rating for item i based on ratings for similar items Can use same similarity metrics and prediction functions as in user-user model rxi ∑ = ∑ s ⋅ rxj j∈N ( i ; x ) ij s j∈N ( i ; x ) ij sij… similarity of items i and j rxj…rating of user u on item j N(i;x)… set items rated by x similar to i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26 Item-Item CF (|N|=2) users 1 1 2 1 movies 5 2 4 6 4 2 5 5 4 4 3 6 7 8 5 1 4 1 4 3 2 3 3 10 11 12 5 4 4 2 3 5 3 9 4 3 - unknown rating 4 4 2 2 1 3 5 2 2 2 3 5 4 - rating between 1 to 5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27 Item-Item CF (|N|=2) users 1 1 2 1 movies 5 2 4 6 4 2 5 4 3 5 6 ? 5 4 1 4 1 4 3 2 3 3 8 9 10 11 12 5 4 4 2 3 5 3 7 4 3 4 2 1 3 5 4 2 2 2 2 - estimate rating of movie 1 by user 5 3 5 4 28 + Item-Item CF (|N|=2) 29 users 1 1 2 1 movies 5 2 4 6 4 2 5 4 3 5 6 ? 5 4 1 4 1 4 3 2 3 3 8 9 10 11 12 5 4 4 2 3 5 3 7 4 4 4 2 3 Neighbor selection: Identify movies similar to movie 1, rated by user 5 1.00 2 1 3 5 0.41 2 -0.10 2 2 sim(1,m) 4 3 5 -0.18 -0.31 0.59 Here we use Pearson correlation as similarity: 1) Subtract mean rating mi from each movie i m1 = (1+3+5+5+4)/5 = 3.6 row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0] 2) Compute cosine similarities between rows + Item-Item CF (|N|=2) 30 users 1 1 2 1 movies 5 2 4 6 4 2 5 4 3 5 6 ? 5 4 1 4 1 4 3 2 3 3 8 9 10 11 12 5 4 4 2 3 5 3 7 4 3 Compute similarity weights: s1,3=0.41, s1,6=0.59 4 4 2 1.00 2 1 3 5 0.41 2 -0.10 2 2 sim(1,m) 4 3 5 -0.18 -0.31 0.59 + Item-Item CF (|N|=2) 31 users 1 1 2 1 movies 5 2 4 6 4 2 5 6 4 4 3 5 7 8 2.6 5 1 4 1 4 3 2 3 3 10 11 12 5 4 4 2 3 5 3 9 4 3 4 2 1 3 5 4 2 2 2 2 Predict by taking weighted average: r1.5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6 3 5 4 𝒓𝒊𝒙 = ∑𝒋∈𝑵(𝒊;𝒙) 𝒔𝒊𝒋 ⋅ 𝒓𝒋𝒙 ∑𝒔𝒊𝒋 Before: rxi CF: Common Practice n Define similarity s of items i and j ∑ = ∑ sr j∈N ( i ; x ) ij xj s j∈N ( i ; x ) ij ij n Select k nearest neighbors N(i; x) n n Items most similar to i, that were rated by x Estimate rating rxi as the weighted average: rxi = bxi ∑ + s ⋅ ( r − b ) ij xj xj j∈N ( i ; x ) baseline estimate for rxi ∑ s j∈N ( i ; x ) ij μ =overallmeanmovierating bx =ratingdeviationofuserx 𝒃𝒙𝒊 = 𝝁 + 𝒃𝒙 + 𝒃𝒊 =(avg.ratingofuserx) – μ bi =ratingdeviationofmoviei J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, ¡http://www.mmds.org ¡ ¡ 32 + Item-Item vs. User-User 33 Inpractice, ithas beenobserved that item-item oftenworksbetterthanuser-user ¡ Why?Items aresimpler, users havemultipletastes ¡ Pros/Cons of Collaborative Filtering n + Works for any kind of item n No feature selection needed n - Cold Start: n Need enough users in the system to find a match n - Sparsity: n The user/ratings matrix is sparse n Hard to find users that have rated the same items n - First rater: n Cannot recommend an item that has not been previously rated n New items, Esoteric items n - Popularity bias: n Cannot recommend items to someone with unique taste n Tends to recommend popular items J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34 + Hybrid Methods n Implement two or more different recommenders and combine predictions n n Perhaps using a linear model Add content-based methods to collaborative filtering n Item profiles for new item problem n Demographics to deal with new user problem J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35 + Tip: Add Data n Leverage all the data n n n Add more data n n Don’t try to reduce data size in an effort to make fancy algorithms work Simple methods on large data do best e.g., add IMDB data on genres More data beats better algorithms http://anand.typepad.com/datawocky/2008/03/more-data-usual.html J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36 37

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 3 - Genoveva Vargas