Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Applying Semantic Analyses to Content-based Recommendation and Document Clustering Eric Rozell, MRC Intern Rensselaer Polytechnic Institute Bio • Graduate Student @ Rensselaer Polytechnic Institute • Research Assistant @ Tetherless World Constellation • Student Fellow @ Federation of Earth Science Informatics Partners • Research Advisor: Peter Fox • Research Focus: Semantic eScience • Contact: [email protected] 2 Background Semantic Analysis Recommendation Clustering Conclusions Outline • Background • Semantic Analysis – Probase Conceptualization – Explicit Semantic Analysis – Latent Dirichlet Allocation • Recommendation Experiment – Recommendation Systems – Experiment Setup – Results • Clustering Experiment – Problem – K-Means – Results • Conclusions 3 Background Semantic Analysis Recommendation Clustering Conclusions Background • Billions of documents on the Web • Semi-structured data from Web 2.0 (e.g., tags, microformats) • Most knowledge remains in unstructured text • Many natural language techniques for: – Ontology extraction – Topic extraction – Named entity recognition/disambiguation • Some techniques are better than others for various information retrieval tasks… 4 Background Semantic Analysis Recommendation Clustering Conclusions Probase • Developed at Microsoft Research Asia • Probabilistic knowledge base built from Bing index and query logs (and other sources) • Text mining patterns – Namely, Hearst patterns: “… artists such as Picaso” • Evidence for hypernym(artists, Picaso) 5 Conclusions Clustering Recommendation Semantic Analysis Background Probase 6 Background Semantic Analysis Recommendation Clustering Conclusions Probase • Very capable at conceptualizing groups of entities: – “China; India; United States” yields “country” – “China; India; Brazil; Russia” yields “emerging market” • Differentiates attributes and entities – “birthday” -> “person” as attribute – “birthday” -> “occasion” as entity • Applications – Clustering Tweets from Concepts [Song et al., 2011] – Understanding Web Tables – Query Expansion (Topic Search) 7 Background Semantic Analysis Recommendation Clustering Conclusions Research Questions • What’s the best way of extracting concepts from text? – Compare techniques for semantic analysis • How are extracted concepts useful? – Generate data about where semantic analysis techniques are applicable • Are user ratings affected by the concepts in media items such as movies? – Test semantic analysis techniques in recommender systems • How useful is Web-scale domain knowledge in narrower domains for information retrieval? – Identify need for domain specific knowledge 8 Background Semantic Analysis Recommendation Clustering Conclusions Semantic Analysis • Generating meaning (concepts) from text • Specifically, get prevalent hypernyms – E.g., “… Apple, IBM, and Microsoft …” – “technology companies” • Semantic analysis using external knowledge – Probase Conceptualization – Explicit Semantic Analysis – WordNet Synsets • Semantic analysis using latent features – Latent Dirichlet Allocation – Latent Semantic Analysis 9 Document Corpus . . . Probase c4 c4 . . . c4 c4 . . . t4 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 . . . This is some plain text. t1 t2 t3 . . . For each document… . . . Background Semantic Analysis Recommendation Clustering Conclusions Probase Conceptualization Naïve Bayes / Summation c1 c2 c3 c4 . . . Inverse Document Frequency / Filtering Document Concepts 10 Background Semantic Analysis Recommendation Clustering Conclusions Probase Conceptualization • “Cowboy doll Woody (Tom Hanks) is co ordinating a reconnaissance mission to find out what presents his owner Andy is getting for his birthday party days before they move to a new house. Unfortunately for Woody, Andy receives a new spaceman toy, Buzz Lightyear (Tim Allen) who impresses the other toys and Andy, who starts to like Buzz more than Woody. Buzz thinks that he is an actual space ranger, not a toy, and thinks that Woody is interfering with his "mission" to return to his home planet…” 11 Text Source: Internet Movie Database (IMDb) Background Semantic Analysis Recommendation Clustering Conclusions Sample Features for “Toy Story” (Probase) • • • • • • • • • dvd encryptions 0.050 “RC” duty free item 0.044 “toys” generic word 0.043 “they, travel, it,…” satellite mission 0.032 “reconnaissance mission” creator-owned work 0.020 “Woody” amazing song 0.013 “fury” doubtful word 0.013 “overcome” ill-fated tool 0.013 “Buzz” lovable ``toy story'' character 0.011 “Buzz Lightyear, Woody,…” • pleased star 0.010 “Woody” • trail builder 0.010 “Woody” 12 Background Conclusions Clustering Recommendation Semantic Analysis Explicit Semantic Analysis 13 Image Source: Gabrilovich et al., 2007 Background Semantic Analysis Recommendation Clustering Conclusions Sample Features for “Toy Story” (ESA) • • • • • • • • • • #REDIRECT [[Buzz!]] 0.034 #REDIRECT [[The Buzz]] 0.028 #REDIRECT [[Buzz (comics)]] 0.027 #REDIRECT [[Buzz cut]] 0.027 #REDIRECT [[Buzz (DC Thomson)]] 0.024 #REDIRECT [[Buzz Out Loud]] 0.024 #REDIRECT [[The Daily Buzz]] 0.023 #REDIRECT [[Buzz Aldrin]] 0.022 #REDIRECT [[Buzz cut]] 0.022 #REDIRECT [[Buzzing Tree Frog]] 0.022 14 Background Semantic Analysis Recommendation Clustering Conclusions Latent Dirichlet Allocation • Blei et al., 2003 • Unsupervised Learning Method • “Generates” documents from Dirichlet distributions over words and topics • Topic distributions over documents can be inferred from corpus 15 Image Source: Wikipedia Background Semantic Analysis Recommendation Clustering Conclusions Recommendation Systems • Collaborative Filtering – “Customers who purchased X also purchased Y.” • Content-based – “Because you enjoyed ‘GoldenEye’, you may want to watch ‘Mission: Impossible’.” • Hybrid – Most modern systems take a hybrid approach. 16 Background Semantic Analysis Recommendation Clustering Conclusions Content-based Recommendation • In GoldenEye/Mission: Impossible example… – Structured item content • Genre – Action/Adventure/Thriller • Tags – Action, Espionage, Adventure – Unstructured item content • Plot synopses – “helicopter, agent, inflitrate, CIA, …” • Concepts? – “aircraft, intelligence agency, …” 17 Background Conclusions Clustering Recommendation Semantic Analysis Recommendation Systems Structured Content-based Approaches Collaborative Filtering Approaches Unstructured Content-based Approaches Test semantic analysis approaches here. 18 Background Movie Ratings from MovieLens Movie Synopses from IMDb Feature Generation Matchbox Recommendation Platform Mean Absolute Error (MAE) Conclusions Clustering Recommendation Semantic Analysis Experiment 19 Conclusions Related Work Clustering Recommendation Semantic Analysis Matchbox 20 Source: Matchbox API Documentation Background Semantic Analysis Recommendation Clustering Conclusions Experimental Data • Data: MovieLens Dataset [HetRec ’11] – 855,598 ratings – 10,197 movies – 2,113 users • Movie synopses from IMDb (http://www.imdb.com) – Collected synopses for 2,633 movies – With 435,043 ratings – From 2,113 users • Ratings data: – Scored by half points from 0.5 to 5 • Choose different numbers of movies (200; 1,000; all) • Train on 90% of ratings, test on remaining 10% 21 Background Semantic Analysis Recommendation Clustering Conclusions Experimental Data • Controls – Baseline 1: Only features are user IDs and movie IDs – Baseline 2: User IDs, Movie IDs, Movie Genre – Baseline 3: User IDs, Movie IDs, Movie Tags • Feature Sets – – – – Term Frequency – Inverse Document Frequency Latent Dirichlet Allocation Explicit Semantic Analysis Probase Conceptualization 22 Background Movies Movies Users Movies Users Conclusions Movies Users Users Semantic Analysis • 4 Scenarios: (training: white, testing: black) Clustering Recommendation Experimental Setup 23 Background Semantic Analysis Results 0.595 0.59 Recommendation 0.585 Baseline #1 0.58 Baseline #2 Baseline #3 TFIDF Normalized 0.575 Probase Sum Clustering ESA 0.57 0.565 Conclusions 0.56 1 2 3 4 5 6 7 8 9 10 24 Background Semantic Analysis Recommendation Results # of Movies Clustering 1,000 200 Baseline 1 0.672293 0.71654 0.802044 Baseline 2 0.641556 0.683297 0.752745 Baseline 3 0.655613 0.68994 0.764369 TF-IDF 0.674764 0.706914 0.815245 Probase 0.670694 0.715456 0.797196 0.670182 (unfinished) 0.714967 0.796787 0.711307 0.790362 ESA LDA Conclusions All (2,633) • testing set contains users and movies not seen in training set • recommendations based on item features alone • small amounts of structured data (e.g., genre) are the most influential in this scenario 25 Background Semantic Analysis Recommendation Results # of Movies Clustering 1,000 200 Baseline 1 0.580087 0.564226 0.577349 Baseline 2 0.576183 0.563028 0.576673 Baseline 3 0.575398 0.563378 0.572297 TF-IDF 0.579906 0.575932 0.588288 Probase 0.578889 0.563669 0.578089 0.579798 (unfinished) 0.564334 0.577638 0.566639 0.579633 ESA LDA Conclusions All (2,633) • testing set contains users not seen in training set. • lots of collaborative data available (explains comparable performance in all feature sets) • given extensive collaborative data, item features are marginally beneficial (in Matchbox) 26 Background Semantic Analysis Recommendation Results # of Movies Clustering 1,000 200 Baseline 1 0.672843 0.687586 0.832491 Baseline 2 0.639683 0.651141 0.81416 Baseline 3 0.652071 0.66492 0.745593 TF-IDF 0.672362 0.665116 0.844305 Probase 0.670159 0.686235 0.823972 0.670451 (unfinished) 0.683594 0.817306 0.684689 0.852056 ESA LDA Conclusions All (2,633) • testing set contains movies not seen in the training set • recommendations based on item features and extensive information on users “rating model” • small amounts of structured data (e.g., genre) are the most influential in this scenario (even for long-term users) 27 Background Semantic Analysis Recommendation Results # of Movies Clustering 1,000 200 Baseline 1 0.560163 0.564673 0.568706 Baseline 2 0.556011 0.556456 0.567598 Baseline 3 0.550761 0.561643 0.56445 TF-IDF 0.551909 0.558942 0.588288 Probase 0.556414 0.558113 0.567332 0.556517 (unfinished) 0.55706 0.568174 0.558105 0.568927 ESA LDA Conclusions All (2,633) • testing set contains users and movies seen in the training set • recommendations again are primarily collaborative • given a large corpus of rating data for users and items, item features are only marginally beneficial 28 Background Semantic Analysis Recommendation Clustering Conclusions Results Experiment Baseline 1 0.672293 0.580087 0.672843 0.560163 Baseline 2 0.641556 0.576183 0.639683 0.556011 Baseline 3 0.655613 0.575398 0.652071 0.550761 TF-IDF 0.674764 0.579906 0.672362 0.551909 Probase 0.670694 0.578889 0.670159 0.556414 ESA 0.670182 0.579798 0.670451 0.556517 29 Background Semantic Analysis Recommendation Clustering Conclusions Document Clustering • Divide a corpus into a specified number of groups • Useful for information retrieval – Automatically generated topics for search results – Recommendations for similar items/pages – Visualization of search space 30 Background Semantic Analysis 1. 2. 3. 4. 5. Start with initial clusters Compute means of clusters Compare cosine distance of each item to means Assign to clusters to based on min. distance Repeat from step 2 until convergence Conclusions Clustering Recommendation K-Means 31 Background Semantic Analysis Recommendation Clustering Conclusions Experimental Setup 1. 2. 3. 4. 5. Generate features for datasets Randomly assign initial clusters Run K-Means Compute purity and ARI Repeat steps 2-4 20 times for mean and standard deviation 32 Background Semantic Analysis Recommendation Clustering Conclusions Experimental Data From sci.electronics … • 20 Newsgroups (mini) • 2,000 messages from Usenet newsgroups • 100 messages per topic • Filter messages for body text • Source: http://kdd.ics.uci.edu/databases/20 newsgroups/20newsgroups.html “A couple of years ago I put together a Tesla circuit which was published in an electronics magazine and could have been the circuit which is referred to here. This one used a flyback transformer from a tv onto which you wound your own primary windings...” 33 Background Semantic Analysis Recommendation Clustering Conclusions Results Feature Set Purity ARI Scores TF-IDF 0.379 ± 0.027 0.199 ± 0.023 Probase Only 0.265 ± 0.013 0.101 ± 0.010 Probase + TF-IDF 0.414 ± 0.034 0.241 ± 0.029 ESA Only 0.204 ± 0.010 0.040 ± 0.004 ESA + TF-IDF 0.389 ± 0.036 0.211 ± 0.032 LDA Only N/A N/A LDA + TF-IDF N/A N/A 34 Background Semantic Analysis Recommendation Clustering Conclusions Results Comparison • Song et al. Tweets Clustering – Experiment #2: Subtle Cluster Distinctions – Used Tweets about NA, Asia, Africa and Europe – Comparable performance for ESA and Probase Conceptualization • Hotho et al. WordNet Clustering – Used Reuters dataset and Bisecting K-Means – Found best results for combined TF-IDF and feature sets – Overall improvement from WordNet features was comparable to Probase features (O[+10%]) 35 Background Semantic Analysis Recommendation Clustering Conclusions Conclusions • Semantic analysis features are marginally beneficial in recommendation • Structured data from limited vocabulary work best for recommending “new items” • Explicit and latent semantic analysis are comparable in recommendation • Knowledge bases generated at Web-scale may be too noisy for narrow domain tasks • Confirmed the efficacy of semantic analysis in document clustering tasks 36 Background Semantic Analysis Recommendation Clustering Conclusions Future Directions • Noise Reduction – Tune the recommender platform for “concepts” – Further explore parameter space for feature generators – Hybrid Conceptualization / Named Entity Disambiguation? • Domain-specific knowledge sources – Comparison of Web-scale and domain-specific resources as external knowledge (e.g., [Aljaber et al., 2010]) 37 Background Semantic Analysis Recommendation Clustering Conclusions Further Reading • Short Text Conceptualization Using a Probabilistic Knowledge Base [Song et al., 2011] • Exploiting Wikipedia as External Knowledge for Document Clustering [Hu et al., 2009] • Hybrid Recommender Using WordNet “Bag of Synsets” [Degemmis et al., 2007] • Hybrid Recommender Using LDA [Jin et al., 2005] • Feature Generation for Text Categorization Using World Knowledge [Gabrilovich and Markovitch, 2005] • WordNet Improves Text Document Clustering [Hotho et al., 2003] 38 Acknowledgements • • • • David Stern, Ulrich Paquet, Jurgen Van Gael Haixun Wang, Yangqiu Song, Zhongyuan Wang Special thanks to Evelyne Viegas! Microsoft Research Connections 39 References • [Gabrilovich et al., 2007] Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference on Artifical intelligence (IJCAI'07), Rajeev Sangal, Harish Mehta, and R. K. Bagga (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1606-1611. • [Blei et al., 2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (March 2003), 993-1022. • [Song et al., 2011] Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen, Short Text Conceptualization using a Probabilistic Knowledgebase, in IJCAI, 2011. • [Stern et al., 2009] David H. Stern, Ralf Herbrich, and Thore Graepel. 2009. Matchbox: large scale online bayesian recommendations. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 111-120. • [HetRec ’11] Ivan Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2011. 2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011). In Proceedings of the 5th ACM conference on Recommender systems. ACM, New York, NY, USA. • [Degemmis et al., 2007] Marco Degemmis, Pasquale Lops, and Giovanni Semeraro. A contentcollaborative recommender that exploits WordNet-based user profiles for neighborhood formation. User Modeling and User-Adapted Interaction. Vol. 17, Issue 3, 217-255. 40 References • [Jin et al., 2005] Xin Jin, Yanzan Zhou, and Bamshad Mobasher. 2005. A maximum entropy web recommendation system: combining collaborative and content features. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (KDD '05). ACM, New York, NY, USA, 612-617. • [Hu et al., 2009] Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. 2009. Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09). ACM, New York, NY, USA, 389-396. • [Gabrilovich and Markovitch, 2005] Evgeniy Gabrilovich and Shaul Markovitch. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 20th international joint conference on Artifical intelligence (IJCAI'05), 1606-1611. • [Hotho et al., 2003] Andreas Hotho, Steffen Staab, and Gerd Stumme. 2003. Wordnet improves text document clustering. In Proceedings of the SIGIR 2003 Semantic Web Workshop, 541-544. • [Aljaber et al., 2010] Bader Aljaber, Nicola Stokes, James Bailey, and Jian Pei. 2010. Document clustering of scientific texts using citation contexts. Information Retrieval. Vol. 13, Issue 2, 101-131. 41 Questions? • Thanks for attending 42 Appendix A. B. C. D. E. Matchbox Details Implementation Details Probase Conceptualization Details Explicit Semantic Analysis Details Learnings from Probase 43 Semantic Analysis Recommendation Clustering Related Work Conclusions (Appendix A) Matchbox • [Stern et al., 2009] • MSR Cambridge recommendation platform • Implements a hybrid recommender using Infer.NET – Uses combination of expectation propagation (EP) and variational message passing • Reduces user, item, and context features to low dimensional trait space 44 Semantic Analysis Recommendation Clustering Related Work Conclusions (Appendix A) Matchbox Setup • Matchbox settings – Use 20 trait dimension (determined experimentally) – 10 iterations of EP algorithm – Trained on approx. 90% of ratings – Updated model with 75% of ratings per user (in remaining 10%) – MAE computed for remaining 25% per user 45 Semantic Analysis Recommendation • • • • • ESA: https://github.com/faraday/wikiprep-esa LDA: Infer.NET Probase: Probase Package v. 0.18 TF-IDF: http://www.codeproject.com/KB/cs/tfidf.aspx Matchbox: http://codebox/matchbox Conclusions Related Work Clustering (Appendix B) Implementation 46 Background Semantic Analysis 1. Identify all Probase terms in text 2. Use Noisy-or Model to combine: – Concepts from tl as attribute (zl = 1) – Concepts from tl as entity/concept (zl = 0) Conclusions Clustering Recommendation (Appendix C) Probase Conceptualization 47 Background Semantic Analysis Recommendation Clustering Conclusions (Appendix C) Probase Conceptualization 3. Weight terms based on occurrence a. Naïve Bayes (similar to Song et al., 2010) • Compute P(c|t) for individual terms and use Naïve Bayes model to derive concepts • Penalizes false positives, does not reward true positives • Generates very small probabilities for large numbers of terms b. Weighted Sum (similar to Gabrilovich et al., 2007) • Compute P(c|t) for individual terms and compute sum over document for each concept • Rewards true positives, does not penalize false positives (accurate concepts and inaccurate concepts, resp.) 48 Background Semantic Analysis Recommendation Clustering Conclusions (Appendix C) Probase Conceptualization 4. Penalize frequent concepts – Stop word (concepts) are domain-independent – For films, many domain-specific stop concepts • E.g., “movie”, “character”, “actor”, etc. – Inverse Document Frequency on concepts penalizes those that are too frequent – Also rewards those that are too infrequent (in only one document) – Solution: Filter for minimum and maximum occurrence 49 Semantic Analysis Recommendation Clustering Related Work Conclusions (Appendix C) Probase Conceptualization • Using Summation (similar to Wikipedia ESA) • Using Naïve Bayes from Song et al. approach – P(|T) P(T|)P()/P(T) / P()L - 1 • Inverse Document Frequency for concepts – IDF(ck) = log ( # of documents / document frequency of ck ) – Minimum occurrence = 2 – Maximum occurrence = 0.5 * # of documents 50 Semantic Analysis Recommendation Clustering Related Work Conclusions (Appendix D) Explicit Semantic Analysis • Gabrilovich et al., 2007 • Builds inverted index of Wikipedia content • Input text converted to weight vector of concepts based on TF-IDF • 𝑤𝑖𝜖𝑇 𝑣𝑖 ∙ 𝑘𝑗 – 𝑣𝑖: TF−IDF weight of w𝑖 – 𝑘𝑗: Weight of concept, c𝑗, for w𝑖 51 (Appendix E) Learnings from Probase • Conceptualization works wonders for small numbers of entities • Would be extremely useful in a large-scale QA environment with many semantic analysis and ML algorithms (e.g., Watson) • A noisy source of knowledge is best suited to noise-tolerant IR applications • Still being developed and improving! – Working on recognizing verbs 52