Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Query Subtopics from Search Log Data Hu et al., SIGIR’12 Presented by Baichuan Li Outline • • • • • • Problem Two observations Approach Results Applications Conclusion Problem • Intention != Query • Many queries are ambiguous or multifaceted Problem Problem • Xbox – Homepage? – Online game? – Where to buy? • Manchester – Manchester news? – Manchester tourism? – Manchester weather? – Manchester united? Solution • Identifying the major subtopics of a query from the search log data – Personalized search – Query suggestion – Search result presentation • Clustering • Re-ranking • Diversification Observations from The Logs • One subtopic per search (OSS) – If a user clicks multiple URLs after submitting a query, then the clicked URLs tend to represent the same subtopic Observations from The Logs • Subtopic clarification by additional keyword (SCAK) – Users often add additional keywords (in most cases, one additional keyword) to a query to expand the query in order to clarify its subtopic – The URLs clicked after searching both with the original and the expanded queries tend to represent the same subtopic – The key word tends to be indicative of the subtopic – E.g. • harry shum microsoft One Subtopic per Search • Use it as the rule for subtopic identification – 10,000 groups of multiclicks of individual queries – The rule is accurate when all the URLs within the multi-clicks are about the same sense or face Accuracy V.S. Frequency Accuracy v.s. Click Position Distribution • The queries with higher frequencies in search log data are more likely to have multi-clicks Subtopic Clarification by Additional Keyword • The keywords ‘microsoft’ and ‘jr’ can be used to represent the two groups (subtopics) respectively Query Types • • • • ‘Q’: the query is a single phrase ‘Q + W’: ‘Q’ + a keyword ‘W + Q’: a keyword + ‘Q’ ‘Others’ Subtopic Overlap and URL Overlap • Randomly selected 500 pairs of queries with the forms ‘Q’ and ‘Q + W’ • If subtopics of an expanded query are contained in subtopics of the original query, then there is ‘subtopic overlap’ • Check whether two queries share identical clicked URLs (‘URL overlap’) Distribution • The more popular (frequent) a query is, the more likely the rule is applicable Mining Subtopics Preprocessing Clustering • Similarity function – The similarity function between URLs ui and uj : – S1: based on the OSS phenomenon where and denote the vectors of multi-clicks of ui and uj respectively – S2: based on the SCAK phenomenon – S3: based on string similarities SCAK Similarity (S2) where and denote the vectors of keywords associated with ui and uj respectively Data • TREC Data: TREC search result diversification track 3 in 2009 • DataSetA and DataSetB: queries and URLs randomly sampled from the logs of the commercial search engine Results Application: Search Result Reranking • The user is first asked which subtopic she is interested in, with the subtopics shown at the top of the results page • When the user selects a subtopic, the URLs belonging to the subtopic will be moved to the top (re-ranked) • The relative order between URLs inside and outside of the subtopic will be kept Example Evaluation • Data – Collect search log data of 20, 000 randomly selected searches – Each query has at least two subtopics mined by the method • Results – The average position of last clicked URLs is 3.41 – Assume the cost for the user to check the subtopics and click one of them is 1.0 – The average position of last clicked URLs belonging to the same subtopics is 1.80 Conclusion • Study the problem of query subtopic mining • Discovered two phenomena of user search behavior – one subtopic per search – Subtopic clarification by additional keyword • Design a novel similarity function • Applications on search result reranking (and search result clustering) • Problem – Can only be employed when there is enough search log data