Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1 Outline Introduction Framework of the Proposed Method Mining Query Concepts Concept Sequence Suffix Tree Experimental Evaluation Summary 2 Introduction What is query suggestion in search engine? Guess user’s search intent ( user query qi , or q1q2 suggest queries q j , qk qi) Better describe user’s information need? Why query suggestion is important? Easy to issue appropriate query? No! A “bottleneck issue” of search engine usability (Google, Yahoo, Bing, Baidu, etc) 3 Introduction Major existing approaches (with search log data) : Approach I: clustering queries using clicked URL data to find similar queries, qi and q j similar? Approach II: mining pairs of queries which are adjacent or co-occur in the same query session, q q frequent? i j Fig1: An example of search log data 4 Introduction Key Limitation: None of them are context-aware: do not consider the immediately preceding queries as context, q1q2 qi 1qi The clustering algorithms cannot scale up to very large data well. 1.8 billion query (151 million unique), 2.6 billion clicked URL(114 million unique) An example: “apple” “steve jobs” “apple” User’s search intent? 5 Proposed Method Framework Key steps: Clustering queries Capture the context: concept sequence Quickly find the queries that many users ask in that context Concept Sequence Suffix Tree 6 Mining Query Concepts An example of click-through bipartites data from search log: For each query qi : a L2 -normalized vector, norm( wij ), if edge eij exist , 0, oterwise wij with norm( wij ) 2 w e ik qi [ j ] ik distance(qi , q j ) (qi [k] qi [k]) 2 uk U 7 Mining Query Concepts Key challenges to cluster queries: Search log click-through bipartite could be huge: e.g., 151 million unique queries Number of clusters is unknown Extremely high dimensionality of query vector: 114 million unique URLs Search logs increase dynamically Existing query clustering algorithms: Hierarchical agglomerative method DBSCAN method (Wen, WWW’01) K-means, etc. 8 Mining Query Concepts Proposed clustering method: 9 Mining Query Concepts for each query q: Step 1: first find the closest cluster C to q among the clusters obtained so far Step 2: compute the diameter of cluster C q Step 3: 1) diameter Dmax , q is assigned to C , C C q 2) otherwise, create a new cluster containing only q quite efficient: Only need one scan of queries Can run efficiently on a PC of 2GM main memory 10 Mining Query Concepts Tricks for algorithm efficiency improvement: A dimension array data structure used in step 1 (sparse data) Prune edges of low weights distance(qi , q j ) (qi [k] qi [k]) 2 uk U 11 Concept Sequence Suffix Tree Extract query sessions data each individual user’s behavior (query/click) data segment into sessions (time interval>30mins) discard the click event data Fig: An example of search log data 12 Concept Sequence Suffix Tree Concept sequence suffix tree A structure used to efficiently find (search) the queries that many users ask in that context (concept sequence) Fig: An example 13 Concept Sequence Suffix Tree Algorithm to build concept sequence suffix tree: qi to c1c2 c j , 1) Map training session data qs q1q2 ji 2) Enumerate subsequence of c c c j ci ci 1 cl , 1 2 i 1, l j (distributed, map-duce) 3) Get all frequent concept subsequences cs 4) Organize these cs into concept sequence suffix tree 14 Concept Sequence Suffix Tree Algorithm for organizing cs into concept sequence suffix tree: 15 Concept Sequence Suffix Tree Organize cs into concept sequence suffix tree : 1) start from root node (empty), and scan through all frequent concept subsequence cs 2) for each cs c1c2 cs ' c1c2 cl , first find node cr corresponding to cl 1 , if cr doesn’t exist, create it 3) update the list of candidate concepts of cs ' if cl is among the top K (a specified threshold , e.g., K=5) candidates so far; 4) representative query of the top K candidate concepts are candidate suggestions for sequence cs ' 16 Concept Sequence Suffix Tree Review an example of Concept Sequence Suffix Tree: cs c1c2 cl , cs ' c1c2 cl 1 , 17 Concept Sequence Suffix Tree Online query suggestion algorithm: 18 Concept Sequence Suffix Tree For a query sequence q1q2 ql : Map it to concept sequence c1c2 cl : if is a new query, stop mapping, and returned concept sequence qi corresponding to qi 1qi 2 ql ; Search the tree to find the longest matched subsequence of the form c j c j 1 cl , j 1 Use candidate suggestions for c j c j 1 suggestion for q1q2 ql cl , j 1 as query 19 Concept Sequence Suffix Tree Review an example of Concept Sequence Suffix Tree: qs q1q2 qi cs c j c2 ci , 1 j i 20 Experimental Evaluation Training Data: A commercial search engine search log (Bing) in US 1.8 billion queries (151 million unique ), 2.6 billion URL clicks (115 million unique), 840million sessions Baseline algorithms: Adjacency: given qi , rank q j based on frequency of qi q j N-Gram: given qs q1q2 of q1q2 qi q j qi , rank q j based on frequency Test set data: Test -0: 1000 randomly selected single-query case sessions Test-1: 1000 randomly selected multi-query case sessions 21 Experimental Results Coverage of suggestion: Fig: The coverage of the three methods on (a) Test-0 and (b) Test-1 22 Experimental Results Quality of suggestion: (collect relevance grading from 10 judges) Fig: The quality of the three methods on (a) Test-0 and (b) Test-1 23 Summary Three things to know: Some basics about query suggestion using search log The proposed efficient query clustering algorithm for searchlog click-through bipartites data The proposed efficient context-aware query suggestion method using concept sequence suffix tree Hints: “concept” level N-gram with varied length N + A structure for efficient search 24 Thank You! 25