Download Context-aware query suggestion by mining click

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1 Outline  Introduction  Framework of the Proposed Method  Mining Query Concepts  Concept Sequence Suffix Tree  Experimental Evaluation  Summary 2 Introduction What is query suggestion in search engine?  Guess user’s search intent ( user query qi , or q1q2  suggest queries q j , qk qi) Better describe user’s information need? Why query suggestion is important?  Easy to issue appropriate query? No!  A “bottleneck issue” of search engine usability (Google, Yahoo, Bing, Baidu, etc) 3 Introduction Major existing approaches (with search log data) :  Approach I: clustering queries using clicked URL data to find similar queries, qi and q j similar?  Approach II: mining pairs of queries which are adjacent or co-occur in the same query session, q q frequent? i j Fig1: An example of search log data 4 Introduction Key Limitation:  None of them are context-aware: do not consider the immediately preceding queries as context, q1q2 qi 1qi  The clustering algorithms cannot scale up to very large data well. 1.8 billion query (151 million unique), 2.6 billion clicked URL(114 million unique) An example:  “apple”  “steve jobs” “apple” User’s search intent? 5 Proposed Method Framework Key steps: Clustering queries  Capture the context: concept sequence  Quickly find the queries that many users ask in that context Concept Sequence Suffix Tree 6 Mining Query Concepts An example of click-through bipartites data from search log: For each query qi : a L2 -normalized vector,  norm( wij ), if edge eij exist , 0, oterwise wij with norm( wij )  2 w  e ik qi [ j ]  ik distance(qi , q j )   (qi [k]  qi [k]) 2 uk U 7 Mining Query Concepts Key challenges to cluster queries:  Search log click-through bipartite could be huge: e.g., 151 million unique queries  Number of clusters is unknown  Extremely high dimensionality of query vector: 114 million unique URLs  Search logs increase dynamically Existing query clustering algorithms:  Hierarchical agglomerative method  DBSCAN method (Wen, WWW’01)  K-means, etc. 8 Mining Query Concepts Proposed clustering method: 9 Mining Query Concepts for each query q:  Step 1: first find the closest cluster C to q among the clusters obtained so far  Step 2: compute the diameter of cluster C  q  Step 3: 1) diameter  Dmax , q is assigned to C , C  C  q 2) otherwise, create a new cluster containing only q quite efficient:  Only need one scan of queries  Can run efficiently on a PC of 2GM main memory 10 Mining Query Concepts Tricks for algorithm efficiency improvement:  A dimension array data structure used in step 1 (sparse data)  Prune edges of low weights distance(qi , q j )   (qi [k]  qi [k]) 2 uk U 11 Concept Sequence Suffix Tree Extract query sessions data  each individual user’s behavior (query/click) data  segment into sessions (time interval>30mins)  discard the click event data Fig: An example of search log data 12 Concept Sequence Suffix Tree Concept sequence suffix tree  A structure used to efficiently find (search) the queries that many users ask in that context (concept sequence) Fig: An example 13 Concept Sequence Suffix Tree Algorithm to build concept sequence suffix tree: qi to c1c2 c j ,  1) Map training session data qs  q1q2 ji  2) Enumerate subsequence of c c c j  ci ci 1 cl , 1 2 i  1, l  j (distributed, map-duce)  3) Get all frequent concept subsequences cs  4) Organize these cs into concept sequence suffix tree 14 Concept Sequence Suffix Tree Algorithm for organizing cs into concept sequence suffix tree: 15 Concept Sequence Suffix Tree Organize cs into concept sequence suffix tree : 1) start from root node (empty), and scan through all frequent concept subsequence cs 2) for each cs  c1c2 cs '  c1c2 cl , first find node cr corresponding to cl 1 , if cr doesn’t exist, create it 3) update the list of candidate concepts of cs ' if cl is among the top K (a specified threshold , e.g., K=5) candidates so far; 4) representative query of the top K candidate concepts are candidate suggestions for sequence cs ' 16 Concept Sequence Suffix Tree Review an example of Concept Sequence Suffix Tree: cs  c1c2 cl , cs '  c1c2 cl 1 , 17 Concept Sequence Suffix Tree Online query suggestion algorithm: 18 Concept Sequence Suffix Tree For a query sequence q1q2 ql :  Map it to concept sequence c1c2 cl : if is a new query, stop mapping, and returned concept sequence qi corresponding to qi 1qi  2 ql ;  Search the tree to find the longest matched subsequence of the form c j c j 1 cl , j  1  Use candidate suggestions for c j c j 1 suggestion for q1q2 ql cl , j  1 as query 19 Concept Sequence Suffix Tree Review an example of Concept Sequence Suffix Tree: qs  q1q2 qi cs  c j c2 ci , 1 j i 20 Experimental Evaluation Training Data:  A commercial search engine search log (Bing) in US  1.8 billion queries (151 million unique ), 2.6 billion URL clicks (115 million unique), 840million sessions Baseline algorithms:   Adjacency: given qi , rank q j based on frequency of qi q j N-Gram: given qs  q1q2 of q1q2 qi q j qi , rank q j based on frequency Test set data:  Test -0: 1000 randomly selected single-query case sessions  Test-1: 1000 randomly selected multi-query case sessions 21 Experimental Results Coverage of suggestion: Fig: The coverage of the three methods on (a) Test-0 and (b) Test-1 22 Experimental Results Quality of suggestion: (collect relevance grading from 10 judges) Fig: The quality of the three methods on (a) Test-0 and (b) Test-1 23 Summary Three things to know:  Some basics about query suggestion using search log  The proposed efficient query clustering algorithm for searchlog click-through bipartites data  The proposed efficient context-aware query suggestion method using concept sequence suffix tree Hints: “concept” level N-gram with varied length N + A structure for efficient search 24 Thank You! 25

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Context-aware query suggestion by mining click