Context-aware Query Suggestion by
Mining Click-through and Session Data
Authors: H. Cao
KDD 08
Presented by Shize Su
Framework of the Proposed Method
Mining Query Concepts
Concept Sequence Suffix Tree
Experimental Evaluation
What is query suggestion in search engine?
 Guess user’s search intent ( user query qi , or q1q2
 suggest queries q j , qk
Better describe user’s
information need?
Why query suggestion is important?
Easy to issue appropriate query? No!
A “bottleneck issue” of search engine usability
(Google, Yahoo, Bing, Baidu, etc)
Major existing approaches (with search log data) :
Approach I: clustering queries using clicked URL data to
find similar queries,
qi and q j similar?
Approach II: mining pairs of queries which are adjacent or
co-occur in the same query session, q q frequent?
Fig1: An example of search log data
Key Limitation:
None of them are context-aware: do not consider the
immediately preceding queries as context, q1q2
qi 1qi
The clustering algorithms cannot scale up to very large data
1.8 billion query (151 million unique),
2.6 billion clicked URL(114 million unique)
An example:
“steve jobs” “apple”
User’s search
Proposed Method Framework
Key steps:
Capture the context: concept sequence
Quickly find the queries that many users ask in that context
Concept Sequence Suffix Tree
Mining Query Concepts
An example of click-through bipartites data from search log:
For each query qi :
a L2 -normalized vector,
norm( wij ), if edge eij exist ,
0, oterwise
with norm( wij ) 
 e ik
qi [ j ] 
distance(qi , q j )
  (qi [k]  qi [k]) 2
uk U
Mining Query Concepts
Key challenges to cluster queries:
Search log click-through bipartite could be huge: e.g.,
151 million unique queries
Number of clusters is unknown
Extremely high dimensionality of query vector: 114
million unique URLs
Search logs increase dynamically
Existing query clustering algorithms:
Hierarchical agglomerative method
DBSCAN method (Wen, WWW’01)
K-means, etc.
Mining Query Concepts
Proposed clustering method:
Mining Query Concepts
for each query q:
 Step 1: first find the closest cluster C to q among the
clusters obtained so far
 Step 2: compute the diameter of cluster C  q
Step 3: 1) diameter  Dmax , q is assigned to C , C  C  q
2) otherwise, create a new cluster containing only q
quite efficient:
 Only need one scan of queries
Can run efficiently on a PC of 2GM main memory
Mining Query Concepts
Tricks for algorithm efficiency improvement:
 A dimension array data structure used in step 1 (sparse data)
 Prune edges of low weights
distance(qi , q j )
  (qi [k]  qi [k]) 2
uk U
Concept Sequence Suffix Tree
Extract query sessions data
each individual user’s behavior (query/click) data
segment into sessions (time interval>30mins)
discard the click event data
Fig: An example of search log data
Concept Sequence Suffix Tree
Concept sequence suffix tree
A structure used to efficiently find (search) the queries that
many users ask in that context (concept sequence)
Fig: An example
Concept Sequence Suffix Tree
Algorithm to build concept sequence suffix tree:
qi to c1c2 c j ,
 1) Map training session data qs  q1q2
 2) Enumerate subsequence of c c
c j  ci ci 1 cl ,
1 2
i  1, l  j
(distributed, map-duce)
3) Get all frequent concept subsequences cs
4) Organize these cs into concept sequence suffix tree
Concept Sequence Suffix Tree
Algorithm for organizing cs into concept sequence suffix tree:
Concept Sequence Suffix Tree
Organize cs into concept sequence suffix tree :
1) start from root node (empty), and scan through all frequent
concept subsequence cs
2) for each cs  c1c2
cs '  c1c2
cl , first find node cr corresponding to
cl 1 , if cr doesn’t exist, create it
3) update the list of candidate concepts of cs ' if cl is among the
top K (a specified threshold , e.g., K=5) candidates so far;
4) representative query of the top K candidate concepts are
candidate suggestions for sequence cs '
Concept Sequence Suffix Tree
Review an example of Concept Sequence Suffix Tree:
cs  c1c2
cl ,
cs '  c1c2
cl 1 ,
Concept Sequence Suffix Tree
Online query suggestion algorithm:
Concept Sequence Suffix Tree
For a query sequence q1q2
ql :
Map it to concept sequence c1c2 cl : if is a new query,
stop mapping, and returned concept sequence
corresponding to qi 1qi  2 ql ;
Search the tree to find the longest matched subsequence
of the form c j c j 1 cl , j  1
Use candidate suggestions for c j c j 1
suggestion for q1q2 ql
cl , j  1 as query
Concept Sequence Suffix Tree
Review an example of Concept Sequence Suffix Tree:
qs  q1q2 qi
cs  c j c2 ci ,
1 j i
Experimental Evaluation
Training Data:
A commercial search engine search log (Bing) in US
1.8 billion queries (151 million unique ), 2.6 billion URL
clicks (115 million unique), 840million sessions
Baseline algorithms:
Adjacency: given qi , rank q j based on frequency of qi q j
N-Gram: given qs  q1q2
of q1q2 qi q j
qi , rank q j based on frequency
Test set data:
Test -0: 1000 randomly selected single-query case sessions
Test-1: 1000 randomly selected multi-query case sessions
Experimental Results
Coverage of suggestion:
Fig: The coverage of the three methods
on (a) Test-0 and (b) Test-1
Experimental Results
Quality of suggestion: (collect relevance grading from 10
Fig: The quality of the three methods
on (a) Test-0 and (b) Test-1
Three things to know:
Some basics about query suggestion using search log
The proposed efficient query clustering algorithm for searchlog click-through bipartites data
The proposed efficient context-aware query suggestion
method using concept sequence suffix tree
Hints: “concept” level N-gram
with varied length N
A structure for efficient search
Thank You!