Download Context-aware query suggestion by mining click

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Context-aware Query Suggestion by
Mining Click-through and Session Data
Authors: H. Cao et.al
KDD 08
Presented by Shize Su
1
Outline

Introduction

Framework of the Proposed Method

Mining Query Concepts

Concept Sequence Suffix Tree

Experimental Evaluation

Summary
2
Introduction
What is query suggestion in search engine?
 Guess user’s search intent ( user query qi , or q1q2
 suggest queries q j , qk
qi)
Better describe user’s
information need?
Why query suggestion is important?

Easy to issue appropriate query? No!

A “bottleneck issue” of search engine usability
(Google, Yahoo, Bing, Baidu, etc)
3
Introduction
Major existing approaches (with search log data) :

Approach I: clustering queries using clicked URL data to
find similar queries,
qi and q j similar?

Approach II: mining pairs of queries which are adjacent or
co-occur in the same query session, q q frequent?
i
j
Fig1: An example of search log data
4
Introduction
Key Limitation:

None of them are context-aware: do not consider the
immediately preceding queries as context, q1q2
qi 1qi

The clustering algorithms cannot scale up to very large data
well.
1.8 billion query (151 million unique),
2.6 billion clicked URL(114 million unique)
An example:

“apple”

“steve jobs” “apple”
User’s search
intent?
5
Proposed Method Framework
Key steps:
Clustering
queries

Capture the context: concept sequence

Quickly find the queries that many users ask in that context
Concept Sequence Suffix Tree
6
Mining Query Concepts
An example of click-through bipartites data from search log:
For each query qi :
a L2 -normalized vector,

norm( wij ), if edge eij exist ,
0, oterwise
wij
with norm( wij ) 
2
w
 e ik
qi [ j ] 
ik
distance(qi , q j )
  (qi [k]  qi [k]) 2
uk U
7
Mining Query Concepts
Key challenges to cluster queries:

Search log click-through bipartite could be huge: e.g.,
151 million unique queries

Number of clusters is unknown

Extremely high dimensionality of query vector: 114
million unique URLs

Search logs increase dynamically
Existing query clustering algorithms:

Hierarchical agglomerative method

DBSCAN method (Wen, WWW’01)

K-means, etc.
8
Mining Query Concepts
Proposed clustering method:
9
Mining Query Concepts
for each query q:
 Step 1: first find the closest cluster C to q among the
clusters obtained so far
 Step 2: compute the diameter of cluster C  q

Step 3: 1) diameter  Dmax , q is assigned to C , C  C  q
2) otherwise, create a new cluster containing only q
quite efficient:
 Only need one scan of queries

Can run efficiently on a PC of 2GM main memory
10
Mining Query Concepts
Tricks for algorithm efficiency improvement:
 A dimension array data structure used in step 1 (sparse data)
 Prune edges of low weights
distance(qi , q j )
  (qi [k]  qi [k]) 2
uk U
11
Concept Sequence Suffix Tree
Extract query sessions data

each individual user’s behavior (query/click) data

segment into sessions (time interval>30mins)

discard the click event data
Fig: An example of search log data
12
Concept Sequence Suffix Tree
Concept sequence suffix tree

A structure used to efficiently find (search) the queries that
many users ask in that context (concept sequence)
Fig: An example
13
Concept Sequence Suffix Tree
Algorithm to build concept sequence suffix tree:
qi to c1c2 c j ,
 1) Map training session data qs  q1q2
ji
 2) Enumerate subsequence of c c
c j  ci ci 1 cl ,
1 2
i  1, l  j
(distributed, map-duce)

3) Get all frequent concept subsequences cs

4) Organize these cs into concept sequence suffix tree
14
Concept Sequence Suffix Tree
Algorithm for organizing cs into concept sequence suffix tree:
15
Concept Sequence Suffix Tree
Organize cs into concept sequence suffix tree :
1) start from root node (empty), and scan through all frequent
concept subsequence cs
2) for each cs  c1c2
cs '  c1c2
cl , first find node cr corresponding to
cl 1 , if cr doesn’t exist, create it
3) update the list of candidate concepts of cs ' if cl is among the
top K (a specified threshold , e.g., K=5) candidates so far;
4) representative query of the top K candidate concepts are
candidate suggestions for sequence cs '
16
Concept Sequence Suffix Tree
Review an example of Concept Sequence Suffix Tree:
cs  c1c2
cl ,
cs '  c1c2
cl 1 ,
17
Concept Sequence Suffix Tree
Online query suggestion algorithm:
18
Concept Sequence Suffix Tree
For a query sequence q1q2
ql :

Map it to concept sequence c1c2 cl : if is a new query,
stop mapping, and returned concept sequence
qi
corresponding to qi 1qi  2 ql ;

Search the tree to find the longest matched subsequence
of the form c j c j 1 cl , j  1

Use candidate suggestions for c j c j 1
suggestion for q1q2 ql
cl , j  1 as query
19
Concept Sequence Suffix Tree
Review an example of Concept Sequence Suffix Tree:
qs  q1q2 qi
cs  c j c2 ci ,
1 j i
20
Experimental Evaluation
Training Data:

A commercial search engine search log (Bing) in US

1.8 billion queries (151 million unique ), 2.6 billion URL
clicks (115 million unique), 840million sessions
Baseline algorithms:


Adjacency: given qi , rank q j based on frequency of qi q j
N-Gram: given qs  q1q2
of q1q2 qi q j
qi , rank q j based on frequency
Test set data:

Test -0: 1000 randomly selected single-query case sessions

Test-1: 1000 randomly selected multi-query case sessions
21
Experimental Results
Coverage of suggestion:
Fig: The coverage of the three methods
on (a) Test-0 and (b) Test-1
22
Experimental Results
Quality of suggestion: (collect relevance grading from 10
judges)
Fig: The quality of the three methods
on (a) Test-0 and (b) Test-1
23
Summary
Three things to know:

Some basics about query suggestion using search log

The proposed efficient query clustering algorithm for searchlog click-through bipartites data

The proposed efficient context-aware query suggestion
method using concept sequence suffix tree
Hints: “concept” level N-gram
with varied length N
+
A structure for efficient search
24
Thank You!
25