Download Mining Query Subtopics from Search Log Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
Mining Query Subtopics from
Search Log Data
Hu et al., SIGIR’12
Presented by Baichuan Li
Outline
•
•
•
•
•
•
Problem
Two observations
Approach
Results
Applications
Conclusion
Problem
• Intention != Query
• Many queries are ambiguous or multifaceted
Problem
Problem
• Xbox
– Homepage?
– Online game?
– Where to buy?
• Manchester
– Manchester news?
– Manchester tourism?
– Manchester weather?
– Manchester united?
Solution
• Identifying the major subtopics of a query
from the search log data
– Personalized search
– Query suggestion
– Search result presentation
• Clustering
• Re-ranking
• Diversification
Observations from The Logs
• One subtopic per search (OSS)
– If a user clicks multiple URLs after submitting a
query, then the clicked URLs tend to represent the
same subtopic
Observations from The Logs
• Subtopic clarification by additional keyword
(SCAK)
– Users often add additional keywords (in most cases,
one additional keyword) to a query to expand the
query in order to clarify its subtopic
– The URLs clicked after searching both with the original
and the expanded queries tend to represent the same
subtopic
– The key word tends to be indicative of the subtopic
– E.g.
• harry shum microsoft
One Subtopic per Search
• Use it as the rule for subtopic identification
– 10,000 groups of multiclicks of individual queries
– The rule is accurate when all the URLs within the
multi-clicks are about the same sense or face
Accuracy V.S. Frequency
Accuracy v.s. Click Position
Distribution
• The queries with higher frequencies in search
log data are more likely to have multi-clicks
Subtopic Clarification by Additional
Keyword
• The keywords ‘microsoft’ and ‘jr’ can be used
to represent the two groups (subtopics)
respectively
Query Types
•
•
•
•
‘Q’: the query is a single phrase
‘Q + W’: ‘Q’ + a keyword
‘W + Q’: a keyword + ‘Q’
‘Others’
Subtopic Overlap and URL Overlap
• Randomly selected 500 pairs of queries with
the forms ‘Q’ and ‘Q + W’
• If subtopics of an expanded query are
contained in subtopics of the original query,
then there is ‘subtopic overlap’
• Check whether two queries share identical
clicked URLs (‘URL overlap’)
Distribution
• The more popular (frequent) a query is, the
more likely the rule is applicable
Mining Subtopics
Preprocessing
Clustering
• Similarity function
– The similarity function between URLs ui and uj :
– S1: based on the OSS phenomenon
where
and
denote the vectors of multi-clicks of
ui and uj respectively
– S2: based on the SCAK phenomenon
– S3: based on string similarities
SCAK Similarity (S2)
where
and
denote the
vectors of keywords associated
with ui and uj respectively
Data
• TREC Data: TREC search result diversification
track 3 in 2009
• DataSetA and DataSetB: queries and URLs
randomly sampled from the logs of the
commercial search engine
Results
Application: Search Result Reranking
• The user is first asked which subtopic she is
interested in, with the subtopics shown at the
top of the results page
• When the user selects a subtopic, the URLs
belonging to the subtopic will be moved to the
top (re-ranked)
• The relative order between URLs inside and
outside of the subtopic will be kept
Example
Evaluation
• Data
– Collect search log data of 20, 000 randomly selected
searches
– Each query has at least two subtopics mined by the
method
• Results
– The average position of last clicked URLs is 3.41
– Assume the cost for the user to check the subtopics
and click one of them is 1.0
– The average position of last clicked URLs belonging to
the same subtopics is 1.80
Conclusion
• Study the problem of query subtopic mining
• Discovered two phenomena of user search
behavior
– one subtopic per search
– Subtopic clarification by additional keyword
• Design a novel similarity function
• Applications on search result reranking (and
search result clustering)
• Problem
– Can only be employed when there is enough search
log data