Download Mining Query Subtopics from Search Log Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
What to Mine from Big Data?
Hang Li
Noah’s Ark Lab
Huawei Technologies
Big Data
Value
Two Main Issues in Big Data Mining
Agenda
• Four Principles for “What to Mine”
• Stories regarding to Principles
– Search and Browse Log Mining as Example
• Our Work on Big Data Mining
– Mining Query Subtopics from Search Log Data
• Summary
Four Principles for “What to Mine”
1. Identifying scenarios of mining as much as
possible
2. Logging as much data as possible
3. Integrating as much data as possible
4. ‘Understanding’ data as much as possible
Identifying scenarios of mining as
much as possible
Immanuel Kant
The world as we know it is our interpretation of
the observable facts in the light of theories
that we ourselves invent
Example of Bad Design of Toolbar
•
•
•
•
A toolbar developed at a search engine
It recorded user’s search behavior data
However,
It did not record the time at which the user
closed browser
• No indication of end of session
Logging as much data as possible
Examples of Useful Log Information
• User moves mouse on screen (user may
unconsciously put mouse on focused area)
– may infer users’ interest on the page
• User uses mouse to scroll up and down
– may infer whether user is serious about page content
(more scrolling suggests more seriousness)
• User clicks on next page
– may infer user’s current focus
• User closes browser window/tab
– may infer user’s current focus
Integrating as much data as
possible
Model of User Search Behavior
• Data needs to be
collected from
different sources
(toolbar, search
engine log)
• E.g., toolbar usually
does not record
search results
• Often challenging
to integrate data
Understanding Data as Much as
Possible
AOL Search Data Leak (2006)
• AOL search data release (20M queries, 650K users, 3 months)
• New York Times article “A Face Is Exposed for AOL Searcher
No. 4417749”
• Queries
–
–
–
–
–
“landscapers in Lilburn, Ga”
several people with the last name Arnold
“homes sold in shadow lake subdivision gwinnett county georgia.”
''dog that urinates on everything”
“60 single men”
• Identified searcher is Thelma Arnold, a widow living in Georia
Mining Query Subtopics from
Search Log Data
Yunhua Hu, Yanan Qian1, Hang Li, Daxin Jiang, Jian Pei2, and
Qinghua Zheng1
Microsoft Research Asia, Beijing, China
1 SPKLSTN Lab, Xi'an Jiaotong University, China
2 Simon Fraser University, Burnaby, BC, Canada
Outline
• Introduction
• Our Method
• Experiments
• Conclusion
16
Demo
Mined Subtopics
Subtopics of Query
• Most queries are ambiguous or multifaceted
in web search
XBox games
Harry Shum
Microsoft
Harry
Shum
XBox
Harry Shum
Jr
XBox
homepage
XBox
marketplace
• Major senses and facets of query (subtopics)
21
Our Work = Automatically Mining
Subtopics of Queries from Search
Log Data
Phenomenon 1: One Subtopic per Search (OSS)
Query
"Harry Shum"
Multi-Clicked URLs (Multi-Clicks)
Frequency
"http://research.microsoft.com/en-us/people/hshum,
http://en.wikipedia.org/wiki/Harry_Shum,
http://www.microsoft.com/presspass/exec/Shum/"
50
"http://en.wikipedia.org/wiki/Harry_Shum,_Jr,
http://www.washingtonpost.com/.../VI2011022701183.html"
95
Jointly Clicked URLs in the same searches tend to
represent the same subtopics
Phenomenon 2: Subtopic Clarification by
Additional Keyword (SCAK)
Query
Clicked URLs
"Harry Shum"
"http://research.microsoft.com/en-us/people/hshum",
"http://en.wikipedia.org/wiki/Harry_Shum,_Jr",
"http://en.wikipedia.org/wiki/Harry_Shum",
"http://www.washingtonpost.com/.../VI2011022701183.html"
"http://www.microsoft.com/presspass/exec/Shum/"
“Microsoft Harry Shum"
"http://research.microsoft.com/en-us/people/hshum",
"http://en.wikipedia.org/wiki/Harry_Shum",
“http://www.microsoft.com/presspass/exec/Shum/”
"Harry Shum Jr"
"http://en.wikipedia.org/wiki/Harry_Shum,_Jr",
"http://www.washingtonpost.com/.../VI2011022701183.html"
"Harry Shum Glee”
"http://en.wikipedia.org/wiki/Harry_Shum,_Jr",
"http://www.washingtonpost.com/.../VI2011022701183.html"
URLs clicked in searches of the query and its expanded
queries tend to represent the same subtopics.
Outline
• Introduction
• Our Method
• Experiments
• Conclusion
25
Our Approach
• Mining subtopics of queries by leveraging the
two phenomena
• Subtopics of query are represented by
– URLs
– Keywords in expanded queries
• Example of subtopic
Subtopi Keywords (in bold face)
URLs
1
“harry shum microsoft” “http://en.wikipedia.org/wiki/Harry_Shum”
“harry shum bing”
“http://research.microsoft.com/en-us/people/hshum/”
“microsoft harry shum” “http://www.microsoft.com/presspass/exec/Shum/”
2
“harry shum jr”
“harry shum glee”
“harry shum junior”
“http://en.wikipedia.org/wiki/Harry_Shum,_Jr.”
“http://harryshumjr.com/”
“http://www.imdb.com/name/nm1484270/”
26
Flow of Clustering Method
27
Preprocessing
• Tree structure to index queries (‘Q+W’ and ‘W+Q’ for
‘Q’)
• Pruning: Only keep expanded queries with URL
overlap
28
Similarity Calculation between URLs
S1: Similarity basedURLs
on OSS
S2: Similarity based on SCAK
S"http://en.wikipedia.org/wiki/Harry_Shum"
3: Similarity between URL tokens
MultiClick1
MultiClick2
MultiClick3
4
3
0
4
0
3
"http://www.microsoft.com/presspass/exec/Shum/"
…
…
…
N/A
…
…
N/A
0.64 N/A
…
N/A
N/A
N/A
N/A
Similarity Matrix of S1
URLs
0.96
N/A
Similarity Matrix of S2
“Jr”
“Glee” “Microsoft”
"http://en.wikipedia.org/wiki/Harry_Shum,_Jr"
3
4
0
"http://www.imdb.com/name/nm1484270/"
4
3
0
Clustering Algorithm
• Agglomerative clustering algorithm
– Two URLs are similar if the similarity is larger than a
threshold
– Each maximum connected subgraph (a group of urls)
represents a subtopic
• Algorithm is efficient and easy to implement
30
Outline
• Introduction
• Our Method
• Experiments
• Conclusion
31
Data Set and Parameter Setting
• One open dataset + two proprietary datasets
• Evaluation metric: B-cubed precision, recall, and F1
• Manually tune the parameters in 1/3 of DataSetA
32
Evaluation of Subtopic Mining
• Evaluation on different similarity functions
• Evaluation on different types of queries
33
Application in Search Result Clustering (1)
• Search result clustering approaches
– Baseline: Wang and Zhai’s work in SIGIR 07
– Our approach: "subtopics of query as seed
clusters" + traditional URL clustering
• Evaluation on TREC and DataSetA
34
Application in Search Result Clustering (2)
• Manual evaluation on DataSetB from various
perspectives
• Side-by-side evaluation on DataSetB
35
Application in Search Results Re-ranking (1)
36
Application in Search Results Re-ranking (2)
37
Outline
• Introduction
• Our Method
• Experiments
• Conclusion
38
Conclusion
• Discovered two phenomena in search log data
to represent query subtopics
• Developed a clustering method for subtopic
mining
• Applied the mined subtopics into two tasks:
search result clustering and re-ranking
39
Strength and Limitation of Big Data
Mining
•
•
•
•
Big data really creates big value
Importance of insight
Log tail challenges
Mining needs knowledge
40
Summary
• Two Major Issues: What to Mine and How to
Mine
• Four Principles for “What to Mine”
• Stories regarding to Principles
– Search and Browse Log Mining as Example
• Our Work on Big Data Mining
– Mining Query Subtopics from Search Log Data
Thanks!
[email protected]
42