Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Query Classification and KDDCUP 2005 Qiang Yang, Dou Shen 1 Query Classification and Online Advertisement 2 QC as Machine Learning Inspired by the KDDCUP’05 competition Classify a query into a ranked list of categories Queries are collected from real search engines Target categories are organized in a tree with each node being a category 3 How to do it? 4 Solutions: Query Enrichment + Staged Classification Queries Target Categories Solution 1: Query/Category Enrichment Queries Target Categories Solution 2: Bridging classifier 5 Query enrichment Textual information Title Category information Snippet Category Full text 6 Classifiers Map by Word Matching Direct and Extended Matching Device E D High precision, low recall SVM: Apply synonymbased classifiers to map Web pages from ODP to target taxonomy Obtain <pages, target category> as the training data Train SVM classifiers for the target categories; Higher Recall 7 Bridging Classifier Problem with Solution 1: When target is changed, training needs to repeat! Solution: Connect the target taxonomy and queries by taking an intermediate taxonomy as a bridge 8 Bridging Classifier (Cont.) How to connect? T The relation between Ci and C Ij The relation between and C Ij q Prior prob. of C Ij The relation between and CiT q 9 Category Selection for Intermediate Taxonomy Category Selection for Reducing Complexity Total Probability (TP) Mutual Information 10 Experiment ─ Data Sets & Evaluation KDDCUP Starting at 1997, KDD Cup is the leading Data Mining and Knowledge Discovery competition in the world, organized by ACM SIGKDD KDDCUP 2005 Task: Categorize 800K search queries into 67 categories Three Awards Participation 142 registration groups; 37 solutions submitted from 32 teams Evaluation data (1) Performance Award ; (2) Precision Award; (3) Creativity Award 800 queries randomly selected from the 800K query set 3 human labelers labeled the entire evaluation query set (details) Evaluation measurements: Precision and Performance (F1) (details) 1 Overall F1 3 a 3 (F1 against human labeler i) i 1 11 / 68 11 Experiment Results ─ Compare Different Methods Comparison among our own methods Comparison with other teams in KDDCUP2005 From Different Groups 12 12 / 68 Result of Bridging Classifiers Performance of the Bridging Classifier with Different Granularity of Intermediate Taxonomy Using bridging classifier allows the target classes to change freely without the need to retrain the classifier! 13 Target-transfer Learning Classifier, once trained, stays constant When target classes change, classifier needs to be retrained with new data Bridging Classifier: Too costly Not online Allow target to change Application: advertisements come and go, but our querytarget mapping needs not be retrained! We call this the target-transfer learning problem 14 Task 2: Can computer do this? 15 Data: Web Search Queries Consider the following search queries “AAAI” “Machine Learning” “Constraint Reasoning” AAAI Machine learning Constraint Reasoning 16 AAAI 07, joint work with D. Shen, J. Sun, M. Qin, Z. Chen et al. Queries have different granularity Car v.s BMW; BMW v.s. AUDI Can we organize the queries into hierarchies? Benefits of building query hierarchies Provide online query suggestion Query classification Query clustering Difficulties of building query hierarchies Queries are short The hierarchical structure cannot be pre-defined 17 Clickthrough Data √ Person 1 “SIGIR” √ Search Engines Clickthrough Data √ 18 Intuitive Ideas Our goal: mine the query hierarchies from clickthrough data If two queries are related to each other, they should share some of the same or similar clicked Web pages; For two queries qi and qj , qi is more general if most of the clicked pages of qj have similar pages to some clicked pages of qi while not the other way around If a query is specific, the contents of its clicked pages are relatively consistent, 19