Download Web-page Classification through Summarization

Query Classification and KDDCUP 2005 Qiang Yang, Dou Shen 1 Query Classification and Online Advertisement 2 QC as Machine Learning Inspired by the KDDCUP’05 competition    Classify a query into a ranked list of categories Queries are collected from real search engines Target categories are organized in a tree with each node being a category 3 How to do it? 4 Solutions: Query Enrichment + Staged Classification Queries Target Categories Solution 1: Query/Category Enrichment Queries Target Categories Solution 2: Bridging classifier 5 Query enrichment  Textual information Title  Category information Snippet Category Full text 6 Classifiers  Map by Word Matching   Direct and Extended Matching  Device  E D High precision, low recall  SVM: Apply synonymbased classifiers to map Web pages from ODP to target taxonomy Obtain <pages, target category> as the training data Train SVM classifiers for the target categories;  Higher Recall 7 Bridging Classifier  Problem with Solution 1:   When target is changed, training needs to repeat! Solution:  Connect the target taxonomy and queries by taking an intermediate taxonomy as a bridge 8 Bridging Classifier (Cont.)  How to connect? T The relation between Ci and C Ij The relation between and C Ij q Prior prob. of C Ij The relation between and CiT q 9 Category Selection for Intermediate Taxonomy  Category Selection for Reducing Complexity  Total Probability (TP)  Mutual Information 10 Experiment ─ Data Sets & Evaluation  KDDCUP   Starting at 1997, KDD Cup is the leading Data Mining and Knowledge Discovery competition in the world, organized by ACM SIGKDD KDDCUP 2005   Task: Categorize 800K search queries into 67 categories Three Awards   Participation   142 registration groups; 37 solutions submitted from 32 teams Evaluation data    (1) Performance Award ; (2) Precision Award; (3) Creativity Award 800 queries randomly selected from the 800K query set 3 human labelers labeled the entire evaluation query set (details) Evaluation measurements: Precision and Performance (F1) (details)  1 Overall F1  3 a 3  (F1 against human labeler i) i 1 11 / 68 11 Experiment Results ─ Compare Different Methods Comparison among our own methods Comparison with other teams in KDDCUP2005 From Different Groups 12 12 / 68 Result of Bridging Classifiers   Performance of the Bridging Classifier with Different Granularity of Intermediate Taxonomy Using bridging classifier allows the target classes to change freely  without the need to retrain the classifier! 13 Target-transfer Learning   Classifier, once trained, stays constant When target classes change, classifier needs to be retrained with new data    Bridging Classifier:    Too costly Not online Allow target to change Application: advertisements come and go, but our querytarget mapping needs not be retrained! We call this the target-transfer learning problem 14 Task 2: Can computer do this? 15 Data: Web Search Queries  Consider the following search queries    “AAAI” “Machine Learning” “Constraint Reasoning” AAAI Machine learning Constraint Reasoning 16 AAAI 07, joint work with D. Shen, J. Sun, M. Qin, Z. Chen et al.  Queries have different granularity   Car v.s BMW; BMW v.s. AUDI Can we organize the queries into hierarchies?  Benefits of building query hierarchies  Provide online query suggestion Query classification  Query clustering   Difficulties of building query hierarchies   Queries are short The hierarchical structure cannot be pre-defined 17 Clickthrough Data √ Person 1 “SIGIR” √ Search Engines  Clickthrough Data √ 18 Intuitive Ideas  Our goal: mine the query hierarchies from clickthrough data   If two queries are related to each other, they should share some of the same or similar clicked Web pages; For two queries qi and qj , qi is more general if    most of the clicked pages of qj have similar pages to some clicked pages of qi while not the other way around If a query is specific,  the contents of its clicked pages are relatively consistent, 19

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Web-page Classification through Summarization