Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002 Agenda Introduction Related work Problem definition -- Hyperlink Selection Solution -- LinkSelector Evaluation Contributions, limitations and future work 2 Introduction Size of WWW (Lawrence and Giles,1999) 800 million web pages 1 million pages added daily How to find information on the Web Using search engines (best coverage 38.3%) (Lawrence and Giles,1999) Clicking on hyperlinks 3 Introduction Web Page 1 Web Page 2 Product Category List A B C D E F Click on A Product Category A Product List A1 A2 A3 A4 A5 Click on A2 Web Page 3 Product A2 Price: 1000 Detailed description B2 4 Introduction Portal page: is the entrance to a website. Portal page Homepage of a website Default web portal (e.g.,My Yahoo!) Most My Yahoo! users never customize their default web portals (Manber et al., 2000). 5 Introduction Hyperlinks in a portal page are selected from a hyperlink pool. A hyperlink pool is a set of hyperlinks pointing to top-level web pages, e.g., hyperlink in a site index page. 6 Portal page 7 Hyperlink pool 8 Portal page 9 Hyperlink pool 10 Introduction Number of hyperlinks in a portal page: several dozens (e.g., 32 in the Arizona Home page). Number of hyperlinks in a hyperlink pool: several hundreds (e.g., 743 in the Arizona Index page). 11 Introduction It is too computational expensive to do an exhaustive search (e.g., C 1.44E 56 ). 32 743 Current practice of hyperlink selection – expert selection Only reflect domain experts’ perspectives Subjective 12 Introduction Our approach is based on web access patterns extracted from a web log – objective and reflect web surfers’ perspectives web structural patterns extracted from an existing website -- objective 13 Related work Web mining is the process of applying data mining techniques to extract patterns from the Web. Web Data Content: texts and graphics in web pages Structure: hyperlinks Usage: web logs 14 Related work Web content mining is the process of automatically retrieving, parsing, indexing and categorizing web documents.(Chakrabrati, 2000) Web structure mining HITS (Kleinberg, 1998) PageRank (Brin and Page, 1998) 15 Related work Web usage mining is the process of applying data mining techniques to extract web access patterns from a web log. 16 Related work Web usage mining General purpose, e.g., Chen et al. 1996; Cooley et al., 1999 Website improvement, e.g., Perkowitz and Etzioni, 2000 Personalization, e.g., Yan 1996 17 Related work Limitations of previous web usage mining research Not considering web structure information, e.g., Chen et al., 1996 Web structure information are used to exclude “uninteresting” web visiting patterns, e.g., Yan et al., 1996 and Cooley et al., 1999 18 Hyperlink Selection The quality of a portal page is measured using a web log and a web log can be divided into sessions. Metrics to measure the quality of a portal page Effectiveness Efficiency Usage 19 Hyperlink Selection Effectiveness: is the percentage of usersought top-level web pages that can be easily accessed from the portal page. What are the user-sought top-level web pages? How to define the easiness to find a web page from a Portal page? 20 Hyperlink Selection User-sought top-level web pages Session j: L1, L10, L11, L2, L13, L14, L5, L9, L7, L12 L1, L2, L5, and L7 are in the hyperlink pool User-sought top-level web pages: L1, L2, L5, L7 21 Hyperlink Selection Usually, web pages that are 1-2 clicks away from a portal page can be easily found from the portal page. 22 Hyperlink Selection • Effectiveness measured at session level effectiveness ( S j ) UHL( S j ) HL UHL( S j ) UHL( S j ) {L1, L2, L5, L7} HL {L1, L2, L3, L4, L5, L8} {L1, L2, L5} effectiveness ( S j ) 75% {L1, L2, L5, L7} • Effectiveness measured at log level s effectiveness( wl ) effectiveness( S j) j 1 s 23 Hyperlink Selection • Efficiency measures the usefulness of hyperlinks placed in a portal page. • Efficiency measured at session level efficiency ( S j ) UHL( S j ) HL PHL UHL( S j ) {L1, L2, L5, L7} HL {L1, L2, L3, L4, L5, L8} PHL {L1, L2, L3} efficiency ( S j ) {L1, L2, L5} 1 {L1, L2, L3} • Efficiency measured at log level s efficiency( wl ) efficiency( S j) j 1 s 24 Hyperlink Selection Usage : how often a portal page is visited. usage( wl ) s UHL( S j ) HL j 1 25 Hyperlink Selection Definition : Given a website w, its hyperlink pool HP and the number of hyperlinks to be placed in the portal page of w – N, where N HP , the hyperlink selection problem is to construct the portal page by selecting N hyperlinks from the hyperlink pool HP to maximize the effectiveness, efficiency and usage of the resulting portal page (i.e., all metrics are measured at the web log level). 26 LinkSelector LinkSelector is based on relationships between hyperlinks in a hyperlink pool. Structure Relationship Access Relationship 27 LinkSelector Structure Relationship 28 LinkSelector Structure Relationship L2 L4 Web page 2 L6 L1 L8 L3 L5 Web page 1 L7 Web page 3 Structure relationships: L1L2 L1L4 L3L5 L3L7 L1L6 L1L8 29 LinkSelector Access Relationship k-HS is denoted as a hyperlink set with k hyperlinks. e.g., {L1,L2} is a 2-HS The support of a k-HS is the percentage of sessions that web pages pointed to by hyperlinks in the k-HS are accessed together. e.g., If web pages pointed to by L1 and L2 are accessed together in 20 sessions out of total 100 sessions, then the support of 2-HS {L1,L2} is 20%. 30 LinkSelector Access Relationship Definition : For a k-HS , where k 2 , there exists an access relationship among elements in the k-HS if and only if its support is greater than a pre-defined threshold. 31 LinkSelector 32 LinkSelector Group-I relationship L2 L4 L1 L3 Web page 1 Web page 2 L6 L8 L5 L7 Web page 3 Structure relationships: L1L2 L1L4 L3L5 L3L7 L1L6 L1L8 33 LinkSelector Group-I relationship L2 L4 L1 L3 Web page 1 L6 L8 L5 L7 L1L4 L3L5 L3L7 L1L6 Web page 3 Access relationships: Structure relationships: L1L2 Web page 2 L1L8 {L1,L2},0.1 {L1,L4},0.1 {L1,L6},0.05 {L1,L8},0.05 {L3,L5},0.4 {L3,L7},0.5 34 LinkSelector Group-I relationship provides indicators of preference for individual hyperlinks in hyperlink selection the number of structure relationships a hyperlink participate in as an initial hyperlink the quality of these structure relationships 35 LinkSelector Group-II relationship no structure relationship between L9 and L12 an access relationship between L9 and L12 36 LinkSelector Group-II relationship provides indicators of hyperlink pair preference in hyperlink selection: hyperlink pairs with Group-II relationships are preferred to hyperlink pairs without Group-II relationships within hyperlink pairs with Group-II relationships, hyperlink pairs with higher support of access relationship are preferred to those with lower support of access relationships 37 LinkSelector Group-III relationship reveals patterns that are not relevant to hyperlink selection L1 L5 Web page 2 Web page 1 Group-IV relationship does not reveal interesting patterns. 38 LinkSelector The Sketch of LinkSelector 39 LinkSelector Discover Structure Relationships 40 LinkSelector Access relationship can be discovered from a web log using association rule mining Data Preprocessing Association rule mining 41 LinkSelector Data Preprocessing Web log cleaning Error logs (e.g., status code 404) Accessory logs (e.g., .gif) Session identification Modify web server 30-min time interval 42 LinkSelector Association rule mining Given: a transaction database tid item 001 1,2,3 002 1,2,4 003 2,3,4 004 4,5,6 An itemset is a set of items, e.g., {1,2} The support of an itemset is the percentage of transactions that contain (e.g., purchase) the itemst. Objective: discover all itemsets with supports larger than a user-defined threshold. Apriori Algorithm (Agrawal and Srikant,1994 ) 43 LinkSelector Calculate Preferences for Hyperlinks HP {L1 , L2 , L3 , L4 , L5 , L6 , L7 } SR {L1 L2 , L1 L3 , L1 L4 , } AR {({L1 , L2 },0.022), ({L1 , L3},0.018), ({L1 , L4 },0.01) } PRE L1 0.022 0.018 0.01 0.05 44 LinkSelector Calculate Preferences for Hyperlinks Sets AR {({L3 , L7 , L9},0.022), ({L4 , L8},0.02), } No structure relationships between L3 and L7 ; L3 and L9 ; L7 and L9 . Preference for hyperlink set {L3 , L7 , L9}is 0.022 45 LinkSelector Clustering is a data mining algorithm to segment objects into groups based on their similarities 1 3 0.2 5 1 2 3 4 5 1 0 0.2 0.1 0.1 0.05 2 0.2 0 0.1 0.1 0.05 3 0.1 0.1 4 5 2 4 0.12 46 LinkSelector Hyperlink Clustering : Hyperlinks Objects Preferences for hyperlinks Weights of objects Preferences for hyperlinks sets Similarities among hyperlinks 47 LinkSelector Limitations of classical clustering algorithms Weights of objects are not considered. Only considers similarities between two objects 48 LinkSelector Our solution Indexes of the proposed similarity matrix are clusters while indexes of the traditional similarity matrix are objects to be clustered. Similarities involving two and more objects are considered in the proposed similarity matrix. Weights of objects are considered 49 LinkSelector 50 Experiment Data collected from UA website Hyperlink pool: 110 links Web log: collected in Sep. 2001 10 M records 4.2 M records total 344 K sessions 262 k sessions Training data (23 days) 82 k sessions Testing data (7 days) 51 Experiment Hyperlinks selected by LinkSelector, Domain experts and access frequency (N=6) 52 Experiment Average improvement: 12.7% Improvement decrease from 22.1% to 8.4% Average number of sessions per day: 11.5k 53 Experiment Average improvement: 17.0% Improvement decrease from 30.2% to 9.4% Absolute number of hyperlinks improved: 15610 to 6509 54 Experiment Average improvement: 16.9% Improvement decrease from 30.2% to 9.3% 55 Experiment Hyperlinks selected by LinkSelector, Classical Hierarchical Clustering and Association rule mining (N=6) 56 Experiment Average improvement compared with association rule mining: 25.8% Average improvement compared with classical clustering: 102.0% 57 Experiment Average improvement compared with association rule mining: 31.7% Average improvement compared with classical clustering: 124.0% 58 Experiment Average improvement compared with association rule mining: 31.6% Average improvement compared with classical clustering: 123.0% 59 Contributions 1. We proposed and formally defined a new and important research problem – hyperlink selection. 2. We proposed and showed what a web mining based hyperlink selection approach outperforms other hyperlink selection approaches. 3.We developed a new clustering algorithm for hyperlink selection. 60 Limitations and Future work User Study Adaptive LinkSelector Structure of website changes Web visiting patterns change 61