Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 [email protected] May 22, 2017 Web Mining 1 Web Mining • Web mining - data mining techniques to automatically discover and extract information from Web documents/services (Etzioni, 1996). • Web mining research – integrate research from several research communities (Kosala and Blockeel, July 2000) such as: • Database (DB) • Information retrieval (IR) • The sub-areas of machine learning (ML) • Natural language processing (NLP) May 22, 2017 Web Mining 2 Mining the World-Wide Web • WWW is huge, widely distributed, global information source for – Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. – Hyper-link information – Access and usage information – Web Site contents and Organization May 22, 2017 Web Mining 3 Mining the World-Wide Web • Growing and changing very rapidly – Broad diversity of user communities • Only a small portion of the information on the Web is truly relevant or useful to Web users – How to find high-quality Web pages on a specified topic? • WWW provides rich sources for data mining May 22, 2017 Web Mining 4 Challenges on WWW Interactions • Finding Relevant Information • Creating knowledge from Information available • Personalization of the information • Learning about customers / individual users Web Mining can play an important Role! May 22, 2017 Web Mining 5 Web Mining: more challenging • Searches for – Web access patterns – Web structures – Regularity and dynamics of Web contents • Problems – The “abundance” problem – Limited coverage of the Web: hidden Web sources, majority of data in DBMS – Limited query interface based on keyword-oriented search – Limited customization to individual users – 2017 Dynamic and semistructured May 22, Web Mining 6 Web Mining : Subtasks • Resource Finding – Task of retrieving intended web-documents • Information Selection & Pre-processing – Automatic selection and pre-processing specific information from retrieved web resources • Generalization – Automatic Discovery of patterns in web sites • Analysis – Validation and / or interpretation of mined patterns May 22, 2017 Web Mining 7 Web Mining Taxonomy Web Mining Web Content Mining May 22, 2017 Web Structure Mining Web Mining Web Usage Mining 8 Web Content Mining • Discovery of useful information from web contents / data / documents – Web data contents: text, image, audio, video, metadata and hyperlinks. • Information Retrieval View ( Structured + Semi-Structured) – Assist / Improve information finding – Filtering Information to users on user profiles • Database View – Model Data on the web – Integrate them for more sophisticated queries May 22, 2017 Web Mining 9 Issues in Web Content Mining • Developing intelligent tools for IR - Finding keywords and key phrases - Discovering grammatical rules and collocations - Hypertext classification/categorization - Extracting key phrases from text documents - Learning extraction models/rules - Hierarchical clustering - Predicting (words) relationship May 22, 2017 Web Mining 10 Cont…. • Developing Web query systems – WebOQL, XML-QL • Mining multimedia data - Mining image from satellite (Fayyad, et al. 1996) - Mining image to identify small volcanoes on Venus (Smyth, et al 1996) . May 22, 2017 Web Mining 11 Web Structure Mining • To discover the link structure of the hyperlinks at the inter-document level to generate structural summary about the Website and Web page. – Direction 1: based on the hyperlinks, categorizing the Web pages and generated information. – Direction 2: discovering the structure of Web document itself. – Direction 3: discovering the nature of the hierarchy or network of hyperlinks in the Website of a particular domain. May 22, 2017 Web Mining 12 Web Structure Mining • Finding authoritative Web pages – Retrieving pages that are not only relevant, but also of high quality, or authoritative on the topic • Hyperlinks can infer the notion of authority – The Web consists not only of pages, but also of hyperlinks pointing from one page to another – These hyperlinks contain an enormous amount of latent human annotation – A hyperlink pointing to another Web page, this can be considered as the author's endorsement of the page May 22, 2017 Web other Mining 13 Web Structure Mining • Web pages categorization (Chakrabarti, et al., 1998) • Discovering micro communities on the web - Example: Clever system (Chakrabarti, et al., 1999), Google (Brin and Page, 1998) • Schema Discovery in Semistructured Environment May 22, 2017 Web Mining 14 Web Usage Mining • Web usage mining also known as Web log mining – mining techniques to discover interesting usage patterns from the secondary data derived from the interactions of the users while surfing the web May 22, 2017 Web Mining 15 Web Usage Mining • Applications – Target potential customers for electronic commerce – Enhance the quality and delivery of Internet information services to the end user – Improve Web server system performance – Identify potential prime advertisement locations – Facilitates personalization/adaptive sites – Improve site design – Fraud/intrusion detection – Predict user’s actions (allows prefetching) May 22, 2017 Web Mining 16 May 22, 2017 Web Mining 17 Problems with Web Logs • Identifying users – Clients may have multiple streams – Clients may access web from multiple hosts – Proxy servers: many clients/one address – Proxy servers: one client/many addresses • Data not in log – POST data (i.e., CGI request) not recorded – Cookie data stored elsewhere May 22, 2017 Web Mining 18 Cont… • Missing data – Pages may be cached – Referring page requires client cooperation – When does a session end? – Use of forward and backward pointers • • • Typically a 30 minute timeout is used Web content may be dynamic – May not be able to reconstruct what the user saw Use of spiders and automated agents – automatic request we pages May 22, 2017 Web Mining 19 Cont… • Like most data mining tasks, web log mining requires preprocessing – To identify users – To match sessions to other data – To fill in missing data – Essentially, to reconstruct the click stream May 22, 2017 Web Mining 20 Log Data - Simple Analysis • Statistical analysis of users – Length of path – Viewing time – Number of page views • Statistical analysis of site – Most common pages viewed – Most common invalid URL May 22, 2017 Web Mining 21 Web Log – Data Mining Applications • Association rules – Find pages that are often viewed together • Clustering – Cluster users based on browsing patterns – Cluster pages based on content • Classification – Relate user attributes to patterns May 22, 2017 Web Mining 22 Web Logs • Web servers have the ability to log all requests • Web server log formats: – Most use the Common Log Format (CLF) – New, Extended Log Format allows configuration of log file • Generate vast amounts of data May 22, 2017 Web Mining 23 • • • • • • • Common Log Format Remotehost: browser hostname or IP # Remote log name of user (almost always "-" meaning "unknown") Authuser: authenticated username Date: Date and time of the request "request”: exact request lines from client Status: The HTTP status code returned Bytes: The content-length of response May 22, 2017 Web Mining 24 Server Logs May 22, 2017 Web Mining 25 Fields • • • • • • • • Client IP: 128.101.228.20 Authenticated User ID: - Time/Date: [10/Nov/1999:10:16:39 -0600] Request: "GET / HTTP/1.0" Status: 200 Bytes: Referrer: “-” Agent: "Mozilla/4.61 [en] (WinNT; I)" May 22, 2017 Web Mining 26 Web Usage Mining • Commonly used approaches (Borges and Levene, 1999) - Maps the log data into relational tables before an adapted data mining technique is performed. - Uses the log data directly by utilizing special pre-processing techniques. • Typical problems - Distinguishing among unique users, server sessions, episodes, etc. in the presence of caching and proxy servers (McCallum, et al., 2000; Srivastava, et al., 2000). May 22, 2017 Web Mining 27 Request • Method: GET – Other common methods are POST and HEAD • URI: / • – This is the file that is being accessed. When a directory is specified, it is up to the Server to decide what to return. Usually, it will be the file named “index.html” or “home.html” • Protocol: HTTP/1.0 May 22, 2017 Web Mining 28 Status • Status codes are defined by the HTTP protocol. • Common codes include: – 200: OK – 3xx: Some sort of Redirection – 4xx: Some sort of Client Error – 5xx: Some sort of Server Error May 22, 2017 Web Mining 29 May 22, 2017 Web Mining 30 Web Mining Taxonomy Web Mining Web Content Mining Web Page Content Mining May 22, 2017 Web Structure Mining Search Result Mining Web Usage Mining General Access Pattern Tracking Web Mining Customized Usage Tracking 31 Mining the World Wide Web Web Mining Web Content Mining Web Page Content Mining Web Page Summarization WebOQL(Mendelzon et.al. 1998) …: Web Structuring query languages; Can identify information within given web pages •(Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages •ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages May 22, 2017 Web Structure Mining Search Result Mining Web Mining Web Usage Mining General Access Pattern Tracking Customized Usage Tracking 32 Mining the World Wide Web Web Mining Web Content Mining Web Page Content Mining Web Structure Mining Search Result Mining Search Engine Result Summarization •Clustering Search Result (Leouski and Croft, 1996, Zamir and Etzioni, 1997): Categorizes documents using phrases in titles and snippets May 22, 2017 Web Usage Mining Web Mining General Access Pattern Tracking Customized Usage Tracking 33 Mining the World Wide Web Web Mining Web Content Mining Search Result Mining Web Page Content Mining May 22, 2017 Web Structure Mining Using Links •PageRank (Brin et al., 1998) •CLEVER (Chakrabarti et al., 1998) Use interconnections between web pages to give weight to pages. Using Generalization •MLDB (1994) Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure. Web Mining Web Usage Mining General Access Pattern Tracking Customized Usage Tracking 34 Mining the World Wide Web Web Mining Web Content Mining Web Page Content Mining Search Result Mining May 22, 2017 Web Structure Mining Web Usage Mining General Access Pattern Tracking Customized Usage Tracking •Web Log Mining (Zaïane, Xin and Han, 1998) Uses KDD techniques to understand general access patterns and trends. Can shed light on better structure and grouping of resource providers. Web Mining 35 Mining the World Wide Web Web Mining Web Content Mining Web Page Content Mining Search Result Mining May 22, 2017 Web Structure Mining Web Usage Mining Customized Usage Tracking General Access Pattern Tracking •Adaptive Sites (Perkowitz and Etzioni, 1997) Analyzes access patterns of each user at a time. Web site restructures itself automatically by learning from user access patterns. Web Mining 36 Web Content Mining • Agent-based Approaches: – Intelligent Search Agents – Information Filtering/Categorization – Personalized Web Agents • Database Approaches: – Multilevel Databases – Web Query Systems May 22, 2017 Web Mining 37 Intelligent Search Agents • Locating documents and services on the Web: – WebCrawler, Alta Vista (http://www.altavista.com): scan millions of Web documents and create index of words (too many irrelevant, outdated responses) – MetaCrawler: mines robot-created indices • Retrieve product information from a variety of vendor sites using only general information about the product domain: – ShopBot May 22, 2017 Web Mining 38 Intelligent Search Agents (Cont’d) • Rely either on pre-specified domain information about particular types of documents, or on hard coded models of the information sources to retrieve and interpret documents: – – – – – Harvest FAQ-Finder Information Manifold OCCAM Parasite • Learn models of various information sources and translates these into its own concept hierarchy: – ILA (Internet Learning Agent) May 22, 2017 Web Mining 39 Information Filtering/Categorization • Using various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them. – HyPursuit: uses semantic information embedded in link structures and document content to create cluster hierarchies of hypertext documents, and structure an information space – BO (Bookmark Organizer): combines hierarchical clustering techniques and user interaction to organize a collection of Web documents based on conceptual information May 22, 2017 Web Mining 40 Personalized Web Agents • This category of Web agents learn user preferences and discover Web information sources based on these preferences, and those of other individuals with similar interests (using collaborative filtering) – – – – – – WebWatcher PAINT Syskill&Webert GroupLens Firefly others May 22, 2017 Web Mining 41 Multiple Layered Web Architecture Layern More Generalized Descriptions ... Layer1 Generalized Descriptions Layer0 May 22, 2017 Web Mining 42 Multilevel Databases • At the higher levels, meta data or generalizations are – extracted from lower levels – organized in structured collections, i.e. relational or object-oriented database. • At the lowest level, semi-structured information are – stored in various Web repositories, such as hypertext documents May 22, 2017 Web Mining 43 Multilevel Databases (Cont’d) • (Han, et. al.): – use a multi-layered database where each layer is obtained via generalization and transformation operations performed on the lower layers • (Kholsa, et. al.): – propose the creation and maintenance of metadatabases at each information providing domain and the use of a global schema for the metadatabase May 22, 2017 Web Mining 44 Multilevel Databases (Cont’d) • (King, et. al.): – propose the incremental integration of a portion of the schema from each information source, rather than relying on a global heterogeneous database schema • The ARANEUS system: – extracts relevant information from hypertext documents and integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views May 22, 2017 Web Mining 45 Multi-Layered Database (MLDB) • A multiple layered database model – based on semi-structured data hypothesis – queried by NetQL using a syntax similar to the relational language SQL • Layer-0: – An unstructured, massive, primitive, diverse global informationbase. • Layer-1: – A relatively structured, descriptor-like, massive, distributed database by data analysis, transformation and generalization techniques. – Tools to be developed for descriptor extraction. • Higher-layers: – Further generalization to form progressively smaller, better structured, and less remote databases for efficient browsing, retrieval, and information discovery. May 22, 2017 Web Mining 46 Three major components in MLDB • S (a database schema): – outlines the overall database structure of the global MLDB – presents a route map for data and meta-data (i.e., schema) browsing – describes how the generalization is performed • H (a set of concept hierarchies): – provides a set of concept hierarchies which assist the system to generalize lower layer information to high layeres and map queries to appropriate concept layers for processing • D (a set of database relations): – the whole global information base at the primitive information level (i.e., layer-0) – the generalized database relations at the nonprimitive May 22,layers 2017 Web Mining 47 The General architecture of WebLogMiner (a Global MLDB) Generalized Data Higher layers Site 1 Site 2 Concept Hierarchies Resource Discovery (MLDB) Knowledge Discovery (WLM) Site 3 May 22, 2017 Characteristic Rules Discriminant Rules Association Rules Web Mining 48 Techniques for Web usage mining • Construct multidimensional view on the Weblog database – Perform multidimensional OLAP analysis to find the top N users, top N accessed Web pages, most frequently accessed time periods, etc. • Perform data mining on Weblog records – Find association patterns, sequential patterns, and trends of Web accessing – May need additional information,e.g., user browsing sequences of the Web pages in the Web server buffer • Conduct studies to – Analyze system performance, improve system design by Web caching, Web page prefetching, and Web page May 22, 2017 Web Mining 49 swapping Web Usage Mining - Phases • Three distinctive phases: preprocessing, pattern discovery, and pattern analysis • Preprocessing - process to convert the raw data into the data abstraction necessary for the further applying the data mining algorithm • Resources: server-side, client-side, proxy servers, or database. • Raw data: Web usage logs, Web page descriptions, Web site topology, user registries, and questionnaire. • Conversion: Content converting, Structure converting, Usage converting May 22, 2017 Web Mining 50 • User: The principal using a client to interactively retrieve and render resources or resource manifestations. • Page view: Visual rendering of a Web page in a specific client environment at a specific point of time • Click stream: a sequential series of page view request • User session: a delimited set of user clicks (click stream) across one or more Web servers. • Server session (visit): a collection of user clicks to a single Web server during a user session. • Episode: a subset of related user clicks that May 22, 2017 Web Mining occur within a user session. 51 • Content Preprocessing - the process of converting text, image, scripts and other files into the forms that can be used by the usage mining. • Structure Preprocessing - The structure of a Website is formed by the hyperlinks between page views, the structure preprocessing can be done by parsing and reformatting the information. • Usage Preprocessing - the most difficult task in the usage mining processes, the data cleaning techniques to eliminate the impact of the irrelevant items to the analysis result. May 22, 2017 Web Mining 52 Pattern Discovery • Pattern Discovery is the key component of the Web mining, which converges the algorithms and techniques from data mining, machine learning, statistics and pattern recognition etc research categories. • Separate subsections: statistical analysis, association rules, clustering, classification, sequential pattern, dependency Modeling. May 22, 2017 Web Mining 53 • Statistical Analysis - the analysts may perform different kinds of descriptive statistical analyses based on different variables when analyzing the session file ; powerful tools in extracting knowledge about visitors to a Web site. May 22, 2017 Web Mining 54 • Association Rules - refers to sets of pages that are accessed together with a support value exceeding some specified threshold. • Clustering: a technique to group together users or data items (pages) with the similar characteristics. – It can facilitate the development and execution of future marketing strategies. • Classification: the technique to map a data item into one of several predefined classes, which help to establish a profile of users belonging to a particular class or category. May 22, 2017 Web Mining 55 Pattern Analysis • Pattern Analysis - final stage of the Web usage mining. • To eliminate the irrelative rules or patterns and to extract the interesting rules or patterns from the output of the pattern discovery process. • Analysis methodologies and tools: query mechanism like SQL, OLAP, visualization etc. May 22, 2017 Web Mining 56 May 22, 2017 Web Mining 57 WUM – Pre-Processing – Data Cleaning Removes log entries that are not needed for the mining process Data Integration Synchronize data from multiple server logs, metadata User Identification Associates page references with different users Session/Episode Identification Groups user’s page references into user sessions Page View Identification Path Completion Fills in page references missing due to browser and proxy caching May 22, 2017 Web Mining 58 WUM – Issues in User Session Identification A single IP address is used by many users different users Proxy server Web server Different IP addresses in a single session ISP server Single user Web server Missing cache hits in the server logs Web Mining May 22, 2017 59 User and Session Identification Issues • Distinguish among different users to a site • Reconstruct the activities of the users within the site • Proxy servers and anonymizers • Rotating IP addresses connections through ISPs • Missing references due to caching • Inability of servers to distinguish among different visits May 22, 2017 Web Mining 60 WUM – Solutions Remote Agent A remote agent is implemented in Java Applet It is loaded into the client only once when the first page is accessed The subsequent requests are captured and send back to the server Modified Browser The source code of the existing browser can be modified to gain user specific data at the client side Dynamic page rewriting When the user first submit the request, the server returns the requested page rewritten to include a session specific ID Each subsequent request will supply this ID to the server Heuristics Use a set of assumptions to identify user sessions and find the missing May 22,cache 2017 hits in the server log Web Mining 61 May 22, 2017 Web Mining 62 WUM – Heuristics The session identification heuristics Timeout: if the time between pages requests exceeds a certain limit, it is assumed that the user is starting a new session IP/Agent: Each different agent type for an IP address represents a different sessions Referring page: If the referring page file for a request is not part of an open session, it is assumed that the request is coming from a different session Same IP-Agent/different sessions (Closest): Assigns the request to the session that is closest to the referring page at the time of the request Same IP-Agent/different sessions (Recent): In the case where multiple sessions are same distance from a page request, assigns the request to the session with the most May 22, 2017 Web Mining 63 recent referrer access in terms of time Cont. The path completion heuristics If the referring page file of a session is not part of the previous page file of that session, the user must have accessed a cached page The “back” button method is used to refer a cached page Assigns a constant view time for each of the cached page file May 22, 2017 Web Mining 64 May 22, 2017 Web Mining 65 May 22, 2017 Web Mining 66 May 22, 2017 Web Mining 67 May 22, 2017 Web Mining 68 May 22, 2017 Web Mining 69 WUM – Association Rule Generation Discovers the correlations between pages that are most often referenced together in a single server session • Provide the information What are the set of pages frequently accessed together by Web users? What page will be fetched next? What are paths frequently accessed by Web users? Association rule A B [ Support = 60%, Confidence = 80% ] Example “50% of visitors who accessed URLs /infor-f.html and labo/infos.html also visited situation.html” May 22, 2017 Web Mining 70 Associations & Correlations • Page associations from usage data – User sessions – User transactions • Page associations from content data – similarity based on content analysis • Page associations based on structure – link connectivity between pages • ==> Obtain frequent itemsets May 22, 2017 Web Mining 71 Examples: 60% of clients who accessed /products/, also accessed /products/software/webminer.htm. 30% of clients who accessed /specialoffer.html, placed an online order in /products/software/. (Example from IBM official Olympics Site) • {Badminton, Diving} ===> {Table Tennis} (a = 69.7%, s = 0.35%) May 22, 2017 Web Mining 72 WUM – Clustering • Groups together a set of items having similar characteristics • User Clusters Discover groups of users exhibiting similar browsing patterns Page recommendation User’s partial session is classified into a single cluster The links contained in this cluster are recommended May 22, 2017 Web Mining 73 Cont.. Page clusters Discover groups of pages having related content Usage based frequent pages Page recommendation The links are presented based on how often URL references occur together across user sessions May 22, 2017 Web Mining 74 Website Usage Analysis • Why developing a Website usage / utilization analyzation tool? • Knowledge about how visitors use Website could - Prevent disorientation and help designers place important information/functions exactly where the visitors look for and in the way users need it - Build up adaptive Website server May 22, 2017 Web Mining 75 Clustering and Classification clients who often access • /products/software/webminer.html tend to be from educational institutions. clients who placed an online order for software tend to be students in the 20-25 age group and live in the United States. 75% of clients who download software from /products/software/demos/ visit between 7:00 and 11:00 pm on weekends. May 22, 2017 Web Mining 76 Website Usage Analysis • Discover user navigation patterns in using Website - Establish a aggregated log structure as a preprocessor to reduce the search space before the actual log mining phase - Introduce a model for Website usage pattern discovery by extending the classical mining model, and establish the processing framework of this model May 22, 2017 Web Mining 77 Sequential Patterns & Clusters 30% of clients who visited /products/software/, had done a search in Yahoo using the keyword “software” before their visit 60% of clients who placed an online order for WEBMINER, placed another online order for software within 15 days May 22, 2017 Web Mining 78 Website Usage Analysis • Website client-server architecture facilitates recording user behaviors in every steps by - submit client-side log files to server when users use clear functions or exit window/modules • The special design for local and universal back/forward/clear functions makes user’s navigation pattern more clear for designer by - analyzing local back/forward history and incorporate it with universal back/forward history May 22, 2017 Web Mining 79 Website Usage Analysis • What will be included in SUA 1. Identify and collect log data 2. Transfer the data to server-side and save them in a structure desired for analysis 3. Prepare mined data by establishing a customized aggregated log tree/frame 4. Use modifications of the typical data mining methods, particularly an extension of a traditional sequence discovery algorithm, to mine user navigation patterns May 22, 2017 Web Mining 80 Website Usage Analysis • Problem need to be considered: - How to identify the log data when a user go through uninteresting function/module - What marks the end of a user session? - Client connect Website through proxy servers • Differences in Website usage analysis with common Web usage mining - Client-side log files available - Log file’s format (Web log files follow Common Log Format specified as a part of HTTP protocol) - Not necessary for log file cleaning/filtering (which usually performed in preprocess of Web log mining) Web Mining May 22, 2017 81 Web Usage Mining - Patterns Discovery Algorithms • (Chen et. al.) Design algorithms for Path Traversal Patterns, finding maximal forward references and large reference sequences. May 22, 2017 Web Mining 82 Path Traversal Patterns • Procedure for mining traversal patterns: – (Step 1) Determine maximal forward references from the original log data (Algorithm MF) – (Step 2) Determine large reference sequences (i.e., Lk, k1) from the set of maximal forward references (Algorithm FS and SS) – (Step 3) Determine maximal reference sequences from large reference sequences • Focus on Step 1 and 2, and devise algorithms for the efficient determination of large reference sequences May 22, 2017 Web Mining 83 Determine large reference sequeces • Algorithm FS: – Utilizes the key ideas of algorithm DHP: – employs hashing and pruning techniques – DHP is very efficient for the generation of candidate itemsets, in particular for the large two-itemsets, thus greatly improving the performance bottleneck of the whole process • Algorithm SS: – employs hashing and pruning techniques to reduce both CPU and I/O costs – by properly utilizing the information in candidate references in prior passes, is able to avoid database scans in some passes, thus further reducing the disk I/O cost May 22, 2017 Web Mining 84 Patterns Analysis Tools • WebViz [pitkwa94] --- provides appropriate tools and techniques to understand, visualize, and interpret access patterns. • Proposes OLAP techniques such as data cubes for the purpose of simplifying the analysis of usage statistics from server access logs. [dyreua et al] May 22, 2017 Web Mining 85 Patterns Discovery and Analysis Tools • The emerging tools for user pattern discovery use sophisticated techniques from AI, data mining, psychology, and information theory, to mine for knowledge from collected data: – (Pirolli et. al.) use information foraging theory to combine path traversal patterns, Web page typing, and site topology information to categorize pages for easier access by users. May 22, 2017 Web Mining 86 (Cont’d) • WEBMINER : – introduces a general architecture for Web usage mining, automatically discovering association rules and sequential patterns from server access logs. – proposes an SQL-like query mechanism for querying the discovered knowledge in the form of association rules and sequential patterns. • WebLogMiner – Web log is filtered to generate a relational database – Data mining on web log data cube and web log database May 22, 2017 Web Mining 87 WEBMINER • SQL-like Query • A framework for Web mining, the applications of data mining and knowledge discovery techniques, association rules and sequential patterns, to Web data: – Association rules: using apriori algorithm • 40% of clients who accessed the Web page with URL /company/products/product1.html, also accessed /company/products/product2.html – Sequential patterns: using modified apriori algorithm • 60% of clients who placed an online order in /company/products/product1.html, also placed an online order in /company/products/product4.html within 15 days May 22, 2017 Web Mining 88 WebLogMiner • Database construction from server log file: – data cleaning – data transformation • Multi-dimensional web log data cube construction and manipulation • Data mining on web log data cube and web log database May 22, 2017 Web Mining 89 Mining the World-Wide Web • Design of a Web Log Miner – – – – Web log is filtered to generate a relational database A data cube is generated form database OLAP is used to drill-down and roll-up in the cube OLAM is used for mining interesting knowledge Web log Database Data Cube R(q) (q,p)G out degre (q) Knowledge Sliced and diced cube R(p) = /n (1 ) 1 Data Cleaning May 22, 2017 2 3 Data Cube OLAP Creation Web Mining 4 Data Mining 90 Construction of Data Cubes (http://db.cs.sfu.ca/sections/publication/slides/slides.html) Amount B.C. Province Prairies Ontario sum 0-20K20-40K 40-60K60K- sum All Amount Comp_Method, B.C. Comp_Metho d Database … ... Discipline sum Each dimension contains a hierarchy of values for one attribute A cube cell stores aggregate values, e.g., count, sum, max, etc. A “sum” cell stores dimension summation values. Sparse-cube technology and MOLAP/ROLAP integration. “Chunk”-based multi-way aggregation and single-pass computation. May 22, 2017 Web Mining 91 WebLogMiner Architecture • Web log is filtered to generate a relational database • A data cube is generated from database • OLAP is used to drill-down and roll-up in the cube • OLAM is used for mining interesting knowledge Web log Database Data Cube R(q) (q,p)G out degre (q) Knowledge Sliced and diced cube R(p) = /n (1 ) May 22, 2017 1 Data Cleaning 2 Data Cube Web Mining Creation 3 OLAP 4 Data Mining 92 WEBSIFT May 22, 2017 Web Mining 93 What is WebSIFT? • a Web Usage Mining framework that – performs preprocessing – performs knowledge discovery – uses the structure and content information about a Web site to automatically define a belief set. May 22, 2017 Web Mining 94 Overview of WebSIFT • Based on WEBMINER prototype • Divides the Web Usage Mining process into three main parts May 22, 2017 Web Mining 95 Overview of WebSIFT • Input: – Access – Referrer and agent – HTML files – Optional data (e.g., registration data or remote agent logs) May 22, 2017 Web Mining 96 Overview of WebSIFT • Preprocessing: – uses input data to construct a user session file – site files are used to classify pages of a site • Knowledge discovery phase – uses existing data mining techniques to generate rules and patterns. – generation of general usage stats May 22, 2017 Web Mining 97 Information Filtering • Links between pages provide evidence for supporting the belief that those pages are related. • Strength of evidence for a set pages being related is proportional to the strength of the topological connection between the set of pages. • Based on site content, can also look at content similarity and by calculating “distance” between pages. May 22, 2017 Web Mining 98 Information Filtering May 22, 2017 Web Mining 99 Information Filtering • Uses two different methods to identify interesting results from a list of discovered frequent itemsets May 22, 2017 Web Mining 100 Information Filtering • Method 1: – declare itemsets that contain pages not directly connected to be interesting – corresponds to a situation where a belief that a set of pages are related has no domain or existing evidence but there is mined evidence. called Beliefs with Mined Evidence algo (BME) May 22, 2017 Web Mining 101 Information Filtering • Method 2: – Absence of itemsets evidence against a belief that pages are related. – Pages that have individual support above a threshold but are not present together in larger frequent itemsets evidence against the pages being related. – domain evidence suggests that pages are related the absence of the frequent itemset can be considered interesting. This is handled by the Beliefs with Contradicting Evidence algo (BCE ) May 22, 2017 Web Mining 102 Experimental Evaluation • Performed on web server of U of MN Dept of Comp Sci & Eng’g web site • Log spanned eight days in Feb 1999 • Physical size of log: 19.3 MB • 102,838 entries • After preprocessing: 43,158 page views (divided among 10,609 user sessions) • Threshold of 0.1% for support used to generate 693 frequent itemsets with maximum set size of six pages. • 178 unique pages represented in all the rules. • BCE and BME algos run on frequent itemsets. May 22, 2017 Web Mining 103 Experimental Evaluation May 22, 2017 Web Mining 104 Experimental Evaluation May 22, 2017 Web Mining 105 Future work • Filtering frequent itemsets, sequential patterns and clusters • Incorporate probabilities and fuzzy logic into information filter • Future works include path completion verification, page usage determination, application of the pattern analysis results, etc. May 22, 2017 Web Mining 106 Link Analysis May 22, 2017 Web Mining 107 Link Analysis • Finding patterns in graphs – Bibliometrics – finding patterns in citation graphs – Sociometry – finding patterns in social networks – Collaborative Filtering – finding patterns in rank(person, item) graph – Webometrics – finding patterns in web page links May 22, 2017 Web Mining 108 Web Link Analysis • Used for – ordering documents matching a user query: ranking – deciding what pages to add to a collection: crawling – page categorization – finding related pages – finding duplicated web sites May 22, 2017 Web Mining 109 Web as Graph • Link graph: – node for each page – directed edge (u,v) if page u contains a hyperlink to page v • Co-citation graph – node for each page – undirected edge (u,v) iff exists a third page w linking to both u and v • Assumption: – link from page A to page B is a recommendation of page B by A – If A and B are connected by a link, there is a higher probability that they are on the same topic May 22, 2017 Web Mining 110 Web structure mining • HITS (Topic distillation) • PageRank (Ranking web pages used by Google) • Algorithm in Cyber-community May 22, 2017 Web Mining 111 HITS Algorithm --Topic Distillation on WWW May 22, 2017 Web Mining 112 HITS Method • Hyperlink Induced Topic Search • Kleinberg, 1998 • A simple approach by finding hubs and authorities • View web as a directed graph • Assumption: if document A has hyperlink to document B, then the author of document A thinks that document B contains valuable information May 22, 2017 Web Mining 113 Main Ideas • Concerned with the identification of the most authoritative, or definitive, Web pages on a broad-topic • Focused on only one topic • Viewing the Web as a graph • A purely link structure-based computation, ignoring the textual content May 22, 2017 Web Mining 114 HITS: Hubs and Authority • Hub: web page links to a collection of prominent sites on a common topic • Authority: Pages that link to a collection of authoritative pages on a broad topic; web page pointed to by hubs • Mutual Reinforcing Relationship: a good authority is a page that is pointed to by many good hubs, while a good hub is a page that points to many good authorities May 22, 2017 Web Mining 115 Hub-Authority Relations Hubs May 22, 2017 Authorities Web Mining Unrelated page of large in-degree 116 HITS: Two Main Steps • A sampling component, which constructs a focused collection of several thousand web pages likely to be rich in relevant authorities • A weight-propagation component, which determines numerical estimates of hub and authority weights by an iterative procedure • As the result, pages with highest weights are returned as hubs and authorities for the research topic May 22, 2017 Web Mining 117 HITS: Root Set and Base Set • Using query term to collect a root set (S) of pages from index-based search engine (AltaVista) • Expand root set to base set (T) by including all pages linked to by pages in root set and all pages that link to a page in root set (up to a designated size cut-off) • Typical base set contains roughly 1000-5000 pages May 22, 2017 Web Mining 118 Step 1: Constructing Subgraph 1.1 Creating a root set (S) - Given a query string on a broad topic - Collect the t highest-ranked pages for the query from a text-based search engine 1.2 Expanding to a base set (T) - Add the page pointing to a page in root set - Add the page pointed to by a page in root set May 22, 2017 Web Mining 119 Root Set and Base Set (Cont’d) T May 22, 2017 S S Web Mining 120 Step 2: Computing Hubs and Authorities 2.1 Associating weights - Authority weight xp - Hub weight yp - Set all values to a uniform constant initially 2.2 Updating weights May 22, 2017 Web Mining 121 Updating Authority Weight xp =q suchthat qp yq q1 q2 Example xp=yq1+yq2+yq3 May 22, 2017 Web Mining P q3 122 Updating Hub Weight yp = xq q such that pq q1 Example P yp=xq1+xq2+xq3 May 22, 2017 q2 Web Mining q3 123 Flowchart Initialization Update all xvalues Set all values to c, e.g. c =1 Update all yvalues Update all xvalues 2nd time 1st time May 22, 2017 Update all yvalues Web Mining 124 Results • All x- and y-values converge rapidly so that termination of the iteration is guaranteed • It can be proved in mathematical approach • Pages with the highest x-values are viewed as the best authorities, while pages with the highest y-values are regarded as the best hubs May 22, 2017 Web Mining 125 Implementation • • • • Search engine: Root set: Base set: Converging speed: • Running time: May 22, 2017 AltaVista 200 pages 1000-5000 pages Very rapid, less than 20 times About 30 minutes Web Mining 126 HITS: Advantages • Weight computation is an intrinsic feature from collection of linked pages • Provides a densely linked community of related authorities and hubs • Pure link-based computation once the root set has been assembled, with no further regard to query terms • Provides surprisingly good search result for a wide range of queries May 22, 2017 Web Mining 127 Drawbacks • Limit On Narrow Topics – Not enough authoritative pages – Frequently returns resources for a more general topic – adding a few edges can potentially change scores considerably • Topic Drifting - Appear when hubs discuss multiple topics May 22, 2017 Web Mining 128 Improved Work • To improve precision: - Combining content with link information - Breaking large hub pages into smaller units - Computing relevance weights for pages • To improve speed: - Building a Connectivity Server that provides linkage information for all pages May 22, 2017 Web Mining 129 Web Structure Mining – Page-Rank Method – CLEVER Method – Connectivity-Server Method May 22, 2017 Web Mining 130 1. Page-Rank Method • Introduced by Brin and Page (1998) • Mine hyperlink structure of web to produce ‘global’ importance ranking of every web page • Used in Google Search Engine • Web search result is returned in the rank order • Treats link as like academic citation • Assumption: Highly linked pages are more ‘important’ than pages with a few links • A page has a high rank if the sum of the ranks of its back-links is high May 22, 2017 Web Mining 131 Page Rank: Computation • Assume: – – – – – – R(u) Fu Bu Nu C E(u) : : : : : : Rank of a web page u Set of pages which u points to Set of pages that points to u Number of links from u Normalization factor Vector of web pages as source of rank • Page Rank Computation: R (v ) R(u ) = c cE (u ) vBu N v May 22, 2017 Web Mining 132 Page Rank: Implementation • Stanford WebBase project Complete crawling and indexing system of with current repository 24 million web pages (old data) • Store each URL as unique integer and each hyperlink as integer IDs • Remove dangling links by iterative procedures • Make initial assignment of the ranks • Propagate page ranks in iterative manner • Upon convergence, add the dangling links back and recompute the rankings May 22, 2017 Web Mining 133 Page Rank: Results • Google utilizes a number of factors to rank the search results: – proximity, anchor text, page rank • The benefits of Page Rank are the greatest for underspecified queries, example: ‘Stanford University’ query using Page Rank lists the university home page the first May 22, 2017 Web Mining 134 Page Rank: Advantages • Global ranking of all web pages – regardless of their content, based solely on their location in web graph structure • Higher quality search results – central, important, and authoritative web pages are given preference • Help find representative pages to display for a cluster center • Other applications: traffic estimation, backlink predictor, user navigation, personalized page rank • Mining structure of web graph is very useful for various information retrieval May 22, 2017 Web Mining 135 CLEVER Method • CLient–side EigenVector-Enhanced Retrieval • Developed by a team of IBM researchers at IBM Almaden Research Centre • Continued refinements of HITS • Ranks pages primarily by measuring links between them • Basic Principles – Authorities, Hubs – Good hubs points to good authorities – Good authorities are referenced by good hubs May 22, 2017 Web Mining 136 Problems Prior to CLEVER • Textual content that is ignored leads to problems caused by some features of web: – HITS returns good resources for more general topic when query topics are narrowly-focused – HITS occasionally drifts when hubs discuss multiple topics – Usually pages from single Web site take over a topic and often use same html template therefore pointing to a single popular site irrelevant to query topic May 22, 2017 Web Mining 137 CLEVER: Solution • Replacing the sums of Equation (1) and (2) of HITS with weighted sums • Assign to each link a non-negative weight • Weight depends on the query term and end point • Extension 1: Anchor Text – using text that surrounds hyperlink definitions (href’s) in Web pages, often referred as ‘anchor text’ – boost weight enhancements of links that occur near instances of query terms May 22, 2017 Web Mining 138 CLEVER: Solution (Cont’d) • Extension 2: Mini Hub Pagelets – breaking large hub into smaller units – treat contiguous subsets of links as minihubs or ‘pagelets’ – contiguous sets of links on a hub page are more focused on single topic than the entire page May 22, 2017 Web Mining 139 CLEVER: The Process Starts by collecting a set of pages Gathers all pages of initial link, plus any pages linking to them Ranks result by counting links Links have noise, not clear which pages are best Recalculate scores Pages with most links are established as most important, links transmit more weigh Repeat calculation no. of times till scores are refined May 22, 2017 Web Mining 140 CLEVER: Advantages Used to populate categories of different subjects with minimal human assistance Able to leverage links to fill category with best pages on web Can be used to compile large taxonomies of topics automatically Emerging new directions: Hypertext classification, focused crawling, mining communities May 22, 2017 Web Mining 141 Connectivity Server Method Server that provides linkage information for all pages indexed by a search engine In its base operation, server accepts a query consisting of a set of one or more URLs and return a list of all pages that point to pages in (parents) and list of all pages that are pointed to from pages in (children) In its base operation, it also provides neighbourhood graph for query set Acts as underlying infrastructure, supports search engine applications May 22, 2017 Web Mining 142 What’s Connectivity Server (Cont’d) Neighborhood Graph May 22, 2017 Web Mining 143 CONSERV: Web Structure Mining Finding Authoritative Pages (Search by topic) (pages that is high in quality and relevant to the topic) Finding Related Pages (Search by URL) (pages that address same topic as the original page, not necessarily semantically identical) Algorithms include Companion, Cocitation May 22, 2017 Web Mining 144 CONSERV: Finding Related Page May 22, 2017 Web Mining 145 CONSERV: Companion Algorithm An extension to HITS algorithm Features: Exploit not only links but also their order on a page Use link weights to reduce the influence of pages that all reside on one host Merge nodes that have a large number of duplicate links The base graph is structured to exclude grandparent nodes but include nodes that share child May 22, 2017 Web Mining 146 Companion Algorithm (Cont’d) Four steps 1. Build a vicinity graph for u 2. Remove duplicates and near-duplicates in graph. 3. Compute link weights based on host to host connection 4. Compute a hub score and a authority score for each node in the graph, return the top ranked authority nodes. May 22, 2017 Web Mining 147 Companion Algorithm (Cont’d) Building the Vicinity Graph Set up parameters: B : no of parents of u, BF : no of children per parent, F : no of children of u, FB : no of parents per child Stoplist (pages that are unrelated to most queries and have a very high in-degree) Procedure Go Back (B) : choose parents (randomly) Back-Forward(BF) : choose siblings (nearest) Go Forward (F) : choose children (first) Forward-Back(FB) : choose siblings (highest indegree) May 22, 2017 Web Mining 148 Companion Algorithm (Cont’d) Remove duplicate Near-duplicate, if two nodes, each has more than 10 links and they have at least 95% of their links in common Replace two nodes with a node whose links are the union of the links of the two nodes (mirror sites, aliases) May 22, 2017 Web Mining 149 Companion Algorithm (Cont’d) Assign edge (link) weights Link on the same host has weight 0 If there are K links from documents on a host to a single document on diff host, each link has an authority weight of 1/k If there are k links from a single document on a host to a set of documents on diff host, give each link a hub weight of 1/k (prevent a single host from having too much influence on the computation) May 22, 2017 Web Mining 150 Companion Algorithm (Cont’d) Compute hub and authority scores Extension of the HITS algorithm with edge weights Initialize all elements of the hub vector H to 1 Initialze all elements of the authority vector A to 1 While the vectors H and A have not converged: For all nodes n in the vicinity graph N, A[n] := (n',n)edges(N) H[n'] x authority_weight(n',n) For all n in N, H[n] := (n',n)edges(N) A[n'] x hub_weight(n',n) Normalize the H and A vectors. May 22, 2017 Web Mining 151 CONSERV: Cocitation Algorithm Two nodes are co-cited if they have a common parent The number of common parents of two nodes is their degree of co-citation Determine the related pages by looking for sibling nodes with the highest degree of cocitation In some cases there is an insufficient level of cocitation to provide meaningful results, chop off elements of URL, restart algorithm. e.g. A.com/X/Y/Z A.com/X/Y May 22, 2017 Web Mining 152 Comparative Study • Page Rank (Google) • Hub/Authority (CLEVER, C-Server) – Assigns initial ranking and retains them independently from queries (fast) – In the forward direction from link to link – Qualitative result May 22, 2017 Web Mining – Assembles different root set and prioritizes pages in the context of query – Looks forward and backward direction – Qualitative result 153 Connectivity-Based Ranking • Query-independent: gives an intrinsic quality score to a page • Approach #1: larger number of hyperlinks pointing to a page, the better the page – drawback? – each link is equally important • Approach #2: weight each hyperlink proportionally to the quality of the page containing the hyperlink May 22, 2017 Web Mining 154 Query-dependent Connectivity-Based Ranking • Carrier and Kazman • For each query, build a subgraph of the link graph G limited to pages on query topic • Build the neighborhood graph 1. A start set S of documents matching query given by search engine (~200) 2. Set augmented by its neighborhood, the set of documents that either point to or are pointed to by documents in S (limit to ~50) 3. Then rank based on indegree May 22, 2017 Web Mining 155 Idea • We desire pages that are relevant (in the neighborhood graph) and authoritative • As in page rank, not only the in-degree of a page p, but the quality of the pages that point to p. If more important pages point to p, that means p is more authoritative • Key idea: Good hub pages have links to good authority pages • given user query, compute a hub score and an authority score for each document • high authority score relevant content • high hub score links to documents with content Web Mining May 22, relevant 2017 156 Improvements to Basic Algorithm • Put weights on edges to reflect importance of links, e.g., put higher weight if anchor text associated with the link is relevant to query • Normalize weights outgoing from a single source or coming into a single sink. This alleviates spamming of query results • Eliminate edges between same domain May 22, 2017 Web Mining 157 Discovering Web communities on the web May 22, 2017 Web Mining 158 Introduction • Introduction of the cyber-community • Methods to measure the similarity of web pages on the web graph • Methods to extract the meaningful communities through the link structure May 22, 2017 Web Mining 159 What is cyber-community • A community on the web is a group of web pages sharing a common interest – Eg. A group of web pages talking about POP Music – Eg. A group of web pages interested in data-mining • Main properties: – Pages in the same community should be similar to each other in contents – The pages in one community should differ from the pages in another community – Similar to cluster May 22, 2017 Web Mining 160 Two different types of communities • Explicitly-defined communities – They are well known ones, such as the resource listed by Yahoo! eg. Arts Music Classic • Implicitly-defined communities – They are communities unexpected or invisible to most users May 22, 2017 Web Mining Painting Pop eg. The group of web pages interested in a particular singer 161 Two different types of communities • The explicit communities are easy to identify – Eg. Yahoo!, InfoSeek, Clever System • In order to extract the implicit communities, we need analyze the web-graph objectively • In research, people are more interested in the implicit communities May 22, 2017 Web Mining 162 Similarity of web pages • Discovering web communities is similar to clustering. For clustering, we must define the similarity of two nodes • A Method I: – For page and page B, A is related to B if there is a hyper-link from A to B, or from B to A Page A Page B – Not so good. Consider the home page of IBM and Microsoft. May 22, 2017 Web Mining 163 Similarity of web pages • Method II (from Bibliometrics) – Co-citation: the similarity of A and B is measured by the number of pages cite both A and B Page A Page B – Bibliographic coupling: the similarity of A and B is measured by the number of pages cited by both A and B. Page A May 22, 2017 Page B Web Mining 164 Methods of clustering • Clustering methods based on co-citation analysis: • Methods derived from HITS (Kleinberg) – Using co-citation matrix • All of them can discover meaningful communities But their methods are very expensive to the whole World Wide Web with billions of web pages. May 22, 2017 Web Mining 165 A cheaper method • The method from Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins – IBM Almaden Research Center • They call their method communities trawling (CT) • They implemented it on the graph of 200 millions pages, it worked very well May 22, 2017 Web Mining 166 Basic idea of CT • Definition of communities • dense directed bipartite sub graphs Fans – Bipartite graph: Nodes are partitioned into two sets, F and C – Every directed edge in the graph is directed from a node u in F to a node v in C – dense if many of the possible edges between F and C are present Web Mining May 22, 2017 Centers F C 167 Basic idea of CT • Bipartite cores – a complete bipartite subgraph with at least i nodes from F and at least j nodes from C – i and j are tunable parameters – A (i, j) Bipartite core • Every community have such a core with a certain i and j. May 22, 2017 Web Mining A (i=3, j=3) bipartite core 168 Basic idea of CT • A bipartite core is the identity of a community • To extract all the communities is to enumerate all the bipartite cores on the web. • Author invent an efficient algorithm to enumerate the bipartite cores. Its main idea is iterate pruning -- elimination- generation pruning May 22, 2017 Web Mining 169 • Complete bipartite graph: there is an edge between each node in F and each node in C • (i,j)-Core: a complete bipartite graph with at least i nodes in F and j nodes in C • (i,j)-Core is a good signature for finding online communities •“Trawling”: finding cores • Find all (i,j)-cores in the Web graph. – In particular: find “fans” (or “hubs”) in the graph – “centers” = “authorities” – Challenge: Web is huge. How to find cores efficiently? May 22, 2017 Web Mining 170 Main idea: pruning • Step 1: using out-degrees – Rule: each fan must point to at least 6 different websites – Pruning results: 12% of all pages (= 24M pages) are potential fans – Retain only links, and ignore page contents May 22, 2017 Web Mining 171 Step 2: Eliminate mirroring pages • Many pages are mirrors (exactly the same • • • • • page) They can produce many spurious fans Use a “shingling” method to identify and eliminate duplicates Results: – 60% of 24M potential-fan pages are removed – # of potential centers is 30 times of # of potential fans May 22, 2017 Web Mining 172 Step 4: Iterative pruning • To find (i,j)-cores – Remove all pages whose # of out-links is < i – Remove all pages whose # of in-links is < j – Do it iteratively • Step 5: inclusion-exclusion pruning • Idea: in each step, we • – Either “include” a community” • – Or we “exclude” a page from further contention May 22, 2017 Web Mining 173 • Check a page x with j out-degree. x is a fan of a (i,j)-core if: • – There are i-1 fans point to all the forward neighbors of x • – This step can be checked easily using the index on fans and centers • Result: for (3,3)-cores, 5M pages remained • Final step: • – Since the graph is much smaller, we can afford to “enumerate” the remaining cores May 22, 2017 Web Mining 174 • Step 3: using in-degrees of pages • Delete pages highly references, e.g., yahoo, altavista • Reason: they are referenced for many reasons, not likely forming an emerging community • Formally: remove all pages with more than k inlinks (k = 50,for instance) • Results: – 60M pages pointing to 20M pages – 2M potential fans May 22, 2017 Web Mining 175 Weakness of CT • The bipartite graph cannot suit all kinds of communities • The density of the community is hard to adjust May 22, 2017 Web Mining 176 Experiment on CT • 200 millions web pages • IBM PC with an Intel 300MHz Pentium II processor, with 512M of memory, running Linux • i from 3 to 10 and j from 3 to 20 • 200k potential communities were discovered 29% of them cannot be found in Yahoo!. May 22, 2017 Web Mining 177 Summary • Conclusion: The methods to discover communities from the web depend on how we define the communities through the link structure • Future works: – How to relate the contents to link structure May 22, 2017 Web Mining 178 Web communities based on dense bipartite graph patterns (WISE’01) By Krishna Reddy and Masaru Kitsuregawa May 22, 2017 Web Mining 179 Aim/Motivation To find all the communities within a large collection of web pages. Proposed solution: •Analyze linkage patterns •Find DBG in the given collection of webpages May 22, 2017 Web Mining 180 Definitions Bipartite graph A BG is a graph which can be partitioned into two non-empty sets T and I. Every directed edge of BG joins a node in T to a node in I Dense Bipartite graph A DBG is a BG where each node of T establishes an edge with at least alpha nodes of I and each node of I has atleast beta nodes as parents to it Community The set T contains the members of the community if there exist a DBG(T,I,alpha,beta) where alpha>= alpha_t and beta>=beta_t Where alpha_t and beta_t May 22, 2017 Web Mining > 0. 181 DBG(T,I,p,q) p q a s b t c u d May 22, 2017 Web Mining 182 Definitions Cocite: Association among pages based on the existence of common children (URL’s). Relax Cocite: we allow u,v,w to group if cocite(u,v) and cocite(v,w) are true. a b p a p c c q d e d e r b q i) May 22, 2017 f Web Mining f g 183 Algorithm 1.For a given URL find T(set of URL’s). Relax- cocite factor is 1. a)While num_iterations<=n • At a fixed relax-cocite factor value,find all w’s such that relax-cocite(w,y) =true • T= w U T 2. Community extraction – Input contains Page_set,outputDBG(T,I,alpha,beta) – Edge file has <p,q> where p is the parent of q. May 22, 2017 Web Mining 184 Algorithm(contd…) • For each P belongs to T,insert the edge<p,q> in edge_file if q belongs child(q). • Sort edge file based on source.Prepare T1 with<source,freq>.Remove <p,q> from edge_file if freq<alpha. • Sort the edge_file based on destination.Prepare I1 with<q,freq>.Remove<p,q> from edgefile if freq<beta. • The result is a DBG(T,I,alpha,beta). May 22, 2017 Web Mining 185 Advantages/Disadvantages • Extracts all DBG’s in a pageset. • Community extracted is significantly large. DISADV: • Need a URL to start with. • Community members need links to be a part of the community May 22, 2017 Web Mining 186 Efficient Identification of Web Communities Gary William Flake, Steve Lawrence & C. Lee Giles May 22, 2017 Web Mining 187 Presentation Structure • Introduction or why they did it! – Motivation – Background • Theory or how they did it! – Definition – Algorithm • Experimentation or how did they do! – Results – Conclusions May 22, 2017 Web Mining 188 Motivation • Exploding Web ~ 1,000,000,000 documents • Search Engine Limitations – Crawling the web – Updating the web – Precision vs Recall 16% Maximum Coverage! • Web Communities – Balanced Min Cut – Identification is NP-hard May 22, 2017 Web Mining 189 Background • Bibliometrics, Citation analysis, Social Networks • Classical Clustering – eg. CORA • HITS – hubs and authorities May 22, 2017 Web Mining 190 s-t Max Flow & Min Cut •Capacity weights •Source & Sink •Water In, Water Out! G(V,E) • Floyd & Fulkerson’s Max Flow = Min Cut Theorem • Incremental Shortest Augmentation algorithm in poly-time May 22, 2017 Web Mining 191 The Idea • The Ideal Community C V Theorem1: A community C can be identified by calculating the s-t minimum cut using appropriately chosen source and sink nodes. • Proof by Contradiction May 22, 2017 Web Mining 192 The Algorithm 1. Choose Source(s) and Sink(s) 2. Generate G(V,E) using crawler 3. Find s-t Min Cut •Virtual Sources & Sinks •Choosing the Source •Choosing the Sink May 22, 2017 Source layers Web Mining Sink layers 193 Expectation Maximization • Implementation Issues – Small size G(V,E) = low recall – Dependent on choice of source set • Recurse over Algorithm – Community obtained in one iteration used as input to next iteration • Termination not guaranteed May 22, 2017 Web Mining 194 Experimental Results • Testing neighborhoods … – Support Vector Machine (SVM) – The Internet Archive – Ronald Rivest • Criterion – Precision & Recall – Seed set size – Running time May 22, 2017 Web Mining 195 SVM Community • Characterization – Recent: Not listed in any portal – Relatively small research community • Seed Set – svm.first.gmd.de, svm.research.bell-labs.com, www.clrc.rhbnc.ac.uk/research/SVM, www.supportvector.net • Performance – 4 iterations of EM – 11,000 URLs in the graph, 252 member web pages May 22, 2017 Web Mining 196 Internet Archive Community • Characterization Large, internal communities • Seed Set : 11 URLs • Performance – 2 iterations of EM – 7,000 URLs, 289 web pages May 22, 2017 Web Mining 197 Ronald Rivest Community • Characterization – Community around an individual • Seed set • http://theory.lcs.mit.edu/~rivest • Performance – 4 iterations of EM – 38,000 URLs, 150 pages – Cormen’s pages as 1st and 3rd result May 22, 2017 Web Mining 198 Summary • Actual running time – 1 sec on a 500 MHz Intel machine • Max Flow Framework • EM Approach • Relevancy test May 22, 2017 Web Mining 199 Applications • Focused crawlers • Increased Precision & Coverage • Automated population of portal categories • Recall Addressed • Improved filtering • Keyword Spamming • Topical Pruning – eg. Pornography May 22, 2017 Web Mining 200 Future Work • Generalize the notion of Community – Parameterize with coupling factor • Low value, weakly connected communities • High value, highly connected communities • Ideal community • Co-learning and Co-boosting May 22, 2017 Web Mining 201 References • L. Page, S. Brin, "PageRank: Bringing order to the Web," Stanford Digital Libraries working paper 1997-0072. • Chakrabarti, Dom, Kumar, “Mining the link structure of the World Wide Web,” IEEE Computer, 32(8), August 1999 • K. Bharat, A. Broder, “The Connectivity Server: Fast access to linkage information on the Web.” In Ashman and Thistlewaite [2], pages 469--477. Brisbane, Australia, 1998 • B. Allan, “Finding Authorities and Hubs from Link Structures on the World Wide Web”, ACM, May 2001 • Jeffrey Dean “Finding Related Pages in the World Wide Web” http://citeseer.nj.nec.com/dean99finding.html • A. Z. Border,… Graph structure in the web: experiments and models. Proc. 9th WWW Conf., 2000. • S. R. Kumar,… Trawling emerging cyber-communities automatically. Proc. 8th WWW Conf., 1999. May 22, 2017 Web Mining 202 References • Principles of Data Mining, Hand, Mannila, Smyth. MIT Press, 2001. • Notes from Dr. M.V. Ramakrishna http://goanna.cs.rmit.edu.au/~rama/cs442/info.html • Notes from CS 395T: Large-Scale Data Mining, Inderjit Dhillon http://www.cs.utexas.edu/users/inderjit/courses/dm2000.html • Link Analysis in Web Information Retreival, Monika Henzinger. Bulletin of the IEEE computer Society Technical Committee on Data Engineering, 2000. research.microsoft.com/research/db/debull/A00sept/henzinge.ps • slides from Data Mining: Concepts and Techniques, Jan and Kamber, Morgan Kaufman, 2001. May 22, 2017 Web Mining 203 1. 2. 3. J. Srivastava, R. Cooley, M. Deshpande, Pang-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, SIGKDD Explorations, Vol. 1, Issue 2, 2000. B. Mobasher, R. Cooley and J. Srivastava, Web Mining: Information and Pattern Discovery on the World Wide Web, Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97), November 1997. B. Mobasher, Namit Jain, Eui-Hong (Sam) Han, Jaideep Srivastava. Web Mining: Pattern Discovery from World Wide Web Transactions. Technical Report TR 96-060, University of Minnesota, Dept. of Computer Science, Minneapolis, 1996 May 22, 2017 Web Mining 204 4. 5. 6. 7. 8. R. Cooley, P. N. Tan., and J. Srivastava. (1999). WebSIFT: the Web site information filter system. In Proceedings of the 1999 KDD Workshop on Web Mining, San Diego, CA. SpringerVerlag, in press. R. W. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from Web data. PhD Thesis, Dept of Computer Science, University of Minnesota, May 2000. Cooley, R., Mobasher, B., and Srivastava, J. Web Mining: Information and pattern Discovery on the World Wide Web. IEEE Computer, pages 558-566, 1997. Etzioni, O. The world wide web: Quagmire or gold mine. Communications of the ACM, 39(11):65-68, 1996. Kosala, R. and Blockeel, H. Web Mining Research: A summary. SIGKDD Explorations, 2(1):1-15, 2000. May 22, 2017 Web Mining 205 • Fayyad, U., Djorgovski, S., and Weir, N. Automating the analysis and cataloging of sky surveys. In Advances in Knowledge Discovery and Data Mining, pages 471-493. AAAI Press, 1996. • Langley, P. User modeling in adaptive interfaces. In Proceedings of the Seventh International Conference on User Modeling, pages 357-370, 1999. • Madria, S.K., Bhowmick, S.S., Ng, W.K., and Lim, E.-P. Research issues in web data mining. In Proceedings of Data Warehousing and Knowledge Discovery, First International Conference, DaWaK ‘99, pages 303-312, 1999. • Masand, B. and Spiliopoulou, M. Webkdd-99: Work-shop on web usage analysis and user profiling. SIGKDD Explorations, 1(2), 2000. May 22, 2017 Web Mining 206 • Smyth, P., Fayyad, U.M., Burl, M.C., and Perona, P. Modeling subjective uncertainty in image annotation. In Advances in Knowledge Discovery and Data Mining, pages 517-539, 1996. • Spiliopoulou, M. Data mining for the web. In Principles of Data Mining and Knowledge Discovery, Second European Symposium, PKDD ‘99, pages 588-589, 1999. • Srivastava, J., Cooley, R., Deshpande, M., and Tan, P.-N. Web usage mining: Discovery and applications of usage patterns from web data. SIGMOD Explorations, 1(2), 2000. • Zaiane, O.R., Xin, M., and Han, J. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. IEEE, pages 19-29, 1998. May 22, 2017 Web Mining 207 Page Ranking o The PageRank Citation Ranking: Bringing Order to the Web (1998), Larry Page, Sergey Brin, R. Motwani, T. Winograd, Stanford Digital Library Technologies Project.. o Authoritative Sources in a Hyperlinked Environment (1998), Jon. Kleinberg, Journal of the ACM o The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Sergey Brin, Lawrence Page, Computer Networks and ISDN Systems. o Web Search Via Hub Synthesis (1998) Dimitris Achlioptas, Amos Fiat, Anna R. Karlin, Frank McSherry. o What is this Page Known for? Computing Web Page Reputations (2000) Davood Rafiei, Alberto O Mendelzon. May•22, 2017 Web Mining 208 o Link Analysis in Web Infromation Retrieval, Monika Henzinger. Bulletin of the IEEE computer Society Technical Committee on Data Engineering, 2000. • Finding Authorities and Hubs From Link Structures on the World Wide Web, Allan Borodin, Gareth O. Roberts, Jeffrey S. Rosenthal, Panayiotis Tsaparas, 2002. • Web Communities and Classification Enhanced hypertext categorization using hyperlinks (1998) Soumen Chakrabarti, Byron Dom, and Piotr Indyk, Proceedings of SIGMOD-98, ACM International Conference on Management of Data. • Automatic Resource list Compilation by Analyzing Hyperlink Structure and Associated Text (1998) S. Chakrabarti, B. Dom, D. Gibson, J. Keinberg, P. Raghavan, and s. Rajagopalan, Proceedings of the 7th International World Wide Web Conference. • Inferring Web Communities from Link Topology (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan, UK Conference on Hypertext. May•22, 2017 Web Mining 209 • o Trawling the web for emerging cyber-communities (1999) Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, WWW8 / Computer Networks. o Finding Related Pages in the World Wide Web (1999) Jeffrey Dean, Monika R. Henzinger, WWW8 / Computer Networks. o A System for Collaborative Web Resource Categorization and Ranking Maxim Lifantsev. • A Study of Approaches to Hypertext Categorization (2002) Yiming Yang, Sean Slattery, Rayid Ghani, Journal of Intelligent Information Systems. o Hypertext Categorization using Hyperlink Patterns and Meta Data (2001) Rayid Ghani, Sean Slattery, Yiming Yang. May 22, 2017 Web Mining 210