Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Web Mining: An Overview CSC 575 Intelligent Information Retrieval Web Mining Today Overview of Web Data Mining Web Content Mining / Text Mining Web Usage Mining Web Personalization Intelligent Information Retrieval 2 What Is Data Mining Data Mining: A Definition The non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data in large data repositories Non-trivial: obvious knowledge is not useful implicit: hidden difficult to observe knowledge previously unknown potentially useful: actionable; easy to understand Intelligent Information Retrieval 3 The Knowledge Discovery in Data (KDD) Viewed as a Process Intelligent Information Retrieval 4 What Can Data Mining Do Many Data Mining Tasks often inter-related often need to try different techniques for each task each tasks may require different types of knowledge discovery Typical data mining tasks Classification Prediction Clustering Association Discovery Sequence Analysis Characterization Discrimination Intelligent Information Retrieval 5 What is Web Mining From its very beginning, the potential of extracting valuable knowledge from the Web has been quite evident Web mining is the collection of technologies to fulfill this potential. Web Mining Definition application of data mining and machine learning techniques to extract useful knowledge from the content, structure, and usage of Web resources. Intelligent Information Retrieval 6 Types of Web Mining Web Mining Web Content Mining Intelligent Information Retrieval Web Usage Mining Web Structure Mining 7 Types of Web Mining Web Mining Web Content Mining Web Usage Mining Web Structure Mining Extracting useful knowledge from the contents of Web documents or other semantic information about Web resources Intelligent Information Retrieval 8 Types of Web Mining Web Mining Web Content Mining Web Usage Mining Web Structure Mining Content data may consist of text, images, audio, video, structured records from lists and tables, or item attributes from backend databases. Intelligent Information Retrieval 9 Types of Web Mining Web Mining Web Content Mining Web Usage Mining Web Structure Mining Applications: • document clustering or categorization • topic identification / tracking • concept discovery • focused crawling • content-based personalization • intelligent search tools Intelligent Information Retrieval 10 Types of Web Mining Web Mining Web Content Mining Web Usage Mining Web Structure Mining Extracting interesting patterns from user interactions with resources on one or more Web sites Intelligent Information Retrieval 11 Types of Web Mining Web Mining Web Content Mining Web Usage Mining Web Structure Mining Applications: • user and customer behavior modeling • Web site optimization • e-customer relationship management • Web marketing • targeted advertising • recommender systems Intelligent Information Retrieval 12 Types of Web Mining Web Mining Web Content Mining Web Usage Mining Web Structure Mining Discovering useful patterns from the hyperlink structure connecting Web sites or Web resources Intelligent Information Retrieval 13 Types of Web Mining Web Mining Web Content Mining Web Usage Mining Web Structure Mining Data sources include the explicit hyperlink between documents, or implicit links among objects (e.g., two objects being “tagged” using the same keyword). Intelligent Information Retrieval 14 Types of Web Mining Web Mining Web Content Mining Web Usage Mining Web Structure Mining Applications: • document retrieval and ranking (e.g., Google) • discovery of “hubs” and “authorities” • discovery of Web communities • social network analysis Intelligent Information Retrieval 15 Web Content Mining :: data preparation Typical steps in content preprocessing Extract text and meta data from Web documents (generally performed automatically using a Web crawler) Recognize special entities (e.g., dates) and pre-defined keywords Remove stop words and non-relevant terms Perform stemming and morphological analysis Compute statistics based on term occurrences document frequency (DF): number of documents with the term occurrence term frequency (TF): frequency of occurrence within a specific document Additional considerations For entities such as products, movies, songs, etc., may need to extract or obtain structured information such as item attributes from databases or available domain ontologies It may be desirable to identify or discover phrases, facets, collocations, etc. (in order to treat commonly occurring groups of features as a single term). Intelligent Information Retrieval 16 Web Content Mining :: data representation Vector Representation Typically, each document is represented as multi-dimensional vector over all terms extracted in the preprocessing step dimension values represent weights associated with terms in the document Document / objects Terms / attributes A web B 3 data 4 mining 2 business C D E 2 1 1 1 4 1 3 intelligence 5 3 1 1 marketing 2 1 1 information 1 5 2 retrieval 6 1 3 4 Term weights may be binary or may be derived as a function of term frequency (TF) and document frequency (DF) In some applications, they may be only a limited number of terms (a controlled vocabulary) is maintained and the weights may be assigned manually or based on external criteria Intelligent Information Retrieval 17 Web Content Mining :: common approaches and applications Basic notion: document similarity Most Web content mining and information retrieval applications involve measuring similarity among two or more documents Vector representation facilitates similarity computations using vector-space operations (such as Cosine of the angle between two vectors) Examples Search engines: measure the similarity between a query (represented as a vector) and the indexed document vectors to return a ranked list of relevant documents Document clustering: group documents based on similarity or dissimilarity (distance) among them Document categorization: measure the similarity of a new document to be classified with representations of existing categories (such as the mean vector representing a group of document vectors) Personalization: recommend documents or items based their similarity to a representation of the user’s profile (may be a term vector representing concepts or terms of interest to the user) Intelligent Information Retrieval 18 Web Content Mining :: example – clustered search results Can drill down within clusters to view sub-topics or to view the relevant subset of results Intelligent Information Retrieval 19 Web Content Mining :: example – personalized content delivery Google's personalized news is an example of a content-based recommender system which recommends items (in part) based on the similarity of their content to a user’s profile (gathered from search and click history) Intelligent Information Retrieval 20 Web Structure Mining :: graph structures on the Web The structure of a typical Web graph Web pages as nodes hyperlinks as edges connecting two related pages Hyperlink Analysis Hyperlinks can serve as a tool for pure navigation But, often they are used to point to pages with authority on the same topic as the source page (similar to a citation in a publication) Some interesting Web structures * Intelligent Information Retrieval 21 Web Structure Mining :: example – Google’s PageRank algorithm Basic idea: Illustration of PageRank propagation Intelligent Information Retrieval Rank of a page depends on the ranks of pages pointing to it Out Degree of page is the number of edges pointing away from it – used to compute the contribution of the page to those to which it points The final PageRank value represents the probability that a random surfer will reach the page d is the prob. that a random surfer chooses the page directly rather than getting there via navigation 22 Web Structure Mining :: example – Hubs and Authorities Basic idea Authority comes from in-edges Being a hub comes from out-edges Mutually re-enforcing relationship A good authority is a page that is pointed to by many good hubs. A good hub is a page that points to many good authorities. Together they tend to form a bipartite graph This idea can be used to discover authoritative pages related to a topic Hubs Authorities HITS algorithm – Hypertext Induced Topic Search Intelligent Information Retrieval 23 Web Structure Mining :: example – Web communities Basic idea Community 2 Web communities are collections of Web pages such that each member node has more hyperlinks (in either direction) within the community than outside the community. Community 1 Typical approach: Maximalflow model * Source node sink Ex: separate the two subgraphs with any choice of source node (left subgraph) and sink node (right subgraph), removing the three dashed links * Source: G. Flake, et al. “Self-Organization and Identification of Web Communities”, IEEE Computer, Vol. 35, No. 3, pp. 66-71, March 2002 . Intelligent Information Retrieval 24 Web Usage Mining The Problem: analyze Web navigational data to Find how the Web site is used by Web users Understand the behavior of different user segments Predict how users will behave in the future Target relevant or interesting information to individual or groups of users Increase sales, profit, loyalty, etc. Challenge Quantitatively capture Web users’ common interests and characterize their underlying tasks Intelligent Information Retrieval 25 Applications of Web Usage Mining Electronic Commerce design cross marketing strategies across products evaluate promotional campaigns target electronic ads and coupons at user groups based on their access patterns predict user behavior based on previously learned rules and users’ profiles present dynamic information to users based on their interests and profiles: “Web personalization” Effective and Efficient Web Presence determine the best way to structure the Web site identify “weak links” for elimination or enhancement prefetch files that are most likely to be accessed enhance workgroup management & communication Search Engines Behavior-based ranking Intelligent Information Retrieval 26 Behavior-based ranking For each query Q, keep track of which docs in the results are clicked on On subsequent requests for Q, re-order docs in results based on click-throughs. Relevance assessment based on Behavior/usage vs. content Intelligent Information Retrieval 27 Query-doc popularity matrix B j Docs q Queries Bqj = number of times doc j clicked-through on query q When query q issued again, order docs by Bqj values. Intelligent Information Retrieval 28 Vector space implementation Maintain a term-doc popularity matrix C as opposed to query-doc popularity initialized to all zeros Each column represents a doc j If doc j clicked on for query q, update Cj Cj + q (here q is viewed as a vector). On a query q’, compute its cosine proximity to Cj for all j. Combine this with the regular text score. Intelligent Information Retrieval 29 Issues Normalization of Cj after updating Assumption of query compositionality “white house” document popularity derived from “white” and “house” Updating - live or batch? Basic assumption: Relevance can be directly measured by number of click throughs Valid? Intelligent Information Retrieval 30 Web Usage Mining :: data sources Typical Sources of Data: automatically generated Web/application server access logs e-commerce and product-oriented user events (e.g., shopping cart changes, product clickthroughs, etc.) user profiles and/or user ratings meta-data, page content, site structure User Transactions sets or sequences of pageviews possibly with associated weights a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser Intelligent Information Retrieval 31 What’s in a Typical Server Log? 1 2006-02-01 00:08:43 1.2.3.4 - GET /classes/cs589/papers.html - 200 9221 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727) http://dataminingresources.blogspot.com/ 2 2006-02-01 00:08:46 1.2.3.4 - GET /classes/cs589/papers/cms-tai.pdf - 200 4096 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727) http://maya.cs.depaul.edu/~classes/cs589/papers.html 3 2006-02-01 08:01:28 2.3.4.5 - GET /classes/ds575/papers/hyperlink.pdf - 200 318814 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1) http://www.google.com/search?hl=en&lr=&q=hyperlink+analysis+for+the+web+survey 4 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/announce.html - 200 3794 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://maya.cs.depaul.edu/~classes/cs480/ 5 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/styles2.css - 200 1636 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://maya.cs.depaul.edu/~classes/cs480/announce.html 6 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/header.gif - 200 6027 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://maya.cs.depaul.edu/~classes/cs480/announce.html Intelligent Information Retrieval 32 Typical Fields in a Log File Entry client IP address base url date/time http method file accessed protocol version status code bytes transferred referrer page user agent 1.2.3.4 maya.cs.depaul.edu 2006-02-01 00:08:43 GET /classes/cs589/papers.html HTTP/1.1 200 (successful access) 9221 http://dataminingresources.blogspot.com/ Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1; +SV1;+.NET+CLR+2.0.50727) In addition, there may be fields corresponding to • login information • client-side cookies (unique keys, issued to clients in order to identify a repeat visitor) • session ids issued by the Web or application servers Intelligent Information Retrieval 33 Usage Data Preparation Tasks Data cleaning remove irrelevant references and fields in server logs remove references due to spider navigation add missing references due to caching Data integration synchronize data from multiple server logs integrate e-commerce and application server data integrate meta-data Data Transformation pageview identification identification of unique users sessionization – partitioning each user’s record into multiple sessions or transactions (usually representing different visits) mapping between user sessions and topics or categories Intelligent Information Retrieval 34 Conceptual Representation of User Transactions or Sessions Pageview/objects Sessions/user transactions user0 user1 user2 user3 user4 user5 user6 user7 user8 user9 A 15 0 12 9 0 17 24 0 7 0 B 5 0 0 47 0 0 89 0 0 38 C 0 32 0 0 23 0 0 78 45 57 D 0 4 56 0 15 157 0 27 20 0 E 0 0 236 0 0 69 0 0 127 0 F 185 0 0 134 0 0 354 0 0 15 Raw weights may be binary or based on time spent on a page; in practice, need to normalize or standardize this data. Intelligent Information Retrieval 35 Web Usage Mining as a Process Intelligent Information Retrieval 36 Common Web Usage Mining Tasks Clustering (unsupervised): Automatically group together users with similar purchasing or navigational patterns User / customer segments Automatically group together items based on co-occurrence in user sessions Automatic creation of concept or functional hierarchies for the site Classification / Prediction (supervised) categorize pages or items into topics in a concept hierarchy classify users into behavioral groups based on their navigation or purchase histories (e.g., browser, likely to purchase, loyal customer, etc.) predict a user’s interest in an item based on that user’s profiles and those of other similar users Predict the life-time-value for a customer based on transaction history and navigation behavior Intelligent Information Retrieval 37 Common Web Usage Mining Tasks Association Rules Associating presence of a set of items with other sets of items X Y, where X and Y are sets of items Support of the itemset X Y: Pr(X,Y); Confidence of rule: Pr(Y|X) Examples: 30% of users who accessed the special-offers page, also placed an online order in /products/software/ Customers who bought The Da Vinci Code and Holy Blood, Holy Grail where 65% likely to also purchase the Harry Potter and the Goblet of Fire DVD Sequential Patterns / Path Analysis Finding common sequences of events/items appearing frequently in transactions General form: “x% of the time, when A and B appear in a transaction together, C appears within z transactions (alternatively, within t time units)” 15% of visitors had the following common pattern in their navigation path during a session: home * software * shopping cart checkout Intelligent Information Retrieval 38 Example: Association Analysis for Ecommerce Product Fully Reversible Mats Association Egyptian Cotton Towels Lift Confidence 456 41% Confidence: 41% who purchased Fully Reversible Mats also purchased Egyptian Cotton Towels Lift: People who purchased Fully Reversible Mats were 456 times more likely to purchase the Egyptian Cotton Towels compared to the general population Intelligent Information Retrieval 39 Example: Association Rules for Personalized Recommendations Intelligent Information Retrieval 40 Profile Aggregation Based on Clustering Transactions (PACT) Input set of relevant pageviews in preprocessed log P { p1 , p2 , set of user transactions T {t1 , t2 , each transaction is a pageview vector , pn } , tm } t w( p1 , t ), w( p2 , t ),..., w( pn , t ) Clusters Transaction (e.g., using k-means) each cluster contains a set of transaction vectors for each cluster compute centroid as cluster representative c u1c , u2c , , unc Aggregate Usage Profiles a set of pageview-weight pairs: for transaction cluster C, select each pageview pi such that u c (in the cluster centroid) is greater than a pre-specified threshold i Intelligent Information Retrieval 41 Characterizing User Segments via Clustering Original Session/user data Result of Clustering user0 user1 user2 user3 user4 user5 user6 user7 user8 user9 Cluster 0 user 1 user 4 user 7 Cluster 1 user 0 user 3 user 6 user 9 Cluster 2 user 2 user 5 user 8 Intelligent Information Retrieval A.html B.html C.html D.html E.html F.html 1 1 0 0 0 1 0 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 0 0 1 1 0 0 1 A.html B.html C.html D.html E.html F.html 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 1 0 1 0 1 1 1 0 Given an active session A B, the best matching cluster is Profile 1. This may result in a recommendation for page F.html, since it appears with high weight in that cluster. Cluster 0 (Cluster Size = 3) -------------------------------------1.00 C.html 1.00 D.html Cluster 1 (Cluster Size = 4) -------------------------------------1.00 B.html 1.00 F.html 0.75 A.html 0.25 C.html Cluster 2 (Cluster Size = 3) -------------------------------------1.00 A.html 1.00 D.html 1.00 E.html 0.33 C.html 42 Example: Clustering User Transactions Transaction Clusters: Clustering similar user transactions and using centroid of each cluster as a usage profile (representative for a user segment) Sample cluster centroid from the CS dept. Web site (cluster size =330) Support URL Pageview Description 1.00 /courses/syllabus.asp?course=45096-303&q=3&y=2002&id=290 SE 450 Object-Oriented Development class syllabus 0.97 /people/facultyinfo.asp?id=290 Web page of a lecturer who thought the above course 0.88 /programs/ Current Degree Descriptions 2002 0.85 /programs/courses.asp?depcode=96 &deptmne=se&courseid=450 SE 450 course description in SE program 0.82 /programs/2002/gradds2002.asp M.S. in Distributed Systems program description Intelligent Information Retrieval 43 Example: Collaborative Filtering Popular Recommendation Technology Recommend items to users by finding other users with similar tastes or interests Compare a target user’s profile (typically ratings on various items) to the profiles of other users in the database with ratings for some common items Use these “nearest neighbors” to predict a rating by the target user on an unseen item Collaborative recommendation is one example of using data mining for automatic personalization Source: J. Riedl, “Why Does KDD Care About Personalization?” Intelligent Information Retrieval 44 Example: Collaborative Filtering Intelligent Information Retrieval 45 Web Mining Approach to Personalization Basic Idea generate aggregate user models (usage profiles) by discovering user access patterns through Web usage mining (offline process) Clustering user transactions Clustering items / pageviews Association rule mining Sequential pattern discovery match a user’s active session against the discovered models to provide dynamic content (online process) Advantages no explicit user ratings or interaction with users helps preserve user privacy, by making effective use of anonymous data enhance the effectiveness and scalability of collaborative filtering more accurate and broader recommendations than content-only approaches Intelligent Information Retrieval 46 Web Personalization The General Problem dynamically serve customized content (pages, products, etc.) to users based on their profiles, preferences, or expected interests as we have seen many of the data mining approaches that allow us to learn aggregate user models can be used for personalization or recommendation Intelligent Information Retrieval 47 Real-Time Recommendation Engine Keep track of users’ navigational history through the site a fixed-size sliding window over the active session to capture the current user’s “short-term” history depth Match current user’s activity against the discovered profiles profiles either can be based on aggregate usage profiles, or are obtained directly from association rules or sequential patterns Dynamically generated recommendations are added to the returned page each pageview can be assigned a recommendation score based on matching score to user profiles (e.g., aggregate usage profiles) “information value” of the pageview based on domain knowledge (e.g., link distance of the candidate recommendation to the active session) Intelligent Information Retrieval 48 Problems with Web Usage Mining New item problem Patterns will not capture new items recently added Bad for dynamic Web sites Poor machine interpretability Hard to generalize and reason about patterns No domain knowledge used to enhance results E.g., Knowing a user is interested in a program, we could recommend the prerequisites, core or popular courses in this program to the user Poor insight into the patterns themselves The nature of the relationships among items or users in a pattern is not directly available Intelligent Information Retrieval 49 Solution: Integrate Semantic Knowledge with Web Usage Mining Information Retrieval/Extraction Approach Represent semantic knowledge in pageviews as keyword vectors Keywords extracted from text or meta-data Text mining can be used to capture higher-level concepts or associations among concepts Cannot capture deeper relationships among objects based on their inherent properties or attributes Ontology-based approach Represent domain knowledge using relational model or ontology representation languages Process Web usage data with the structured domain knowledge Requires the extraction of ontology instances from Web pages Challenge: performing underlying mining operations on structured objects (e.g., computing similarities or performing aggregations) Intelligent Information Retrieval 50