Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx Web Usage Mining Outline • • • • • Definition and Goal Source and Type of data Data Collection and Pre-processing Data Modeling Discovery and Analysis of Web Usage Patterns Web Usage Mining • Definition and Goal • Automatic discovery and analysis of patterns • Goal: Capture, model and analyze the behavior pattern and profiles of users interacting with web sites. • Source and Type of Data: • • • • • • Server log files: Web Server and Applications access Site files and meta data Operational databases Application Templates Domain Knowledge Internet Service Provider data collection Data Collection • Web sites and Applications data • Primary source of data in Web Usage Mining • Each HTTP request generates a single entry in the server access logs • Log entry: time and date of request; IP address; resource requested; HTTP method; User Agent(Browser and Operating System); referring web resource; client-side cookies. • 12006-02-01 00:08:43 1.2.3.4 - GET /classes/cs589/papers.html 200 9221 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+. NET+CLR+2.0.50727) http://dataminingresources.blogspot.com/22006-02-01 00:08:46 1.2.3.4 - GET /classes/cs589/papers/cms-tai.pdf Data Abstraction • Pageview: collection of web objects or resource corresponding to a single “user event”. Example: reading an article; view a product page; adding a product to a shopping cart. • Session: sequence of pageviews by a single user during a single visit. • Content Data: objects and relationships suggested to the user(Text and images). • User data: operational database(Ex: user profile information, visit histories…) Web Usage Data Pre-Processing • Data Fusion : merging of log files from several web and application servers: • shared embedded session ids • heuristic methods based on the “referrer” field in server logs • Data cleaning : removing useless data such as references including style files, graphics or sound files Web Usage Data Pre-Processing(Continue) • Pageview identification attributes: • pageview ID (URL uniquely representing the page viewed); static pageview Type(ex: information page, product page); Metadata(keywords) • User Identification: • User authentication mechanism(User activity record) • Use of client-side cookies • Sessionization • Each user activity record represents a single vist to the site or a session. • An episode is a subset or subsequence of a session comprised of semantically or functionally related pageviews Web Usage Data Pre-Processing (Continue) • Path Completion • To solve missing references due to client or proxy-side caching. When a user returns to the previous page, the version of the download of that page will still the same due to caching. • Data Integration • User data (e.g., demographics, ratings, and purchase histories) and product attributes and categories from operational databases. • Building a content enhanced transaction data • Multiplying user-pageview matrix and the transpose of the termpageview matrix). read Bamshad Mobasher, ch12: Web Usage Mining pp14-18) Discovery and Analysis of Web Usage Patterns • Session and Visitor Analysis • data is aggregated by predeter-mined units such as days, sessions, visitors, or domains • Reports on most frequently accessed pages, average view time of a page, average length of a path through a site, common entry and exit points. • useful for improving the system performance, and providing support for marketing decisions. • Online Analytical Processing (OLAP)provides a more integrated framework for analysis with a higher degree of flexibility. Discovery and Analysis of Web Usage Patterns Cluster Analysis and Visitor Segmentation • Recall that Clustering is a data mining technique that groups together a set of items having similar characteristics. • User clusters : most used • Clustering of user records (sessions or transactions) • Establish groups of users exhibiting similar browsing patterns. • Useful for providing personalized Web content to similar users Page clusters (or items) • Based on the usage data (i.e., starting from the user sessions or transaction data): items commonly accessed and purchased automatically organized into groups • Based on the content features associated with pages or items (keywords or product at-tributes): collections of pages or products related to the same topic or category. • It can also be used to provide permanent or dynamic HTML pages that suggest related hyperlinks to the users according to their past history of navigational or purchase activities Discovery and Analysis of Web Usage Patterns Association and Correlation Analysis • Recall an association rule is an expression of the form • Can found groups of items or pages that are commonly X→Y [sup, conf], where X accessed or purchased and Y are itemsets, sup is together. the support of the itemset X • Enables Web sites to provide ∪ Y representing the effective cross-sale product probability that X and Y recommendations. occur together in a transaction, and conf is the • One problem for association rule recommendation systems confidence of the rule, is that a system cannot give any defined by sup(X∪Y) / sup(X), representing the recommendations when the conditional probability that dataset is sparse. Y occurs in a transaction given that X has occurred in that transaction. Resource: Web Usage Mining By Bamshad Mobasher ; http://maya.cs.depaul.edu/~mobasher/papers/12-webusage-mining.pdf