Download Data Mining Lecture 1: Introduction to Data Mining

Project Presentations • Thursday next week, each student will make a 4-minute presentation on their project in class (with 1 or 2 minutes for questions) • Email me your Powerpoint or PDF slides, with your name (e.g., joesmith.ppt), before 10am next Thursday • Suggested content: – – – – – • Definition of the task/goal Description of data sets Description of algorithms Experimental results and conclusions Be visual where possible! (i.e., use figures, graphs, etc) Final project report will be due by 12 noon Tuesday of finals week – more details to come later Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 18: Analysis of Web User Data Padhraic Smyth Department of Information and Computer Science University of California, Irvine Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Outline • Basic concepts in Web mining • Analyzing user navigation or clickstream data • Predictive modeling of Web navigation behavior – Markov modeling methods • Analyzing search engine data • Ecommerce aspects of Web log mining – Automated recommender systems Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Further Reading • Modeling the Internet and the Web, P. Baldi, P. Frasconi, P. Smyth, Wiley, 2003. • ACM Transactions on Internet Technology (ACM TOIT) – can be accessed via ACM Digital Library (available from UCI IP addresses). • Annual WebKDD workshops at the ACM SIGKDD conferences. • Papers on Web page prediction – Selective Markov models for predicting Web page accesses, M. Deshpande, G. Karypis, ACM Transactions on Internet Technology, May 2004. – Model-based clustering and visualization of navigation patterns on a Web site, Cadez et al, Journal of Data Mining and Knowledge Discovery, 2003. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Introduction to Web Mining • Useful to study human digital behavior, e.g. search engine data can be used for – Exploration e.g. # of queries per session? – Modeling e.g. any time of day dependence? – Prediction e.g. which pages are relevant? • Applications – Understand social implications of Web usage – Design of better tools for information access – E-commerce applications Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Advertising Applications • Revenue of many internet companies is driven by advertising • Key problem: – Given user data: • Pages browsed • Keywords used in search • Demographics – Determine the most relevant ads (in real-time) – Currently about 50% of keyword searches can not be matched effectively to any ads – (other aspects include bidding/pricing of ads) • Another major problem: “click fraud” • Understanding the user is key to these types of applications – Algorithms that can automatically detect when online advertisements are being manipulated (this is a major problem for Internet advertising) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Data Sources for Web Mining • Web content – Text and HTML content on Web pages, e.g., categorization of content • Web connectivity – Hyperlink/directed-graph structure of the Web – e.g., using PageRank to infer importance of Web pages – e.g., using links to improve accuracy in classification of Web pages • Web user data – Data on how users interact with the Web • Navigation data, aka “clickstream” data • Search query data (keywords for users) • Online transaction data (e.g., purchases at an ecommerce store) – Volume of data? • Large portals (e.g., Yahoo!, MSN) report 100’s of millions of users per month Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Flowchart of a typical Web Mining process (From Cooley, ACM TOIT, 2003) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine How our Web navigation is recorded… • Web logs – Record activity between client browser and a specific Web server – Easily available – Can be augmented with cookies (provide notion of “state”) • Search engine records – Text in queries, which pages were viewed, which snippets were clicked on, etc • Client-side browsing records – Automatically recorded by client-side software – Harder to obtain, but much more accurate than server-side logs • Other sources – Web site registration, purchases, email, etc – ISP recording of Web browsing Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Web Server Log Files • Server Transfer Log: – transactions between a browser and server are logged – – – – IP address, the time of the request Method of the request (GET, HEAD, POST…) Status code, a response from the server Size in byte of the transaction • Referrer Log: – where the request originated • Agent Log: – browser software making the request (spider) • Error Log: – request resulted in errors (404) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine W3C Extended Log File Format Field Date Description Date Time Client IP address date time c-ip User Name Servis Name Server Name Server IP Address Server Port Method URI Stem URI Query Protocol Status Win32 Status Bytes Sent Bytes Received Time Taken Protocol Version Host cs-username s-sitename s-computername s-ip s-port cs-method cs-uri-stem cs-uri-query sc-status sc-win32-status sc-bytes cs-bytes time-taken cs-version cs-host The date that the activity occurred The time that the activity occurred The IP address of the client that accessed your server The name of the autheticated user who access your server, anonymous users are represented by The Internet service and instance number that was accessed by a client The name of the server on which the log entry was generated The IP address of the server that accessed your server The port number the client is connected to The action the client was trying to perform The resource accessed The query, if any, the client was trying to perform The status of the action, in HTTP or FTP terms The status of the action, in terms used by Microsoft Windows The number of bytes sent by the server The number of bytes received by the server The duration of time, in milliseconds, that the action consumed The protocol (HTTP, FTP) version used by the client Display the content of the host header User Agent Cookie Referrer cs(User Agent) cs(Cookie) cs(Referrer) The browser used on the client The content of the cookie sent or received, if any The previous site visited by the user. This site provided a link to the current site s = server actions c = client actions cs = client-to-server actions sc = server-to-client actions Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Example of Web Log entries Apache web log: 205.188.209.10 - - [29/Mar/2002:03:58:06 -0800] "GET /~sophal/whole5.gif HTTP/1.0" 200 9609 "http://www.csua.berkeley.edu/~sophal/whole.html" "Mozilla/4.0 (compatible; MSIE 5.0; AOL 6.0; Windows 98; DigExt)" 216.35.116.26 - - [29/Mar/2002:03:59:40 -0800] "GET /~alexlam/resume.html HTTP/1.0" 200 2674 "-" "Mozilla/5.0 (Slurp/cat; [email protected]; http://www.inktomi.com/slurp.html)“ 202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/indextop.html HTTP/1.1" 200 3510 "http://www.csua.berkeley.edu/~tahir/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“ 202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/animate.js HTTP/1.1" 200 14261 "http://www.csua.berkeley.edu/~tahir/indextop.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“ Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Routine Server Log Analysis • Typical statistics/histograms that are computed – – – – – – Most and least visited web pages Entry and exit pages Referrals from other sites or search engines What are the searched keywords How many clicks/page views a page received Error reports, like broken links • Many software products that produce standard reports of this type of data – Very useful for Web site managers – But does not provide “deep” insights – e.g., are there clusters/groups of users that use the site in different ways? Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Visualization of Web Log Data over Time Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Descriptive Summary Statistics • Histograms, scatter plots, time-series plots – Very important! – Helps to understand the big picture – Provides “marginal” context for any model-building • models aggregate behavior, not individuals – Challenging for Web log data • Examples – Session lengths (e.g., power laws) – Click rates as a function of time, content Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine L = number of page requests in a single session from visitors to www.ics.uci.edu over 1 week in November 2002 (robots removed) 0 Empirical Frequency of L 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 10 1 10 2 10 Session Length L Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Best fit of simple power law model 0 Log P(L) = -a Log L + b 10 or -1 P(L) = b L-a Probability of L 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 10 1 10 2 10 3 10 Session Length L Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine 0 10 POISSON -1 10 Probability of L GEOMETRIC -2 10 INVERSE GAUSSIAN -3 10 POWER-LAW -4 10 -5 10 -6 10 0 10 1 10 2 10 Session Length L Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Web data measurement issues • Important to understand how data is collected • Web data is collected automatically via software logging tools – Advantage: • No manual supervision required – Disadvantage: • Data can be skewed (e.g. due to the presence of robot traffic) • Important to identify robots (also known as crawlers, spiders) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine A time-series plot of ICS Website data Number of page requests per hour as a function of time from page requests in the www.ics.uci.edu Web server logs during the first week of April 2002. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Example: Web Traffic from Commercial Site (slide from Ronny Kohavi, Amazon) Weekends Sept-11 Note significant drop in human traffic, not bot traffic Internal Performance bot Registration at Search Engine sites Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Robot / human identification • • Removal of robot data is important preprocessing step before clickstream analysis Robot page-requests often identified using a variety of heuristics – e.g. some robots self-identify themselves in the server logs • All robots in principle should visit robots.txt on the Web Server • Also, robots should identify themselves via the User Agent field in page requests • But other robots actively try to disguise that they are robots – Patterns of access • • • • • Robots explore the entire website in breadth first fashion Humans access web-pages more typically in depth-first fashion Timing between page-requests can be more regular for robots (e.g., every 5 seconds) Duration of sessions, number of page-requests per day: often unusually large (e.g., 1000’s of page-requests per day) for robots. Tan and Kumar (Journal of Data Mining and Knowledge Discovery, 2002) provide a detailed description of using classification techniques to learn how to detect robots Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Fractions of Robot Data (from Tan and Kumar, 2002) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine From Tan and Kumar, 2002 Overall accuracies of around 90% were obtained using decision tree classifiers, trained on sessions of lengths 1, 2, 3, 4,.. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Page requests, caching, and proxy servers • In theory, requester browser requests a page from a Web server and the request is processed • In practice, there are – – – – Data Mining Lectures Other users Browser caching Dynamic addressing in local network Proxy Server caching Analysis of Web User Data Padhraic Smyth, UC Irvine Page requests, caching, and proxy servers A graphical summary of how page requests from an individual user can be masked at various stages between the user’s local computer and the Web server. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Page requests, caching, and proxy servers • Web server logs are therefore not so ideal in terms of a complete and faithful representation of individual page views • There are heuristics to try to infer the true actions of the user: – Path completion (Cooley et al. 1999) • e.g. If known B -> F and not C -> F, then session ABCF can be interpreted as ABCBF • Anderson et al. 2001 for more heuristics • In general case, it is hard to know what exactly the user viewed Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Identifying individual users from Web server logs • Useful to associate specific page requests to specific individual users • IP address most frequently used • Disadvantages – One IP address can belong to several users – Dynamic allocation of IP address • Better to use cookies (or login ID if available) – Information in the cookie can be accessed by the Web server to identify an individual user over time – Actions by the same user during different sessions can be linked together Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Identifying individual users from Web server logs • Commercial websites use cookies extensively • 97 % of users have cookies enabled permanently on their browsers (source: Amazon.com, 2003) • However … – There are privacy issues – need implicit user cooperation – Cookies can be deleted / disabled • Another option is to enforce user registration – High reliability – But can discourage potential visitors – Large portals (such as Yahoo!) have high fraction of logged-in users Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Sessionizing • Time oriented (robust) – e.g., by gaps between requests • not more than 20 minutes between successive requests • this is a heuristic – but is a standard “rule” used in practice • Navigation oriented (good for short sessions and when timestamps unreliable) – Referrer is previous page in session, or – Referrer is undefined but request within 10 secs, or – Link from previous to current page in web site Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Client-side data • Advantages of collecting data at the client side: – Direct recording of page requests (eliminates ‘masking’ due to caching) – Recording of all browser-related actions by a user (including visits to multiple websites) – More-reliable identification of individual users (e.g. by login ID for multiple users on a single computer) • Preferred mode of data collection for studies of navigation behavior on the Web • Companies like ComScore and Nielsen use client-side software to track home computer users Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Client-side data • Statistics like ‘Time per session’ and ‘Page-view duration’ are more reliable in client-side data • Some limitations – Still some statistics like ‘Page-view duration’ cannot be totally reliable e.g. user might go to fetch coffee – Need explicit user cooperation – Typically recorded on home computers – may not reflect a complete picture of Web browsing behavior • Web surfing data can be collected at intermediate points like ISPs, proxy servers – Can be used to create user profile and target advertise Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Modeling Clickrate Data • Data – 200k Alexa users, client-side, over 24 hours – ignore URLs requested – goal is to build a time-series model that characterizes user click rates Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine 300 NUMBER OF CLICKS 250 200 150 100 50 0 0 5 10 15 20 HOURS Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine 120 NUMBER OF CLICKS 100 80 60 40 20 0 -20 -40 -60 5 5.5 6 6.5 7 HOURS Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine 300 NUMBER OF CLICKS 250 200 150 100 50 0 0 5 10 15 20 HOURS Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Markov-Poisson Model (Scott and Smyth, 2003) • Doubly stochastic process – Locally constant Poisson rate – indexed by M Markov states • Fit a model with M = 3 states • absence of a Web session • Web session with slow click rate: 1 minute rate • Web session with rapid click rate: 10 second rate – Used hierarchical Bayes on individuals Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Early studies from 1995 to 1997 • Earliest studies on client-side data are Catledge and Pitkow (1995) and Tauscher and Greenberg (1997) • In both studies, data was collected by logging Web browser commands • Population consisted of faculty, staff and students • Both studies found – clicking on the hypertext anchors as the most common action – using ‘back button’ was the second common action Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Early studies from 1995 to 1997 • high probability of page revisitation (~0.58-0.61) • Lower bound because the page requests prior to the start of the studies are not accounted for • Humans are creatures of habit? • Content of the pages changed over time? • strong recency (page that is revisited is usually the page that was visited in the recent past) effect • Correlates with the ‘back button’ usage • Similar repetitive actions are found in telephone number dialing etc Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine The Cockburn and McKenzie study from 2002 • Earlier studies were outdates • Web has changed dramatically in the past few years • Cockburn and McKenzie (2002) provides a more up-to-date analysis • Study found revisitation rates higher than past 94 and 95 studies (~0.81) – Analyzed the daily history.dat files produced by the Netscape browser for 17 users for about 4 months – Population studied consisted of faculty, staff and graduate students – Time-window is three times that of past studies Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine The Cockburn and McKenzie study from 2002 • Revisitation rate less biased than the previous studies? • Human behavior changed from an exploratory mode to a utilitarian mode? – The more pages user visits, the more are the requests for new pages – The most frequently requested page for each user can account for a relatively large fraction of his/her page requests • Useful to see the scatter plot of the distinct number of pages requested per user versus the total pages requested Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine The Cockburn and McKenzie study from 2002 The number of distinct pages visited versus page vocabulary size of each of the 17 users in the Cockburn and McKenzie (2002) study (log-log plot) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine The Cockburn and McKenzie study from 2002 Bar chart of the ratio of the number of page requests for the most frequent page divided by the total number of page requests, for 17 users in the Cockburn McKenzie (2002) study Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Outline • Basic concepts in Web log data analysis • Predictive modeling of Web navigation behavior – Markov modeling methods • Analyzing search engine data • Ecommerce aspects of Web log mining Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Markov models for page prediction • General approach is to use a finite-state Markov chain – Each state can be a specific Web page or a category of Web pages – If only interested in the order of visits (and not in time), each new request can be modeled as a transition of states • Issues – Self-transition – Time-independence Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Markov models for page prediction • For simplicity, consider order-dependent, time-independent finite-state Markov chain with M states • Let s be a sequence of observed states of length L. e.g. s = ABBCAABBCCBBAA with three states A, B and C. st is state at position t (1<=t<=L). In general, L P( s )  P( s1 ) P( st | st 1 ,..., s1 ) t 2 • first-order Markov assumption • This provides a simple generative model toL produce sequential data P(st | st 1 ,..., s1 )  P(st | st 1 ) P ( s )  P ( s1 ) P ( st | st 1 ) t 2 Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Markov models for page prediction • • If we denote Tij = P(st = j|st-1 = i), we can define a M x M transition matrix Properties – Strong first-order assumption – Simple way to capture sequential dependence • If each page is a state and if W pages, O(W2), W can be of the order 105 to 106 for a CS dept. of a university • To alleviate, we can cluster W pages into M clusters, each assigned a state in the Markov model • Clustering can be done manually, based on directory structure on the Web server, or automatic clustering using clustering techniques Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Markov models for page prediction • • • Tij = P(st = j|st-1 = i) represents the probability that an individual user’s next request will be from category j, given they were in category i We can add E, an end-state to the model E.g. for three categories with end state:  P(1 | 1) P(2 | 1) P(3 | 1) P( E | 1)    P ( 1 | 2 ) P ( 2 | 2 ) P ( 3 | 2 ) P ( E | 2 )   T  P(1 | 3) P(2 | 3) P(3 | 3) P( E | 3)     0 0 0 1   • E denotes the end of a sequence, and start of a new sequence Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Markov models for page prediction • • First-order Markov model assumes that the next state is based only on the current state Limitations – Doesn’t consider ‘long-term memory’ • We can try to capture more memory with kth-order Markov chain P(st | st 1 ,.., s1 )  P(st | st 1 ,.., st k ) • Limitations – Inordinate amount of training data O(Mk+1) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Parameter estimation for Markov model transitions • Smoothed parameter estimates of transition probabilities are T ij  nij  qij ni   • If nij = 0 for some transition (i, j) then instead of having a parameter estimate of 0 (ML), we will have qij /( ni   ) allowing prior knowledge to be incorporated • If nij > 0, we get a smooth combination of the data-driven information (nij) and the prior Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Parameter estimation for Markov models • One simple way to set prior parameter is – Consider alpha as the effective sample size – Partition the states into two sets, set 1 containing all states directly linked to state i and the remaining in set 2 – Assign uniform probability r/K to all states in set 2 (all set 2 states are equally likely) – The remaining (1-r) can be either uniformly assigned among set 1 elements or weighted by some measure – Prior probabilities in and out of E can be set based on our prior knowledge of how likely we think a user is to exit the site from a particular state Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Predicting page requests with Markov models • Deshpande and Karypis (2004) propose schemes to prune kth-order Markov state space – Provide systematic but modest improvements • Another way is to use empirical smoothing techniques that combine different models from order 1 to order k (Chen and Goodman 1996) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Mixtures of Markov Chains • Cadez et al. (2003) and Sen and Hansen (2003) replace the first-order Markov chain: P(st | st 1 ,..., s1 )  P(st | st 1 ) with a mixture of first-order Markov chains K P( st | st 1 ,..., s1 )   P( st | st 1 , c  k )P(c  k ) k 1 where c is a discrete-value hidden variable taking K values Sk P(c = k) = 1 and P(st | st-1, c = k) is the transition matrix for the kth mixture component • One interpretation of this is user behavior consists of K different navigation behaviors described by the K Markov chains Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Modeling Web Page Requests with Markov chain mixtures • MSNBC Web logs – Order of 2 million individual users per day – different session lengths per individual – difficult visualization and clustering problem • WebCanvas – uses mixtures of Markov chains to cluster individuals based on their observed sequences – software tool: EM mixture modeling + visualization • Next few slides are based on material in: I. Cadez et al, Model-based clustering and visualization of navigation patterns on a Web site, Journal of Data Mining and Knowledge Discovery, 2003. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine From Web logs to sequences 128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -, User 1 User 2 User 3 User 4 User 5 … Data Mining Lectures 2 3 7 1 5 3 3 7 5 1 … 2 3 7 1 1 2 1 7 1 5 3 1 7 1 3 1 7 5 3 1 1 1 3 1 3 3 7 1 7 5 1 1 1 1 1 1 Analysis of Web User Data 3 3 Padhraic Smyth, UC Irvine Clusters of Finite State Machines Cluster 1 A B Cluster 2 A B D E D E A B Cluster 3 Data Mining Lectures D E Analysis of Web User Data Padhraic Smyth, UC Irvine Learning Problem • Assumptions – data is being generated by K different groups – Each group is described by a stochastic finite state machine (SFSM) • aka, a Markov model with an end-state • Given – A set of sequences from different users of different lengths • Learn – A “mixture” of K different stochastic finite state machines • Solution – EM is very easy: fractional counts of transitions – efficient and accurate, scales as O(KN) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Sketch of EM Algorithm for Mixtures of Markov Chains • • Model = mixture of K Markov chains (K fixed) Input data = N categorical sequences (can be variable length) • Initialization: • E-step: • – – – Generate random initial transition matrices for each of the K groups Compute p( sequence i | model k), for i=1,..N, k = 1,…K Use Bayes rule to compute p(model k | sequence i) • Yields membership probabilities for each sequence M-step: – – Estimate the transition probabilities for each cluster, given membership probabilities Consists of “fractional counting of transitions”, • e.g., sequence with probability 0.8 in cluster k, results in transition counts weighted by 0.8 • Repeat E and M steps until convergence • Complexity of each iteration is O(K N L) where L is the average sequence length Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Prediction with Markov mixtures P(st+1 | s[1,t] ) = Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Prediction with Markov mixtures P(st+1 | s[1,t] ) = S P(st+1 , k | s[1,t] ) = S P(st+1 | k , s[1,t] ) P(k | s[1,t] ) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Prediction with Markov mixtures P(st+1 | s[1,t] ) = S P(st+1 , k | s[1,t] ) = S P(st+1 | k , s[1,t] ) P(k | s[1,t] ) = S P(st+1 | k , st ) P(k | s[1,t] ) Prediction of kth component Membership, based on sequence history => Predictions are a convex combination of K different component transition matrices, with weights based on sequence history Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Experimental Methodology • Model Training: – fit 2 types of models • mixtures of histograms (multinomials) • mixtures of finite state machines – Train on a full day’s worth of MSNBC Web data • Model Evaluation: – “one-step-ahead” prediction on unseen test data • Test sequences from a different day of Web logs – compute log P(user’s next click | previous clicks, model) – Using equation on the previous slide – logP score: • Rewarded if next click was given high P by the model • Punished if next click was given low P by the model – negative average of logP scores ~ “predictive entropy” • Has a natural interpretation • Lower bounded by 0 bits (perfect prediction) • Upper bounded by log M bits, where M is the number of categories Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Predictive Entropy Out-of-Sample 4 Negative log-likelihood [bits/token] 3.8 3.6 3.4 3.2 3 2.8 2.6 Mixtures of Multinomials 2.4 Mixtures of SFSMs 2.2 2 Data Mining Lectures 20 40 60 80 100 120 140 Number of mixture [K] Analysis of Web Usercomponents Data 160 180 200 Padhraic Smyth, UC Irvine log count(R) RUN LENGTH DISTRIBUTIONS WITHIN MARKOV CLUSTERS 10 5 5 0 0 log count(R) 0 10 5 10 15 20 5 0 0 10 10 15 20 5 0 0 log count(R) 10 log count(R) 10 5 10 15 20 10 15 20 Cluster 2: Category 7 -2 5 10 Cluster 3: Category 1 -2 5 10 -2 Cluster 4: Category 1 0 0 10 5 10 Cluster 5: Category 12 5 0 0 0 10 15 R = Run Length Data Mining Lectures 20 0 1 2 3 R = Run Length Analysis of Web User Data 4 5 5 20 30 40 10 15 20 Cluster 4: Category 3 0 5 5 0 10 5 0 10 Cluster 3: Category 13 10 0 Cluster 5: Category 9 40 0 0 40 30 2 0 30 0 4 5 20 20 0 5 10 10 Cluster 2: Category 8 4 5 0 0 2 10 Cluster 4: Category 2 5 0 5 0 0 10 Cluster 3: Category 12 Cluster 1: Category 8 4 2 0 5 5 Cluster 1: Category 14 10 Cluster 2: Category 1 0 log count(R) 10 Cluster 1: Category 13 5 10 Cluster 5: Category 6 0 1 2 3 4 R = Run Length Padhraic Smyth, UC Irvine Timing Results 2500 2000 N=150,000 Time [sec] 1500 N=110,000 1000 N = 70,000 500 0 -500 0 20 40 60 80 100 120 140 160 180 200 Number of mixture components [K] Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine WebCanvas • Software tool for Web log visualization – – – – • uses Markov mixtures to cluster data for display extensively used within Microsoft also applied to non-Web data (e.g., how users navigate in Word, etc) Algorithm and visualization are in latest release of SQLServer (the “sequence mining” tool) Model-based visualization – random sample of actual sequences – interactive tiled windows displayed for visualization – more effective than • planar graphs • traffic-flow movie in Microsoft Site Server v3.0 Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Insights from WebCanvas for MSNBC data • From msnbc.com site adminstrators…. – significant heterogeneity of behavior – relatively focused activity of many users • typically only 1 or 2 categories of pages – many individuals not entering via main page – detected problems with the weather page – missing transitions (e.g., tech <=> business) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Possible Extensions • Adding time-dependence – adding time-between clicks, time of day effects • Uncategorized Web pages – coupling page content with sequence models • Modeling “switching” behaviors – allowing users to switch between behaviors – Could use a topic-style model: users = mixtures of behaviors – e.g., Girolami M & Kaban A., Sequential Activity Profiling: Latent Dirichlet Allocation of Markov Chains, Journal of Data Mining and Knowledge Discovery, Vol 10, 175-196. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Related Work • Mixtures of Markov chains – special case: Poulsen (1990) – general case: Ridgeway (1997), Smyth (1997) • Clustering of Web page sequences – non-probabilistic approaches (Fu et al, 1999) • Markov models for prediction – Anderson et al (IJCAI, 2001): • mixtures of Markov outperform other sequential models for page-request prediction Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Predicting page requests with Markov models • K can be chosen by evaluating the out-of-sample predictive performance based on – Accuracy of prediction – Log probability score – Entropy • Other variations: – Sen and Hansen 2003 – Position-dependent Markov models (Anderson et al. 2001, 2002) – Zukerman et al. 1999 Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Outline • Basic concepts in Web log data analysis • Predictive modeling of Web navigation behavior – Markov modeling methods • Analyzing search engine data • Ecommerce aspects of Web log mining Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Analysis of Search Engine Query Logs # of Queries Source Time Period Lau & Horvitz 4690 Excite Sep 1997 Silverstein et al 1 Billion AltaVista 6 weeks in Aug & Sep 1998 Spink et al (series of studies) 1 Million for each time period Excite Sep 1997 Dec 1999 May 2001 Xie & O’Hallaron 110,000 Vivisimo 35 days Jan & Feb 2001 1.9 Million Excite 8 hrs in a day, Dec 1999 Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Main Results • Average number of terms in a query ranges from a low of 2.2 to a high of 2.6 • The most common number of terms in a query is 2 • The majority of users don’t refine their query – The number of users who viewed only a single page increase 29% (1997) to 51% (2001) (Excite) – 85% of users viewed only first page of search results (AltaVista) • 45% (2001) of queries are about Commerce, Travel, Economy, People (was 20% in 1997) • All four studies produced a generally consistent set of findings about user behavior in a search engine context – most users view relatively few pages per query – most users don’t use advanced search features Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Xie and O Halloran Study (2002) - Query Length Distributions (bars) - Poisson Model (dots & lines) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Power-law Characteristics of Common Queries Power-Law in log-log space • Frequency f(r) of Queries with Rank r – 110000 queries from Vivisimo – 1.9 Million queries from Excite • There are strong regularities in terms of patterns of behavior in how we search the Web Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Outline • Basic concepts in Web log data analysis • Predictive modeling of Web navigation behavior – Markov modeling methods • Analyzing search engine data • Ecommerce aspects of Web log mining Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine Ecommerce Data • Page request Web logs combined with – – – – Purchase (market-basket) information User address information (if they make a purchase) Demographics information (can be purchased) Emails to/from the customer • Main focus here is to increase revenue – Data mining widely used an online commerce companies like Amazon • This is a very rich source of problems for data mining – – – – – Data Mining Lectures What products should we advertise to this person? Can we do dynamic pricing? If a person buys X should we also suggest Y? Who are our best customers? etc Analysis of Web User Data Padhraic Smyth, UC Irvine Combining Data Sources • Comprehensive collection of US consumer and telephone data available via the internet – – – – Multi-sourced database Demographic, socioeconomic, and lifestyle information. Information on most U.S. households Contributors’ files refreshed a minimum of 3-12 times per year. – Data sources include: County Real Estate Property Records, U.S. Telephone Directories, Public Information, Motor Vehicle Registrations, Census Directories, Credit Grantors, Public Records and Consumer Data, Driver’s Licenses, Voter Registrations, Product Registration Questionnaires, Catalogers, Magazines, Specialty Retailers, Packaged Goods Manufacturers, Accounts Receivable Files, Warranty Cards • Much of this data can be accessed in real-time once a customer self-identifies • Much data mining work in this area is not published Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine To be continued….. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining Lecture 1: Introduction to Data Mining