Download ICS 278: Data Mining Lecture 1: Introduction to Data Mining

Document related concepts
no text concepts found
Transcript
ICS 278: Data Mining
Lecture 17: Web Log Mining
Padhraic Smyth
Department of Information and Computer Science
University of California, Irvine
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Outline
• Basic concepts in Web log data analysis
• Predictive modeling of Web navigation behavior
– Markov modeling methods
• Analyzing search engine data
• Ecommerce aspects of Web log mining
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Introduction
• Useful to study human digital behavior, e.g. search engine data can
be used for
– Exploration e.g. # of queries per session?
– Modeling e.g. any time of day dependence?
– Prediction e.g. which pages are relevant?
• Applications
– Understand social implications of Web usage
– Design of better tools for information access
– E-commerce applications
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
How our Web navigation is recorded…
•
Web logs
– Record activity between client browser and a specific Web server
– Easily available
– Can be augmented with cookies (provide notion of “state”)
•
Search engine records
– Text in queries, which responses were clicked on, etc
•
Client-side browsing records
– Produced for research purposes as part of a study
– Automatically recorded by client-side software
– Harder to obtain, but much more accurate than server-side logs
•
Other sources
– Web site registration, purchases, email, etc
– ISP recording of Web browsing
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Web Server Log Files
• Server Transfer Log:
– transactions between a browser and server are logged
–
–
–
–
IP address, the time of the request
Method of the request (GET, HEAD, POST…)
Status code, a response from the server
Size in byte of the transaction
• Referrer Log:
– where the request originated
• Agent Log:
– browser software making the request (spider)
• Error Log:
– request resulted in errors (404)
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
W3C Extended Log File Format
Field
Date
Description
Date
Time
Client IP address
date
time
c-ip
User Name
Servis Name
Server Name
Server IP Address
Server Port
Method
URI Stem
URI Query
Protocol Status
Win32 Status
Bytes Sent
Bytes Received
Time Taken
Protocol Version
Host
cs-username
s-sitename
s-computername
s-ip
s-port
cs-method
cs-uri-stem
cs-uri-query
sc-status
sc-win32-status
sc-bytes
cs-bytes
time-taken
cs-version
cs-host
The date that the activity occurred
The time that the activity occurred
The IP address of the client that accessed your server
The name of the autheticated user who access your server, anonymous
users are represented by The Internet service and instance number that was accessed by a client
The name of the server on which the log entry was generated
The IP address of the server that accessed your server
The port number the client is connected to
The action the client was trying to perform
The resource accessed
The query, if any, the client was trying to perform
The status of the action, in HTTP or FTP terms
The status of the action, in terms used by Microsoft Windows
The number of bytes sent by the server
The number of bytes received by the server
The duration of time, in milliseconds, that the action consumed
The protocol (HTTP, FTP) version used by the client
Display the content of the host header
User Agent
Cookie
Referrer
cs(User Agent)
cs(Cookie)
cs(Referrer)
s = server actions
c = client actions
cs = client-to-server actions
Data Mining
sc =Lectures
server-to-client actions
The browser used on the client
The content of the cookie sent or received, if any
The previous site visited by the user. This site provided a link to the current
site
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Example of Web Log entries
Apache web log:
205.188.209.10 - - [29/Mar/2002:03:58:06 -0800] "GET
/~sophal/whole5.gif HTTP/1.0" 200 9609
"http://www.csua.berkeley.edu/~sophal/whole.html" "Mozilla/4.0
(compatible; MSIE 5.0; AOL 6.0; Windows 98; DigExt)"
216.35.116.26 - - [29/Mar/2002:03:59:40 -0800] "GET
/~alexlam/resume.html HTTP/1.0" 200 2674 "-" "Mozilla/5.0
(Slurp/cat; [email protected]; http://www.inktomi.com/slurp.html)“
202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET
/~tahir/indextop.html HTTP/1.1" 200 3510
"http://www.csua.berkeley.edu/~tahir/" "Mozilla/4.0 (compatible;
MSIE 6.0; Windows NT 5.1)“
202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/animate.js
HTTP/1.1" 200 14261
"http://www.csua.berkeley.edu/~tahir/indextop.html" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1)“
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Routine Server Log Analysis
•
•
•
•
•
•
Most and least visited web pages
Entry and exit pages
Referrals from other sites or search engines
What are the searched keywords
How many clicks/page views a page received
Error reports, like broken links
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Visualization of Web Log Data over Time
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Server Log Analysis
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Descriptive Summary Statistics
• Histograms, scatter plots, time-series plots
– Very important!
– Helps to understand the big picture
– Provides “marginal” context for any model-building
• models aggregate behavior, not individuals
– Challenging for Web log data
• Examples
– Session lengths (e.g., power laws)
– Click rates as a function of time, content
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
L = number of page requests in a
single session
from visitors to www.ics.uci.edu
over 1 week in November 2002
(robots removed)
0
Empirical Frequency of L
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
1
10
2
10
Session Length L
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Best fit of simple power law model
0
Log P(L) = -a Log L + b
10
or
-1
P(L) = b L-a
Probability of L
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
1
10
2
10
3
10
Session Length L
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
0
10
POISSON
-1
10
Probability of L
GEOMETRIC
-2
10
INVERSE GAUSSIAN
-3
10
POWER-LAW
-4
10
-5
10
-6
10
0
10
1
10
2
10
Session Length L
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Web data measurement issues
• Important to understand how data is collected
• Web data is collected automatically via software logging tools
– Advantage:
• No manual supervision required
– Disadvantage:
• Data can be skewed (e.g. due to the presence of robot traffic)
• Important to identify robots (also known as crawlers, spiders)
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
A time-series plot of ICS Website data
Number of page requests per hour as a function of time from page
requests in the www.ics.uci.edu Web server logs during the first week of
April 2002.
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Robot / human identification
• Robot requests identified by classifying page requests using a
variety of heuristics
– e.g. some robots self-identify themselves in the server logs (robots.txt)
– Robots explore the entire website in breadth first fashion
– Humans access web-pages in depth first fashion
• Tan and Kumar (2002) discuss more techniques
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Page requests, caching, and proxy servers
• In theory, requester browser requests a page from a Web server
and the request is processed
• In practice, there are
–
–
–
–
Data Mining Lectures
Other users
Browser caching
Dynamic addressing in local network
Proxy Server caching
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Page requests, caching, and proxy servers
A graphical summary of how page requests from an individual user can be
masked at various stages between the user’s local computer and the Web
server.
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Page requests, caching, and proxy servers
• Web server logs are therefore not so ideal in terms of a complete
and faithful representation of individual page views
• There are heuristics to try to infer the true actions of the user: – Path completion (Cooley et al. 1999)
• e.g. If known B -> F and not C -> F, then session ABCF can be interpreted
as ABCBF
• Anderson et al. 2001 for more heuristics
• In general case, hard to know what user viewed
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Identifying individual users from Web server logs
• Useful to associate specific page requests to specific individual users
• IP address most frequently used
• Disadvantages
– One IP address can belong to several users
– Dynamic allocation of IP address
• Better to use cookies
– Information in the cookie can be accessed by the Web server to identify
an individual user over time
– Actions by the same user during different sessions can be linked
together
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Identifying individual users from Web server logs
• Commercial websites use cookies extensively
• 97% of users have cookies enabled permanently on their browsers
(source: Amazon.com, 2003)
• However …
– There are privacy issues – need implicit user cooperation
– Cookies can be deleted / disabled
• Another option is to enforce user registration
– High reliability
– Can discourage potential visitors
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Sessionizing
• Time oriented (robust)
– E.g., by gaps between requests
• not more than 25 minutes between successive requests
• Navigation oriented (good for short sessions and when timestamps
unreliable)
– Referrer is previous page in session, or
– Referrer is undefined but request within 10 secs, or
– Link from previous to current page in web site
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Client-side data
• Advantages of collecting data at the client side:
– Direct recording of page requests (eliminates ‘masking’ due to caching)
– Recording of all browser-related actions by a user (including visits to
multiple websites)
– More-reliable identification of individual users (e.g. by login ID for
multiple users on a single computer)
•
Preferred mode of data collection for studies of navigation behavior on the
Web
•
Companies like comScore and Nielsen use client-side software to track
home computer users
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Client-side data
• Statistics like ‘Time per session’ and ‘Page-view duration’ are more
reliable in client-side data
• Some limitations
– Still some statistics like ‘Page-view duration’ cannot be totally reliable
e.g. user might go to fetch coffee
– Need explicit user cooperation
– Typically recorded on home computers – may not reflect a complete
picture of Web browsing behavior
• Web surfing data can be collected at intermediate points like ISPs,
proxy servers
– Can be used to create user profile and target advertise
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Early studies from 1995 to 1997
•
Earliest studies on client-side data are Catledge and Pitkow (1995) and
Tauscher and Greenberg (1997)
•
In both studies, data was collected by logging Web browser commands
•
Population consisted of faculty, staff and students
•
Both studies found
– clicking on the hypertext anchors as the most common action
– using ‘back button’ was the second common action
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Early studies from 1995 to 1997
• high probability of page revisitation (~0.58-0.61)
• Lower bound because the page requests prior to the start of the studies are
not accounted for
• Humans are creatures of habit?
• Content of the pages changed over time?
• strong recency (page that is revisited is usually the page that was
visited in the recent past) effect
• Correlates with the ‘back button’ usage
• Similar repetitive actions are found in telephone number dialing etc
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
The Cockburn and McKenzie study from 2002
•
Previous studies are relatively old
•
Web has changed dramatically in the past few years
•
Cockburn and McKenzie (2002) provides a more up-to-date analysis
•
Study found revisitation rates higher than past 94 and 95 studies (~0.81)
– Analyzed the daily history.dat files produced by the Netscape browser for 17
users for about 4 months
– Population studied consisted of faculty, staff and graduate students
– Time-window is three times that of past studies
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
The Cockburn and McKenzie study from 2002
• Revisitation rate less biased than the previous studies?
• Human behavior changed from an exploratory mode to a utilitarian
mode?
– The more pages user visits, the more are the requests for new pages
– The most frequently requested page for each user can account for a
relatively large fraction of his/her page requests
• Useful to see the scatter plot of the distinct number of pages
requested per user versus the total pages requested
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
The Cockburn and McKenzie study from 2002
The number of distinct pages visited versus page vocabulary size of each
of the 17 users in the Cockburn and McKenzie (2002) study (log-log plot)
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
The Cockburn and McKenzie study from 2002
Bar chart of the ratio of the number of page requests for the most frequent
page divided by the total number of page requests, for 17 users in the
Cockburn McKenzie (2002) study
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Outline
• Basic concepts in Web log data analysis
• Predictive modeling of Web navigation behavior
– Markov modeling methods
• Analyzing search engine data
• Ecommerce aspects of Web log mining
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Markov models for page prediction
• General approach is to use a finite-state Markov chain
– Each state can be a specific Web page or a category of Web pages
– If only interested in the order of visits (and not in time), each new
request can be modeled as a transition of states
• Issues
– Self-transition
– Time-independence
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Markov models for page prediction
•
For simplicity, consider order-dependent, time-independent finite-state
Markov chain with M states
•
Let s be a sequence of observed states of length L. e.g. s =
ABBCAABBCCBBAA with three states A, B and C. st is state at position t
(1<=t<=L). In general,
L
P( s )  P( s1 ) P( st | st 1 ,..., s1 )
t 2
•
first-order Markov assumption
•
This provides a simple generative model toL produce sequential data
P(st | st 1 ,..., s1 )  P(st | st 1 )
P ( s )  P ( s1 ) P ( st | st 1 )
t 2
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Markov models for page prediction
•
•
If we denote Tij = P(st = j|st-1 = i), we can define a M x M transition matrix
Properties
– Strong first-order assumption
– Simple way to capture sequential dependence
•
If each page is a state and if W pages, O(W2), W can be of the order 105 to
106 for a CS dept. of a university
•
To alleviate, we can cluster W pages into M clusters, each assigned a state
in the Markov model
•
Clustering can be done manually, based on directory structure on the Web
server, or automatic clustering using clustering techniques
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Markov models for page prediction
•
•
•
Tij = P(st = j|st-1 = i) represents the probability that an individual user’s
next request will be from category j, given they were in category i
We can add E, an end-state to the model
E.g. for three categories with end state:  P(1 | 1) P(2 | 1) P(3 | 1) P( E | 1) 


P
(
1
|
2
)
P
(
2
|
2
)
P
(
3
|
2
)
P
(
E
|
2
)


T 
P(1 | 3) P(2 | 3) P(3 | 3) P( E | 3) 


 0
0
0
1 

•
E denotes the end of a sequence, and start of a new sequence
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Markov models for page prediction
•
•
First-order Markov model assumes that the next state is based only on the
current state
Limitations
– Doesn’t consider ‘long-term memory’
•
We can try to capture more memory with kth-order Markov chain
P(st | st 1 ,.., s1 )  P(st | st 1 ,.., st k )
•
Limitations
– Inordinate amount of training data O(Mk+1)
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Parameter estimation for Markov model transitions
•
Smoothed parameter estimates of transition probabilities are
T
ij

nij  qij
ni  
•
If nij = 0 for some transition (i, j) then instead of having a parameter
estimate of 0 (ML), we will have qij /( ni   ) allowing prior knowledge to
be incorporated
•
If nij > 0, we get a smooth combination of the data-driven information (nij)
and the prior
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Parameter estimation for Markov models
• One simple way to set prior parameter is
– Consider alpha as the effective sample size
– Partition the states into two sets, set 1 containing all states directly
linked to state i and the remaining in set 2
– Assign uniform probability r/K to all states in set 2 (all set 2 states are
equally likely)
– The remaining (1-r) can be either uniformly assigned among set 1
elements or weighted by some measure
– Prior probabilities in and out of E can be set based on our prior
knowledge of how likely we think a user is to exit the site from a
particular state
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Predicting page requests with Markov models
•
Deshpande and Karypis (2001) propose schemes to prune kth-order Markov
state space
– Provide systematic but modest improvements
•
Another way is to use empirical smoothing techniques that combine
different models from order 1 to order k (Chen and Goodman 1996)
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Mixtures of Markov Chains
•
Cadez et al. (2003) and Sen and Hansen (2003) replace the first-order
Markov chain:
P(st | st 1 ,..., s1 )  P(st | st 1 )
with a mixture of first-order Markov chains
K
P( st | st 1 ,..., s1 )   P( st | st 1 , c  k )P(c  k )
k 1
where c is a discrete-value hidden variable taking K values Sk P(c = k) = 1
and
P(st | st-1, c = k) is the transition matrix for the kth mixture component
•
One interpretation of this is user behavior consists of K different navigation
behaviors described by the K Markov chains
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Modeling Web Page Requests with Markov chain
mixtures
• MSNBC Web logs
– 2 million individuals per day
– different session lengths per individual
– difficult visualization and clustering problem
• WebCanvas
– uses mixtures of Markov chains to cluster individuals based on their
observed sequences
– software tool: EM mixture modeling + visualization
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
From Web logs to sequences
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,
User 1
User 2
User 3
User 4
User 5
…
Data Mining Lectures
2
3
7
1
5
3
3
7
5
1
…
2
3
7
1
1
2
1
7
1
5
3
1
7
1
3
1
7
5
3
1
1
1
3
1
3
3
7
1
7
5
1
1
1
1
1
1
Lecture 17: Web Log Mining
3
3
Padhraic Smyth, UC Irvine
Clusters of Finite State Machines
Cluster 1
A
B
Cluster 2
A
B
D
E
D
E
A
B
Cluster 3
Data Mining Lectures
D
E
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Learning Problem
•
Assumptions
– data is being generated by K different groups
– Each group is described by a stochastic finite state machine (SFSM)
• aka, a Markov model with an end-state
•
Given
– A set of sequences from different users of different lengths
•
Learn
– A “mixture” of K different stochastic finite state machines
•
Solution
– EM is very easy: fractional counts of transitions
– efficient and accurate, scales as O(KN)
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Experimental Methodology
• Model Training:
– fit 2 types of models
• mixtures of histograms
• mixtures of finite state machines
– Train on a full day’s worth of MSNBC Web data
• Model Evaluation:
– “one-step-ahead” prediction on unseen test data
• Test sequences from a different day of Web logs
– negative log probability = predictive entropy
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Predictive Entropy Out-of-Sample
4
Negative log-likelihood [bits/token]
3.8
3.6
3.4
3.2
3
2.8
2.6
Mixtures of Multinomials
2.4
Mixtures of SFSMs
2.2
2
Data Mining Lectures
20
40
60
80
100
120
140
Number of
mixture
[K]
Lecture
17: Web Logcomponents
Mining
160
180
200
Padhraic Smyth, UC Irvine
log count(R)
RUN LENGTH DISTRIBUTIONS WITHIN MARKOV CLUSTERS
10
5
5
0
0
log count(R)
0
10
5
10
15
20
5
0
0
10
10
15
20
5
0
0
log count(R)
10
log count(R)
10
5
10
15
20
10
15
20
Cluster 2: Category 7
-2
5
10
Cluster 3: Category 1
-2
5
10
-2
Cluster 4: Category 1
0
0
10
5
10
Cluster 5: Category 12
5
0
0
0
10
15
R = Run Length
Data Mining Lectures
20
0
1
2
3
4
R = Run Length
Lecture 17: Web Log Mining
5
5
20
30
40
10
15
20
Cluster 4: Category 3
0
5
5
0
10
5
0
10
Cluster 3: Category 13
10
0
Cluster 5: Category 9
40
0
0
40
30
2
0
30
0
4
5
20
20
0
5
10
10
Cluster 2: Category 8
4
5
0
0
2
10
Cluster 4: Category 2
5
0
5
0
0
10
Cluster 3: Category 12
Cluster 1: Category 8
4
2
0
5
5
Cluster 1: Category 14
10
Cluster 2: Category 1
0
log count(R)
10
Cluster 1: Category 13
5
10
Cluster 5: Category 6
0
1
2
3
4
R = Run Length
Padhraic Smyth, UC Irvine
Timing Results
2500
2000
N=150,000
Time [sec]
1500
N=110,000
1000
N = 70,000
500
0
-500
0
20
40
60
80
100
120
140
160
180
200
Number of mixture components [K]
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
WebCanvas
• Software tool for Web log visualization
– uses Markov mixtures to cluster data for display
– in use by msnbc.com administrators at Microsoft
– also being applied to non-Web data
• Model-based visualization
– random sample of actual sequences
– interactive tiled windows displayed for visualization
– more effective than
• planar graphs
• traffic-flow movie in Microsoft Site Server v3.0
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Data Mining Lectures
WebCanvas: Cadez, Heckerman, et al, 2003
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Insights from WebCanvas
• From msnbc.com site adminstrators….
– significant heterogeneity of behavior
– relatively focused activity of many users
• typically only 1 or 2 categories of pages
– many individuals not entering via main page
– detected problems with the weather page
– missing transitions (e.g., tech <=> business)
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Extensions
• Adding time-dependence
– adding time-between clicks, time of day effects
• Uncategorized Web pages
– coupling page content with sequence models
• Modeling “switching” behaviors
– allowing users to switch between models
• Individualized weights (hierarchical Bayes)
• Update: WebCanvas tool will be part of 2004 SQLServer release
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Prediction with Markov mixtures
P(st+1 | s[1,t] ) =
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Prediction with Markov mixtures
P(st+1 | s[1,t] ) = S P(st+1 , k | s[1,t] )
= S P(st+1 | k , s[1,t] ) P(k | s[1,t] )
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Prediction with Markov mixtures
P(st+1 | s[1,t] ) = S P(st+1 , k | s[1,t] )
= S P(st+1 | k , s[1,t] ) P(k | s[1,t] )
= S P(st+1 | k , st ) P(k | s[1,t] )
Prediction of
kth component
Membership, based
on sequence history
=> Predictions are a convex combination of
K different component transition matrices,
with weights based on sequence history
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Related Work
• Mixtures of Markov chains
– special case: Poulsen (1990)
– general case: Ridgeway (1997), Smyth (1997)
• Clustering of Web page sequences
– non-probabilistic approaches (Fu et al, 1999)
• Markov models for prediction
– Anderson et al (IJCAI, 2001):
• mixtures of Markov outperform other sequential models for page-request
prediction
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Predicting page requests with Markov models
• K can be chosen by evaluating the out-of-sample predictive
performance based on
– Accuracy of prediction
– Log probability score
– Entropy
• Other variations:
– Sen and Hansen 2003
– Position-dependent Markov models (Anderson et al. 2001, 2002)
– Zukerman et al. 1999
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Modeling Clickrate Data
• Data
– 200k Alexa users, client-side, over 24 hours
– ignore URLs requested
– goal is to build a time-series model that characterizes user click rates
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
300
NUMBER OF CLICKS
250
200
150
100
50
0
0
5
10
15
20
HOURS
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
120
NUMBER OF CLICKS
100
80
60
40
20
0
-20
-40
-60
5
5.5
6
6.5
7
HOURS
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
300
NUMBER OF CLICKS
250
200
150
100
50
0
0
5
10
15
20
HOURS
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Markov-Poisson Model
• Doubly stochastic process
– Locally constant Poisson rate
– indexed by M Markov states
• Fit a model with M = 3 states
• absence of a Web session
• Web session with slow click rate: 1 minute rate
• Web session with rapid click rate: 10 second rate
– Used hierarchical Bayes on individuals
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Outline
• Basic concepts in Web log data analysis
• Predictive modeling of Web navigation behavior
– Markov modeling methods
• Analyzing search engine data
• Ecommerce aspects of Web log mining
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Analysis of Search Engine Query Logs
# of Sample Query
Source SE
Time Period
Lau & Horvitz
4690 of 1 Million
Excite
Sep 1997
Silverstein et al
1 Billion
AltaVista
6 weeks in Aug & Sep
1998
Spink et al
(series of studies)
1Million for each time
period
Excite
Sep 1997
Dec 1999
May 2001
Xie & O’Hallaron
110,000
Vivisimo
35 days Jan & Feb 2001
1.9 Million
Excite
8 hrs in a day, Dec 1999
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Main Results
• Average number of terms in a query is ranging from a low of 2.2 to a
high of 2.6
• The most common number of terms in a query is 2
• The majority of users don’t refine their query
– The number of users who viewed only a single page increase 29% (1997)
to 51% (2001) (Excite)
– 85% of users viewed only first page of search results (AltaVista)
• 45% (2001) of queries are about Commerce, Travel, Economy, People
(was 20% in 1997)
– The queries about adult or entertainment decreased from 20% (1997) to
around 7% (2001)
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Xie and O Halloran Study (2002)
- Query Length
Distributions (bar)
- Poisson Model
(dots & lines)
• All four studies produced a generally consistent set of findings about
user behavior in a search engine context
– most users view relatively few pages per query
– most users don’t use advanced search features
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Power-law Characteristics of Common Queries
Power-Law
in log-log space
•
Frequency f(r) of Queries with Rank r
– 110000 queries from Vivisimo
– 1.9 Million queries from Excite
•
There are strong regularities in terms of patterns of behavior in how we
search the Web
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Outline
• Basic concepts in Web log data analysis
• Predictive modeling of Web navigation behavior
– Markov modeling methods
• Analyzing search engine data
• Ecommerce aspects of Web log mining
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
The next few slides are from Ronny Kohavi, director of data
mining and personalization at Amazon.com. His full set of
slides are available online – see the PPT slides and related
papers on ecommerce and data mining online at
http://robotics.stanford.edu/~ronnyk/ronnyk-bib.html
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
ECommerce
• Page request Web logs combined with
–
–
–
–
Purchase (market-basket) information
User address information (if they make a purchase)
Demographics information (can be purchased)
Emails to/from the customer
• Main focus here is to increase revenue
– Data mining widely used an online commerce companies like Amazon
• This is a very rich source of problems for data mining
–
–
–
–
–
Data Mining Lectures
What products should we advertise to this person?
Can we do dynamic pricing?
If a person buys X should we also suggest Y?
Who are our best customers?
etc
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Combining Data Sources
•
•
Comprehensive collection of US consumer and telephone data available
via the internet
–
–
–
–
Multi-sourced database
Demographic, socioeconomic, and lifestyle information.
Information on most U.S. households
Contributors’ files refreshed a minimum of 3-12 times per year.
–
Data sources include:
County Real Estate Property Records, U.S. Telephone
Directories, Public Information, Motor Vehicle Registrations, Census Directories, Credit
Grantors, Public Records and Consumer Data, Driver’s Licenses, Voter Registrations,
Product Registration Questionnaires, Catalogers, Magazines, Specialty Retailers,
Packaged Goods Manufacturers, Accounts Receivable Files, Warranty Cards
Much of this data can be accessed in real-time once a customer
self-identifies
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Map of World Wide Revenue
Although Debenhams online site only ships in
the UK, we see some revenue from the rest of UK – 98.8%
the world.
US – 0.6%
Australia – 0.1%
Low
Medium
High
NOTE: About 50% of the non-UK
orders are wedding list purchases
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Online Consumer Demographics

Results from Blue-Martini

People who have a Travel and Entertainment credit card are
48% more likely to be online shoppers (27% for people with
premium credit card)

People whose home was built after 1990 are 45% more likely
to be online shoppers

Households with income over $100K are 31% more likely to be
online shoppers

People under the age of 45 are 17% more
likely to be online shoppers
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Demographics - Income

A higher household income means you are
more likely to be an online shopper
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Demographics – Credit Cards
• The more credit cards, the more likely you are to be
an online shopper
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Example: Web Traffic
Weekends
Sept-11
Note significant drop in
human traffic, not bot
traffic
Internal
Performance bot
Registration
at Search
Engine sites
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Product Affinities at MEC
Product
Orbit
Sleeping Pad
Bambini
Tights Children’s
Orbit
Stuff Sack
Bambini
Crewneck
Sweater
Children’s
Silk
Long Johns
Women’s
Silk Crew
Women’s
Cascade
Entrant
Overmitts
•
•
•
Association
Polartec
300 Double
Mitts
Lift
222
Confidence
Website
Recommended Products
37%
Cygnet
Sleeping Bag
195
Aladdin 2
Backpack
52%
Yeti Crew Neck
Pullover Children’s
304
Beneficial T’s
Organic Long
Sleeve T-Shirt Kids’
73%
Micro Check
Vee Sweater
51
Primus Stove
Volant
Pants
Composite Jacket
48%
Volant
Pants
Windstopper
Alpine Hat
Tremblant 575
Vest Women’s
Minimum support for the associations is 80 customers
Confidence: 37% of people who purchased Orbit Sleeping Pad also purchased Orbit Stuff Sack
Lift: People who purchased Orbit Sleeping Pad were 222 times more likely to purchase the Orbit Stuff Sack compared to the general population
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Customer Locations Relative to Retail Stores
Heavy purchasing areas away from retail
stores can suggest new retail store locations
No stores in several hot areas:
MEC is building a store in
Montreal right now.
Map of Canada with store locations.
Black dots show store locations.
Data Mining Lectures
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine
Building The Customer Signature
• Building a customer signature is a significant effort, but well worth the
effort
• A signature summarizes customer or visitor behavior across hundreds of
attributes, many which are specific to the site
• Once a signature is built, it can be used to answer many questions.
• The mining algorithms will pick the most important attributes for each
question
• Example attributes computed:
–
–
–
–
–
–
Data Mining Lectures
Total Visits and Sales
Revenue by Product Family
Revenue by Month
Customer State and Country
Recency, Frequency, Monetary
Latitude/Longitude from the Customer’s Postal Code
Lecture 17: Web Log Mining
Padhraic Smyth, UC Irvine