Download Web Usage Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Usage Mining: Discovery and
Applications of Usage Patterns from
Web Data
Srivastava J., Cooley R.,
Deshpande M, Tan P.N.
Appeared in SIGKDD Explorations, Vol. 1, Issue 2,
2000
Web Mining

What is?


Data Mining efforts associated with the Web
What kind of?



Content Mining
Structure Mining
Usage Mining
Web Data

Content


Structure


Ex) HTML tags
Usage


Ex) texts and graphics
Ex) IP address, page reference, date/time
User profile

Ex) registration data, customer profile
Web Usage Mining


The application of data mining
techniques to discover usage patterns
from Web Data.
Three phrases



Preprocessing
Pattern discovery
Pattern analysis
Data Sources
Where the usage data can be collected
from?
 Server Level Collections


The web server log records the browsing
behavior of site visitors, but cached page
views are not recorded.
The packet sniffing extracts usage data
directly from TCP/IP packets.
Data Sources (contd.)
<Sample Web Server Log>
# IP Address Userid
1
2
3
4
5
Time
Method/ URL/ Protocol Status Size Referrer Agent
123.456.78.9 - [25/Apr/1998:03:04:41 -0500] "GET A.html HTTP/1.0" 200 3290 Mozilla/3.04 (Win95, I)
123.456.78.9 - [25/Apr/1998:03:05:34 -0500] "GET B.html HTTP/1.0" 200 2050 A.html Mozilla/3.04 (Win95, I)
123.456.78.9 - [25/Apr/1998:03:05:39 -0500] "GET L.html HTTP/1.0" 200 4130 Mozilla/3.04 (Win95, I)
123.456.78.9 - [25/Apr/1998:03:06:02 -0500] "GET F.html HTTP/1.0" 200 5096 B.html Mozilla/3.04 (Win95, I)
123.456.78.9 - [25/Apr/1998:03:06:58 -0500] "GET A.html HTTP/1.0" 200 3290 - Mozilla/3.01 (X11, I,
IRIX6.2, IP22)
6 123.456.78.9 - [25/Apr/1998:03:07:42 -0500] "GET B.html HTTP/1.0" 200 2050 A.html Mozilla/3.01 (X11, I,
IRIX6.2, IP22)
7 123.456.78.9 - [25/Apr/1998:03:07:55 -0500] "GET R.html HTTP/1.0" 200 8140 L.html Mozilla/3.04 (Win95, I)
8 123.456.78.9 - [25/Apr/1998:03:09:50 -0500] "GET C.html HTTP/1.0" 200 1820 A.html Mozilla/3.01 (X11, I,
IRIX6.2, IP22)
9 123.456.78.9 - [25/Apr/1998:03:10:02 -0500] "GET O.html HTTP/1.0" 200 2270 F.html Mozilla/3.04 (Win95, I)
10 123.456.78.9 - [25/Apr/1998:03:10:45 -0500] "GET J.html HTTP/1.0" 200 9430 C.html Mozilla/3.01 (X11, I,
IRIX6.2, IP22)
11 123.456.78.9 - [25/Apr/1998:03:12:23 -0500] "GET G.html HTTP/1.0" 200 7220 B.html Mozilla/3.04 (Win95, I)
12 209.456.78.2 - [25/Apr/1998:05:05:22 -0500] "GET A.html HTTP/1.0" 200 3290
Mozilla/3.04 (Win95, I)
13 209.456.78.3 - [25/Apr/1998:05:06:03 -0500] "GET D.html HTTP/1.0" 200 1680 A.html Mozilla/3.04 (Win95, I)
Data Sources (contd.)

Client Level Collections


By using remote agents
ex) java applet (overhead), java script (not
able to capture all user clicks)
By modifying the source code of existing
browser
ex) Mosaic (hard to convince users to use
browser)
Data Sources (contd.)

Proxy Level Collections


Intermediate level of caching between web
server and client browser.
Characterize the browsing behavior of a
group of users sharing a common proxy
server.
Data Abstractions

User :

Page Views :
a single individual that is accessing file from one or
more Web servers through a browser
time
every file displayed on user’s browser at one

Click Stream : a sequential series of page view requests
User Session : the click stream of page views for a single

Server Session :

Episode :

user across the entire Web
the set of page views in a user session
for a particular Web site
any semantically meaningful subset of a user or
server session
Web Usage Mining Process
Preprocessing
•
Usage Processing
The most difficult task due to the
incompleteness of the available data (IP
address, agent, server side click stream)




Single IP address/Multiple Server Sessions
Multiple IP address/Single Server Session
Multiple IP address/Single User
Multiple Agent/Single User
Preprocessing(contd.)

Content Preprocessing



Converting the text, image, scripts into
useful forms (ex. vectors of words)
Classification/clustering algorithm can be
used to filter discovered patterns based on
topic or intended use
Structure Preprocessing

Hyperlinks between page views
Pattern Discovery
Statistical Analysis


Page views, viewing time, length of navigational
path
Association Rules


Apriori algorithm: correlation between users
Clustering



Usage clustering : inferring user demographics
Page clustering: pages having related content
Pattern Discovery (contd.)

Classification


30% of users who placed an online order
in /Product/Music are in the 18-25 age
group and live on the West Coast.
Sequential Patterns

Time-ordered set of sessions: predicting
future visit patters for where to put
advertisement
Pattern Analysis

Motivation

Filter out uninteresting rules / patterns
from the set found in the pattern discovery
phrase.
Application Areas
Examples

Personalization


http://aztec.cs.depaul.edu/scripts/ACR2/
Business

http://www.accrue.com/
Related documents