Download FALL 2012

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
FALL 2012
DSCI5240 Graduate
Presentation
By Xxxxxxx
Web Usage Mining
Outline
•
•
•
•
•
Definition and Goal
Source and Type of data
Data Collection and Pre-processing
Data Modeling
Discovery and Analysis of Web Usage Patterns
Web Usage Mining
• Definition and Goal
• Automatic discovery and analysis of patterns
• Goal: Capture, model and analyze the behavior pattern and profiles of
users interacting with web sites.
• Source and Type of Data:
•
•
•
•
•
•
Server log files: Web Server and Applications access
Site files and meta data
Operational databases
Application Templates
Domain Knowledge
Internet Service Provider data collection
Data Collection
• Web sites and Applications data
• Primary source of data in Web Usage Mining
• Each HTTP request generates a single entry in the server access
logs
• Log entry: time and date of request; IP address; resource
requested; HTTP method; User Agent(Browser and Operating
System); referring web resource; client-side cookies.
• 12006-02-01 00:08:43 1.2.3.4 - GET /classes/cs589/papers.html 200 9221 HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.
NET+CLR+2.0.50727)
http://dataminingresources.blogspot.com/22006-02-01
00:08:46 1.2.3.4 - GET /classes/cs589/papers/cms-tai.pdf
Data Abstraction
• Pageview: collection of web objects or resource corresponding to a
single “user event”. Example: reading an article; view a product
page; adding a product to a shopping cart.
• Session: sequence of pageviews by a single user during a single
visit.
• Content Data: objects and relationships suggested to the user(Text
and images).
• User data: operational database(Ex: user profile information, visit
histories…)
Web Usage Data Pre-Processing
• Data Fusion : merging of log files from several web and
application servers:
• shared embedded session ids
• heuristic methods based on the “referrer” field in server logs
• Data cleaning : removing useless data such as references
including style files, graphics or sound files
Web Usage Data Pre-Processing(Continue)
• Pageview identification attributes:
• pageview ID (URL uniquely representing the page viewed); static
pageview Type(ex: information page, product page);
Metadata(keywords)
• User Identification:
• User authentication mechanism(User activity record)
• Use of client-side cookies
• Sessionization
• Each user activity record represents a single vist to the site or a session.
• An episode is a subset or subsequence of a session comprised of
semantically or functionally related pageviews
Web Usage Data Pre-Processing (Continue)
• Path Completion
• To solve missing references due to client or proxy-side caching. When a
user returns to the previous page, the version of the download of that
page will still the same due to caching.
• Data Integration
• User data (e.g., demographics, ratings, and purchase histories) and
product attributes and categories from operational databases.
• Building a content enhanced transaction data
• Multiplying user-pageview matrix and the transpose of the termpageview matrix). read Bamshad Mobasher, ch12: Web Usage Mining
pp14-18)
Discovery and Analysis of Web Usage Patterns
• Session and Visitor Analysis
• data is aggregated by predeter-mined units such as days, sessions,
visitors, or domains
• Reports on most frequently accessed pages, average view time of a
page, average length of a path through a site, common entry and exit
points.
• useful for improving the system performance, and providing support for
marketing decisions.
• Online Analytical Processing (OLAP)provides a more integrated
framework for analysis with a higher degree of flexibility.
Discovery and Analysis of Web Usage Patterns
Cluster Analysis and Visitor Segmentation
• Recall that Clustering is a data
mining technique that groups
together a set of items having
similar characteristics.
• User clusters : most used
• Clustering of user records
(sessions or transactions)
• Establish groups of users
exhibiting similar browsing
patterns.
• Useful for providing
personalized Web content to
similar users
Page clusters (or items)
• Based on the usage data (i.e., starting
from the user sessions or transaction
data): items commonly accessed and
purchased automatically organized into
groups
• Based on the content features
associated with pages or items
(keywords or product at-tributes):
collections of pages or products related
to the same topic or category.
• It can also be used to provide
permanent or dynamic HTML pages
that suggest related hyperlinks to the
users according to their past history of
navigational or purchase activities
Discovery and Analysis of Web Usage Patterns
Association and Correlation Analysis
• Recall an association rule is
an expression of the form • Can found groups of items or
pages that are commonly
X→Y [sup, conf], where X
accessed or purchased
and Y are itemsets, sup is
together.
the support of the itemset X
• Enables Web sites to provide
∪ Y representing the
effective cross-sale product
probability that X and Y
recommendations.
occur together in a
transaction, and conf is the • One problem for association
rule recommendation systems
confidence of the rule,
is that a system cannot give any
defined by sup(X∪Y) /
sup(X), representing the
recommendations when the
conditional probability that
dataset is sparse.
Y occurs in a transaction
given that X has occurred in
that transaction.
Resource: Web Usage Mining By Bamshad Mobasher ;
http://maya.cs.depaul.edu/~mobasher/papers/12-webusage-mining.pdf