Download paper sunum

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
A Web Usage Mining Framework
for Mining Evolving User Profiles
in Dynamic Web Sites
Nasraoui, Soliman, Saka, Badia, Germain
IEEE Transactions on Knowledge and Data Engineering, 2008
Ozer Ozdikis
Huseyin Candan
1

Extraction of User Profiles using
◦ Web Usage Data
◦ Web site hierarchy
◦ External data etc…

Evolution of User Profiles in time
◦ Introducing new profiles, killing invalid ones…
◦ validation of the profile evolution
2
1.
2.
3.
4.
◦
◦
◦
◦
◦
◦
◦
◦
◦
Introduction
Web Usage Mining
Handling Profile Evolution
Integrating Semantics
Profile Discovery based on Usage
Preprocessing
Clustering sessions
Similarity measure in clustering
Post processing and Enrichment
Profile Evolution
Tracking the Evolving User Profiles
Validating the Profile Evolution
Experimental Results
3
1.
2.
3.
4.
◦
◦
◦
◦
◦
◦
◦
◦
◦
Introduction
Web Usage Mining
Handling Profile Evolution
Integrating Semantics
Profile Discovery based on Usage
Preprocessing
Clustering sessions
Similarity measure in clustering
Post processing and Enrichment
Profile Evolution
Tracking the Evolving User Profiles
Validating the Profile Evolution
Experimental Results
4
Key features of the paper
 Dynamic content (a portal for companies)
 Clustering of user sessions extracted from web
logs into homogenous groups of similar activities
 Session similarity is calculated using navigated
URL’s and website hierarchy (from URL and site
taxonomy)
 Generate mass user profiles
 Repeat this generation periodically
 Track the changes between the previous profiles
and new profiles, and evaluate their evolution
5
1.
2.
3.
4.
◦
◦
◦
◦
◦
◦
◦
◦
◦
Introduction
Web Usage Mining
Handling Profile Evolution
Integrating Semantics
Profile Discovery based on Usage
Preprocessing
Clustering sessions
Similarity measure in clustering
Post processing and Enrichment
Profile Evolution
Tracking the Evolving User Profiles
Validating the Profile Evolution
Experimental Results
6
Web Usage Mining stages:
1. Collect usage data/clickstreams
2. Preprocess (reformat, filter irrelevant data)
3. Analyze and discover interesting patterns
4. Evaluate discovered profiles
5. Track the evolution of profiles
 Web Usage Mining has been used for
personalization, predicting navigation
patterns, building datacubes to apply OLAP
etc…
7

Previous studies related with evolution
◦ Machine learning based (another dimension for
learning evolving concepts)
◦ Time-based forgetting approaches
◦ Separate user profiles for short-term and long-term
interests
8

Some concepts related to profile evolution
◦ Evolutionary / Revolutionary / Hybrid Learning
regarding the adaptation to change
◦ No-memory / Partial Memory / Full Memory
◦ Supervised / Unsupervised
◦ Single user / mass user
9

For user modeling, web usage data can be
supported with
◦ Keywords representing web page content
◦ Website’s hierarchical structure (different pages but
semantically relevant, e.g. under the same group)
◦ Semantic enrichment of navigated URLs
(semantically enhanced web logs -> C-Logs)
◦ Taxonomy can be “defined explicitly” or “inferred
implicitly” via URL tokenization (http://a/b/c.htm)
10
1.
2.
3.
4.
◦
◦
◦
◦
◦
◦
◦
◦
◦
Introduction
Web Usage Mining
Handling Profile Evolution
Integrating Semantics
Profile Discovery based on Usage
Preprocessing
Clustering sessions
Similarity measure in clustering
Post processing and Enrichment
Profile Evolution
Tracking the Evolving User Profiles
Validating the Profile Evolution
Experimental Results
11




Preprocess the weblog to identify sessions
and produce their vector representations
Produce profiles using H-UNC (Hierarchical
Unsupervised Niche Clustering) -> a GA
approach
Enrich profiles with additional facets (external
knowledge)
Track profile evolution, and measure the
validation of discovered profiles
12
1.
2.
3.
4.
◦
◦
◦
◦
◦
◦
◦
◦
◦
Introduction
Web Usage Mining
Handling Profile Evolution
Integrating Semantics
Profile Discovery based on Usage
Preprocessing
Clustering sessions
Similarity measure in clustering
Post processing and Enrichment
Profile Evolution
Tracking the Evolving User Profiles
Validating the Profile Evolution
Experimental Results
13
Session Identification
 Sessions are extracted using just the weblog files
(no login data, no cookies)
 Access time, IP Address, URL viewed, REFERRER
are used for session identification
Session Representation
 Each valid URL in the web site is given a unique
number j ε 1,2,…Nu
 Each session is represented as a binary vector of
size Nu. Navigation order is not considered.
 Example (number of valid URLs=4):
1001 -> user accessed URL 1 and 4
14
1.
2.
3.
4.
◦
◦
◦
◦
◦
◦
◦
◦
◦
Introduction
Web Usage Mining
Handling Profile Evolution
Integrating Semantics
Profile Discovery based on Usage
Preprocessing
Clustering sessions
Similarity measure in clustering
Post processing and Enrichment
Profile Evolution
Tracking the Evolving User Profiles
Validating the Profile Evolution
Experimental Results
15
Unsupervised Niche Clustering (UNC)
 The goal is to find
◦ profiles pi (a set of URLs representing session
clusters) and
◦ scales σi (variance/dispersion of sessions in a
cluster around the cluster representative profile)

wij : robust weight of a profile pi on a session
item sj. If this value is large (i.e. a profile is
“close” to a session), pi is a strong
representative of sj, which has a positive
effect on the fitness value of pi.
16



Randomly select Np sessions as initial pi’s
Initialize the variable to some small value
Repeat :
◦ Calculate distance(!) dij between every profile pi and
every session sj
◦ Calculate robust weight wij for every profile pi and every
session sj
◦ Calculate scale σi for every profile pi
◦ Calculate fitness fi for every profile
◦ Repeat (GA loop) :




Randomly select parent profiles
Generate child profiles (through crossover and mutation)
Calculate fitness values of the child profiles
Apply deterministic crowding for replacement policy
17
Hierarchical Unsupervised Niche Clustering
 A divisive hierarchical version of UNC
 Repeat dividing clusters into smaller clusters
hierarchically considering
◦ the required hierarchy level (Lmax)
◦ Maximum allowed cluster cardinality (Nsplit)
◦ Maximum allowed scale (σsplit)

As a result, we have profile vectors and their
scales. Sessions are assigned to the closest
profiles.
18
1.
2.
3.
4.
◦
◦
◦
◦
◦
◦
◦
◦
◦
Introduction
Web Usage Mining
Handling Profile Evolution
Integrating Semantics
Profile Discovery based on Usage
Preprocessing
Clustering sessions
Similarity measure in clustering
Post processing and Enrichment
Profile Evolution
Tracking the Evolving User Profiles
Validating the Profile Evolution
Experimental Results
19

Cosine similarity:

Web Session Similarity (web site structure):


Su(l,k) : URL-to-URL similarity*
Distance used in UNC is then d = (1-Sweb)2
20


URL structure (tokenized URL paths P)
http://a/b/c.html
For dynamic content, relations with an
externally defined taxonomy (“is-a” relation).
http://products.php?id=1&category=x
21


For dynamic content (dynamic URLs),
preprocess the data and map the dynamic
URLs to strings separated by “/” using
ontology.
If we have such a table (taxonomy data), we
can define a hierarchical structure even for
the dynamic URLs.
22
1.
2.
3.
4.
◦
◦
◦
◦
◦
◦
◦
◦
◦
Introduction
Web Usage Mining
Handling Profile Evolution
Integrating Semantics
Profile Discovery based on Usage
Preprocessing
Clustering sessions
Similarity measure in clustering
Post processing and Enrichment
Profile Evolution
Tracking the Evolving User Profiles
Validating the Profile Evolution
Experimental Results
23


After H-UNC, we have clusters of sessions.
Summarize the sessions in each cluster as a
profile vector pi, where pik is the frequency with
which URLk was accessed in sessions belonging
to cluster i. Example :
◦ For cluster 1, let s11 = 1001, s12=1100, s13=1001
◦ Then p1 = (1)(0.33)(0)(0.66)

Convert pi’s to binary vectors so that only URLs
with some minimum weight remain. Example :
◦ let minimum URL weight be 0.5,
◦ then p1 = 1001
24
Extend for Robust Profiles
 Calculate weights wij for all sessions in a
cluster (between profile i and session j) like in
UNC
 Assign sessions with high weights (robust
weights), to the cluster’s “core”.
 So, a cluster’s “core” is the group of sessions
that are very similar to the representative
profile.
 Thus, noisy sessions are eliminated.
25

Enrich the profiles with facets (additional
profile descriptors) like:
◦ Search queries
◦ Inquiring companies
◦ Inquired companies
using IP Addresses, whois.com, registration
database etc… for the sessions in the
cluster
26
1.
2.
3.
4.
◦
◦
◦
◦
◦
◦
◦
◦
◦
Introduction
Web Usage Mining
Handling Profile Evolution
Integrating Semantics
Profile Discovery based on Usage
Preprocessing
Clustering sessions
Similarity measure in clustering
Post processing and Enrichment
Profile Evolution
Tracking the Evolving User Profiles
Validating the Profile Evolution
Experimental Results
27



Profile boundaries : pi vectors and σi
(scale/variance/dispersion) values are used to
determine the boundaries
Profile compatibility : how much the
boundaries of two profiles overlap
Algorithm to “TrackProfiles”. The idea is:
◦ Divide the time into time periods, and generate
profiles for each time period.
◦ Compare the similarity of profile vectors for
consecutive time periods Ti and Ti+1 using Sweb
◦ If distance (i.e. 1-similarity) is <σprofile1 then two
profiles found in Ti and Ti+1 are related.
28


Birth : New profile incompatible with old profiles
Persistence : New profile compatible with an old
profile
◦ One-to-one
◦ Bifurcation (splitting)
◦ mergal



Death : No new profile found for an old profile
Atavism (reappearance) : Old profile disappears
then reappears
Volatile : Dead profiles that have never been
persistent
29

Example for profile merge
30
1.
2.
3.
4.
◦
◦
◦
◦
◦
◦
◦
◦
◦
Introduction
Web Usage Mining
Handling Profile Evolution
Integrating Semantics
Profile Discovery based on Usage
Preprocessing
Clustering sessions
Similarity measure in clustering
Post processing and Enrichment
Profile Evolution
Tracking the Evolving User Profiles
Validating the Profile Evolution
Experimental Results
31

How close are the profiles to the original
input data
◦ Precision : a profile with high precision should
include “only” the true items
◦ Coverage(Recall) : a profile with a high coverage
should include all data items

Example : let session=1001, then profiles
◦ 1000, 0001 : high precision, low coverage
◦ 1101, 1011 : low precision, high coverage
◦ 1001 : high precision, high coverage (ideal but
unrealistic case -> every session must be a profile)
32


So we need to balance precision and coverage
with some small number of profiles to get
high quality Qij for session j and profile i.
Define
◦ Precision Precij = |sj ∩ pi | / |pi|
◦ Coverage Covij = |sj ∩ pi | / |sj|

A combined measure for quality is defined as
Qij = F1,ij = 2*precij*covij / (precij + covij)
33

So, we defined the quality measure between a
profile and a session.
Now, how do we capture the concept drift?

The meaning is:

◦ Decide a minimum quality threshold Qmin to be satisfied
◦ Discover the profiles at time period T2
◦ Take the sessions at the next time period T1, and for
each session sj find the maximum quality Qij using a
profile from the previous time frame
◦ If the quality is higher than Qmin, add this session sj to
our quality sessions set denoted by s*(T1, T2)
34



As a result, we can measure quality by
evaluating the equation below
As long as most of the sessions at T1 are
successfully represented by profiles found at
T2, this rate will be high.
If the minimum threshold quality Qmin is
taken low, the rate will be high. The best case
is 1. If Qmin is increased, number of sessions
satisfying this quality decreases.
35

Example graph showing
◦ the sessions satisfying the minimum quality (y-axis)
◦ minimal F1 quality (x-axis)
◦ Dark line (lower)
is the cross-period
validation.
◦ Light line (upper)
is the validation of
profiles with the
sessions in the
same time frame.
36

Example graph showing
◦ the sessions satisfying the minimum quality (y-axis)
◦ minimal F1 quality (x-axis)
◦ Dark line (lower)
is the cross-period
validation.
◦ Light line (upper)
is the validation of
profiles with the
sessions in the
same time frame.
37
1.
2.
3.
4.
◦
◦
◦
◦
◦
◦
◦
◦
◦
Introduction
Web Usage Mining
Handling Profile Evolution
Integrating Semantics
Profile Discovery based on Usage
Preprocessing
Clustering sessions
Similarity measure in clustering
Post processing and Enrichment
Profile Evolution
Tracking the Evolving User Profiles
Validating the Profile Evolution
Experimental Results
38

They generate profiles with facets (like search
queries, inquired companies, inquiring
companies etc…)
39



Profiles are generated at first half of September
Light lines compare profiles with the sessions in the
same time frame, i.e. first half of September (they are
identical at all graphs)
Dark lines compare the same profiles with the
sessions in the following time frames (cross
validations).
40


In this experiment, profiles from T are validated against
the sessions in the immediately following time period T+1
For Figure 1:
◦ Profiles are generated using the sessions in the first half of
September
◦ Light line shows the validation using the sessions in the same
time period
◦ Dark line shows the validation using the sessions in the following
time period, i.e. second half of September
41

There is one more experiment, but the only
difference with the previous one is that they
use a shorter time period (1 week) in their
observations. The idea is the same.
42



The work presented in this paper is an
unsupervised learning that tries to learn mass
anonymous user profiles
The profiles are mined in a no-memory
revolutionary scheme.
The evolving profiles are validated in a fullmemory mode.
43

In the paper, facets are used to support
profiles with additional information. But it is
not mentioned how they are used. (e.g. the
most searched companies etc)
44